COLING Workshop

International Joint Workshop on Natural Language Processing in Biomedicine and its Applications

BioNLP/NLPBA 2004

Shared Task

Geneva, Switzerland

August 28-29, 2004

Downloads

Shared task training data 2,296,096 bytes Genia4EReval.tar.gz
Shared task evaluation data 861,152 bytes Genia4ERtaskV2.tar.gz

New> We thank all those groups who participated in the shared task. The final report on the systems is available here.

Bio-Entity Recognition

This year we propose to have a special shared task: bio-medical named entity recognition from the GENIA corpus. The purpose of this track is essentially to investigate the integration of statistical machine learning methods with deep natural language processing resources such as parsers and various knowledge sources from the bio-medical domain such as ontologies, thesauri and lexicons.

The task is essentially an open challenge task and we encourage the development of systems that explore any kind of external resource. The aim of this task is not simply to find the system with the highest F-score, as ranking of systems in this way may not be particularly insightful. Rather we encourage papers that present careful analysis and discussion of results and explore previously under-utilized resources in interesting ways.  We also encourage the comparison of systems with and without the resources to motivate discussion of its merits.

Authors of accepted papers will be invited to give a short talk outline of the methodology (10 minutes + 5 Q/A) at a special afternoon session of the workshop.

Data

The training data will consist of 2000 MEDLINE abstracts of the GENIA version 3 corpus provided in column format with named entities in IOB2 notation. These abstracts were collected using the search terms ghumanh, hblood cellh, gtranscription factorh. The testing data will consist of a further number of previously unpublished MEDLINE abstracts which will be released 10 days prior to the submission date of the papers. The testing data will come from a super-domain of the training data (gblood cellh, htranscription factorh) which we hope will encourage the use of generalizable methods. See the important dates for further information.

The named entity classes used in the evaluation will be a conflated version of those already annotated: DNA, RNA, protein and cell_line, cell_type.  The conversion from the GENIA taxonomy to the conflated set of classes was done as follows:

1) new class: DNA
from GENIA classes: DNA_domain_or_region & DNA_family_or_group & DNA_molecule
2) new class: RNA
from GENIA classes: RNA_domain_or_region & RNA_family_or_group & RNA_molecule
3) new class: cell_line
from GENIA class: cell_line
4) new class: cell_type
from GENIA class: cell_type
5) new class PROTEIN
from GENIA classes: protein_complex & protein_domain_or_region & protein_family_or_group & protein_molecule & protein_substructure & protein_subunit

All other GENIA classes were removed in the conversion.
@

Format

We follow the tradition used in other evaluation exercises such as CoNLL of using column formatted data with named entities in IOB2 format. IOB2 is used where named entities are not nested and therefore do not overlap. Words outside of named entities are tagged with gOh, while the first word in a named entity is tagged with gB-kh for begin class k, and further named entity words receive tag gI-kh for inside. Columns are separated by spaces. An example is as follows:

TAR    RNA

independent    O

transactivation    O

by    O

Tat    B-protein

in    O

cells    O

derived    O

from    O

the    O

CNS    B-cell_type

a    O

novel    O

mechanism    O

of    O

HIV-1    B-DNA

gene    I-DNA

regulation    O

.    O

While the use of non-overlapping and non-nested named entities ignores many interesting research issues in the internal semantics of terms we felt that this simplifying assumption was necessary to maintain some level of consistency for comparison purposes.  Moreover, the use of any particular tokenization scheme will also be controversial to some extent, but again for purposes of comparison and evaluation we have considered this as a necessary simplifying assumption.

Since the task is essentially an open challenge we have not provided any additional feature information beyond the tokenized surface forms of words.

Evaluation

Evaluation will be done by the authors using the provided script that will show F-scores for each class. We will be releasing the test data 10 days prior to the final submission date for papers in order to allow time for good analysis of results. The  evaluation script will show three F-scores: exact matching of boundaries, and two relaxed matching schemes so that systems will be given a full point for a named entity even though the left (or right) boundary is out by one word position. This is because we believe that the results of such systems are still realistic and useful in practical tasks. The other evaluation script uses strict boundary matching. Results from both should be included in the paper.

We require that authors submit a copy of their final evaluation data along with their papers so that we can confirm results and compile a summary of systems.

Please note that evaluation will be qualitative as well as quantitative (based on F-scores). We are aiming to select approximately the best 6 submissions which gpresent careful analysis and discussion of results and explore previously under-utilized resources in interesting waysh.

Resources

Since we want to focus on the use and re-use of deep knowledge resources (NLP and domain) we expect systems to explore more than widely used lexical-level features (POS, lemma, orthographic, etc.).  This does not however exclude comparison against such models in the paper, or preclude the re-use of corpus-external terminological resources for lexical-tasks such as stemming, etc.

Possible areas which could be explored include: resolution of local syntactic ambiguities in coordinated NEs or NEs containing embedded abbreviations; use of dependency relations; re-use of database term lists such as SwissProt or Protein Information Resource; re-used of alias lists such as LocusLink; use of special lexicons such as UMLS Specialist; use of Gene Ontology or MeSH; combining coreference resolution with NE to obtain improved accuracy, use of in-domain part of speech using the GENIA POS corpus, etc.

Organization

Nigel Collier, National Institute of Informatics , Japan

Jindong Kim, University of Tokyo , Japan                           (contact person)

Yuka Tateisi, University of Tokyo , Japan

Tomoko Ohta, University of Tokyo , Japan

Yoshimasa Tsuruoka, University of Tokyo , Japan

Shared task contacts: bio04sharedtask@nii.ac.jp

General organization : jnlpba-request@sim.hcuge.ch

Main page : BioNLP/NLPBA 2004

Participation and Important Dates

If you are interested in participating please notify the shared task organizers by sending an email to bio04sharedtask@nii.ac.jp. You should include your name(s), affiliation and contact address(es).

Submission Instructions

Please note that the deadline for submission of papers, data and the questionnaire is 21st April 2004 (midnight GMT). Due to the very tight scheduling for reviews we regret that submissions received after this cannot be accepted.

1. Submission of papers

You should send your shared task paper formatted in the COLING project style (up to 4 pages) to jnlpba-submit@david.hcuge.ch with the subject "BioNLP/NLPBA Shared Task". Please note that this email address is different to the shared task organizers' email given above

2. Submission of data and questionnaire

You should submit the result produced by your system on the evaluation data to the shared task organizers. This can be done either by compressing it (e.g. zip, gzip) and emailing the file to bio04sharedtask@nii.ac.jp with the subject header "BioNLP/NLPBA Shared Task Data", or by using the web based CGI-script which you can find (coming soon) at: http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ERtask/submit.html  . Either method is acceptable but the web based method is preferred.  Along with the data you should submit the questionnaire given in the evaluation file bundle so that we can keep track of your data and if you agree, make it available for other researchers. 

If you have any problems or questions then please let us know (bio04sharedtask@nii.ac.jp). Thank you again and good luck with your systems!

References

  1. S. Ananiadou and J.Tsujii (eds.) (2003), "Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine", ACL 2003, Sapporo .
  2. M. Andrade, A. Valencia, Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families, BioInformatics 4 (7).
  3. M. Andrade, P. Bork, Automated extraction of information in molecular biology, FEBS Letters 476 (2002) 12–17.
  4. D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, B. A. Rapp, D. L. Wheeler, Genbank, Nucleic Acids Research 28 (2000) 15–18.
  5. H. M. Berman, T. Battistuz, T. N. Bhat, W. F. Bluhm, P. E. Bourne, K. Burkhardt, Z. Feng, G. L. Gilliland, L. Iype, S. Jain, P. Fagan, J. Marvin, V. Ravichanran, B. Schneider, N. Thanki, D. Padilla, H. Weissig, J. D.
  6. H. M. Bernstein, T. F. Koetzle, G. J. Williams, E. F. Meyer, M. D. Brice, J. R. Rodgers, O. Kennard, T. Shimanouchi, M. Tasumi, The protein data bank: a computer-based archival file for macromolecular structures, Journal of Molecular Biology 112 (1977) 535–542.
  7. C. Blaschke, C. Andrade, C. Ouzounis, A. Valencia, Automatic extraction of biological information from scientific text: Protein-protein interactions, Intelligent Systems for Molecular Biology 7 (1999) 60–67.
  8. C. Blaschke, L. Hirschman, A. Valencia, Information extraction in molecular biology, Brief Bioinform 3 (2002) 154–165.
  9. B. Boeckmann, A. Bairoch, R. Apweiler, M.C.Blatter, A. Estreicher, E. Gasteiger, M.J. Martin, K. Michoud, C. OfDonovan, I. Phan, S. Pilbout and M. Schneider (2003), gThe SWISS-PROT protein knowledgebase and its supplement TrEMBLg, Nucleic Acids Res. 31:365-370
  10. N. Collier, H. Park, N. Ogata, Y. Tateishi, C. Nobata, T. Ohta, T. Sekimizu, H. Imai, J. Tsujii, The GENIA project: corpus-based knowledge acquisition and information extraction from genome research papers, in: Proceedings of the Annual Meeting of the European chapter of the Association for Computational Linguistics (EACLf99), Bergen , Norway , 1999.
  11. N. Collier, C. Nobata, J. Tsujii, Extracting the names of genes and gene products with a hidden Markov model, in: Proceedings of the 18th International Conference on Computational Linguistics (COLINGf2000), Saarbrucken , Germany , 2000.
  12. N. Collier, C. Nobata, J. Tsujii, Automatic acquisition and classification of terminology using a tagged corpus in the molecular biology domain, Journal of Terminology, John Benjamins 7 (2) (2002) 239–257.
  13. M. Craven, J. Kumlien, Constructing biological knowledge bases by extracting information from text sources, in: Proceedings of the 7th International Conference on Intelligent Systemps for Molecular Biology (ISMB-99), Heidelburg , Germany , 1999, pp. 77–86.
  14. T. Erjavec, J. D. Kim, T. Ohta, Y. Tateisi and J. Tsujii, (2003) "Encoding Biomedical Resources in TEI: The Case of the GENIA Corpus"  in Proc. of ACL 2003 Workshop on Natural Language Processing in Biomedicine, ACL, 2003, Sapporo.
  15. The Gene Ontology Consortium. (2000), gGene ontology: tool for the unification of biologyh, Nature Genetics, 25:25–29.
  16. K. Frantzi, S. Ananiadou, and H. Mima, (2000), gAutomatic recognition of multiword termsh, International Journal of Digital Libraries 3(2): 117-132, 2000
  17. K. Fukuda, T. Tsunoda, A. Tamura, and T. Takagi (1998), gToward information extraction: identifying protein names from biological papersh, PSB, pages 705–716.
  18. D. Hanisch, J. Fluck, HT. Mevissen, and R. Zimmer (2003), gPlaying biologyfs name game: identifying protein names in scientific texth, PSB, pages 403–414.  
  19. W Hersh, RT Bhupatiraju, TREC GENOMICS Track Overview. TREC 2003
  20. K. Humphreys, G. Demetriou, R. Gaizauskas, Two applications of information extraction to biological science journal articles: Enzyme interactions and protein structures, in: Proceedings of the 5th Pacific Symposium on Biocomputing (BSB 2000), Honolulu , Hawaifi , USA , 2000.
  21. M. Kayaalp, A. R. Aronson, S. M.. Humphrey, N. C. Ide,  L.. K. Tanabe, L H. Smith, D. Demner, R.R. Loane, J.G. Mork, O. Bodenreider. TREC 2003.
  22. J. Kazama, T. Makino, Y. Ohta, J. Tsujii, Tuning support vector machines for biomedical named entity recognition, in: Workshop on Natural Language Processing in the Biomedical Domain at the Association for Computational Linguistics (ACL) 2002, 2002, pp. 1–8.
  23. K. J. Lee, Y. S. Hwang and H. C. Rim (2003), "Two-Phase Biomedical NE Recognition based on SVMs"  in Proc. of ACL 2003 Workshop on Natural Language Processing in Biomedicine, ACL, 2003, Sapporo.
  24. D. A. Lindberg, L. Humphreys, Betsy, T. McCray, Alexa, The unified medical language system, Methods of Information in Medicine 32 (1993) 281–291.
  25. C. Lovis, P. Michel, R. Baud, J. Scherrer, Word segmentation processing: a way to exponentially extend medical dictionaries, Medinfo 8 (1995) 28–32.
  26. MEDLINE, The PubMed database can be found at: http://www.ncbi.nlm.nih.gov/PubMed/ (1999).
  27. A. Murzin, S. E. Brenner, T. J. P. Hubbard, C. Chothia, SCOP: a structural classification of proteins database for the investigation of sequences and structures, Journal of Molecular Biology 247 (1995) 536–540.
  28. NLM. 2002. UMLS Knowledge Sources. 13th edition. F. Olsson, G. Eriksson, K. Franzen, L. Asker, and P. LidLen. 2002. Notions of Correctness when Evaluating Protein Name Tagger. COLING, pages 765–771.
  29. G. Nenadic, I. Spasic and S. Ananiadou (2003), Terminology-Driven Mining of Biomedical Literature, Bioinformatics, Vol. 13, pp. 1-6, 2003
  30. G. Nenadic, S. Rice, I. Spasic, S. Ananiadou and B. Stapley, (2003) "Selecting Features for Text-based Classification: from Documents to Terms"  in Proc. of ACL 2003 Workshop on Natural Language Processing in Biomedicine, ACL, 2003, Sapporo.
  31. C. Nobata, N. Collier, J. Tsujii, Automatic term identification and classification in biology texts, in: Proceedings of the Natural Language Pacific Rim Symposium (NLPRSf2000), 1999.
  32. C. Nobata, N. Collier, J. Tsujii, Comparison between tagged corpora for the named entity task, in: A. Kilgarrif, T. Berber Sardinha (Eds.), Proceedings of the Association for Computational Linguistics (ACLf2000) Workshop on Comparing Corpora, Hong Kong , 2000, pp. 20–27.
  33. T. Ohta, Y. Tateishi, H. Mima, J. Tsujii, The GENIA corpus: An annotated research abstract corpus in the molecular biology domain, in: Human Language Technologies Conference (HLT 2002), 2002.
  34. T. Ono, H. Hishigaki, A. Tanigami, T. Takagi, Automated extraction of information on protein-protein interactions from the biological literature, Bioinformatics 17 (2) (2001) 155–161.
  35. T. C. Rindflesch, L. Hunter, A. R. Aronson, Mining molecular binding terminology from biomedical text, in: American Medical Informatics Association (AMIA)f99 annual symposium, Washington DC, USA, 1999, pp. 127–131.
  36. T. C. Rindflesch, L. Tanabe, J. N. Weinstein, L. Hunter, EDGAR: Extraction of drugs, genes and relations from the biomedical literature, in: Pacific Symposium on Bio-informatics (PSBf2000), Hawaifi , USA , 2000, pp. 514–525.
  37. P. Ruch, C. Chichester, G. Cohen, G. Coray, F. Ehrler, H. Ghorbel, H. Müller, V. Pallotta. Report on the TREC 2003 Experiment: Genomic Track. TREC 2003.
  38. T. Sekimizu, H. Park, J. Tsujii, Identifying the interaction between genes and gene products based on frequently seen verbs in medline abstracts, in: Genome Informatics, Universal Academy Press, Inc., 1998, pp. 62–71.
  39. D. Shen, J. Zhang, G. Zhou, J. Su and  C. L. Tan (2003), "Effective Adaptation of Hidden Markov Model-based Named Entity Recognizer for biomedical Domain"  in Proc. of ACL 2003 Workshop on Natural Language Processing in Biomedicine, ACL, 2003, Sapporo.
  40. I. Spasic, G. Nenadic and S. Ananiadou (2003), "Using Domain-Specific Verbs for Term Classification"  in Proceedings of the Workshop on Natural Language Processing in Biomedicine, at ACLf2003, Sapporo.
  41. Y. Tateishi, T. Ohta, N. Collier, C. Nobata, K. Ibushi, J. Tsujii, Building an annotated corpus in the molecular-biology domain, in: COLINGf2000 Workshop on Semantic Annotation and Intelligent Content, Luxemburg, 2000.
  42. K. Takeuchi, N. Collier, Use of support vector machines in extended named entity recognition, in: Proceedings of the 6th Conference on Natural Language Learning 2002 (CoNLL-2002), Roth, D. and van den Bosch, A. (eds), 2002, pp. 119–125.
  43. L. Tanabe and W. J. Wilbur. 2002. Tagging gene and protein names in biomedical text. Bioinformatics, 18(8):1124–1132.
  44. E.F. Tjong Kim Sang and J. Veenstra. 1999. Representing Text Chunks. EACL, pages 173–179.
  45. Y. Tsuruoka and J. Tsujii (2003) "Boosting Precision and Recall of Dictionary-Based Protein Name Recognition"  in Proc. of ACL 2003 Workshop on Natural Language Processing in Biomedicine, ACL, 2003, Sapporo.
  46. J. D. Westbrook, C. Zardecki, The protein data bank, Acta Crystallographica D58 (2000) 899–907.
  47. C.H. Wu, H. Huang, L. Arminski, J. Castro-Alvear, Y. Chen, Z.-Z. Hu, R.S. Ledley, K.C. Lewis, H.-W. Mewes, B.C. Orcutt, B.E. Suzek, A. Tsugita, C.R. Vinayaka, L.-S.L. Yeh, J. Zhang, and W.C. Barker. 2002. The Protein Information Resource: an integrated public resource of functional annotation of proteins. Nucleic Acids Res., 30:35–37.
  48. K. Yamamoto, T. Kudo, A. Konagaya and Y. Matsumoto (2003) "Protein Name Tagging for Biomedical Annotation in Text"  in Proc. of ACL 2003 Workshop on Natural Language Processing in Biomedicine, ACL, 2003, Sapporo.
  49. T. Yamashita and Y. Matsumoto. 2000. Language Independent Morphological Analysis. 6th Applied Natural Language Processing Conference, pages 232–238.

Nigel Collier (last modified: September 21st  2004).