COLING
Workshop
International
Joint Workshop on Natural Language Processing
in
Biomedicine and its Applications
BioNLP/NLPBA
2004
Shared
Task
Geneva,
Switzerland
August
28-29, 2004
Downloads
New> We thank all those groups who
participated in the shared task. The final report on the systems is available
here.
Bio-Entity Recognition
This
year we propose to have a special shared task: bio-medical named entity
recognition from the GENIA
corpus. The purpose of this track is essentially to investigate the integration
of statistical machine learning methods with deep natural language processing
resources such as parsers and various knowledge sources from the bio-medical
domain such as ontologies, thesauri and lexicons.
The
task is essentially an open challenge task and we encourage the development of
systems that explore any kind of external resource. The aim of this task is not
simply to find the system with the highest F-score, as ranking of systems in
this way may not be particularly insightful. Rather we encourage papers that
present careful analysis and discussion of results and explore previously
under-utilized resources in interesting ways.
We also encourage the comparison of systems with and without the
resources to motivate discussion of its merits.
Authors
of accepted papers will be invited to give a short talk outline of the
methodology (10 minutes + 5 Q/A) at a special afternoon session of the workshop.
Data
The
training data will consist of 2000 MEDLINE abstracts of the GENIA version 3
corpus provided in column format with named entities in IOB2 notation. These
abstracts were collected using the search terms ghumanh, hblood cellh,
gtranscription factorh. The testing data will consist of a further number of
previously unpublished MEDLINE abstracts which will be released 10 days prior to
the submission date of the papers. The testing data will come from a
super-domain of the training data (gblood cellh, htranscription factorh)
which we hope will encourage the use of generalizable methods. See the important
dates for further information.
The
named entity classes used in the evaluation will be a conflated version of
those already annotated: DNA, RNA, protein and cell_line, cell_type.
The conversion from the GENIA taxonomy to the conflated set of classes was done
as follows:
1)
new class: DNA
from GENIA classes: DNA_domain_or_region & DNA_family_or_group & DNA_molecule
2) new class: RNA
from GENIA classes: RNA_domain_or_region & RNA_family_or_group & RNA_molecule
3) new class: cell_line
from GENIA class: cell_line
4) new class: cell_type
from GENIA class: cell_type
5) new class PROTEIN
from GENIA classes: protein_complex & protein_domain_or_region &
protein_family_or_group & protein_molecule & protein_substructure &
protein_subunit
All
other GENIA classes were removed in the conversion.
@
Format
We follow the tradition
used in other evaluation exercises such as CoNLL of using column formatted data
with named entities in IOB2 format. IOB2 is used where named entities are not
nested and therefore do not overlap. Words outside of named entities are tagged
with gOh, while the first word in a named entity is tagged with gB-kh
for begin class k, and further named
entity words receive tag gI-kh for inside.
Columns are separated by spaces. An example is as follows:
TAR
RNA
independent
O
transactivation
O
by
O
Tat
B-protein
in
O
cells
O
derived
O
from
O
the
O
CNS
B-cell_type
a
O
novel
O
mechanism
O
of
O
HIV-1
B-DNA
gene
I-DNA
regulation
O
.
O
While the use of
non-overlapping and non-nested named entities ignores many interesting research
issues in the internal semantics of terms we felt that this simplifying
assumption was necessary to maintain some level of consistency for comparison
purposes. Moreover, the use of any
particular tokenization scheme will also be controversial to some extent, but
again for purposes of comparison and evaluation we have considered this as a
necessary simplifying assumption.
Since the task is
essentially an open challenge we have not provided any additional feature
information beyond the tokenized surface forms of words.
Evaluation
Evaluation
will be done by the authors using the provided script that will show F-scores
for each class. We will be releasing the test data 10 days prior to the final
submission date for papers in order to allow time for good analysis of results.
The evaluation script will show three F-scores: exact matching of
boundaries, and two relaxed matching schemes so that
systems will be given a full point for a named entity even though the left (or
right) boundary is out by one word position. This is because we believe that the results
of such systems are still realistic and useful in practical tasks. The other
evaluation script uses strict boundary matching. Results from both should be
included in the paper.
We require that authors submit a copy of their final evaluation data along with
their papers so that we can confirm results and compile a summary of systems.
Please note that evaluation will be
qualitative
as well as quantitative (based on
F-scores). We are aiming to select approximately the best 6 submissions which
gpresent careful analysis and discussion of results and explore previously
under-utilized resources in interesting waysh.
Resources
Since
we want to focus on the use and re-use of deep knowledge resources (NLP and
domain) we expect systems to explore more than widely used lexical-level
features (POS, lemma, orthographic, etc.). This
does not however exclude comparison against such models in the paper, or
preclude the re-use of corpus-external terminological resources for
lexical-tasks such as stemming, etc.
Possible areas which could be explored include: resolution of local syntactic
ambiguities in coordinated NEs or NEs containing embedded abbreviations; use of
dependency relations; re-use of database term lists such as SwissProt or Protein
Information Resource; re-used of alias lists such as LocusLink; use of special
lexicons such as UMLS Specialist; use of Gene Ontology or MeSH; combining
coreference resolution with NE to obtain improved accuracy, use of in-domain
part of speech using the GENIA POS corpus, etc.
Organization
Nigel Collier, National Institute of
Informatics
,
Japan
Jindong
Kim,
University of Tokyo
,
Japan
(contact person)
Yuka
Tateisi,
University of Tokyo
,
Japan
Tomoko
Ohta,
University of Tokyo
,
Japan
Yoshimasa
Tsuruoka,
University of Tokyo
,
Japan
Shared task contacts: bio04sharedtask@nii.ac.jp
General
organization : jnlpba-request@sim.hcuge.ch
Main
page : BioNLP/NLPBA
2004
Participation and Important Dates
If you are interested in participating please
notify the shared task organizers by sending an email to bio04sharedtask@nii.ac.jp.
You should include your name(s), affiliation and contact address(es).
- Release
of development data and evaluation scripts:
March 4th
- Release
of test data: April 12th
- Submission
deadline for shared task workshop papers and final evaluation data: April
21st
- Notification
of accepted papers: May 14th
- Deadline
for camera ready copies: June 6th
- Workshop
date: August 28th and August 29th (morning only)
Please note that the deadline for submission of papers, data and the
questionnaire is 21st April 2004 (midnight GMT). Due to the very tight
scheduling for reviews we regret that submissions received after this cannot be
accepted.
1. Submission of papers
You should send your shared task paper formatted in the COLING project style
(up to 4 pages) to jnlpba-submit@david.hcuge.ch with
the subject "BioNLP/NLPBA Shared Task". Please note that this email
address is different to the shared task organizers' email given above
2. Submission of data and questionnaire
You should submit the result produced by your system on
the evaluation data to the shared task organizers. This can be done either by
compressing it (e.g. zip, gzip) and emailing the file to bio04sharedtask@nii.ac.jp with
the subject header "BioNLP/NLPBA Shared Task Data", or by
using the web based CGI-script which you can find (coming soon) at:
http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ERtask/submit.html
.
Either method is acceptable but the web based method is preferred. Along with the data you should submit the
questionnaire given in the evaluation file bundle so that we can keep track of
your data and if you agree, make it available for other researchers.
If you have any problems or
questions then please let us know (bio04sharedtask@nii.ac.jp).
Thank you again and good luck with your systems!
References
- S. Ananiadou and J.Tsujii (eds.) (2003), "Proceedings of the
ACL 2003 Workshop on Natural Language Processing in Biomedicine", ACL
2003,
Sapporo
.
- M. Andrade, A. Valencia, Automatic extraction of keywords from
scientific text: application to the knowledge domain of protein families,
BioInformatics 4 (7).
- M. Andrade, P. Bork, Automated extraction of information in
molecular biology, FEBS Letters 476 (2002) 12–17.
- D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, B. A.
Rapp, D. L. Wheeler, Genbank, Nucleic Acids Research 28 (2000) 15–18.
- H. M. Berman, T. Battistuz, T. N. Bhat, W. F. Bluhm, P. E. Bourne,
K. Burkhardt, Z. Feng, G. L. Gilliland, L. Iype, S. Jain, P. Fagan, J.
Marvin, V. Ravichanran, B. Schneider, N. Thanki, D. Padilla, H. Weissig, J.
D.
- H. M. Bernstein, T. F. Koetzle, G. J. Williams, E. F. Meyer, M. D.
Brice, J. R. Rodgers, O. Kennard, T. Shimanouchi, M. Tasumi, The protein
data bank: a computer-based archival file for macromolecular structures,
Journal of Molecular Biology 112 (1977) 535–542.
- C. Blaschke, C. Andrade, C. Ouzounis, A. Valencia, Automatic
extraction of biological information from scientific text: Protein-protein
interactions, Intelligent Systems for Molecular Biology 7 (1999)
60–67.
- C. Blaschke, L. Hirschman, A. Valencia, Information extraction in
molecular biology, Brief Bioinform 3 (2002) 154–165.
- B.
Boeckmann, A. Bairoch, R. Apweiler, M.C.Blatter, A. Estreicher, E. Gasteiger,
M.J. Martin, K. Michoud, C. OfDonovan, I. Phan, S. Pilbout and M.
Schneider (2003), gThe SWISS-PROT protein knowledgebase and its supplement
TrEMBLg, Nucleic Acids Res. 31:365-370
- N.
Collier, H. Park, N. Ogata, Y. Tateishi, C. Nobata, T. Ohta, T. Sekimizu, H.
Imai, J. Tsujii, The GENIA project: corpus-based knowledge acquisition and
information extraction from genome research papers, in: Proceedings of the
Annual Meeting of the European chapter of the Association for Computational
Linguistics (EACLf99),
Bergen
,
Norway
, 1999.
- N.
Collier, C. Nobata, J. Tsujii, Extracting the names of genes and gene
products with a hidden Markov model, in: Proceedings of the 18th
International Conference on Computational Linguistics (COLINGf2000),
Saarbrucken
,
Germany
, 2000.
- N.
Collier, C. Nobata, J. Tsujii, Automatic acquisition and classification of
terminology using a tagged corpus in the molecular biology domain, Journal
of Terminology, John Benjamins 7 (2) (2002) 239–257.
- M.
Craven, J. Kumlien, Constructing biological knowledge bases by extracting
information from text sources, in: Proceedings of the 7th International
Conference on Intelligent Systemps for Molecular Biology (ISMB-99),
Heidelburg
,
Germany
, 1999, pp. 77–86.
- T. Erjavec, J. D. Kim, T. Ohta, Y. Tateisi and J. Tsujii, (2003)
"Encoding Biomedical Resources in TEI: The Case of the GENIA
Corpus" in Proc. of ACL 2003 Workshop on Natural Language
Processing in Biomedicine, ACL, 2003, Sapporo.
- The Gene Ontology Consortium. (2000), gGene ontology: tool for
the unification of biologyh, Nature
Genetics, 25:25–29.
- K. Frantzi, S. Ananiadou, and H. Mima, (2000), gAutomatic
recognition of multiword termsh, International Journal of Digital
Libraries 3(2): 117-132, 2000
- K. Fukuda, T. Tsunoda, A. Tamura, and T. Takagi (1998), gToward
information extraction: identifying protein names from biological papersh,
PSB,
pages 705–716.
- D.
Hanisch, J. Fluck, HT. Mevissen, and
R. Zimmer (2003), gPlaying biologyfs name game: identifying protein
names in scientific texth, PSB,
pages 403–414.
- W Hersh, RT Bhupatiraju,
TREC GENOMICS Track Overview. TREC 2003
- K. Humphreys, G. Demetriou, R. Gaizauskas, Two applications of
information extraction to biological science journal articles: Enzyme
interactions and protein structures, in: Proceedings of the 5th Pacific
Symposium on Biocomputing (BSB 2000),
Honolulu
,
Hawaifi
,
USA
, 2000.
- M. Kayaalp, A. R. Aronson, S. M.. Humphrey, N. C. Ide, L.. K.
Tanabe, L H. Smith, D. Demner, R.R. Loane, J.G. Mork, O. Bodenreider. TREC
2003.
- J. Kazama, T. Makino, Y. Ohta, J. Tsujii, Tuning support vector
machines for biomedical named entity recognition, in: Workshop on Natural
Language Processing in the Biomedical Domain at the Association for
Computational Linguistics (ACL) 2002, 2002, pp. 1–8.
- K. J. Lee, Y. S. Hwang and H. C. Rim (2003), "Two-Phase
Biomedical NE Recognition based on SVMs" in Proc. of ACL 2003
Workshop on Natural Language Processing in Biomedicine, ACL, 2003, Sapporo.
- D. A. Lindberg, L. Humphreys, Betsy, T. McCray, Alexa, The unified
medical language system, Methods of Information in Medicine 32 (1993)
281–291.
- C. Lovis, P. Michel, R. Baud, J. Scherrer, Word segmentation
processing: a way to exponentially extend medical dictionaries, Medinfo 8
(1995) 28–32.
- MEDLINE, The PubMed database can be found at: http://www.ncbi.nlm.nih.gov/PubMed/
(1999).
- A. Murzin, S. E. Brenner, T. J. P. Hubbard, C. Chothia, SCOP: a
structural classification of proteins database for the investigation of
sequences and structures, Journal of Molecular Biology 247 (1995)
536–540.
- NLM. 2002. UMLS Knowledge Sources. 13th edition. F. Olsson, G.
Eriksson, K. Franzen, L. Asker, and P. LidLen. 2002. Notions of Correctness
when Evaluating Protein Name Tagger. COLING, pages 765–771.
- G. Nenadic,
I.
Spasic and
S. Ananiadou
(2003), Terminology-Driven Mining of
Biomedical Literature, Bioinformatics, Vol. 13, pp. 1-6, 2003
- G. Nenadic, S. Rice, I. Spasic, S. Ananiadou and B. Stapley, (2003)
"Selecting Features for Text-based Classification: from Documents to
Terms" in Proc. of ACL 2003 Workshop on Natural Language
Processing in Biomedicine, ACL, 2003, Sapporo.
- C. Nobata, N. Collier, J. Tsujii, Automatic term identification and
classification in biology texts, in: Proceedings of the Natural Language
Pacific Rim Symposium (NLPRSf2000), 1999.
- C. Nobata, N. Collier, J. Tsujii, Comparison between tagged corpora
for the named entity task, in: A. Kilgarrif, T. Berber Sardinha (Eds.),
Proceedings of the Association for Computational Linguistics (ACLf2000)
Workshop on Comparing Corpora,
Hong Kong
, 2000, pp. 20–27.
- T. Ohta, Y. Tateishi, H. Mima, J. Tsujii, The GENIA corpus: An
annotated research abstract corpus in the molecular biology domain, in:
Human Language Technologies Conference (HLT 2002), 2002.
- T. Ono, H. Hishigaki, A. Tanigami, T. Takagi, Automated extraction
of information on protein-protein interactions from the biological
literature, Bioinformatics 17 (2) (2001) 155–161.
- T. C. Rindflesch, L. Hunter, A. R. Aronson, Mining molecular
binding terminology from biomedical text, in: American Medical Informatics
Association (AMIA)f99 annual symposium, Washington DC, USA, 1999, pp.
127–131.
- T. C. Rindflesch, L. Tanabe, J. N. Weinstein, L. Hunter, EDGAR:
Extraction of drugs, genes and relations from the biomedical literature, in:
Pacific Symposium on Bio-informatics (PSBf2000),
Hawaifi
,
USA
, 2000, pp. 514–525.
- P. Ruch, C. Chichester, G. Cohen, G. Coray, F. Ehrler, H. Ghorbel,
H. Müller, V. Pallotta. Report on the TREC 2003 Experiment: Genomic
Track. TREC 2003.
- T. Sekimizu, H. Park, J. Tsujii, Identifying the interaction
between genes and gene products based on frequently seen verbs in medline
abstracts, in: Genome Informatics, Universal Academy Press, Inc., 1998, pp.
62–71.
- D. Shen, J. Zhang, G. Zhou, J. Su and C.
L. Tan (2003), "Effective Adaptation of Hidden Markov Model-based Named
Entity Recognizer for biomedical Domain" in Proc. of ACL 2003
Workshop on Natural Language Processing in Biomedicine, ACL, 2003, Sapporo.
- I. Spasic, G. Nenadic and S. Ananiadou (2003), "Using
Domain-Specific Verbs for Term Classification" in Proceedings of
the Workshop on Natural Language Processing in Biomedicine, at ACLf2003,
Sapporo.
- Y. Tateishi, T. Ohta, N. Collier, C. Nobata, K. Ibushi, J. Tsujii,
Building an annotated corpus in the molecular-biology domain, in:
COLINGf2000 Workshop on Semantic Annotation and Intelligent Content,
Luxemburg, 2000.
- K. Takeuchi, N. Collier, Use of support vector machines in extended
named entity recognition, in: Proceedings of the 6th Conference on Natural
Language Learning 2002 (CoNLL-2002), Roth, D. and van den Bosch, A. (eds),
2002, pp. 119–125.
- L. Tanabe and W. J. Wilbur. 2002. Tagging gene and protein names in
biomedical text. Bioinformatics,
18(8):1124–1132.
- E.F. Tjong Kim Sang and J. Veenstra. 1999. Representing Text
Chunks. EACL,
pages 173–179.
- Y. Tsuruoka and J. Tsujii (2003) "Boosting Precision and
Recall of Dictionary-Based Protein Name Recognition" in Proc. of
ACL 2003 Workshop on Natural Language Processing in Biomedicine, ACL, 2003,
Sapporo.
- J. D. Westbrook, C. Zardecki, The protein data bank, Acta
Crystallographica D58 (2000) 899–907.
- C.H. Wu, H. Huang, L. Arminski, J. Castro-Alvear, Y. Chen, Z.-Z. Hu,
R.S. Ledley, K.C. Lewis, H.-W. Mewes, B.C. Orcutt, B.E. Suzek, A. Tsugita,
C.R. Vinayaka, L.-S.L. Yeh, J. Zhang, and W.C. Barker. 2002. The Protein
Information Resource: an integrated public resource of functional annotation
of proteins. Nucleic Acids
Res., 30:35–37.
- K. Yamamoto, T. Kudo, A. Konagaya and Y. Matsumoto (2003)
"Protein Name Tagging for Biomedical Annotation in Text" in
Proc. of ACL 2003 Workshop on Natural Language Processing in Biomedicine,
ACL, 2003, Sapporo.
- T. Yamashita and Y. Matsumoto. 2000. Language Independent
Morphological Analysis. 6th
Applied Natural Language
Processing Conference, pages
232–238.
Nigel
Collier (last modified:
September 21st 2004).