NTCIR-13 MedWeb (Medical Natural Language Processing for Web Document)
NTCIR-13 MedWeb (Medical Natural Language Processing for Web Document) task requires to perform a multi-label classification that labels for eight diseases/symptoms must be assigned to each tweet. Given pseudo-tweets, the output are Positive:p or Negative:n labels for eight diseases/symptoms. The achievements of this task can almost be directly applied to a fundamental engine for actual applications.
This task provides pseudo-Twitter messages in a cross-language and multi-label corpus, covering three languages (Japanese, English, and Chinese), and annotated with eight labels such as influenza, diarrhea/stomachache, hay fever, cough/sore throat, headache, fever, runny nose, and cold. For more details, please refer to the Task data section and Overview of the NTCIR-13: MedWeb Task [PDF].
ReferencesShoko Wakamiya, Mizuki Morita, Yoshinobu Kano, Tomoko Ohkuma and Eiji Aramaki: Overview of the NTCIR-13 MedWeb Task, In Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies (NTCIR-13), pp. 40-49, 2017. [PDF]
NTCIR-13 MedWeb (Medical Natural Language Processing for Web Document) task provides pseudo-Twitter messages (in Japanese, English, and Chinese) with labels for eight diseases/symptoms such as influenza, diarrhea/stomachache, hay fever, cough/sore throat, headache, fever, runny nose, and cold.
Creating Pseudo-TweetsOwing to the Twitter developer policy on data redistribution, the tweet data crawled using the Twitter API are not publicly available. Therefore, we created Japanese pseudo-tweets by a crowdsourcing service. Then, the Japanese pseudo-tweets were translated into English and Chinese by relevant first-language practitioners. Note that ID corresponds to the corpora of other language (e.g., the tweet of "135en" corresponds to the tweets of "135ja" and "135zh" as shown in Table below).
Symptom LabelingTwo annotators attached Positive:p or Negative:n labels of eight symptoms to tweets, respectively. For more information, please check the annotation guideline [figshare].
Corpus SizeJapanese, English, and Chinese corpora consist of 2,560 tweet texts, respectively. Each corpus is divided into Training data consisting of 1,920 tweet texts (75% of the whole corpus) and test data corpus consisting of 640 tweet texts (25% of the whole corpus).
|135en||I have a cold, which makes my nose stuffy like crazy.||n||n||n||n||n||n||p||p|
The test collection and data are available from NII free of charge.
NTCIR-13 MedWeb Test Collection is downloadable from NII/IDR at:
NII IDR: http://www.nii.ac.jp/dsc/idr/en/ntcir/ntcir.html
NTCIR-13 MedWeb Test Collection is licensed under a Creative Commons Attribution 4.0 International License.
- Task Overview of NTCIR-13 MedWeb Task : Overview of the NTCIR-13: MedWeb Task [PDF]
- NTCIR-13 MedWeb website
Contact us: ntc-secretariat
- Updated on : 2018-07-23