NTCIR-13 MedWeb (Medical Natural Language Processing for Web Document)
NTCIR-13 MedWeb (Medical Natural Language Processing for Web Document) task requires to perform a multi-label classification that labels for eight diseases/symptoms must be assigned to each tweet. Given pseudo-tweets, the output are Positive:p or Negative:n labels for eight diseases/symptoms. The achievements of this task can almost be directly applied to a fundamental engine for actual applications.
This task provides pseudo-Twitter messages in a cross-language and multi-label corpus, covering three languages (Japanese, English, and Chinese), and annotated with eight labels such as influenza, diarrhea/stomachache, hay fever, cough/sore throat, headache, fever, runny nose, and cold. For more details, please refer to the Task data section and Overview of the NTCIR-13: MedWeb Task [PDF].
References
Shoko Wakamiya, Mizuki Morita, Yoshinobu Kano, Tomoko Ohkuma and Eiji Aramaki: Overview of the NTCIR-13 MedWeb Task, In Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies (NTCIR-13), pp. 40-49, 2017. [PDF]
NTCIR-13 MedWeb (Medical Natural Language Processing for Web Document) task provides pseudo-Twitter messages (in Japanese, English, and Chinese) with labels for eight diseases/symptoms such as influenza, diarrhea/stomachache, hay fever, cough/sore throat, headache, fever, runny nose, and cold.
Creating Pseudo-Tweets
Owing to the Twitter developer policy on data redistribution, the tweet data crawled using the Twitter API are not publicly available. Therefore, we created Japanese pseudo-tweets by a crowdsourcing service. Then, the Japanese pseudo-tweets were translated into English and Chinese by relevant first-language practitioners. Note that ID corresponds to the corpora of other language (e.g., the tweet of "135en" corresponds to the tweets of "135ja" and "135zh" as shown in Table below).Symptom Labeling
Two annotators attached Positive:p or Negative:n labels of eight symptoms to tweets, respectively. For more information, please check the annotation guideline [figshare].Corpus Size
Japanese, English, and Chinese corpora consist of 2,560 tweet texts, respectively. Each corpus is divided into Training data consisting of 1,920 tweet texts (75% of the whole corpus) and test data corpus consisting of 640 tweet texts (25% of the whole corpus).
ID | Tweet | Influenza | Diarrhea | Hayfever | Cough | Headache | Fever | Runnynose | Cold |
---|---|---|---|---|---|---|---|---|---|
135ja | 風邪で鼻づまりがやばい。 | n | n | n | n | n | n | p | p |
135en | I have a cold, which makes my nose stuffy like crazy. | n | n | n | n | n | n | p | p |
135zh | 感冒引起的鼻塞很烦人。 | n | n | n | n | n | n | p | p |
The test collection and data are available from NII free of charge.
-
NTCIR-13 MedWeb Test Collection is downloadable from NII/IDR at:
NII IDR: http://www.nii.ac.jp/dsc/idr/en/ntcir/ntcir.html
NTCIR-13 MedWeb Test Collection is licensed under a Creative Commons Attribution 4.0 International License.
Reference
- The terms of use [PDF]
- Task Overview of NTCIR-13 MedWeb Task : Overview of the NTCIR-13: MedWeb Task [PDF]
- NTCIR-13 MedWeb website
Contact us: ntc-secretariat
[JAPANESE] [NTCIR Home] [Top of this page] [NTCIR Data Home]
- Updated on : 2018-07-23
- ntc-admin