NTCIR Project
NTCIR-13 MedWeb
Research Purpose Use of Test Collection

[JAPANESE] [NTCIR Home] [NTCIR Data Home]


NTCIR-13 MedWeb (Medical Natural Language Processing for Web Document)

Test Collection

NTCIR-13 MedWeb (Medical Natural Language Processing for Web Document) task requires to perform a multi-label classification that labels for eight diseases/symptoms must be assigned to each tweet. Given pseudo-tweets, the output are Positive:p or Negative:n labels for eight diseases/symptoms. The achievements of this task can almost be directly applied to a fundamental engine for actual applications.

This task provides pseudo-Twitter messages in a cross-language and multi-label corpus, covering three languages (Japanese, English, and Chinese), and annotated with eight labels such as influenza, diarrhea/stomachache, hay fever, cough/sore throat, headache, fever, runny nose, and cold. For more details, please refer to the Task data section and Overview of the NTCIR-13: MedWeb Task [PDF].

References

Shoko Wakamiya, Mizuki Morita, Yoshinobu Kano, Tomoko Ohkuma and Eiji Aramaki: Overview of the NTCIR-13 MedWeb Task, In Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies (NTCIR-13), pp. 40-49, 2017. [PDF]

Task data

NTCIR-13 MedWeb (Medical Natural Language Processing for Web Document) task provides pseudo-Twitter messages (in Japanese, English, and Chinese) with labels for eight diseases/symptoms such as influenza, diarrhea/stomachache, hay fever, cough/sore throat, headache, fever, runny nose, and cold.

Creating Pseudo-Tweets

Owing to the Twitter developer policy on data redistribution, the tweet data crawled using the Twitter API are not publicly available. Therefore, we created Japanese pseudo-tweets by a crowdsourcing service. Then, the Japanese pseudo-tweets were translated into English and Chinese by relevant first-language practitioners. Note that ID corresponds to the corpora of other language (e.g., the tweet of "135en" corresponds to the tweets of "135ja" and "135zh" as shown in Table below).

Symptom Labeling

Two annotators attached Positive:p or Negative:n labels of eight symptoms to tweets, respectively. For more information, please check the annotation guideline [figshare].

Corpus Size

Japanese, English, and Chinese corpora consist of 2,560 tweet texts, respectively. Each corpus is divided into Training data consisting of 1,920 tweet texts (75% of the whole corpus) and test data corpus consisting of 640 tweet texts (25% of the whole corpus).

Table. Examples of pseudo-tweets with labels
ID Tweet Influenza Diarrhea Hayfever Cough Headache Fever Runnynose Cold
135ja 風邪で鼻づまりがやばい。 n n n n n n p p
135en I have a cold, which makes my nose stuffy like crazy. n n n n n n p p
135zh 感冒引起的鼻塞很烦人。 n n n n n n p p

To obtain the test collection

The test collection and data are available from NII free of charge.

Creative Commons License (CC BY 4.0)
NTCIR-13 MedWeb Test Collection is licensed under a Creative Commons Attribution 4.0 International License.

Reference

Contact us: ntc-secretariat


[JAPANESE] [NTCIR Home] [Top of this page] [NTCIR Data Home]