NTCIR-13 MedWeb

NTCIR-13 MedWeb (Medical Natural Language Processing for Web Document) task requires to perform a multi-label classification that labels for eight diseases/symptoms must be assigned to each tweet. Given pseudo-tweets, the output are Positive:p or Negative:n labels for eight diseases/symptoms. The achievements of this task can almost be directly applied to a fundamental engine for actual applications.

This task provides pseudo-Twitter messages in a cross-language and multi-label corpus, covering three languages (Japanese, English, and Chinese), and annotated with eight labels such as influenza, diarrhea/stomachache, hay fever, cough/sore throat, headache, fever, runny nose, and cold. For more details, please refer to the Task data section and Overview of the NTCIR-13: MedWeb Task [PDF].

References

Shoko Wakamiya, Mizuki Morita, Yoshinobu Kano, Tomoko Ohkuma and Eiji Aramaki: Overview of the NTCIR-13 MedWeb Task, In Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies (NTCIR-13), pp. 40-49, 2017. [PDF]

NTCIR-13 MedWeb (Medical Natural Language Processing for Web Document) task provides pseudo-Twitter messages (in Japanese, English, and Chinese) with labels for eight diseases/symptoms such as influenza, diarrhea/stomachache, hay fever, cough/sore throat, headache, fever, runny nose, and cold.

Creating Pseudo-Tweets

Owing to the Twitter developer policy on data redistribution, the tweet data crawled using the Twitter API are not publicly available. Therefore, we created Japanese pseudo-tweets by a crowdsourcing service. Then, the Japanese pseudo-tweets were translated into English and Chinese by relevant first-language practitioners. Note that ID corresponds to the corpora of other language (e.g., the tweet of "135en" corresponds to the tweets of "135ja" and "135zh" as shown in Table below).

Symptom Labeling

Two annotators attached Positive:p or Negative:n labels of eight symptoms to tweets, respectively. For more information, please check the annotation guideline [figshare].

Corpus Size

Japanese, English, and Chinese corpora consist of 2,560 tweet texts, respectively. Each corpus is divided into Training data consisting of 1,920 tweet texts (75% of the whole corpus) and test data corpus consisting of 640 tweet texts (25% of the whole corpus).

Table. Examples of pseudo-tweets with labels
ID Tweet Influenza Diarrhea Hayfever Cough Headache Fever Runnynose Cold

135ja 風邪で鼻づまりがやばい。 n n n n n n p p

135en I have a cold, which makes my nose stuffy like crazy. n n n n n n p p

135zh 感冒引起的鼻塞很烦人。 n n n n n n p p

Table. Examples of pseudo-tweets with labels
ID	Tweet	Influenza	Diarrhea	Hayfever	Cough	Headache	Fever	Runnynose	Cold
135ja	風邪で鼻づまりがやばい。	n	n	n	n	n	n	p	p
135en	I have a cold, which makes my nose stuffy like crazy.	n	n	n	n	n	n	p	p
135zh	感冒引起的鼻塞很烦人。	n	n	n	n	n	n	p	p

The test collection and data are available from NII free of charge.

NTCIR-13 MedWeb Test Collection is downloadable from NII/IDR at:
NII IDR: http://www.nii.ac.jp/dsc/idr/en/ntcir/ntcir.html

(CC BY 4.0)
NTCIR-13 MedWeb Test Collection is licensed under a Creative Commons Attribution 4.0 International License.

Reference

The terms of use [PDF]
Task Overview of NTCIR-13 MedWeb Task : Overview of the NTCIR-13: MedWeb Task [PDF]
NTCIR-13 MedWeb website

Contact us: ntc-secretariat

NTCIR Project NTCIR-13 MedWeb Research Purpose Use of Test Collection

NTCIR-13 MedWeb (Medical Natural Language Processing for Web Document)

References

Creating Pseudo-Tweets

Symptom Labeling

Corpus Size

NTCIR Project
NTCIR-13 MedWeb
Research Purpose Use of Test Collection