[JAPANESE] [NTCIR Home] [NTCIR DATA Home]
Reseachers can get Task data including 701 articles selected from Mainichi
Shinbun articles published between 1998-2001 from NII.
The NTCIR-7 MuST Test Collection is planned to be used as materials of research on summarizing and visualizing trends (multi-modal summarization for trend information) and its component technologies such as text summarization, information extraction and information visualization. It includes not only the materials to be processed, but also the corpus with annotations of intermediate results of compiling trends.
The untagged MuST Corpus is a source of information, which consists of 701 articles of a Japanese newspaper. Articles related several topics of industries, economics and natural disasters were collected.
Task Data except the untagged MuST Corpus consists of three parts: annotated documents, which corresponds to the results of key sentence selection, named entity extraction and temporal processing; data in a table format obtained by manual information extraction on statistical values and those changes; and a set of information extraction queries (of T2N subtask) pertaining the document data and the lists of data correctly extracted for the specified queries.
|# of docs||
|Filename||Lang.||# of docs||Size||Lang.||
|(A)||available from NII for research purpose|
|(B)||For the non-participants, Mainichi Newspaper Japanese Article Data Full-Text Article Database CD-ROMs are available for research purpose use from Nichigai Associates and Mainichi Shinbum|
Mainichi: Mainichi Newspaper Japanese Article Data 1998-2001
For the experiments of information extraction, only with MuST Corpus (selected 701 articles), it is possible to research using the tagged documents as correct answers.
If you have the whole newspaper article data 1998-2001, the document data can be used for various experiments as below:
・to analyse the ratio of extracted pattern on the whole
・to combine the ducuments with unsupervised method
We use four year Japanese newspaper articles published in the years of 1998-2001.
Mainichi Newspaper Japanese Article Data Full-Text Article Database CD-ROMs are available for research purpose use from Nichigai Associates and Mainichi Newspaper Co. and the document records in the CD-ROMs shall be converted into the NTCIR standard record format by the script mai2.pl.(currently information is available in Japanese only)
・To obtain script mai2ntc-r.pl：http://research.nii.ac.jp/ntcir/permission/ntcir-4/script/mai2ntc-r.pl_txt
(1) MuST Corpus
701 articles selected from Mainichi Sinbun articles 1998-2001 with annotation, which corresponds to the results of key sentence selection, named entity extraction and temporal processing. The specification of the annotation is contained in the overview of MuST at the NTCIR-6 Workshop.
NTCIR-6 MuST Overview:
Untagged MuST Corpus: 701 articles selected from Mainichi Shinbun articles published between 1998-2001
Newspaper articles related 31 topics of industries, economics and natural disasters. Examples of the topics are gasoline and oil price, beer industry and earthquakes.
(2) Change Expressions
Data in a table format obtained by manual information extraction on statistical values and those changes. 219 articles of 9 topics in MuST Corpus were processed. The specification of the information extraction is contained in the overview of MuST at the NTCIR-7 Workshop.
NTCIR-7 MuST Overview:
(3) T2N TestSet
A set of queries in an information extraction task (named T2N subtask) pertaining the document data and the lists of data correctly extracted for the specified queries. The following is a sample of the queries. The specification of the task and the format of the answer files, README_E and T2NSpec_E, are included in this task data.
<task id="MuSTT2N0101" name_j="ガソリン" name_e="Gasoline">
The followings are the procedures to obtain this CLQA test collection. The test collection and data available from NII are free of charge.
NTCIR Project (Rm.1309)
National Institute of Informatics
2-1-2 Hitotsubashi Chiyoda-ku, Tokyo
The release of the new test collections and correction information shall be announced through the ntcir Mailing list
The test collection has been constructed and used for the NTCIR. They are
usable only for the research purpose use.
The documents collection included in the test collection were provided to NII for used in NTCIR free of charge or for a fee. The providers of the document data kindly understand the importance of the test collection in the research on information access technologies and then granted the use of the data for research purpose. Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. It is important for our continued reliable and good relationship with the data producers/providers that we researchers must behave as a reliable partners and use the data only for research purpose under the user agreement and use them carefully not to violate any rights for them .