[JAPANESE] [NTCIR Home] [NTCIR DATA Home]
The NTCIR-7 MuST Test Collection is planned to be used as materials of
research on summarizing and visualizing trends (multi-modal summarization
for trend information) and its component technologies such as text
summarization, information extraction and information visualization. It
includes not only the materials to be processed, but also the corpus with
annotations of intermediate results of compiling trends.
The untagged MuST Corpus is a source of information, which consists of
701 articles of a Japanese newspaper. Articles related several topics
of industries, economics and natural disasters were collected.
Task Data except the untagged MuST Corpus consists of three parts: annotated
documents, which corresponds to the results of key sentence selection,
named entity extraction and temporal processing; data in a table format
obtained by manual information extraction on statistical values and those
changes; and a set of information extraction queries (of T2N subtask)
pertaining the document data and the lists of data correctly extracted
for the specified queries.
Collection |
Task |
Documents |
Task Data | |||||||||||
Genre |
Filename |
Lang. |
Year |
# of docs |
Size |
MuST Corpus |
IE |
Relevance Judgement | ||||||
Filename | Lang. | # of docs | Size | Lang. |
# |
|||||||||
NTCIR-7 MuST |
IE, |
News Paper |
Mainichi (B) |
J |
1998 -
|
419,759 |
535MB |
Untagged MuST Corpust (A) |
J |
701 | 2.9MB |
J |
25 (8topics) |
N/A |
(A) | available from NII for research purpose | |
(B) | For the non-participants, Mainichi Newspaper Japanese Article Data Full-Text Article Database CD-ROMs are available for research purpose use from Nichigai Associates and Mainichi Shinbum |
Mainichi: Mainichi Newspaper Japanese Article Data 1998-2001
For the experiments of information extraction, only with MuST Corpus (selected
701 articles), it is possible to research using the tagged documents as
correct answers.
If you have the whole newspaper article data 1998-2001, the document data can be used for various experiments as below:
・to analyse the ratio of extracted pattern on the whole
・to combine the ducuments with unsupervised method
We use four year Japanese newspaper articles published in the years of 1998-2001.
Mainichi Newspaper Japanese Article Data Full-Text Article Database CD-ROMs are available for research purpose use from Nichigai Associates and Mainichi Newspaper Co. and the document records in the CD-ROMs shall be converted into the NTCIR standard record format by the script mai2.pl.(currently information is available in Japanese only)
・To obtain script mai2ntc-r.pl:http://research.nii.ac.jp/ntcir/permission/ntcir-4/script/mai2ntc-r.pl_txt
・README【mai2ntc-r.pl】http://research.nii.ac.jp/ntcir/permission/ntcir-4/script/READMEforMainichiScript-r.txt
(1) MuST Corpus
701 articles selected from Mainichi Sinbun articles 1998-2001 with annotation,
which corresponds to the results of key sentence selection, named entity
extraction and temporal processing. The specification of the annotation
is contained in the overview of MuST at the NTCIR-6 Workshop.
NTCIR-6 MuST Overview:
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings6/NTCIR/82.pdf
Untagged MuST Corpus: 701 articles selected from Mainichi Shinbun articles
published between 1998-2001
Newspaper articles related 31 topics of industries, economics and natural disasters. Examples of the topics are gasoline and oil price, beer industry and earthquakes.
(2) Change Expressions
Data in a table format obtained by manual information extraction on statistical
values and those changes. 219 articles of 9 topics in MuST Corpus
were processed. The specification of the information extraction is
contained in the overview of MuST at the NTCIR-7 Workshop.
NTCIR-7 MuST Overview:
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings7/pdf/NTCIR7/C4/01-NTCIR7-OV-MuST-KatoT.pdf
(3) T2N TestSet
A set of queries in an information extraction task (named T2N subtask) pertaining the document data and the lists of data correctly extracted for the specified queries. The following is a sample of the queries. The specification of the task and the format of the answer files, README_E and T2NSpec_E, are included in this task data.
<task id="MuSTT2N0101" name_j="ガソリン" name_e="Gasoline">
<stats>
<stat id="MuSTT2N010101">
<name>"レギュラーガソリンの全国平均店頭価格"</name>
<alias>"レギュラーガソリンの小売価格(1リットル当たり)"</alias>
<alias>"ガソリン価格"</alias>
<v_unit>"円"</v_unit>
</stat>
<stat id="MuSTT2N010102">
<name>"ドバイ原油価格"</name>
<alias>"原油価格"</alias>
<v_unit>"ドル"</v_unit>
</stat>
</stats>
<docs>
<doc>"000306043"</doc>
…
<doc>"001129088"</doc>
</docs>
</task>
The followings are the procedures to obtain this CLQA test collection. The test collection and data available from NII are free of charge.
NTCIR Project (Rm.1309)
National Institute of Informatics
2-1-2 Hitotsubashi Chiyoda-ku, Tokyo
102-8430, JAPAN
PHONE: +81-3-4212-2750
FAX: +81-3-4212-2751
Email: ntc-secretariat
The release of the new test collections and correction information shall be announced through the ntcir Mailing list
The test collection has been constructed and used for the NTCIR. They are
usable only for the research purpose use.
The documents collection included in the test collection were provided
to NII for used in NTCIR free of charge or for a fee. The providers of
the document data kindly understand the importance of the test collection
in the research on information access technologies and then granted the
use of the data for research purpose. Please remember that the document
data in the NTCIR test collection is copyrighted and has commercial value
as data. It is important for our continued reliable and good relationship
with the data producers/providers that we researchers must behave as a
reliable partners and use the data only for research purpose under the
user agreement and use them carefully not to violate any rights for them
.