NTCIR Project
NTCIR-7 MuST
Research Purpose Use of Test Collection

[JAPANESE] [NTCIR Home] [NTCIR DATA Home]



NTCIR-7 MuST(Multiomodal Summarization for Trend Information Test Collection)

[テストコレクションの概要]


The NTCIR-7 MuST Test Collection is planned to be used as materials of research on summarizing and visualizing trends (multi-modal summarization for trend information)  and its component technologies such as text summarization, information extraction and information visualization. It includes not only the materials to be processed, but also the corpus with annotations of intermediate results of compiling trends. 

The untagged MuST Corpus is a source of information, which consists of 701 articles of a Japanese newspaper.  Articles related several topics of industries, economics and natural disasters were collected. 

Task Data except the untagged MuST Corpus consists of three parts: annotated documents, which corresponds to the results of key sentence selection, named entity extraction and temporal processing; data in a table format obtained by manual information extraction on statistical values and those changes; and  a set of information extraction queries (of T2N subtask) pertaining the document data and the lists of data correctly extracted for the specified queries.

Reseachers can get Task data including 701 articles selected from Mainichi Shinbun articles published between 1998-2001 from NII.

Collection

Task

Documents

Task Data

Genre

Filename

Lang.

Year

# of docs

Size

MuST Corpus

IE

Relevance Judgement
Filename Lang. # of docs Size Lang.

#

NTCIR-7 MuST

IE,
Summarization,
Visualization

News Paper
articles

Mainichi
(B)

J

 1998

 -
2001

 

419,759

535MB

Untagged
MuST Corpust
(A)

J

701 2.9MB

J

25
(8topics)
N/A

(A) available from NII for research purpose
(B) For the non-participants, Mainichi Newspaper Japanese Article Data Full-Text Article Database CD-ROMs are available for research purpose use from Nichigai Associates and Mainichi Shinbum


Documents,Topics and Questions

 Documents 

Mainichi: Mainichi Newspaper Japanese Article Data 1998-2001

For the experiments of information extraction, only with MuST Corpus (selected 701 articles), it is possible to research using the tagged documents as correct answers.


If you have the whole newspaper article data 1998-2001, the document data can be used for various experiments as below:
・to analyse the ratio of extracted pattern on the whole
・to combine the ducuments with unsupervised method


We use four year Japanese newspaper articles published in the years of 1998-2001.
Mainichi Newspaper Japanese Article Data Full-Text Article Database CD-ROMs are available for research purpose use from Nichigai Associates and Mainichi Newspaper Co. and the document records in the CD-ROMs shall be converted into the NTCIR standard record format by the script mai2.pl.(currently information is available in Japanese only)
  ・To obtain script mai2ntc-r.pl:http://research.nii.ac.jp/ntcir/permission/ntcir-4/script/mai2ntc-r.pl_txt
  ・README【mai2ntc-r.pl】http://research.nii.ac.jp/ntcir/permission/ntcir-4/script/READMEforMainichiScript-r.txt


 Task Data 


(
1) MuST Corpus

701 articles selected from Mainichi Sinbun articles 1998-2001 with annotation, which corresponds to the results of key sentence selection, named entity extraction and temporal processing. The specification of the annotation is contained in the overview of MuST at the NTCIR-6 Workshop.


NTCIR-6 MuST Overview:
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings6/NTCIR/82.pdf

Untagged MuST Corpus: 701 articles selected from Mainichi Shinbun articles published between 1998-2001

Newspaper artic
les related 31 topics of industries, economics and natural disasters.  Examples of the topics are gasoline and oil price, beer industry and earthquakes.

(2) Change Expressions

Data in a table format obtained by manual information extraction on statistical values and those changes.  219 articles of 9 topics in MuST Corpus were processed.  The specification of the information extraction is contained in the overview of MuST at the NTCIR-7 Workshop.


NTCIR-7 MuST Overview:
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings7/pdf/NTCIR7/C4/01-NTCIR7-OV-MuST-KatoT.pdf

(3) T2N TestSet

A set of queries in an information extraction task (named T2N subtask) pertaining the document data and the lists of data correctly extracted for the specified queries.  The following is a sample of the queries.  The specification of the task and the format of the answer files, README_E and T2NSpec_E, are included in this task data.

<task id="MuSTT2N0101" name_j="ガソリン" name_e="Gasoline">
    <stats>
        <stat id="MuSTT2N010101">
            <name>"レギュラーガソリンの全国平均店頭価格"</name>
            <alias>"レギュラーガソリンの小売価格(1リットル当たり)"</alias>
            <alias>"ガソリン価格"</alias>
            <v_unit>"円"</v_unit>
        </stat>
        <stat id="MuSTT2N010102">
            <name>"ドバイ原油価格"</name>
            <alias>"原油価格"</alias>
            <v_unit>"ドル"</v_unit>
        </stat>
    </stats>
    <docs>
        <doc>"000306043"</doc>
        …
        <doc>"001129088"</doc>
    </docs>
</task>


To obtain the Test Collection

The followings are the procedures to obtain this CLQA test collection. The test collection and data available from NII are free of charge.


Address

NTCIR Project (Rm.1309)
National Institute of Informatics
2-1-2 Hitotsubashi Chiyoda-ku, Tokyo
102-8430, JAPAN
PHONE: +81-3-4212-2750

FAX: +81-3-4212-2751
Email: ntc-secretariat


Mailing List

The release of the new test collections and correction information shall be announced through the ntcir Mailing list

Notice

The test collection has been constructed and used for the NTCIR. They are usable only for the research purpose use.
The documents collection included in the test collection were provided to NII for used in NTCIR free of charge or for a fee. The providers of the document data kindly understand the importance of the test collection in the research on information access technologies and then granted the use of the data for research purpose. Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. It is important for our continued reliable and good relationship with the data producers/providers that we researchers must behave as a reliable partners and use the data only for research purpose under the user agreement and use them carefully not to violate any rights for them .