NTCIR-7 MuST(Multiomodal Summarization for Trend Information Test Collection)

The NTCIR-7 MuST Test Collection is planned to be used as materials of research on summarizing and visualizing trends (multi-modal summarization for trend information) and its component technologies such as text summarization, information extraction and information visualization. It includes not only the materials to be processed, but also the corpus with annotations of intermediate results of compiling trends.

The untagged MuST Corpus is a source of information, which consists of 701 articles of a Japanese newspaper. Articles related several topics of industries, economics and natural disasters were collected.

Task Data except the untagged MuST Corpus consists of three parts: annotated documents, which corresponds to the results of key sentence selection, named entity extraction and temporal processing; data in a table format obtained by manual information extraction on statistical values and those changes; and a set of information extraction queries (of T2N subtask) pertaining the document data and the lists of data correctly extracted for the specified queries.

Reseachers can get Task data including 701 articles selected from Mainichi Shinbun articles published between 1998-2001 from NII.

Collection

Task

Documents

Task Data

Genre

Filename

Lang.

Year

# of docs

Size

MuST Corpus

Relevance Judgement

Filename

Lang.

# of docs

Size

Lang.

NTCIR-7 MuST

IE,
Summarization,
Visualization

News Paper
articles

Mainichi
(B)

1998

-
2001

419,759

535MB

Untagged
MuST Corpust
(A)

701

2.9MB

25
(8topics)

N/A

(A)	available from NII for research purpose
(B)	For the non-participants, Mainichi Newspaper Japanese Article Data Full-Text Article Database CD-ROMs are available for research purpose use from Nichigai Associates and Mainichi Shinbum

　　

Mainichi: Mainichi Newspaper Japanese Article Data 1998-2001

For the experiments of information extraction, only with MuST Corpus (selected 701 articles), it is possible to research using the tagged documents as correct answers.

If you have the whole newspaper article data 1998-2001,　the document data can be used for various experiments as below:
・to analyse the ratio of extracted pattern on the whole
・to combine the ducuments with unsupervised method

We use four year Japanese newspaper articles published in the years of 1998-2001.
Mainichi Newspaper Japanese Article Data Full-Text Article Database CD-ROMs are available for research purpose use from Nichigai Associates and Mainichi Newspaper Co. and the document records in the CD-ROMs shall be converted into the NTCIR standard record format by the script mai2.pl.(currently information is available in Japanese only)
　　・To obtain script mai2ntc-r.pl：http://research.nii.ac.jp/ntcir/permission/ntcir-4/script/mai2ntc-r.pl_txt
　　・README【mai2ntc-r.pl】http://research.nii.ac.jp/ntcir/permission/ntcir-4/script/READMEforMainichiScript-r.txt

(1) MuST Corpus

701 articles selected from Mainichi Sinbun articles 1998-2001 with annotation, which corresponds to the results of key sentence selection, named entity extraction and temporal processing. The specification of the annotation is contained in the overview of MuST at the NTCIR-6 Workshop.

NTCIR-6 MuST Overview:
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings6/NTCIR/82.pdf

Untagged MuST Corpus: 701 articles selected from Mainichi Shinbun articles published between 1998-2001

Newspaper articles related 31 topics of industries, economics and natural disasters. Examples of the topics are gasoline and oil price, beer industry and earthquakes.

(2) Change Expressions

Data in a table format obtained by manual information extraction on statistical values and those changes. 219 articles of 9 topics in MuST Corpus were processed. The specification of the information extraction is contained in the overview of MuST at the NTCIR-7 Workshop.

NTCIR-7 MuST Overview:
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings7/pdf/NTCIR7/C4/01-NTCIR7-OV-MuST-KatoT.pdf

(3) T2N TestSet

A set of queries in an information extraction task (named T2N subtask) pertaining the document data and the lists of data correctly extracted for the specified queries. The following is a sample of the queries. The specification of the task and the format of the answer files, README_E and T2NSpec_E, are included in this task data.

<task id="MuSTT2N0101" name_j="ガソリン" name_e="Gasoline">
    <stats>
        <stat id="MuSTT2N010101">
            <name>"レギュラーガソリンの全国平均店頭価格"</name>
            <alias>"レギュラーガソリンの小売価格（１リットル当たり）"</alias>
            <alias>"ガソリン価格"</alias>
            <v_unit>"円"</v_unit>
        </stat>
        <stat id="MuSTT2N010102">
            <name>"ドバイ原油価格"</name>
            <alias>"原油価格"</alias>
            <v_unit>"ドル"</v_unit>
        </stat>
    </stats>
    <docs>
        <doc>"000306043"</doc>
        …
        <doc>"001129088"</doc>
    </docs>
</task>

The followings are the procedures to obtain this CLQA test collection. The test collection and data available from NII are free of charge.

The application form of the test collection must be filled out and sent by E-mail to ntc-secretariat
User Agreement (memorandum on Permission to Use Test Collection)is required.

The user agreement form for each test collection that you would like to obtain must be filled out and sent by postal mail or courier to the address below.

Please download and make two copies of the form in double-sided print.

Signatures are needed on both agreement forms.

After counter-signed by NII side, one copy of the form will be sent to you and one copy will be kept by the NII.

Documents to submit

Application Form [txt]
User agreement form (sent by email)

Refference

Task Overview of NTCIR 7 MuST
Overview of MuST at the NTCIR-7Workshop – Challenges to Multi-modal Summarization for Trend Information –

Address

NTCIR Project (Rm.1309)

National Institute of Informatics
2-1-2 Hitotsubashi Chiyoda-ku, Tokyo
102-8430, JAPAN
PHONE: +81-3-4212-2750
FAX: +81-3-4212-2751
Email: ntc-secretariat

Mailing List

The release of the new test collections and correction information shall be announced through the ntcir Mailing list

Notice

The test collection has been constructed and used for the NTCIR. They are usable only for the research purpose use.
The documents collection included in the test collection were provided to NII for used in NTCIR free of charge or for a fee. The providers of the document data kindly understand the importance of the test collection in the research on information access technologies and then granted the use of the data for research purpose. Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. It is important for our continued reliable and good relationship with the data producers/providers that we researchers must behave as a reliable partners and use the data only for research purpose under the user agreement and use them carefully not to violate any rights for them .

[JAPANESE] [NTCIR Home] [Top of this page] [NTCIR DATA Home]
Updated on : 2009-04-14
ntc-admin

NTCIR Project NTCIR-7 MuST Research Purpose Use of Test Collection

NTCIR Project
NTCIR-7 MuST
Research Purpose Use of Test Collection