gif(National Institute of Informatics)

The 4th NTCIR Workshop

NTCIR-4 is over. For information on data see the NTCIR data page.


NTCIR-4 Test Collections: Documents

The following documents collections are used for the 4th NTCIR Workshop. They are available for the participating research groups free of charge for the task participation and system evaluation within the 4th NTCIR Workshop. To obtain the data, the signed user agreement forms must be submitted to the NTCIR Project Office at the NII. Please notice that the Xinhua Collection in the NTCIR-4 CLIR test collection needs the different procedure and the separate user agreement form to obtain the data.

task test collection documents
genre language file name number of documents (size) year
CLIR NTCIR-4 CLIR news articles Chinese (traditional) CIRB020 249,508 1998-1999
Japanese Mainichi 220,078
Korean Hankookilbo<*> 149,498
English EIRB010 10,204
Mainichi Daily News 12,723
Korea Times<*> 21,377
Hong Kong Standard<*> ca. 60K
Xinhua<*> 208,168
PATENT NTCIR-4 PATENT patent full Japanese

Publication of unexamined patent applications<*>

ca. 3500K 1993-2002
patent abstract English

Patent Abstracts of Japan (PAJ)<*>

ca. 3500K
QAC NTCIR-4 QA news articles Japanese Mainichi 220,078 1998-1999
Yomiuri<*> ca. 260K
TSC NTCIR-4 SUMM news articles Japanese Mainichi 220,078 1998-1999
Yomiuri<*> ca. 260K
WEB NTCIR-4 WEB Web multiple languages <4> NW100G-01 11,038,720
crawled in 2001

1: For the details of the task data (topics and relevance judgments, questions and answers, summaries, etc), please consult the CFPs of each task.

2: New data (the addition to the NTCIR-3 test collections) is indicated by <*>.

3: Please notice that the document collections shall be used for the purpose of accomplishing tasks set out in the NTCIR Workshop 4 and for the purpose of research related to the tasks. The documents can not be used for "commercial purpose" nor "information purpose".

4: almost Japanese or English (some in other languages)

Last updated : 2003-03-01