The NTCIR-5 CLIR test collection can be used for experiments of cross-lingual information retrieval between Chinese(traditional), Japanese, Korean and English(CJKE) such as

*It should be noted that document sets usable for MLIR are only CJKE..
Chinese documents use Traditional Chinese(Big 5)

The documents are full-text of news articles in CJKE languages, which were published in Asian areas from 2000 to 2001. The test collection also includes 50 search topics in CJKE, and relevance judgement information.

–The entire collection is provided by NII.

Documents, Topics and Questions


(1) List of document sets

The documents sets included in the NTCIR-5 CLIR test collection are as follows.

Doc language files No. of docs
2000 2001 Total
Chinese CIRB040r (revised)
(581.7 MB)

---United Daily News (udn) 244,038 222,526 466,564
---United Express (ude) 40,445 51,851 92,296
---Ming Hseng News (mhn) 84,437 85,302 169,739
---Economic Daily News (edn) 79,380 93,467 172,847
Total 448,300 453,146 901,446
Japanese ---Mainichi Newspaper 2000-2001 (118.8 MB)(B) 99,207 100,474 199,681
---Yomiuri Newspaper 2000-2001 (343.3 MB)(B) 306,709 352,010 658,719
Total 405,916 452,484 858,400
Korean ---Hankookilbo 2000-2001 (52.1 MB)(A) 40,306 44,944 85,250
---Chosunilbo 2000-2001 (88.7 MB)(A) 67,711 67,413 135,124
Total 108,017 112,357 220,374
English ---Mainichi Daily News 2000-2001 (9.9 MB)(A) 6,608 5,547 12,155
---Korea Times 2000-2001(25.3 MB)(A) 16,461 14,069 30,530
---Daily Yomiuri 2000-2001(22.9 MB)(B) 9,081 8,660 17,741
---Xinhua 2000-2001(from LDC)(B) 107,956 90,668 198,624
Total 140,106 118,944 259,050

(B) -- For NTCIR Workshop participants, data is available from NII.
--For non-participants, data is available from third party (Newspaper co., LDC, etc)

Yomiuri Newspaper Japanese Article Data is available for research purpose use from Nihon Database Kaihatsu Co. Ltd (currently information is available in Japanese only). and the document records in the Data shall be converted into the NTCIR standard record format by the script yomi2ntcir.pl.

Mainichi Newspaper Japanese Article Data Full-Text Article Database CD-ROMs are available for research purpose use from Nichigai Associates and Mainichi Newspaper Co. and the document records in the CD-ROMs shall be converted into the NTCIR standard record format by the script mai2.pl.(currently information is available in Japanese only)

Xinhua News Service English Articles File(2000-20001) are available for research purpose from Linguistic@Data Consortium (LDC) .If you wil obtain the data for purposes of Opinion Analysis research, please visit the following page for more information.
EHow to obtain the Newspaper Article data
(2) Tags used in document records

The format of each news article is consistent by using a set of tags.
Mandatory tags
<DOC> </DOC> The tag for each document
<DOCNO> </DOCNO> Document identifier
<LANG> </LANG> Language code: CH, EN, JA, KR
<HEADLINE> </HEADLINE> Title of this news article
<DATE> </DATE> Issue date
<TEXT> </TEXT> Text of news article
Optional tags
<P> </P> Paragraph marker
<SECTION> </SECTION> Section identifier in original newspapers
<AE> </AE> Contain figures or not
<WORDS> </WORDS> Number of words in 2 bytes (for Mainichi Newspaper)

(1)List of topic files

The test collection of NTCIR-4 CLIR includes a set of topics for Chinese, Japanese Korean and English. The file names are as follows:

Language File Name
(1) Chinese NTCIR5CLIRTopicCH.txt
(2) Japanese NTCIR5CLIRTopicJA.txt
(3) Korean NTCIR5CLIRTopicKR.txt
(4) English NTCIR5CLIRTopicEN.txt

(2) A sample of topics

<TITLE>NBA labor dispute</TITLE>
<DESC>To retrieve the labor dispute between the two parties of the US National Basketball Association at the end of 1998 and the agreement that they reached. </DESC>
<REL>The content of the related documents should include the causes of NBA labor dispute, the relations between the players and the management, main controversial issues of both sides, compromises after negotiation and content of the new agreement, etc. The document will be regarded as irrelevant if it only touched upon the influences of closing the court on each game of the season.</REL>
<CONC>NBA (National Basketball Association), union, team, league, labor dispute, league and union, negotiation, to sign an agreement, salary, lockout, Stern, Bird Regulation.</CONC>

(3) Tags used in topics

The tags used in topic are shown as follows.

<TOPIC> </TOPIC> The tag for each topic
<NUM> </NUM> Topic identifier
<SLANG> </SLANG> Source language code: CH, EN, JA, KR
<TLANG> </TLANG> Target language code: CH, EN, JA, KR
<TITLE> </TITLE> The concise representation of information request, which is composed of noun or noun phrase.
<DESC> </DESC> A short description of the topic. The brief description of information need, which is composed of one or two sentences.
<NARR> </NARR> A much longer description of topic. The <NARR> may has three parts;
(1)<BACK>...</BACK>: background information about the topic is described.
(2)<REL>...</REL>: further interpretation of the request and proper nouns, the list of relevant or irrelevant items, the specific requirements or limitations of relevant documents, and so on are given.
(3)<TERM>...</TERM>: definition or explanation of proper nouns, scientific terms and so on.
<CONC> </CONC> The keywords relevant to whole topic.

Relevance Judgments

(1) List of judgment files

Type of run Topics Docs Relevance Judgments Files
Single Language IR (SLIR) Chinese(C) Chinese(C) CLIR5FormalRunRJ-C-Rigid.txt
Japanese(J) Japanese(J) CLIR5FormalRunRJ-J-Rigid.txt
Korean(K) Korean(K) CLIR5FormalRunRJ-K-Rigid.txt
English(E) English(E) CLIR5FormalRunRJ-E-Rigid.txt
Bilingual IR (BLIR) C or J or K E CLIR5FormalRunRJ-E-Rigid.txt
C or K J CLIR5FormalRunRJ-J-Rigid.txt
K or J C CLIR5FormalRunRJ-C-Rigid.txt
C or J K CLIR5FormalRunRJ-K-Rigid.txt
Multilingual IR (MLIR) C or J or K or E C and J and K and E CLIR5FormalRunRJ-CJKE-Rigid.txt

(2) Two kinds of relevance judgment file

In this test collection, four categories of relevance are used for the judgment, i.e.,"Highly Relevant," "Relevant," "Partially Relevant," and "Irrelevant." However, since the trec_eval scoring program we use adopts binary relevance, we have to decide the thresholds for the 4 categories of relevance. For the reason, we provide two kinds of relevance judgment file:
(a) "Rigid" relevance - "Highly Relevant" and "Relevant" are regarded as relevant.
(b) "Relaxed" relevance - "Highly Relevant", "Relevant" and "Partially Relevant" are regarded as relevant.

(3) Format of Relevance Judgement File

The format of a relevance judgment file is as follows:

(topic-ID) (dummy) (document-ID) (relevance judgment "0" or "1") (comment)

The followings are the procedures to obtain the test collection. The test collection and data available from NII are free of charge.

Task@data ( without document data )

Document data

Please read "The terms of use" carefully before you apply data.

Documents to submit

Application Form [txt]
User agreement Form [PDF]


The terms of use [PDF]
Task Overview of NTCIR 5 CLQA
Overview of CLIR Task at the Fifth NTCIR Workshop


