NTCIR Project
Research Purpose Use of Test Collection


NTCIR-4 CLIR(IR Test Collection)

Test Collection

The NTCIR-4 CLIR test collection can be used for experiments of cross-lingual information retrieval between Chinese(traditional), Japanese, Korean and English(CJKE) such as

*It should be noted that document sets provided for MLIR are only CJE and CJKE..

The documents are full-text of news articles in CJKE languages, which were published in Asian areas from 1998 to 1999. The test collection also includes 60 search topics in CJKE, and relevance judgement information.

*The entire collection is provided by NII.

Documents, Topics and Questions


(1) List of document sets

The documents sets included in the NTCIR-4 CLIR test collection are as follows.

Language Collection No. of Docs Note
Chinese 1998-99 CIRB020 (United Daily News) 249,508 Used in NTCIR-3
CIRB011 (China Times, China Times Express, Commercial Times, China Daily News, Central and Daily News ) 132,173 Used in NTCIR-3
total 381,681
Japanese 1998-99 Yomiuri 375,980 New
Mainichi 220,078 Used in NTCIR-3
Total 596,058
Korean 1998-99 Hankookilbo 149,921 New
Chosunilbo 104,517 New
total 254,438
English 1998-99 EIRB010 Taiwan News 7,489 Used in NTCIR-3
China Times English News (Taiwan) 2,715 Used in NTCIR-3
Mainichi Daily News (Japan) 12,723 Used in NTCIR-3
Korea Times 19,599 New
Xinhua (AQUAINT) 208,168 New
Hong Kong Standard 96,856 New
total 347,550
-- data is available from NII
-- data is available NTCIR Workshop participants only
-- For NTCIR Workshop participants, data is available from NII.
For non-participants, data is available from third party (Newspaper co., LDC, etc)

(2) Tags used in document records

The format of each news article is consistent by using a set of tags.

Mandatory tags
<DOC> </DOC> The tag for each document
<DOCNO> </DOCNO> Document identifier
<LANG> </LANG> Language code: CH, EN, JA, KR
<HEADLINE> </HEADLINE> Title of this news article
<DATE> </DATE> Issue date
<TEXT> </TEXT> Text of news article
Optional tags
<P> </P> Paragraph marker
<SECTION> </SECTION> Section identifier in original newspapers
<AE> </AE> Contain figures or not
<WORDS> </WORDS> Number of words in 2 bytes (for Mainichi Newspaper)

(1)List of topic files

The test collection of NTCIR-4 CLIR includes a set of topics for Chinese, Japanese Korean and English. The file names are as follows:

Language File Name
(1) Chinese NTCIR4CLIRFormalRunTopic-CH.txt
(2) Japanese NTCIR4CLIRFormalRunTopic-JA_mod20031203.txt
(3) Korean NTCIR4CLIRFormalRunTopic-KR.txt
(4) English NTCIR4CLIRFormalRunTopic-RN.txt

(2) A sample of topics

<TITLE>NBA labor dispute</TITLE>
<DESC>To retrieve the labor dispute between the two parties of the US National Basketball Association at the end of 1998 and the agreement that they reached. </DESC>
<REL>The content of the related documents should include the causes of NBA labor dispute, the relations between the players and the management, main controversial issues of both sides, compromises after negotiation and content of the new agreement, etc. The document will be regarded as irrelevant if it only touched upon the influences of closing the court on each game of the season.</REL>
<CONC>NBA (National Basketball Association), union, team, league, labor dispute, league and union, negotiation, to sign an agreement, salary, lockout, Stern, Bird Regulation.</CONC>

(3) Tags used in topics

The tags used in topic are shown as follows.

<TOPIC> </TOPIC> The tag for each topic
<NUM> </NUM> Topic identifier
<SLANG> </SLANG> Source language code: CH, EN, JA, KR
<TLANG> </TLANG> Target language code: CH, EN, JA, KR
<TITLE> </TITLE> The concise representation of information request, which is composed of noun or noun phrase.
<DESC> </DESC> A short description of the topic. The brief description of information need, which is composed of one or two sentences.
<NARR> </NARR> A much longer description of topic. The <NARR> may has three parts;
(1)<BACK>...</BACK>: background information about the topic is described.
(2)<REL>...</REL>: further interpretation of the request and proper nouns, the list of relevant or irrelevant items, the specific requirements or limitations of relevant documents, and so on are given.
(3)<TERM>...</TERM>: definition or explanation of proper nouns, scientific terms and so on.
<CONC> </CONC> The keywords relevant to whole topic.

Relevance Judgments

(1) List of judgment files

Type of run Topics Docs Relevance Judgments Files
Single Language IR (SLIR) Chinese(C) Chinese(C) CLIR4FormalRunRJ-C-Rigid.txt
Japanese(J) Japanese(J) CLIR4FormalRunRJ-J-Rigid.txt
Korean(K) Korean(K) CLIR4FormalRunRJ-K-Rigid.txt
English(E) English(E) CLIR4FormalRunRJ-E-Rigid.txt
Bilingual IR (BLIR) C or J or K E CLIR4FormalRunRJ-E-Rigid.txt
C or K J CLIR4FormalRunRJ-J-Rigid.txt
K or J C CLIR4FormalRunRJ-C-Rigid.txt
C or J K CLIR4FormalRunRJ-K-Rigid.txt
Pivot Bilingual IR (PLIR) C or J or K E CLIR4FormalRunRJ-E-Rigid.txt
C or K or E J CLIR4FormalRunRJ-J-Rigid.txt
J or K or E C CLIR4FormalRunRJ-C-Rigid.txt
C or J or E K CLIR4FormalRunRJ-K-Rigid.txt
Multilingual IR (MLIR) C or J or K or E C and J and K and E CLIR4FormalRunRJ-CJKE-Rigid.txt
C or J or K or E C and J and E CLIR4FormalRunRJ-CJE-Rigid.txt

(2) Two kinds of relevance judgment file

In this test collection, four categories of relevance are used for the judgment, i.e.,"Highly Relevant," "Relevant," "Partially Relevant," and "Irrelevant." However, since the trec_eval scoring program we use adopts binary relevance, we have to decide the thresholds for the 4 categories of relevance. For the reason, we provide two kinds of relevance judgment file:
(a) "Rigid" relevance - "Highly Relevant" and "Relevant" are regarded as relevant.
(b) "Relaxed" relevance - "Highly Relevant", "Relevant" and "Partially Relevant" are regarded as relevant.

(3) Format of Relevance Judgement File

The format of a relevance judgment file is as follows:

(topic-ID) (dummy) (document-ID) (relevance judgment "0" or "1") (comment)

If the number of relevant documents is over 1000, the upper limit of average precision computed by the trec_eval scoring program is less than 1.0. For example, when the number of relevant documents is 1072, the upper limit seems to be 1000/1072 (= 0.9328).

The test collection has been constructed and used for the NTCIR. They are usable only for the research purpose use.
The documents collection included in the test collection were provided to NII for used in NTCIR free of charge or for a fee. The providers of the document data kindly understand the importance of the test collection in the research on information access technologies and then granted the use of the data for research purpose. Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. It is important for our continued reliable and good relationship with the data producers/providers that we researchers must behave as a reliable partners and use the data only for research purpose under the user agreement and use them carefully not to violate any rights for them .

The followings are the procedures to obtain the test collection. The test collection and data available from NII are free of charge.

Task data ( without document data )

Document data

Please read "The terms of use" carefully before you apply data.

Documents to submit



NTCIR Project (Rm.1309)
National Institute of Informatics
2-1-2 Hitotsubashi Chiyoda-ku, Tokyo
102-8430, JAPAN

PHONE: +81-3-4212-2750
FAX: +81-3-4212-2751
Email: ntc-secretariat

Mailing List

The release of the new test collections and correction information shall be announced through the ntcir Mailing list