NTCIR Project
NTCIR-5 CLIR
Research Purpose Use of Test Collection

[JAPANESE] [NTCIR Home] [NTCIR DATA Home]

NTCIR-5 CLIR (IR Test Collection)

The NTCIR-5 CLIR test collection can be used for experiments of cross-lingual information retrieval between Chinese(traditional), Japanese, Korean and English(CJKE) such as

Multilingual CLIR (MLIR)
Bilingual CLIR (BLIR)
Single Language (Monolingual) IR (SLIR).

*It should be noted that document sets usable for MLIR are only CJKE..
Chinese documents use Traditional Chinese(Big 5)

The documents are full-text of news articles in CJKE languages, which were published in Asian areas from 2000 to 2001. The test collection also includes 50 search topics in CJKE, and relevance judgement information.

＊The entire collection is provided by NII.

(1) List of document sets

The documents sets included in the NTCIR-5 CLIR test collection are as follows.

Doc language	files		No. of docs
Doc language	files		2000	2001	Total
Chinese	CIRB040r (revised) (581.7 MB) (A)	---United Daily News (udn)	244,038	222,526	466,564
		---United Express (ude)	40,445	51,851	92,296
		---Ming Hseng News (mhn)	84,437	85,302	169,739
		---Economic Daily News (edn)	79,380	93,467	172,847
		Total	448,300	453,146	901,446
Japanese	---Mainichi Newspaper 2000-2001 (118.8 MB)(B)		99,207	100,474	199,681
	---Yomiuri Newspaper 2000-2001 (343.3 MB)(B)		306,709	352,010	658,719
	Total		405,916	452,484	858,400
Korean	---Hankookilbo 2000-2001 (52.1 MB)(A)		40,306	44,944	85,250
	---Chosunilbo 2000-2001 (88.7 MB)(A)		67,711	67,413	135,124
	Total		108,017	112,357	220,374
English	---Mainichi Daily News 2000-2001 (9.9 MB)(A)		6,608	5,547	12,155
	---Korea Times 2000-2001(25.3 MB)(A)		16,461	14,069	30,530
	---Daily Yomiuri 2000-2001(22.9 MB)(B)		9,081	8,660	17,741
	---Xinhua 2000-2001(from LDC)(B)		107,956	90,668	198,624
	Total		140,106	118,944	259,050

(A)

-- data is available from ＮＩＩ

(B)

-- For NTCIR Workshop participants, data is available from NII.
--For non-participants, data is available from third party (Newspaper co., LDC, etc)
Yomiuri Newspaper Japanese Article Data is available for research purpose use from Nihon Database Kaihatsu Co. Ltd (currently information is available in Japanese only). and the document records in the Data shall be converted into the NTCIR standard record format by the script yomi2ntcir.pl.

README for Yomiuri Newspaper Japanese Article Data : http://research.nii.ac.jp/ntcir/permission/ntcir-4/script/READMEforYomiuri98+99.txt
To obtain script yomi2ntcir.pl. ：http://research.nii.ac.jp/ntcir/permission/ntcir-4/script/yomi2ntc.pl
README【script yomi2ntcir.pl. 】：http://research.nii.ac.jp/ntcir/permission/ntcir-4/script/READMEforYomiuriScript.txt

Mainichi Newspaper Japanese Article Data Full-Text Article Database CD-ROMs are available for research purpose use from Nichigai Associates and Mainichi Newspaper Co. and the document records in the CD-ROMs shall be converted into the NTCIR standard record format by the script mai2.pl.(currently information is available in Japanese only)

To obtain script mai2ntc-r.pl：http://research.nii.ac.jp/ntcir/permission/ntcir-4/script/mai2ntc-r.pl_txt
README【mai2ntc-r.pl】http://research.nii.ac.jp/ntcir/permission/ntcir-4/script/READMEforMainichiScript-r.txt

Xinhua News Service English Articles File(2000-20001) are available for research purpose from Linguistic　Data Consortium (LDC) .If you wil obtain the data for purposes of Opinion Analysis research, please visit the following page for more information.
http://research.nii.ac.jp/ntcir/permission/ntcir-6/ntcir6xinhua-research.html
・How to obtain the Newspaper Article data
http://research.nii.ac.jp/ntcir/permission/ntcir-4/mainichi-en.html#Xinhua


(2) Tags used in document records The format of each news article is consistent by using a set of tags.

Mandatory tags
<DOC>	</DOC>	The tag for each document
<DOCNO>	</DOCNO>	Document identifier
<LANG>	</LANG>	Language code: CH, EN, JA, KR
<HEADLINE>	</HEADLINE>	Title of this news article
<DATE>	</DATE>	Issue date
<TEXT>	</TEXT>	Text of news article
Optional tags
<P>	</P>	Paragraph marker
<SECTION>	</SECTION>	Section identifier in original newspapers
<AE>	</AE>	Contain figures or not
<WORDS>	</WORDS>	Number of words in 2 bytes (for Mainichi Newspaper)

(1)List of topic files

The test collection of NTCIR-4 CLIR includes a set of topics for Chinese, Japanese Korean and English. The file names are as follows:

Language	File Name
(1) Chinese	NTCIR5CLIRTopicCH.txt
(2) Japanese	NTCIR5CLIRTopicJA.txt
(3) Korean	NTCIR5CLIRTopicKR.txt
(4) English	NTCIR5CLIRTopicEN.txt

(2) A sample of topics

<TOPIC>
<NUM>013</NUM>
<SLANG>CH</SLANG>
<TLANG>EN</TLANG>
<TITLE>NBA labor dispute</TITLE>
<DESC>To retrieve the labor dispute between the two parties of the US National Basketball Association at the end of 1998 and the agreement that they reached. </DESC>
<NARR>
<REL>The content of the related documents should include the causes of NBA labor dispute, the relations between the players and the management, main controversial issues of both sides, compromises after negotiation and content of the new agreement, etc. The document will be regarded as irrelevant if it only touched upon the influences of closing the court on each game of the season.</REL>
</NARR>
<CONC>NBA (National Basketball Association), union, team, league, labor dispute, league and union, negotiation, to sign an agreement, salary, lockout, Stern, Bird Regulation.</CONC>
</TOPIC>

(3) Tags used in topics

The tags used in topic are shown as follows.

<TOPIC>	</TOPIC>	The tag for each topic
<NUM>	</NUM>	Topic identifier
<SLANG>	</SLANG>	Source language code: CH, EN, JA, KR
<TLANG>	</TLANG>	Target language code: CH, EN, JA, KR
<TITLE>	</TITLE>	The concise representation of information request, which is composed of noun or noun phrase.
<DESC>	</DESC>	A short description of the topic. The brief description of information need, which is composed of one or two sentences.
<NARR>	</NARR>	A much longer description of topic. The <NARR> may has three parts; (1)<BACK>...</BACK>: background information about the topic is described. (2)<REL>...</REL>: further interpretation of the request and proper nouns, the list of relevant or irrelevant items, the specific requirements or limitations of relevant documents, and so on are given. (3)<TERM>...</TERM>: definition or explanation of proper nouns, scientific terms and so on.
<CONC>	</CONC>	The keywords relevant to whole topic.

Relevance Judgments

(1) List of judgment files

Type of run	Topics	Docs	Relevance Judgments Files
Single Language IR (SLIR)	Chinese(C)	Chinese(C)	CLIR5FormalRunRJ-C-Rigid.txt CLIR5FormalRunRJ-C-Relax.txt
	Japanese(J)	Japanese(J)	CLIR5FormalRunRJ-J-Rigid.txt CLIR5FormalRunRJ-J-Relax.txt
	Korean(K)	Korean(K)	CLIR5FormalRunRJ-K-Rigid.txt CLIR5FormalRunRJ-K-Relax.txt
	English(E)	English(E)	CLIR5FormalRunRJ-E-Rigid.txt CLIR5FormalRunRJ-E-Relax.txt
Bilingual IR (BLIR)	C or J or K	E	CLIR5FormalRunRJ-E-Rigid.txt CLIR5FormalRunRJ-E-Relax.txt
	C or K	J	CLIR5FormalRunRJ-J-Rigid.txt CLIR5FormalRunRJ-J-Relax.txt
	K or J	C	CLIR5FormalRunRJ-C-Rigid.txt CLIR5FormalRunRJ-C-Relax.txt
	C or J	K	CLIR5FormalRunRJ-K-Rigid.txt CLIR5FormalRunRJ-K-Relax.txt
Multilingual IR (MLIR)	C or J or K or E	C and J and K and E	CLIR5FormalRunRJ-CJKE-Rigid.txt CLIR5FormalRunRJ-CJKE-Relax.txt

(2) Two kinds of relevance judgment file

In this test collection, four categories of relevance are used for the judgment, i.e.,"Highly Relevant," "Relevant," "Partially Relevant," and "Irrelevant." However, since the trec_eval scoring program we use adopts binary relevance, we have to decide the thresholds for the 4 categories of relevance. For the reason, we provide two kinds of relevance judgment file:
(a) "Rigid" relevance - "Highly Relevant" and "Relevant" are regarded as relevant.
(b) "Relaxed" relevance - "Highly Relevant", "Relevant" and "Partially Relevant" are regarded as relevant.

(3) Format of Relevance Judgement File

The format of a relevance judgment file is as follows:

(topic-ID) (dummy) (document-ID) (relevance judgment "0" or "1") (comment)

Fields are separated by a TAB
The list was sorted by topic ID in ascending order, then by document ID in ascending order. A "topic-ID" is the three-digit number of the search topic.
The "dummy" filed contains information on the four categories judgment. i.e., S - "Highly Relevant," A - "Relevant ," B - "Partially Relevant," and C - "Irrelevant."
The "relevance judgement" filed contains a binary value: "1" means "Relevant," and "0" represents "Irrelevant," which converted according to "Rigid" and "Relaxed" criteria, mentioned above.
The "comment" field contains phrases or passages showing the reason why the documents were judged as relevant or partially relevant were extracted from the document by each assessor.

The followings are the procedures to obtain the test collection. The test collection and data available from NII are free of charge.

Task　data ( without document data )

NTCIR-5 CLIR Task data are downloadable from NII/IDR at;
http://www.nii.ac.jp/dsc/idr/en/ntcir/ntcir.html

Document data

Please read "The terms of use" carefully before you apply data.

1. Please fill out "The application form of the test collection" and sent byE-mail to ntc-secretariat as an attachment file.
2. After review of your application in NII,we will contact you the result within a few days.

< If your application are accepted >
Please send us "user agreement (memorandumon Permission to Use Test Collection)" by postal mail in the format shown in the following.
- Please download it and print two copies (double-sided printing).
- Please fill out and sign on both agreement forms. (Signatures are needed on both agreement forms. )
- Please send them by postal mail or courier to the Address below.
3. After we received agreement forms, we will distribute the data.

After being counter-signed by the NII side, one copy of the User Agreement Form will be sent to you and one copy will be kept by NII.
Because it shows permission to use the data, please keep it in a safe place during the term available.

Documents to submit

Application Form [txt]
User agreement Form (sent by email)

Reference

The terms of use [PDF]
Task Overview of NTCIR 5 CLQA
Overview of CLIR Task at the Fifth NTCIR Workshop

Address

NTCIR Project (Rm.1309)
National Institute of Informatics
2-1-2 Hitotsubashi Chiyoda-ku, Tokyo
102-8430, JAPAN

PHONE: +81-3-4212-2750
FAX: +81-3-4212-2751
Email: ntc-secretariat

Mailing List

The release of the new test collections and correction information shall be announced through the ntcir Mailing list

Notice

The test collection has been constructed and used for the NTCIR. They are usable only for the research purpose use.
The documents collection included in the test collection were provided to NII for used in NTCIR free of charge or for a fee. The providers of the document data kindly understand the importance of the test collection in the research on information access technologies and then granted the use of the data for research purpose. Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. It is important for our continued reliable and good relationship with the data producers/providers that we researchers must behave as a reliable partners and use the data only for research purpose under the user agreement and use them carefully not to violate any rights for them .

[JAPANESE] [NTCIR Home] [Top of this page] [NTCIR DATA Home]
Updated on : 2016-11-04

ntc-admin

NTCIR Project NTCIR-5 CLIR Research Purpose Use of Test Collection

NTCIR Project
NTCIR-5 CLIR
Research Purpose Use of Test Collection