README for Topics and Relevance Assessments of NTCIR-3 CLIR Test Collection - <Formal runs>

Apr./01/2003

1. Two sets of topics

The test collection of NTCIR-3 CLIR includes two sets of topics: one is for Chinese, Japanese and English 1998-99 document sets, and another is for Korean 1994 document set. The outline is shown in the following table. (see the section 3 about document sets)

document sets (target language)	topic sets (source language)
1998-99 Chinese docs Japanese docs English docs	Year & Language	File Name
	(1) 1998-99 Chinese	CLIRFormalRunTopic-CH98
	(2) 1998-99 Japanese	CLIRFormalRunTopic-JA98 CLIRFormalRunTopic-JA98r (*see below)
	(3) 1998-99 Korean	CLIRFormalRunTopic-KR98
	(4) 1998-99 English	CLIRFormalRunTopic-EN98
1994 Korean docs	(5) 1994 Chinese	CLIRFormalRunTopic-CH94
	(6) 1994 Japanese	CLIRFormalRunTopic-JA94
	(7) 1994 Korean	CLIRFormalRunTopic-KR94
	(8) 1994 English	CLIRFormalRunTopic-EN94

* NOTE: There is a mistranslation in Japanese topic set for 1998 named as "CLIRFormalRunTopic-JA98.*" in the preliminary version for NTCIR WS 3 participants. The topic number of the topic which contains the error is "010". It had been found in the process of relevance assessments and corrected, then the relevance assessments were carried out for the modified topic.
The error had not been found in the process of the search result submissions from the participants and the process of pooling the documents from the results yet. Therefore the pooling for the topic "010" was not enough effective, since some of the searches for it might have failed. It might be one of the reasons why the relevance assessments might not be exhaustive.
"NTCIR-3 CLIR test collection for the research purpose" includes both of the preliminary version of the topic set and the revised (error-corrected) version. The revised Japanese topic set are named "CLIRFormalRunTopic-JA98r.*".
If you use the topic "010" for your experiment, please pay attention to this point.

2. Format of Topics

2.1A sample of topics

<TOPIC>
<NUM>013</NUM>
<SLANG>CH</SLANG>
<TLANG>EN</TLANG>
<TITLE>NBA labor dispute</TITLE>
<DESC>
To retrieve the labor dispute between the two parties of the US National Basketball Association at the end of 1998 and the agreement that they reached.
</DESC>
<NARR>
The content of the related documents should include the causes of NBA labor dispute, the relations between the players and the management, main controversial issues of both sides, compromises after negotiation and content of the new agreement, etc. The document will be regarded as irrelevant if it only touched upon the influences of closing the court on each game of the season.
</NARR>
<CONC>
NBA (National Basketball Association), union, team, league, labor dispute, league and union, negotiation, to sign an agreement, salary, lockout, Stern, Bird Regulation.
</CONC>
</TOPIC>

2.2 Tags used in topics

The tags used in topic are shown as follows.

<TOPIC>	</TOPIC>	The tag for each topic
<NUM>	</NUM>	Topic identifier
<SLANG>	</SLANG>	Source language code: CH, EN, JA, KR
<TLANG>	</TLANG>	Target language code: CH, EN, JA, KR
<TITLE>	</TITLE>	The concise representation of information request, which is composed of noun or noun phrase.
<DESC>	</DESC>	A short description of the topic.The brief description of information need, which is composed of one or two sentences.
<NARR>	</NARR>	A much longer description of topic.The <NARR> has to be detailed, like the further interpretation to the request and proper nouns, the list of relevant or irrelevant items, the specific requirements or limitations of relevant documents, and so on.
<CONC>	</CONC>	The keywords relevant

3. Combinations of document sets with each topic set

3.1 Document Sets

Topics and relevance judgment information were created for search of the following document collections. If you wish to use the topics and relevance judgments, you need to obtain the document sets.

A. 1998-99 Document sets
No.	Language	Name	Docs
1.	Chinese(C)	CIRB011 (1998-99)	132,173
1.	Chinese(C)	United Daily News (CIRB020, 1998-99)	249,508
2.	Japanese(J)	Mainichi Newspaper (1998-99)	220,078
3.	English(E)	Taiwan News and China Times English News (EIRB010, 1998-99)	10,204
3.	English(E)	Mainichi Daily News	12,723
B. 1994 Document set
4.	Korean(K)	Korea Economic Daily (1994)	66,146

3.2 Combinations of three components

You have to use 'correct' combinations of document sets, topic sets and relevance judgment file as follows.

A. 1998-99 Document sets
Type of run	Topics	Docs	Relevance Judgments Files
Monolingual IR	C(1998-99)	C	CLIRFormalRunRJ-C-Rigid CLIRFormalRunRJ-C-Relax
	J(1998-99)	J	CLIRFormalRunRJ-J-Rigid CLIRFormalRunRJ-J-Relax
	E(1998-99)	E	CLIRFormalRunRJ-E-Rigid CLIRFormalRunRJ-E-Relax
Bilingual IR	C or K or E (1998-99)	J	CLIRFormalRunRJ-J-Rigid CLIRFormalRunRJ-J-Relax
	C or K or J (1998-99)	E	CLIRFormalRunRJ-E-Rigid CLIRFormalRunRJ-E-Relax
	K or E or J (1998-99)	C	CLIRFormalRunRJ-C-Rigid CLIRFormalRunRJ-C-Relax
Multilingual IR	C or J or E or K (1998-99)	C and J	CLIRFormalRunRJ-CJ-Rigid CLIRFormalRunRJ-CJ-Relax
	C or J or E or K (1998-99)	C and E	CLIRFormalRunRJ-CE-Rigid CLIRFormalRunRJ-CE-Relax
	C or J or E or K (1998-99)	J and E	CLIRFormalRunRJ-JE-Rigid CLIRFormalRunRJ-JE-Relax
	C or J or E or K (1998-99)	C and J and E	CLIRFormalRunRJ-CJE-Rigid CLIRFormalRunRJ-CJE-Relax
B. 1994 Document set
Monolingual IR	K(1994)	K	CLIRFormalRunRJ-K-Rigid CLIRFormalRunRJ-K-Relax
Bilingual IR	C or J or E (1994)	K	CLIRFormalRunRJ-K-Rigid CLIRFormalRunRJ-K-Relax

3.3 Two kinds of relevance judgment file

In this test collection, four categories of relevance are used for the judgment, i.e.,"Highly Relevant," "Relevant," "Partially Relevant," and "Irrelevant." However, since the trec_eval scoring program we use adopts binary relevance, we have to decide the thresholds for the 4 categories of relevance. For the reason, we provide two kinds of relevance judgment file:
(1) "Rigid" relevance - "Highly Relevant" and "Relevant" are regarded as relevant.
(2) "Relaxed" relevance - "Highly Relevant", "Relevant" and "Partially Relevant" are regarded as relevant.

4. The sets of topics used for each sub-task

4.1 Overview

The NTCIR-3 CLIR task has many sub-tasks of which combinations of languages used in CLIR task are different each other. It should be carefully noted that the topic set to be used is determined by which languages are employed in the target 'document' set, not depending on the language used in 'topics.'

4.2 Searching the Chinese document set (C)

The Chinese (C) document set contains 381,681 Chinese documents.
The set of 42 topics (1998-99) should be used for retrieval experiments using the C collection:
Topic No.1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 22, 23, 24, 25, 27, 32, 33, 34, 35, 36, 37, 38, 39, 40, 42, 43, 45, 46, 47, 48, 49, and 50.

4.3 Searching the Japanese document set (J)

The Japanese (J) document set contains 220,078 Japanese documents.
The set of 42 topics (for 1998-99) should be used for retrieval experiments using the J collection:
Topic No. 2, 4, 5, 7, 8, 10, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, and 50.

4.4 Searching the English document set (E)

The English (E) document set contains 22,927 English documents.
The set of 32 topics (for 1998-99) should be used for retrieval experiments using the E collection:
Topic No. 2, 4, 5, 7, 9, 12, 13, 14, 18, 19, 20, 21, 23, 24, 26, 27, 28, 29, 31, 32, 33, 34, 35, 36, 37, 38, 39, 42, 43, 45, 46, and 50.

4.5 Searching the Chinese and Japanese document set (CJ)

The Chinese (C) and Japanese (J) document set contains 601,759.
All 50 topics (for 1998-99) should be used for retrieval experiments using the CJ collection.

4.6 Searching the Chinese and English document set (CE)

The Chinese (C) and English (E) document set contains 404,608 Chinese and English documents. The set of 46 topics (for 1998-99) should be used for retrieval experiments using the CE collection:
Topic No. 1 ,2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 31, 32 ,33, 34, 35, 36, 37, 38, 39, 40, 42, 43, 45, 46, 47, 48, 49, and 50.

4.7 Searching the Japanese and English document set (JE)

The Japanese (J) and English (E) document set contains 243,005 Japanese and English documents. The set of 45 topics (for 1998-99) should be used for retrieval experiments using the JE collection: Topic No. 2, 4, 5, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, and 50.

4.8 Searching the Chinese, Japanese and English document set (CJE)

The Chinese, Japanese and English document set (CJE) document set contains 624,686 Chinese, Japanese, and English documents.
All 50 topics (for 1998-99) should be used for retrieval experiments using the CJE collection.