README for Topics and Relevance Assessments of NTCIR-3 CLIR Test Collection - <Formal runs>
Apr./01/2003
1. Two sets of topics
The test collection of NTCIR-3 CLIR includes two sets of topics: one is for Chinese, Japanese and English 1998-99 document sets, and another is for Korean 1994 document set. The outline is shown in the following table. (see the section 3 about document sets)
document sets (target language) |
topic sets (source language) |
|
1998-99 Chinese docs Japanese docs English docs |
Year & Language | File Name |
(1) 1998-99 Chinese | CLIRFormalRunTopic-CH98 | |
(2) 1998-99 Japanese | CLIRFormalRunTopic-JA98 CLIRFormalRunTopic-JA98r (*see below) |
|
(3) 1998-99 Korean | CLIRFormalRunTopic-KR98 | |
(4) 1998-99 English | CLIRFormalRunTopic-EN98 | |
1994 Korean docs |
(5) 1994 Chinese | CLIRFormalRunTopic-CH94 |
(6) 1994 Japanese | CLIRFormalRunTopic-JA94 | |
(7) 1994 Korean | CLIRFormalRunTopic-KR94 | |
(8) 1994 English | CLIRFormalRunTopic-EN94 |
* NOTE: There is a mistranslation in Japanese topic set for 1998 named
as "CLIRFormalRunTopic-JA98.*" in the preliminary version for
NTCIR WS 3 participants. The topic number of the topic which contains the
error is "010". It had been found in the process of relevance
assessments and corrected, then the relevance assessments were carried
out for the modified topic.
The error had not been found in the process of the search result submissions
from the participants and the process of pooling the documents from the
results yet. Therefore the pooling for the topic "010" was not
enough effective, since some of the searches for it might have failed.
It might be one of the reasons why the relevance assessments might not
be exhaustive.
"NTCIR-3 CLIR test collection for the research purpose" includes
both of the preliminary version of the topic set and the revised (error-corrected)
version. The revised Japanese topic set are named "CLIRFormalRunTopic-JA98r.*".
If you use the topic "010" for your experiment, please pay attention
to this point.
2. Format of Topics
2.1A sample of topics
<TOPIC> <NUM>013</NUM> <SLANG>CH</SLANG> <TLANG>EN</TLANG> <TITLE>NBA labor dispute</TITLE> <DESC> To retrieve the labor dispute between the two parties of the US National Basketball Association at the end of 1998 and the agreement that they reached. </DESC> <NARR> The content of the related documents should include the causes of NBA labor dispute, the relations between the players and the management, main controversial issues of both sides, compromises after negotiation and content of the new agreement, etc. The document will be regarded as irrelevant if it only touched upon the influences of closing the court on each game of the season. </NARR> <CONC> NBA (National Basketball Association), union, team, league, labor dispute, league and union, negotiation, to sign an agreement, salary, lockout, Stern, Bird Regulation. </CONC> </TOPIC> |
2.2 Tags used in topics
The tags used in topic are shown as follows.
<TOPIC> | </TOPIC> | The tag for each topic |
<NUM> | </NUM> | Topic identifier |
<SLANG> | </SLANG> | Source language code: CH, EN, JA, KR |
<TLANG> | </TLANG> | Target language code: CH, EN, JA, KR |
<TITLE> | </TITLE> | The concise representation of information request, which is composed of noun or noun phrase. |
<DESC> | </DESC> | A short description of the topic.The brief description of information need, which is composed of one or two sentences. |
<NARR> | </NARR> | A much longer description of topic.The <NARR> has to be detailed, like the further interpretation to the request and proper nouns, the list of relevant or irrelevant items, the specific requirements or limitations of relevant documents, and so on. |
<CONC> | </CONC> | The keywords relevant |
3. Combinations of document sets with each topic set
3.1 Document Sets
Topics and relevance judgment information were created for search of the following document collections. If you wish to use the topics and relevance judgments, you need to obtain the document sets.
A. 1998-99 Document sets | |||
No. | Language | Name | Docs |
1. | Chinese(C) | CIRB011 (1998-99) | 132,173 |
United Daily News (CIRB020, 1998-99) | 249,508 | ||
2. | Japanese(J) | Mainichi Newspaper (1998-99) | 220,078 |
3. | English(E) | Taiwan News and China Times English News (EIRB010, 1998-99) | 10,204 |
Mainichi Daily News | 12,723 | ||
B. 1994 Document set | |||
4. | Korean(K) | Korea Economic Daily (1994) | 66,146 |
3.2 Combinations of three components
You have to use 'correct' combinations of document sets, topic sets and relevance judgment file as follows.
A. 1998-99 Document sets | |||
Type of run | Topics | Docs | Relevance Judgments Files |
Monolingual IR | C(1998-99) | C | CLIRFormalRunRJ-C-Rigid CLIRFormalRunRJ-C-Relax |
J(1998-99) | J | CLIRFormalRunRJ-J-Rigid CLIRFormalRunRJ-J-Relax |
|
E(1998-99) | E | CLIRFormalRunRJ-E-Rigid CLIRFormalRunRJ-E-Relax |
|
Bilingual IR | C or K or E (1998-99) |
J | CLIRFormalRunRJ-J-Rigid CLIRFormalRunRJ-J-Relax |
C or K or J (1998-99) |
E | CLIRFormalRunRJ-E-Rigid CLIRFormalRunRJ-E-Relax |
|
K or E or J (1998-99) |
C | CLIRFormalRunRJ-C-Rigid CLIRFormalRunRJ-C-Relax |
|
Multilingual IR | C or J or E or K (1998-99) |
C and J | CLIRFormalRunRJ-CJ-Rigid CLIRFormalRunRJ-CJ-Relax |
C or J or E or K (1998-99) |
C and E | CLIRFormalRunRJ-CE-Rigid CLIRFormalRunRJ-CE-Relax |
|
C or J or E or K (1998-99) |
J and E | CLIRFormalRunRJ-JE-Rigid CLIRFormalRunRJ-JE-Relax |
|
C or J or E or K (1998-99) |
C and J and E | CLIRFormalRunRJ-CJE-Rigid CLIRFormalRunRJ-CJE-Relax |
|
B. 1994 Document set | |||
Monolingual IR | K(1994) | K | CLIRFormalRunRJ-K-Rigid CLIRFormalRunRJ-K-Relax |
Bilingual IR | C or J or E (1994) |
K | CLIRFormalRunRJ-K-Rigid CLIRFormalRunRJ-K-Relax |
3.3 Two kinds of relevance judgment file
In this test collection, four categories of relevance are used for the
judgment, i.e.,"Highly Relevant," "Relevant," "Partially
Relevant," and "Irrelevant." However, since the trec_eval
scoring program we use adopts binary relevance, we have to decide the thresholds
for the 4 categories of relevance. For the reason, we provide two kinds
of relevance judgment file:
(1) "Rigid" relevance - "Highly Relevant" and "Relevant"
are regarded as relevant.
(2) "Relaxed" relevance - "Highly Relevant", "Relevant"
and "Partially Relevant" are regarded as relevant.
4. The sets of topics used for each sub-task
4.1 Overview
The NTCIR-3 CLIR task has many sub-tasks of which combinations of languages used in CLIR task are different each other. It should be carefully noted that the topic set to be used is determined by which languages are employed in the target 'document' set, not depending on the language used in 'topics.'
4.2 Searching the Chinese document set (C)
The Chinese (C) document set contains 381,681 Chinese documents.
The set of 42 topics (1998-99) should be used for retrieval experiments
using the C collection:
Topic No.1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19,
20, 21, 22, 23, 24, 25, 27, 32, 33, 34, 35, 36, 37, 38, 39, 40, 42, 43,
45, 46, 47, 48, 49, and 50.
4.3 Searching the Japanese document set (J)
The Japanese (J) document set contains 220,078 Japanese documents.
The set of 42 topics (for 1998-99) should be used for retrieval experiments
using the J collection:
Topic No. 2, 4, 5, 7, 8, 10, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41,
42, 43, 44, 45, 46, 47, and 50.
4.4 Searching the English document set (E)
The English (E) document set contains 22,927 English documents.
The set of 32 topics (for 1998-99) should be used for retrieval experiments
using the E collection:
Topic No. 2, 4, 5, 7, 9, 12, 13, 14, 18, 19, 20, 21, 23, 24, 26, 27, 28,
29, 31, 32, 33, 34, 35, 36, 37, 38, 39, 42, 43, 45, 46, and 50.
4.5 Searching the Chinese and Japanese document set (CJ)
The Chinese (C) and Japanese (J) document set contains 601,759.
All 50 topics (for 1998-99) should be used for retrieval experiments using
the CJ collection.
4.6 Searching the Chinese and English document set (CE)
The Chinese (C) and English (E) document set contains 404,608 Chinese and
English documents. The set of 46 topics (for 1998-99) should be used for
retrieval experiments using the CE collection:
Topic No. 1 ,2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19,
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 31, 32 ,33, 34, 35, 36, 37, 38,
39, 40, 42, 43, 45, 46, 47, 48, 49, and 50.
4.7 Searching the Japanese and English document set (JE)
The Japanese (J) and English (E) document set contains 243,005 Japanese and English documents. The set of 45 topics (for 1998-99) should be used for retrieval experiments using the JE collection: Topic No. 2, 4, 5, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, and 50.
4.8 Searching the Chinese, Japanese and English document set (CJE)
The Chinese, Japanese and English document set (CJE) document set contains
624,686 Chinese, Japanese, and English documents.
All 50 topics (for 1998-99) should be used for retrieval experiments using
the CJE collection.