README for Topics and Relevance Assessments of NTCIR-3 CLIR Test Collection - <Dry runs>
Apr./01/2003
1. Two sets of topics
The test collection of NTCIR-3 CLIR includes two sets of topics: one is for Chinese, Japanese and English 1998-99 document sets, and another is for Korean 1994 document set. The outline is shown in the following table. (see the section 3 about document sets)
document set (target language) |
topic set (source language) |
|
1998-99 Chinese docs Japanese docs English docs |
Year & Language | File Name |
(1) 1998-99 Chinese | CLIRDryRunTopic-CH98 | |
(2) 1998-99 Japanese | CLIRDryRunTopic-JA98 | |
(3) 1998-99 Korean | CLIRDryRunTopic-KR98 | |
(4) 1998-99 English | CLIRDryRunTopic-EN98 | |
1994 Korean docs |
(5) 1994 Chinese | CLIRDryRunTopic-CH94 |
(6) 1994 Japanese | CLIRDryRunTopic-JA94 | |
(7) 1994 Korean | CLIRDryRunTopic-KR94 | |
(8) 1994 English | CLIRDryRunTopic-EN94 |
2. Format of Topics
<TOPIC> <NUM>013</NUM> <SLANG>CH</SLANG> <TLANG>EN</TLANG> <TITLE>NBA labor dispute</TITLE> <DESC> To retrieve the labor dispute between the two parties of the US National Basketball Association at the end of 1998 and the agreement that they reached. </DESC> <NARR> The content of the related documents should include the causes of NBA labor dispute, the relations between the players and the management, main controversial issues of both sides, compromises after negotiation and content of the new agreement, etc. The document will be regarded as irrelevant if it only touched upon the influences of closing the court on each game of the season. </NARR> <CONC> NBA (National Basketball Association), union, team, league, labor dispute, league and union, negotiation, to sign an agreement, salary, lockout, Stern, Bird Regulation. </CONC> </TOPIC> |
2.2 Tags used in topics
<TOPIC> | </TOPIC> | The tag for each topic |
<NUM> | </NUM> | Topic identifier |
<SLANG> | </SLANG> | Source language code: CH, EN, JA, KR |
<TLANG> | </TLANG> | Target language code: CH, EN, JA, KR |
<TITLE> | </TITLE> | The concise representation of information request, which is composed of noun or noun phrase. |
<DESC> | </DESC> | A short description of the topic.The brief description of information need, which is composed of one or two sentences. |
<NARR> | </NARR> | A much longer description of topic.The <NARR> has to be detailed, like the further interpretation to the request and proper nouns, the list of relevant or irrelevant items, the specific requirements or limitations of relevant documents, and so on. |
<CONC> | </CONC> | The keywords relevant |
3. Combinations of document sets with each topic set
3.1 Document Sets
Topics and relevance judgment information were created for search of the following document collections. If you wish to use the topics and relevance judgments, you need to obtain the document sets.
A. 1998-99 Document sets | |||
No. | Language | Name | Docs |
1. | Chinese(C) | CIRB011 (1998-99) | 132,173 |
United Daily News (CIRB020, 1998-99) | 249,508 | ||
2. | Japanese(J) | Mainichi Newspaper (1998-99) | 220,078 |
3. | English(E) | Taiwan News and China Times English News (EIRB010, 1998-99) | 10,204 |
Mainichi Daily News | 12,723 | ||
B. 1994 Document set | |||
4. | Korean(K) | Korea Economic Daily (1994) | 66,146 |
3.2 Combinations of three components
You have to use 'correct' combinations of document sets, topic sets and relevance judgment file as follows.
A. 1998-99 Document sets | |||
Type of run | Topics | Docs | Relevance Judgments Files |
Monolingual IR | C(1998-99) | C | CLIRDryRunRJ-C-Rigid CLIRDryRunRJ-C-Relax |
J(1998-99) | J | CLIRDryRunRJ-J-Rigid CLIRDryRunRJ-J-Relax |
|
E(1998-99) | E | CLIRDryRunRJ-E-Rigid CLIRDryRunRJ-E-Relax |
|
Bilingual IR | C or K or E (1998-99) |
J | CLIRDryRunRJ-J-Rigid CLIRDryRunRJ-J-Relax |
C or K or J (1998-99) |
E | CLIRDryRunRJ-E-Rigid CLIRDryRunRJ-E-Relax |
|
K or E or J (1998-99) |
C | CLIRDryRunRJ-C-Rigid CLIRDryRunRJ-C-Relax |
|
Multilingual IR | C or J or E or K (1998-99) |
C and J | CLIRDryRunRJ-CJ-Rigid CLIRDryRunRJ-CJ-Relax |
C or J or E or K (1998-99) |
C and E | CLIRDryRunRJ-CE-Rigid CLIRDryRunRJ-CE-Relax |
|
C or J or E or K (1998-99) |
J and E | CLIRDryRunRJ-JE-Rigid CLIRDryRunRJ-JE-Relax |
|
C or J or E or K (1998-99) |
C and J and E | CLIRDryRunRJ-CJE-Rigid CLIRDryRunRJ-CJE-Relax |
|
B. 1994 Document set | |||
Monolingual IR | K(1994) | K | CLIRDryRunRJ-K-Rigid CLIRDryRunRJ-K-Relax |
Bilingual IR | C or J or E (1994) |
K | CLIRDryRunRJ-K-Rigid CLIRDryRunRJ-K-Relax |
3.3 Two kinds of relevance judgment file
In this test collection, four categories of relevace are used for the judgment,
i.e.,"Highly Relevant," "Relevant," "Partially
Relevant," and "Irrelevant." However, since the trec_eval
scoring program we use adopts bibary relevance, we have to decide the thresholds
for the 4 categories of relevance. For the reason, we provide two kinds
of relevance judgment file:
(1) "Rigid" relevance - "Highly Relevant" and "Relevant"
are regarded as relevant.
(2) "Relaxed" relevance - "Highly Relevant", "Relevant"
and "Partially Relevant" are regarded as relevant.