Call For Participation

Cross-Lingual Information Retrieval Task in NTCIR Workshop 4

(October. 31 2003)

Apr. 17 2003 - The numbers of docs of Hankookilbo and Korea Times are modified (see 5.1(a))
Jun. 05 2003 - A new document set, Hong Kong Standard, is added (see 5.1(a)).
Jul. 26 2003 - A new Korean document set, Chosunilbo, is added (see 5.1(a)).
Oct. 01 2003 - New document sets, Yomiuri and CIRB011, are added for evaluation (see 5.1(a))
Oct. 01 2003 - Formal run topics are available. (clir password protected download area)
Oct. 10 2003 - FAQ is available. Please confirm you have all the document collections you need!
Oct. 15 2003 - Submission Guideline is available.
Oct. 28 2003 - Wrong specification in the table on the document languages for Small-MLIR was corrected. (see 2. MLIR)
Oct. 29 2003 - Abnormal document lists were added: [CIRB011] [CIRB020] [Yomiuri] [xinhua] [all]
Oct. 29 2003 - Numbers of documents in CIRB011, CIRB020, Yomiuri, and xinhua were corrected after discarding the abnormal documents. (see 5.1 (a))
Oct. 29 2003 - Submission Deadline is extended to: Saturday, Nov.8, 2003.
Oct. 29 2003 - Announcement on "HK Standard duplication" is upped.
Oct. 30 2003 - Numbers of documents "total of Enclish 1988-1999" was corrected. (see 5.1 (a))
Oct. 30 2003 - Policy on the Treatment of the Abnormal Document Records was upped.
Oct. 31 2003 - Submission Guideline is updated. How to submit was added.

1. Introduction

The cross-lingual information retrieval (CLIR) task of NTCIR Workshop 4 includes four subtasks such as

for promoting research on East Asian languages (Chinese, Japanese, and Korean).
a) See also the official web site (http://research.nii.ac.jp/ntcir/) for further information.
b) Online registration is available at the web site (http://research.nii.ac.jp/ntcir/ntcir4-ws/howto-en.html).

2. Subtasks

The CLIR task provides 4 subtasks. Participants can choose to take part in any one, any two, any three or all of four subtasks.

Multilingual CLIR (MLIR)
The topic set and document set of MLIR subtask consist of more than two languages. In the case of NTCIR Workshop 4, the participants are allowed to submit results of runs for two types of multilingual document collection,

Regarding the topic set, participants can use whichever they like of these four languages. The following depicts the MLIR subtask.

Topic set Document set

Bilingual CLIR (BLIR)
The topic set and document set of BLIR subtask consist of two languages. For example, for doing K-->J run (from Korean topics to Japanese documents), the topics are needed to be translated into Japanese (or, the documents into Korean).
In the case of BLIR at NTCIR Workshop 4, participants are not allowed to submit results of runs using topics written in English, except the case of trying pivot language approach (see below).
The following depicts the BLIR subtask.

Topic set Document set
C -> J
C -> K
C -> E
J -> C
J -> K
J -> E
K -> C
K -> J
K -> E

Pivot Bilingual CLIR (PLIR) new!
This subtask is a new challenge at NTCIR-4. Pivot language approach (or trans-lingual approach) means a special form of BLIR, which consists of two steps, e.g., C -> E is followed by E -> J (i.e., C -> E -> J) for doing C -> J search (in this case, English serves as a kind of intermediate or pivot language).
The participants submitting runs for this subtask are allowed to also submit BLIR runs using English topics (i.e., E -> C or J or K) in order to analyze performance of the approach.

Topic set Document set
C -> J
C -> K
C -> E
J -> C
J -> K
J -> E
K -> C
K -> J
K -> E
E -> C
E -> J
E -> K

Single Language IR (SLIR)
The topic set and document set of SLIR subtask consist of a single language. The following depicts the SLIR subtask.

Topic set Document set
C -> C
J -> J
K -> K
E -> E

3. Special Interests in NTCIR Workshop 4 check!

In the NTCIR Workshop 4, it should be challenged to explore some special issues as follows.
Pivot Language Approach
Since the approach seems to be realistic, the participants are strongly recommended to try this approach for BLIR using English as a pivot language.
More MLIR!
MLIR should be studied more intensively for developing a search engine working in a real situation on the Internet. The submission of MLIR runs is strongly recommended.
Term Disambiguation in More Real Situation
Needless to say, an important issue for enhancing CLIR performance is term disambiguation. In order to promote development of term disambiguation techniques, gtitle-only runh becomes mandatory in the NTCIR-4.
Sharing Knowledge on Language Model
In recent, so-called glanguage modelh is often used for CLIR. The model may give us a good opportunity for enhancing theoretical or practical aspect of CLIR.

4. Language Resources

Link information to some resources can be obtained from a list of language resources.
The task organizers would like to continue making efforts for providing useful resources such as bilingual term lists, parallel corpora, or translation probability tables. If you have a language resource that can be shared among all participants, please tell us about it.

5. Test Collection

5.1 Document set

The test collection used in CLIR task is composed of document set and topic set. The following will give a brief description of each set. It should be noted that a new document set may be added late

(a)Document collection for evaluation

Language Collection No. of Docs Note  
Chinese 1998-99 CIRB020 (United Daily News) + 249,508
Used in NTCIR-3 (modified on 2003-10-29)
CIRB011 (China Times, China Times Express, Commercial Times, China Daily News, Central and Daily News ) ***, + 132,173
Used in NTCIR-3 (modified on 2003-10-29)
total 381,681
(modified on 2003-10-29)
Japanese 1998-99 Mainichi 220,078 Used in NTCIR-3  
Yomiuri ***, + 375,980
New (modified on 2003-10-29)
Total 596,058
(modified on 2003-10-29)
Korean 1998-99 Hankookilbo 149,921 New  
Chosunilbo** 104,517 New  
total 254,438  
English 1998-99 EIRB010 Taiwan News 7,489 Used in NTCIR-3  
China Times English News (Taiwan) 2,715 Used in NTCIR-3  
Mainichi Daily News (Japan) 12,723 Used in NTCIR-3  
Korea Times 19,599 New  
Xinhua (AQUAINT)+ 208,168
New (modified on
Hong Kong Standard * 96,856
New (modified on
total 347,550
(modified on 2003-10-29)
(modified on 2003-10-30)

* Hong Kong Standard was added on Jun. 05 2003.
** Chosunilbo was added on Jul. 26 2003.
*** Yomiuri and CIRB011 were added on Oct. 1 2003.
+ : the number of documents were after discarding the abnormal records.
     The lists of abnormal document IDs are available here [CIRB011] [CIRB020] [Yomiuri] [xinhua] [all]

(b)Document collection for training

Language Collection No. of Docs
Chinese 1998-99 CIRB020 (United Daily News) 249,508
CIRB011 (China Times, China Times Express, Commercial Times, China Daily News, Central and Daily News ) 132,173
Japanese 1998-99 Mainichi 220,078
Korean 1994 *Korea Economic Daily (1994) 66,146
English 1998-99 EIRB010 Taiwan News 7,489
China Times English News (Taiwan) 2,715
Mainichi Daily News (Japan) 12,723

*This document set is not used for evaluation in NTCIR Workshop 4.The contract form is available at http://research.nii.ac.jp/ntcir/permission/perm-en.html. After signing needed contracts, participants have to send them to NII. The details could be referred to the URL mentioned above.

The format of each news article is consistent by using a set of tags. The sample documents will be shown in the Appendix.

Mandatory tags
<DOC> </DOC> The tag for each document
<DOCNO> </DOCNO> Document identifier
<LANG> </LANG> Language code: CH, EN, JA, KR
<HEADLINE> </HEADLINE> Title of this news article
<DATE> </DATE> Issue date
<TEXT> </TEXT> Text of news article
Optional tags
<P> </P> Paragraph marker
<SECTION> </SECTION> Section identifier in original newspapers
<AE> </AE> Contain figures or not
<WORDS> </WORDS> Number of words in 2 bytes (for Mainichi Newspaper)

5.2 Topics

Each topic has four fields; 'T' (TITLE), 'D' (DESC), 'N' (NARR), 'C' (CONC). The following shows a sample topic.

<TITLE>NBA labor dispute</TITLE>
<DESC>To retrieve the labor dispute between the two parties of the US National Basketball Association at the end of 1998 and the agreement that they reached. </DESC>
<REL>The content of the related documents should include the causes of NBA labor dispute, the relations between the players and the management, main controversial issues of both sides, compromises after negotiation and content of the new agreement, etc. The document will be regarded as irrelevant if it only touched upon the influences of closing the court on each game of the season.</REL>
<CONC>NBA (National Basketball Association), union, team, league, labor dispute, league and union, negotiation, to sign an agreement, salary, lockout, Stern, Bird Regulation.</CONC>

The tags used in topic are shown as follows.

<TOPIC> </TOPIC> The tag for each topic
<NUM> </NUM> Topic identifier
<SLANG> </SLANG> Source language code: CH, EN, JA, KR
<TLANG> </TLANG> Target language code: CH, EN, JA, KR
<TITLE> </TITLE> The concise representation of information request, which is composed of noun or noun phrase.
<DESC> </DESC> A short description of the topic. The brief description of information need, which is composed of one or two sentences.
<NARR> </NARR> A much longer description of topic. The <NARR> may has three parts;new!
(1)<BACK>...</BACK>: background information about the topic is described.
(2)<REL>...</REL>: further interpretation of the request and proper nouns, the list of relevant or irrelevant items, the specific requirements or limitations of relevant documents, and so on are given.
(3)<TERM>...</TERM>: definition or explanation of proper nouns, scientific terms and so on.
<CONC> </CONC> The keywords relevant to whole topic.

It should be noted that three subfields, <BACK> ,<REL> and <TERM>, are newly added in <NARR> field in the NTCIR-4.

*Topics for evaluation are available from the the CLIR download cite.(2003/10/01)

6. Types of Runs

Mandatory Runs: T-run and D-run
Each participant must submit two types of run for each combination of topic language and document language(s);

The purpose of asking participants to submit these mandatory runs is to make research findings clear by comparing systems or methods under a unified condition.

Recommended Runs: DN-run
Also, the task organizers would like to recommend strongly DN run, which is run using <DESC> and <NARR> fields are used.

Optional Runs
Other any combinations of fields are allowed to submit as optional runs according to each participantfs research interests, e.g. TDN-run, DC-run, TDNC-run and so on.

Number of Runs check!
Each participant can submit up to 5 runs in total for each language pair regardless of the type of run, and participants are allowed to include two T runs in maximum and also two D-runs in maximum into the 5 runs. The language pair means the combination of topic language and document language(s). For example,
Language combination -> Topic: C and Docs: CJE (C->CJE)
Submission -> two T-runs, a D-run, a DN-run and a TDNC run (5 runs in total).

Identification and Priority of Runs
Each run has to be associated with a RunID. RunID is an identity for each run. The rule of format for RunID is as follows.

The 'pp' is two digits used to represent the priority of the run. It will be used as a parameter for pooling. The participants have to decide the priority for each submitted run in the basis of each language pair. "01" means the high priority. For example, a participating group, LIPS, submits 3 runs for C-->CJE. The first is a T run, the second is a D run and the third is a DN run. Therefore, the Run ID for each run is LIPS-C-CJE-T-01, LIPS-C-CJE-D-02, and LIPS-C-CJE-DN-03, respectively. Or, if the group uses different ranking techniques in T run for C --> CJE, the RunID for each run has to be LIPS-C-CJE-T-01, LIPS-C-CJE-T-02, and LIPS-C-CJE-D-03.

7. Evaluation

Relevance judgments will be done in four grades, Highly Relevant, Relevant, Partially Relevant, and Irrelevant. Evaluation will be done using trec_eval, which is run at two different threshold of relevance levels. Also, new proposed metrics for multi-grade relevance judgments, weighted R precision and weighted average precision, which basically give merits to the system that will retrieve more relevant documents in higher ranks, may be employed.

8. Schedule

2003-03-20 Application Due
2003-03-30 Data (Document sets) Release
2003-10-01 Distribution of Search Topics
2003-11-01 Submission of Search Results* ---> Extended to 2003-11-08
2004-02-20 Delivery of Evaluation Results
2004-03-19 Paper Due (for Working Notes)
2004-05 NTCIR Workshop 4 (Conference)
2004-09-01 Paper Due (for Formal Proceedings)

*Please see the Submission Guideline.

9.CLIR Task Executives Committee (Task Organizers)

Hsin-Hsi Chen, Taiwan
Kuang-hua Chen, Taiwan (co-chair)
Koji Eguchi, Japan
Noriko Kando, Japan
Kazuaki Kishida, Japan (co-chair)
Kazuko Kuriyama, Japan
Suk-Hoon Lee, Korea (co-chair)
Sung Hyon Myaeng, Korea
(in alphabetical order of family names)

10. Contact Information

If you have a question, please contact the task organizers.

E-mail: ntcadm-clir

1.Sample of Chinese Document Record

2.Sample of Japanese Document Record

3. Sample of Korean Dcoument Record

