[Japanese] [NTCIR] [workshop] [register][agreement][Resources] Task [clir][patent][qac][tsc][web]

Call For Participation
Cross-Language Retrieval Task in NTCIR Workshop 3

The 3rd NTCIR Workshop was OVER.
For the current NTCIR Workshop (the latest in the series of NTCIR Workshops), visit  Go NTCIR Workshop !

NTCIR: http://research.nii.ac.jp/ntcir/
CLIR: http://reseach.nii.ac.jp/workshop/clir/

(Jan. 30, 2002)

[intro][committee]track[mlir][blir][slir][collection][type][eval][schedule][appendix]

For participants(Access Controlled)
Submission Instruction

1. Introduction

In order to promote the researches on cross-language information retrieval (CLIR), the program committee of NTCIR 3 workshop is happy to announce a CLIR task. There are 3 tracks in CLIR task: 1) Multilingual CLIR, 2) Bilingual CLIR, and 3) Single Language IR (non-English IR). Participants from all over the world are welcome and both automatic systems and interactive systems are welcome. Participants are asked to report their system specification and technique details. Please visit the official web site (http://research.nii.ac.jp/ntcir/) for further information. Online registration is available at the web site (http://research.nii.ac.jp/ntcir/ workshop/application-en.html).

Basically, 4 languages will involve in the CLIR task: Chinese, English, Japanese, and Korean. Please note that other languages may be added later, if the preparation procedure will be completed on schedule.

2. CLIR Task Executives Committee

Dr. Hsin-Hsi Chen, Taiwan (Co-chair)
Dr. Kuang-hua Chen, Taiwan (Co-chair)
Dr. Koji Eguchi, Japan
Dr. Noriko Kando, Japan
Dr. Hyeon Kim, Korea
Dr. Kazuaki Kishida, Japan
Dr. Kazuko Kuriyama, Japan
Dr. Suk-Hoon Lee, Korea
Dr. Sung Hyon Myaeng, Korea (in alphabetical order of family names)

3. Tracks

The CLIR Task provides three tracks. Participants could choose to take part in any one, any two, or all of three tracks.

Multilingual CLIR (MLIR)

The topic set and document set of MLIR Track consist of more than two languages. The challenge is that participants have to resolve the complicated multi-language issues. Since the publishing dates of Korean documents are not parallel to those of Chinese, English, and Japanese documents (see Document Sets section below), document set used in MLIR consists of Chinese, English, and Japanese. However, the topic could be Chinese, English, Japanese, and Korean. We will translate topic of other languages into Korean, so participants could carry out K&rarrow;CEJ" track. The following depicts the MLIR track.

Topic Set

Document Set

Topic Set

Document Set

C

-->

C, J, E

C

-->

J, E

E

-->

C, J, E

E

-->

J, E

J

-->

C, J, E

J

-->

J, E

K

-->

C, J, E

K

-->

J, E

C

-->

C, J

C

-->

C, E

E

-->

C, J

E

-->

C, E

J

-->

C, J

J

-->

C, E

K

-->

C, J

K

-->

C, E

Bilingual CLIR (BLIR)

The topic set and document set of BLIR Track consist of two languages. The complexity of this track is less than that of MLIR. However, participants also have to resolve cross-language issues. Please note that the topics in K-->J and K-->C are translated from Chinese topics and Japanese topics. On the contrary, the topics in C-->K, E-->K, and J-->K are translated from Korean topics. The following depicts the BLIR track.

Topic Set

Document Set

C

-->

J

E

-->

J

K

-->

J

C

-->

K

E

-->

K

J

-->

K

J

-->

C

K

-->

C

E

-->

C

Single Language IR (SLIR)

The topic set and document set of SLIR Track consist of one language. The following depicts the SLIR track. We do not allow E-->E track.

Topic Set

Document Set

C

-->

C

J

-->

J

K

-->

K

4. Language Resources

We do not provide the multi-lingual dictionaries or segmentation tools for Chinese, Japanese, and Korean. Participants have to construct the needed tools or dictionaries. However, we will prepare a list of CJK language resources and post it on the NTCIR official website. The participants could download or purchase them from resource providers.

5. Test Collection [document][topics]

The test collection used in CLIR task is composed of document set and topic set. The following will give a brief description of each set.

Document Set

The documents used in CLIR are news articles collected from different news agencies of different countries. The following shows the information of documents.

Japan

Mainichi Newspaper (1998-1999): Japanese

236,664

Mainichi Daily News (1998-1999): English

12,723

Korea

Korea Economic Daily (1994): Korean

66,146

Taiwan

CIRB010 (1998-1999): Chinese

132,173

CIRB011 (1998-1999): Chinese

132,173

United Daily News (udn) (1998-1999): Chinese

249,508

Taiwan News (tns) (1998-1999): English

7,489

Chinatimes English News (ctg) (1998-1999): English

2,715


The participants have to sign different contracts for using these materials. Each contract has its own requirements. We hope participants could understand the complicated situations of copyright issues in different countries.


The period of permitted use of the Mainichi Newspapers and Mainichi Daily News (Document collections from Japan) are from 2001-09-01 to 2003-09-30. For active participants who will submit the results and who affiliated at the organization outside Japan will be able to extend the period up to 2008-09-30. After the permitted period will be terminated, the participants will have to delete all the documents data, or will have to purchase the data from Mainichi Newspaper Co., and obtain the permission for research purpose use from the Company.

The participants have to sign a contract with Dr. Kuang-hua Chen for the use of United Daily News (Document collections from Taiwan). However, the udn.com (the company of United Daily News) reserves the right to reject the contract. Basically, udn.com will prove the contracts in normal situations. The participants also have to sign a contract with Dr. Kuang-hua Chen for the use of CIRB010, CIRB011, Taiwan News and Chinatimes English News.

The participants have to sign contract with Prof. Sung Hyon Myaeng and will be proved the right to use Korea Economic Daily for 2 years plus possible extensions.

All contract forms are at http://research.nii.ac.jp/ntcir/permission/perm-en.html. After signing the needed contracts, participants have to send them to NII. The details could be referred to the URL mentioned above.

The format of each news article is consistent by using a set of tags. The sample documents will be shown in the Appendix.

The tag set is shown as follows.

Mandatory tags

<DOC>     

</DOC>

The tag for each document

<DOCNO>   

</DOCNO>

Document identifier

<LANG>

</LANG>

Language code: CH, EN, JA, KR

<HEADLINE>

</HEADLINE>

Title of this news article

<DATE>

</DATE>

Issue date

<TEXT>

</TEXT>

Text of news article

Optional tags

<P>

</P>

Paragraph marker

<SECTION>

</SECTION>

Section identifier in original newspapers

<AE>

</AE>

Contain figures or not

<WORDS>

</WORDS>

Number of words in 2 bytes (for Mainichi Newspaper)

Topic Set

The following shows a sample topic.

<TOPIC>

<NUM>013</NUM>

<SLANG>CH</SLANG>

<TLANG>EN</TLANG>

<TITLE>NBA labor dispute</TITLE>

<DESC>

To retrieve the labor dispute between the two parties of the US National Basketball Association at the end of 1998 and the agreement that they reached.

</DESC>

<NARR>

The content of the related documents should include the causes of NBA labor dispute, the relations between the players and the management, main controversial issues of both sides, compromises after negotiation and content of the new agreement, etc. The document will be regarded as irrelevant if it only touched upon the influences of closing the court on each game of the season.

</NARR>

<CONC>

NBA (National Basketball Association), union, team, league, labor dispute, league and union, negotiation, to sign an agreement, salary, lockout, Stern, Bird Regulation.

</CONC>

</TOPIC>

The tags used in topic are shown as follows.

<TOPIC>     

</TOPIC>

The tag for each topic

<NUM>   

</NUM>

Topic identifier

<SLANG>

</SLANG>

Source language code: CH, EN, JA, KR

<TLANG>

</TLANG>

Target language code: CH, EN, JA, KR

<TITLE>

</TITLE>

The concise representation of information request, which is composed of noun or noun phrase.

<DESC>

</DESC>

A short description of the topic.The brief description of information need, which is composed of one or two sentences.

<NARR>

</NARR>

A much longer description of topic.The <NARR> has to be detailed, like the further interpretation to the request and proper nouns, the list of relevant or irrelevant items, the specific requirements or limitations of relevant documents, and so on.

<CONC>

</CONC>

The keywords relevant to whole topic.

6. Types of Runs

Basically, we allow all types of runs using any combination of fields in topic and use the$B!!(B'T' (TITLE), 'D' (DESC), 'N' (NARR), 'C' (CONC) and any combination of these symbols to name the run types. That is to say, participant can submit T run, D run, N run, C run, TD run, TN run, TC run, DN run, DC run, NC run, TDN run, TDC run, TNC run, DNC run, and TDNC run. Each participant can submit up to 3 runs for each language pair regardless of the type of run. The language pair means the combination of topic language and document language(s). Among these run types, D run type is mandatory run type. Each participant has to submit at least a D run for a language pair. Each run has to be associated with a RunID. RunID is an identity for each run. The rule of format for RunID is as follows.

Group's ID-Topic Language-Document Language-Run Type-pp

The 'pp' is two digits used to represent the priority of the run. It will be used as a parameter for pooling. The participants have to decide the priority for each submitted run in the basis of each language pair. "01" means the high priority. For example, a participating group, LIPS, submits 3 runs for C-->CJ track. The first is a D run, the second is a DN run and the third is a TD run. Therefore, the Run ID for each run is LIPS-C-CJ-D-01, LIPS-C-CJ-DN-02, and LIPS-C-CJ-TD-03, respectively.  Also if the group uses different ranking techniques in D run for C --> CJ track, the RunID for each run has to be LIPS-C-CJ-D-01, LIPS-C-CJ-D-02, and LIPS-C-CJ-D-03.

7. Evaluation

Relevance judgments will be done in four grades, Highly Relevant, Relevant, Partially Relevant, and Irrelevant. Evaluation will be done using

1)       trec_eval will be run at two different threshold of relevance levels.

2)       new proposed metrics for multigrade relevance judgments, weighted R precision and weighted average precision, they are basically give merits to the system that will retrieve more relevant documents in higher ranks.

8. Schedule


2001-09-30 Application Due
2001-10-01 Deliver Dry Run Data
2001-10-30 Distribution of Dry Run Search Topics
2001-11-12 Submit Search Results of Dry Run
2001-12-07 Deliver Evaluation Results of Dry Run
2002-01-08 Distribution of Formal Run Search Topics
2002-02-04 Submit Search Results of Formal Run
2002-07-01 Deliver Evaluation Results of Formal Run
2002-08-20 Paper Due
2002-10-08/
2002-10-10
NTCIR Workshop 3


Appendix

[chinese][korean][japanese]

1. Chinese Document

2. Japanese Document

3. Korean Document


[Japanese] [NTCIR] [workshop] [register] [agreement] [Resources] Task [clir] [patent] [qac] [tsc] [web]