[NTCIR Home][NTCIR Workshop 4 Home]
Call For Participation
Cross-Lingual Information Retrieval Task in NTCIR Workshop 4
(October. 31 2003)
Apr. 17 2003 - The numbers of docs of Hankookilbo and Korea Times are modified
(see 5.1(a))
Jun. 05 2003 - A new document set, Hong Kong Standard, is added (see 5.1(a)).
Jul. 26 2003 - A new Korean document set, Chosunilbo, is added (see 5.1(a)).
Oct. 01 2003 - New document sets, Yomiuri and CIRB011, are added for evaluation
(see 5.1(a))
Oct. 01 2003 - Formal run topics are available. (clir password protected download area)
Oct. 10 2003 - FAQ is available. Please confirm you have all the document collections you
need!
Oct. 15 2003 - Submission Guideline is available.
Oct. 28 2003 - Wrong specification in the table on the document languages for Small-MLIR was corrected. (see 2. MLIR)
Oct. 29 2003 - Abnormal document lists were added: [CIRB011] [CIRB020] [Yomiuri] [xinhua] [all]
Oct. 29 2003 - Numbers of documents in CIRB011, CIRB020, Yomiuri, and xinhua
were corrected after discarding the abnormal documents. (see 5.1 (a))
Oct. 29 2003 - Submission Deadline is extended to: Saturday, Nov.8, 2003.
Oct. 29 2003 - Announcement on "HK Standard duplication" is upped.
Oct. 30 2003 - Numbers of documents "total of Enclish 1988-1999"
was corrected. (see 5.1 (a))
Oct. 30 2003 - Policy on the Treatment of the Abnormal Document Records was upped.
Oct. 31 2003 - Submission Guideline is updated. How to submit was added.
[1.Intro][2.Subtasks][MLIR][BLIR][PLIR][SLIR][3.Interests][4.Resources][5.Test Collection][Doc][Topic]
[6.Type of Runs][7.Evaluation][8.Schedule][9.Organizers][10.Contact][Appendix]
The cross-lingual information retrieval (CLIR) task of NTCIR Workshop 4 includes four subtasks such as
for promoting research on East Asian languages (Chinese, Japanese, and
Korean).
Note:
a) See also the official web site (http://research.nii.ac.jp/ntcir/) for further information.
b) Online registration is available at the web site (http://research.nii.ac.jp/ntcir/ntcir4-ws/howto-en.html).
The CLIR task provides 4 subtasks. Participants can choose to take part in any one, any two, any three or all of four subtasks.
Multilingual CLIR (MLIR)
The topic set and document set of MLIR subtask consist of more than two
languages. In the case of NTCIR Workshop 4, the participants are allowed
to submit results of runs for two types of multilingual document collection,
Regarding the topic set, participants can use whichever they like of these four languages. The following depicts the MLIR subtask.
Topic set | Document set | |
C | -> | CJKE |
J | -> | CJKE |
K | -> | CJKE |
E | -> | CJKE |
C | -> | |
J | -> | |
K | -> | |
E | -> |
Bilingual CLIR (BLIR)
The topic set and document set of BLIR subtask consist of two languages.
For example, for doing K-->J run (from Korean topics to Japanese documents),
the topics are needed to be translated into Japanese (or, the documents
into Korean).
Note:
In the case of BLIR at NTCIR Workshop 4, participants are not allowed to
submit results of runs using topics written in English, except the case
of trying pivot language approach (see below).
The following depicts the BLIR subtask.
Topic set | Document set | |
C | -> | J |
C | -> | K |
C | -> | E |
J | -> | C |
J | -> | K |
J | -> | E |
K | -> | C |
K | -> | J |
K | -> | E |
Pivot Bilingual CLIR (PLIR)
This subtask is a new challenge at NTCIR-4. Pivot language approach (or
trans-lingual approach) means a special form of BLIR, which consists of
two steps, e.g., C -> E is followed by E -> J (i.e., C -> E ->
J) for doing C -> J search (in this case, English serves as a kind of
intermediate or pivot language).
Note:
The participants submitting runs for this subtask are allowed to also submit
BLIR runs using English topics (i.e., E -> C or J or K) in order to
analyze performance of the approach.
Topic set | Document set | |
C | -> | J |
C | -> | K |
C | -> | E |
J | -> | C |
J | -> | K |
J | -> | E |
K | -> | C |
K | -> | J |
K | -> | E |
E | -> | C |
E | -> | J |
E | -> | K |
Single Language IR (SLIR)
The topic set and document set of SLIR subtask consist of a single language.
The following depicts the SLIR subtask.
Topic set | Document set | |
C | -> | C |
J | -> | J |
K | -> | K |
E | -> | E |
In the NTCIR Workshop 4, it should be challenged to explore some special
issues as follows.
Pivot Language Approach
Since the approach seems to be realistic, the participants are strongly
recommended to try this approach for BLIR using English as a pivot language.
More MLIR!
MLIR should be studied more intensively for developing a search engine
working in a real situation on the Internet. The submission of MLIR runs
is strongly recommended.
Term Disambiguation in More Real Situation
Needless to say, an important issue for enhancing CLIR performance is term
disambiguation. In order to promote development of term disambiguation
techniques, gtitle-only runh becomes mandatory in the NTCIR-4.
Sharing Knowledge on Language Model
In recent, so-called glanguage modelh is often used for CLIR. The model
may give us a good opportunity for enhancing theoretical or practical aspect
of CLIR.
Link information to some resources can be obtained from a list of language resources.
Note:
The task organizers would like to continue making efforts for providing
useful resources such as bilingual term lists, parallel corpora, or translation
probability tables. If you have a language resource that can be shared
among all participants, please tell us about it.
The test collection used in CLIR task is composed of document set and topic set. The following will give a brief description of each set. It should be noted that a new document set may be added late
(a)Document collection for evaluation
Language | Collection | No. of Docs | Note | |||
NTCIR-4 CLIR |
Chinese 1998-99 | CIRB020 (United Daily News) + | 249,203 |
Used in NTCIR-3 | (modified on 2003-10-29) | |
CIRB011 (China Times, China Times Express, Commercial Times, China Daily News, Central and Daily News ) ***, + | 132,172 |
Used in NTCIR-3 | (modified on 2003-10-29) | |||
total | 381,375 |
(modified on 2003-10-29) | ||||
Japanese 1998-99 | Mainichi | 220,078 | Used in NTCIR-3 | |||
Yomiuri ***, + | 373,558 |
New | (modified on 2003-10-29) | |||
Total | 593,636 |
(modified on 2003-10-29) | ||||
Korean 1998-99 | Hankookilbo | 149,921 | New | |||
Chosunilbo** | 104,517 | New | ||||
total | 254,438 | |||||
English 1998-99 | EIRB010 | Taiwan News | 7,489 | Used in NTCIR-3 | ||
China Times English News (Taiwan) | 2,715 | Used in NTCIR-3 | ||||
Mainichi Daily News (Japan) | 12,723 | Used in NTCIR-3 | ||||
Korea Times | 19,599 | New | ||||
Xinhua (AQUAINT)+ | 208,167 |
New | (modified on 2003-10-29) |
|||
Hong Kong Standard * | 96,683 |
New | (modified on 2003-10-29) |
|||
total | 347,376 |
(modified on 2003-10-29) (modified on 2003-10-30) |
* Hong Kong Standard was added on Jun. 05 2003.
** Chosunilbo was added on Jul. 26 2003.
*** Yomiuri and CIRB011 were added on Oct. 1 2003.
+ : the number of documents were after discarding the abnormal records.
The lists of abnormal document IDs are available
here [CIRB011] [CIRB020] [Yomiuri] [xinhua] [all]
(b)Document collection for training
Language | Collection | No. of Docs | ||
NTCIR-3 CLIR |
Chinese 1998-99 | CIRB020 (United Daily News) | 249,508 | |
CIRB011 (China Times, China Times Express, Commercial Times, China Daily News, Central and Daily News ) | 132,173 | |||
Japanese 1998-99 | Mainichi | 220,078 | ||
Korean 1994 | *Korea Economic Daily (1994) | 66,146 | ||
English 1998-99 | EIRB010 | Taiwan News | 7,489 | |
China Times English News (Taiwan) | 2,715 | |||
Mainichi Daily News (Japan) | 12,723 |
*This document set is not used for evaluation in NTCIR Workshop 4.The contract form is available at http://research.nii.ac.jp/ntcir/permission/perm-en.html. After signing needed contracts, participants have to send them to NII. The details could be referred to the URL mentioned above.
The format of each news article is consistent by using a set of tags. The sample documents will be shown in the Appendix.
Mandatory tags | ||
<DOC> | </DOC> | The tag for each document |
<DOCNO> | </DOCNO> | Document identifier |
<LANG> | </LANG> | Language code: CH, EN, JA, KR |
<HEADLINE> | </HEADLINE> | Title of this news article |
<DATE> | </DATE> | Issue date |
<TEXT> | </TEXT> | Text of news article |
Optional tags | ||
<P> | </P> | Paragraph marker |
<SECTION> | </SECTION> | Section identifier in original newspapers |
<AE> | </AE> | Contain figures or not |
<WORDS> | </WORDS> | Number of words in 2 bytes (for Mainichi Newspaper) |
Each topic has four fields; 'T' (TITLE), 'D' (DESC), 'N' (NARR), 'C' (CONC). The following shows a sample topic.
<TOPIC> <NUM>013</NUM> <SLANG>CH</SLANG> <TLANG>EN</TLANG> <TITLE>NBA labor dispute</TITLE> <DESC>To retrieve the labor dispute between the two parties of the US National Basketball Association at the end of 1998 and the agreement that they reached. </DESC> <NARR> <REL>The content of the related documents should include the causes of NBA labor dispute, the relations between the players and the management, main controversial issues of both sides, compromises after negotiation and content of the new agreement, etc. The document will be regarded as irrelevant if it only touched upon the influences of closing the court on each game of the season.</REL> </NARR> <CONC>NBA (National Basketball Association), union, team, league, labor dispute, league and union, negotiation, to sign an agreement, salary, lockout, Stern, Bird Regulation.</CONC> </TOPIC> |
The tags used in topic are shown as follows.
<TOPIC> | </TOPIC> | The tag for each topic |
<NUM> | </NUM> | Topic identifier |
<SLANG> | </SLANG> | Source language code: CH, EN, JA, KR |
<TLANG> | </TLANG> | Target language code: CH, EN, JA, KR |
<TITLE> | </TITLE> | The concise representation of information request, which is composed of noun or noun phrase. |
<DESC> | </DESC> | A short description of the topic. The brief description of information need, which is composed of one or two sentences. |
<NARR> | </NARR> | A much longer description of topic. The <NARR> may has three parts; (1)<BACK>...</BACK>: background information about the topic is described. (2)<REL>...</REL>: further interpretation of the request and proper nouns, the list of relevant or irrelevant items, the specific requirements or limitations of relevant documents, and so on are given. (3)<TERM>...</TERM>: definition or explanation of proper nouns, scientific terms and so on. |
<CONC> | </CONC> | The keywords relevant to whole topic. |
It should be noted that three subfields, <BACK> ,<REL> and <TERM>, are newly added in <NARR> field in the NTCIR-4.
*Topics for evaluation are available from the the CLIR download cite.(2003/10/01)
Mandatory Runs: T-run and D-run
Each participant must submit two types of run for each combination of topic
language and document language(s);
The purpose of asking participants to submit these mandatory runs is to make research findings clear by comparing systems or methods under a unified condition.
Recommended Runs: DN-run
Also, the task organizers would like to recommend strongly DN run, which
is run using <DESC> and <NARR> fields are used.
Optional Runs
Other any combinations of fields are allowed to submit as optional runs
according to each participantfs research interests, e.g. TDN-run, DC-run,
TDNC-run and so on.
Number of Runs
Each participant can submit up to 5 runs in total for each language pair
regardless of the type of run, and participants are allowed to include
two T runs in maximum and also two D-runs in maximum into the 5 runs. The
language pair means the combination of topic language and document language(s).
For example,
Language combination -> Topic: C and Docs: CJE (C->CJE)
Submission -> two T-runs, a D-run, a DN-run and a TDNC run (5 runs in
total).
Identification and Priority of Runs
Each run has to be associated with a RunID. RunID is an identity for each
run. The rule of format for RunID is as follows.
The 'pp' is two digits used to represent the priority of the run. It will be used as a parameter for pooling. The participants have to decide the priority for each submitted run in the basis of each language pair. "01" means the high priority. For example, a participating group, LIPS, submits 3 runs for C-->CJE. The first is a T run, the second is a D run and the third is a DN run. Therefore, the Run ID for each run is LIPS-C-CJE-T-01, LIPS-C-CJE-D-02, and LIPS-C-CJE-DN-03, respectively. Or, if the group uses different ranking techniques in T run for C --> CJE, the RunID for each run has to be LIPS-C-CJE-T-01, LIPS-C-CJE-T-02, and LIPS-C-CJE-D-03.
Relevance judgments will be done in four grades, Highly Relevant, Relevant, Partially Relevant, and Irrelevant. Evaluation will be done using trec_eval, which is run at two different threshold of relevance levels. Also, new proposed metrics for multi-grade relevance judgments, weighted R precision and weighted average precision, which basically give merits to the system that will retrieve more relevant documents in higher ranks, may be employed.
2003-03-20 | Application Due |
2003-03-30 | Data (Document sets) Release |
2003-10-01 | Distribution of Search Topics |
Submission of Search Results* ---> Extended to 2003-11-08 | |
2004-02-20 | Delivery of Evaluation Results |
2004-03-19 | Paper Due (for Working Notes) |
2004-05 | NTCIR Workshop 4 (Conference) |
2004-09-01 | Paper Due (for Formal Proceedings) |
*Please see the Submission Guideline.
Hsin-Hsi Chen, Taiwan
Kuang-hua Chen, Taiwan (co-chair)
Koji Eguchi, Japan
Noriko Kando, Japan
Kazuaki Kishida, Japan (co-chair)
Kazuko Kuriyama, Japan
Suk-Hoon Lee, Korea (co-chair)
Sung Hyon Myaeng, Korea
(in alphabetical order of family names)
If you have a question, please contact the task organizers.
E-mail: | ntcadm-clir |
[Top][1.Intro][2.Subtasks][MLIR][BLIR][PLIR][SLIR][3.Interests][4.Resources][5.Test Collection][Doc][Topic]
[6.Type of Runs][7.Evaluation][8.Schedule][9.Organizers][10.Contact][Appendix]
1.Sample of Chinese Document Record
2.Sample of Japanese Document Record
3. Sample of Korean Dcoument Record
CFP of CLIR Task in NTCIR 4 Workshop [Top]