The 3rd NTCIR Workshop was OVER. |
For the current NTCIR Workshop (the latest in the series of NTCIR Workshops), visit ! |
In order to promote the researches on cross-language information retrieval
(CLIR), the program committee of NTCIR 3 workshop is happy to announce
a CLIR task. There are 3 tracks in CLIR task: 1) Multilingual CLIR, 2)
Bilingual CLIR, and 3) Single Language IR (non-English IR). Participants
from all over the world are welcome and both automatic systems and interactive
systems are welcome. Participants are asked to report their system specification
and technique details. Please visit the official web site (http://research.nii.ac.jp/ntcir/)
for further information. Online registration is available at the web site
(http://research.nii.ac.jp/ntcir/ workshop/application-en.html).
Basically, 4 languages will involve in the CLIR task: Chinese, English,
Japanese, and Korean. Please note that other languages may be added later,
if the preparation procedure will be completed on schedule.
Dr. Hsin-Hsi Chen, Taiwan (Co-chair)
Dr. Kuang-hua Chen, Taiwan (Co-chair)
Dr. Koji Eguchi, Japan
Dr. Noriko Kando, Japan
Dr. Hyeon Kim, Korea
Dr. Kazuaki Kishida, Japan
Dr. Kazuko Kuriyama, Japan
Dr. Suk-Hoon Lee, Korea
Dr. Sung Hyon Myaeng, Korea (in alphabetical order of family names)
The CLIR Task provides three tracks. Participants could choose to take
part in any one, any two, or all of three tracks.
Multilingual CLIR (MLIR)
The topic set and document set of MLIR Track consist of more than two languages. The challenge is that participants have to resolve the complicated multi-language issues. Since the publishing dates of Korean documents are not parallel to those of Chinese, English, and Japanese documents (see Document Sets section below), document set used in MLIR consists of Chinese, English, and Japanese. However, the topic could be Chinese, English, Japanese, and Korean. We will translate topic of other languages into Korean, so participants could carry out K&rarrow;CEJ" track. The following depicts the MLIR track.
Topic Set |
|
Document Set |
|
Topic Set |
|
Document Set |
C |
--> |
C, J, E |
|
C |
--> |
J, E |
E |
--> |
C, J, E |
|
E |
--> |
J, E |
J |
--> |
C, J, E |
|
J |
--> |
J, E |
K |
--> |
C, J, E |
|
K |
--> |
J, E |
C |
--> |
C, J |
C |
--> |
C, E |
|
E |
--> |
C, J |
|
E |
--> |
C, E |
J |
--> |
C, J |
|
J |
--> |
C, E |
K |
--> |
C, J |
|
K |
--> |
C, E |
The topic set and document set of BLIR Track consist of two languages. The complexity of this track is less than that of MLIR. However, participants also have to resolve cross-language issues. Please note that the topics in K-->J and K-->C are translated from Chinese topics and Japanese topics. On the contrary, the topics in C-->K, E-->K, and J-->K are translated from Korean topics. The following depicts the BLIR track.
Topic Set |
|
Document Set |
C |
--> |
J |
E |
--> |
J |
K |
--> |
J |
C |
--> |
K |
E |
--> |
K |
J |
--> |
K |
J |
--> |
C |
K |
--> |
C |
E |
--> |
C |
The topic set and document set of SLIR Track consist of one language. The following depicts the SLIR track. We do not allow E-->E track.
Topic Set |
|
Document Set |
C |
--> |
C |
J |
--> |
J |
K |
--> |
K |
We do not provide the multi-lingual dictionaries or segmentation tools for Chinese, Japanese, and Korean. Participants have to construct the needed tools or dictionaries. However, we will prepare a list of CJK language resources and post it on the NTCIR official website. The participants could download or purchase them from resource providers.
5. Test Collection [document][topics]
The test collection used in CLIR task is composed of document set and topic
set. The following will give a brief description of each set.
The documents used in CLIR are news articles collected from different news agencies of different countries. The following shows the information of documents.
Japan |
Mainichi Newspaper (1998-1999): Japanese |
236,664 |
Mainichi Daily News (1998-1999): English |
12,723 |
|
Korea |
Korea Economic Daily (1994): Korean |
66,146 |
Taiwan |
CIRB010 (1998-1999): Chinese |
132,173 |
CIRB011 (1998-1999): Chinese |
132,173 |
|
United Daily News (udn) (1998-1999): Chinese |
249,508 |
|
Taiwan News (tns) (1998-1999): English |
7,489 |
|
Chinatimes English News (ctg) (1998-1999): English |
2,715 |
The participants have to sign different contracts for using these materials.
Each contract has its own requirements. We hope participants could understand the complicated
situations of copyright issues in different countries.
The period of permitted use of the Mainichi Newspapers and Mainichi Daily
News (Document collections from Japan) are from 2001-09-01 to 2003-09-30.
For active participants who will submit the results and who affiliated
at the organization outside Japan will be able to extend the period up
to 2008-09-30. After the permitted period will be terminated, the participants
will have to delete all the documents data, or will have to purchase the
data from Mainichi Newspaper Co., and obtain the permission for research
purpose use from the Company.
The participants have to sign a contract with Dr. Kuang-hua Chen for the
use of United Daily News (Document collections from Taiwan). However, the
udn.com (the company of United Daily News) reserves the right to reject
the contract. Basically, udn.com will prove the contracts in normal situations.
The participants also have to sign a contract with Dr. Kuang-hua Chen for
the use of CIRB010, CIRB011, Taiwan News and Chinatimes English News.
The participants have to sign contract with Prof. Sung Hyon Myaeng and
will be proved the right to use Korea Economic Daily for 2 years plus possible extensions.
All contract forms are at http://research.nii.ac.jp/ntcir/permission/perm-en.html. After signing the needed contracts, participants have to send them to
NII. The details could be referred to the URL mentioned above.
The format of each news article is consistent by using a set of tags. The
sample documents will be shown in the Appendix.
The tag set is shown as follows.
Mandatory tags |
||
<DOC> |
</DOC> |
The tag for each document |
<DOCNO> |
</DOCNO> |
Document identifier |
<LANG> |
</LANG> |
Language code: CH, EN, JA, KR |
<HEADLINE> |
</HEADLINE> |
Title of this news article |
<DATE> |
</DATE> |
Issue date |
<TEXT> |
</TEXT> |
Text of news article |
Optional tags |
||
<P> |
</P> |
Paragraph marker |
<SECTION> |
</SECTION> |
Section identifier in original newspapers |
<AE> |
</AE> |
Contain figures or not |
<WORDS> |
</WORDS> |
Number of words in 2 bytes (for Mainichi Newspaper) |
The following shows a sample topic.
<TOPIC> <NUM>013</NUM> <SLANG>CH</SLANG> <TLANG>EN</TLANG> <TITLE>NBA labor dispute</TITLE> <DESC> To retrieve the labor dispute between the two parties of the US National Basketball Association at the end of 1998 and the agreement that they reached. </DESC> <NARR> The content of the related documents should include the causes of NBA labor dispute, the relations between the players and the management, main controversial issues of both sides, compromises after negotiation and content of the new agreement, etc. The document will be regarded as irrelevant if it only touched upon the influences of closing the court on each game of the season. </NARR> <CONC> NBA (National Basketball Association), union, team, league, labor dispute, league and union, negotiation, to sign an agreement, salary, lockout, Stern, Bird Regulation. </CONC> </TOPIC> |
The tags used in topic are shown as follows.
<TOPIC> |
</TOPIC> |
The tag for each topic |
<NUM> |
</NUM> |
Topic identifier |
<SLANG> |
</SLANG> |
Source language code: CH, EN, JA, KR |
<TLANG> |
</TLANG> |
Target language code: CH, EN, JA, KR |
<TITLE> |
</TITLE> |
The concise representation of information request, which is composed of noun or noun phrase. |
<DESC> |
</DESC> |
A short description of the topic.The brief description of information need, which is composed of one or two sentences. |
<NARR> |
</NARR> |
A much longer description of topic.The <NARR> has to be detailed, like the further interpretation to the request and proper nouns, the list of relevant or irrelevant items, the specific requirements or limitations of relevant documents, and so on. |
<CONC> |
</CONC> |
The keywords relevant to whole topic. |
Basically, we allow all types of runs using any combination of fields in topic and use the$B!!(B'T' (TITLE), 'D' (DESC), 'N' (NARR), 'C' (CONC) and any combination of these symbols to name the run types. That is to say, participant can submit T run, D run, N run, C run, TD run, TN run, TC run, DN run, DC run, NC run, TDN run, TDC run, TNC run, DNC run, and TDNC run. Each participant can submit up to 3 runs for each language pair regardless of the type of run. The language pair means the combination of topic language and document language(s). Among these run types, D run type is mandatory run type. Each participant has to submit at least a D run for a language pair. Each run has to be associated with a RunID. RunID is an identity for each run. The rule of format for RunID is as follows.
Group's ID-Topic Language-Document Language-Run Type-pp
The 'pp' is two digits used to represent the priority of the run. It will
be used as a parameter for pooling. The participants have to decide the
priority for each submitted run in the basis of each language pair. "01"
means the high priority. For example, a participating group, LIPS, submits
3 runs for C-->CJ track. The first is a D run, the second is a DN run
and the third is a TD run. Therefore, the Run ID for each run is LIPS-C-CJ-D-01,
LIPS-C-CJ-DN-02, and LIPS-C-CJ-TD-03, respectively. Also if the group
uses different ranking techniques in D run for C --> CJ track, the RunID
for each run has to be LIPS-C-CJ-D-01, LIPS-C-CJ-D-02, and LIPS-C-CJ-D-03.
Relevance judgments will be done in four grades, Highly Relevant, Relevant, Partially Relevant, and Irrelevant. Evaluation will be done using
1) trec_eval will be run at two different threshold of relevance levels.
2) new proposed metrics for multigrade relevance judgments, weighted R precision and weighted average precision, they are basically give merits to the system that will retrieve more relevant documents in higher ranks.
2001-09-30 | Application Due |
2001-10-01 | Deliver Dry Run Data |
2001-10-30 | Distribution of Dry Run Search Topics |
2001-11-12 | Submit Search Results of Dry Run |
2001-12-07 | Deliver Evaluation Results of Dry Run |
2002-01-08 | Distribution of Formal Run Search Topics |
2002-02-04 | Submit Search Results of Formal Run |
2002-07-01 | Deliver Evaluation Results of Formal Run |
2002-08-20 | Paper Due |
2002-10-08/ 2002-10-10 |
NTCIR Workshop 3 |