[JAPANESE] [NTCIR Home] [NTCIR DATA Home]
The NTCIR-6 CLIR test collection can be used for experiments of cross-lingual information retrieval between Chinese(traditional), Japanese, Korean and English (CJKE) such as
*It should be noted that document sets provided for MLIR are only CJKE.
The documents are full-text of news articles in CJK languages (except for
English), which were published in Asian areas from 2000 to 2001. The test
collection also includes 50 search topics in CJKE, and relevance judgement
information.
NTCIR-6 CLIR task consists of STAGE1 and STAGE2, and in STAGE2, old test
collections of NTCIR-3, -4 and -5 were re-used with no change. Therefore,
"NTCIR-6 Test Collection" indicates a test collection created
in the STAGE1.
Collection | Task | Document Data | Task Data | |||||||
Genre | Filename | Lang. | Year | # of docs | Size | Topics | Relevance Judgement | |||
Lang. | # | grades | ||||||||
NTCIR-6 CLIR | IR | News Paper articules | CIRB040r(A) | CH | 2000-01 | 901,446 | 581.7 MB | CJKE | 50 | 4 |
Mainichi Newspaper(B) | JA | 2000-01 | 199,681 | 118.8 MB | CJKE | 50 | 4 | |||
Yomiuri Newspaper(B) | JA | 2000-01 | 658,719 | 343.3 MB | CJKE | 50 | 4 | |||
Hankookilbo(A) | KR | 2000-01 | 85,250 | 52.1 MB | CJKE | 50 | 4 | |||
Chosunilbo(A) | KR | 2000-01 | 135,124 | 88.7 MB | CJKE | 50 | 4 |
(A) | CIRB040rAHankookilboAChosunilbo: available from NII for research purpose | |
(B) | Mainichi Newspaper,Yomiuri Newspaper: For the non-participants, available
for research purpose from other party with fee@@ EYomiuri Newspaper Japanese Article Data is available for research purpose use from Nihon Database Kaihatsu Co. Ltd (currently information is available in Japanese only). and the document records in the Data shall be converted into the NTCIR standard record format by the script yomi2ntcir.pl. @@EREADME for Yomiuri Newspaper Japanese Article Data : http://research.nii.ac.jp/ntcir/permission/ntcir-4/script/READMEforYomiuri98+99.txt @@ETo obtai script yomi2ntcir.pl. Fhhttp://research.nii.ac.jp/ntcir/permission/ntcir-4/script/yomi2ntc.pl @@EREADMEyscript yomi2ntcir.pl. zFhttp://research.nii.ac.jp/ntcir/permission/ntcir-4/script/READMEforYomiuriScript.txt EMainichi Newspaper Japanese Article Data Full-Text Article Database CD-ROMs are available for research purpose use from Nichigai Associates and Mainichi Newspaper Co. and the document records in the CD-ROMs shall be converted into the NTCIR standard record format by the script mai2.pl.(currently information is available in Japanese only) @@ETo obtain script mai2ntc-r.plFhttp://research.nii.ac.jp/ntcir/permission/ntcir-4/script/mai2ntc-r.pl_txt @@EREADMEymai2ntc-r.plzhttp://research.nii.ac.jp/ntcir/permission/ntcir-4/script/READMEforMainichiScript-r.txt |
Document Sets
The documents sets included in the NTCIR-6 CLIR test collection are as follows.
Doc language | files | No. of docs | |||
2000 | 2001 | Total | |||
Chinese | CIRB040r (revised) (581.7 MB) |
United Daily News (udn) | 244,038 | 222,526 | 466,564 |
United Express (ude) | 40,445 | 51,851 | 92,296 | ||
Ming Hseng News (mhn) | 84,437 | 85,302 | 169,739 | ||
Economic Daily News (edn) | 79,380 | 93,467 | 172,847 | ||
Total | 448,300 | 453,146 | 901,446 | ||
Japanese | Mainichi Newspaper 2000-2001 (118.8 MB) | 99,207 | 100,474 | 199,681 | |
Yomiuri Newspaper 2000-2001 (343.3 MB) | 306,709 | 352,010 | 658,719 | ||
Total | 405,916 | 452,484 | 858,400 | ||
Korean | Hankookilbo 2000-2001 (52.1 MB) | 40,306 | 44,944 | 85,250 | |
Chosunilbo 2000-2001 (88.7 MB) | 67,711 | 67,413 | 135,124 | ||
Total | 108,017 | 112,357 | 220,374 |
(2) Tags used in document records
The format of each news article is
consistent by using a set of tags.
Mandatory tags | ||
<DOC> | </DOC> | The tag for each document |
<DOCNO> | </DOCNO> | Document identifier |
<LANG> | </LANG> | Language code: CH, JA, KR |
<HEADLINE> | </HEADLINE> | Title of this news article |
<DATE> | </DATE> | Issue date |
<TEXT> | </TEXT> | Text of news article |
Optional tags | ||
<P> | </P> | Paragraph marker |
<SECTION> | </SECTION> | Section identifier in original newspapers |
<AE> | </AE> | Contain figures or not |
<WORDS> | </WORDS> | Number of words in 2 bytes (for Mainichi Newspaper) |
Search topics
(1)List of topic files
The test collection of NTCIR-6 CLIR includes a set of topics for Chinese,
Japanese Korean and English. The file names are as follows:
Language | File Name |
(1) Chinese | NTCIR6CLIRTopicCH.txt |
(2) Japanese | NTCIR6CLIRTopicJA.txt |
(3) Korean | NTCIR6CLIRTopicKR.txt |
(4) English | NTCIR6CLIRTopicEN.txt |
(2) A sample of topics
<TOPIC> <NUM>013</NUM> <SLANG>CH</SLANG> <TLANG>EN</TLANG> <TITLE>NBA labor dispute</TITLE> <DESC>To retrieve the labor dispute between the two parties of the US National Basketball Association at the end of 1998 and the agreement that they reached. </DESC> <NARR> <REL>The content of the related documents should include the causes of NBA labor dispute, the relations between the players and the management, main controversial issues of both sides, compromises after negotiation and content of the new agreement, etc. The document will be regarded as irrelevant if it only touched upon the influences of closing the court on each game of the season.</REL> </NARR> <CONC>NBA (National Basketball Association), union, team, league, labor dispute, league and union, negotiation, to sign an agreement, salary, lockout, Stern, Bird Regulation.</CONC> </TOPIC> |
(3) Tags used in topics
The tags used in topic are shown as follows.
<TOPIC> | </TOPIC> | The tag for each topic |
<NUM> | </NUM> | Topic identifier |
<ONUM> | </ONUM> | Original topic ID (in NTCIR-6 STAGE1, a part of old topics for NTCIR-3 to -5 was re-used.) |
<SLANG> | </SLANG> | Source language code: CH, EN, JA, KR |
<TLANG> | </TLANG> | Target language code: CH, EN, JA, KR |
<TITLE> | </TITLE> | The concise representation of information request, which is composed of noun or noun phrase. |
<DESC> | </DESC> | A short description of the topic. The brief description of information need, which is composed of one or two sentences. |
<NARR> | </NARR> | A much longer description of topic. The <NARR> may has three
parts; (1)<BACK>...</BACK>: background information about the topic is described. (2)<REL>...</REL>: further interpretation of the request and proper nouns, the list of relevant or irrelevant items, the specific requirements or limitations of relevant documents, and so on are given. (3)<TERM>...</TERM>: definition or explanation of proper nouns, scientific terms and so on. |
<CONC> | </CONC> | The keywords relevant to whole topic. |
Relevance Judgments
(1) List of judgment files
Type of run | Topics | Docs | Relevance Judgments Files |
Single Language IR (SLIR) | Chinese(C) | Chinese(C) | CLIR6FormalRunRJ-C-Rigid.txt CLIR6FormalRunRJ-C-Relax.txt |
Japanese(J) | Japanese(J) | CLIR6FormalRunRJ-J-Rigid.txt CLIR6FormalRunRJ-J-Relax.txt |
|
Korean(K) | Korean(K) | CLIR6FormalRunRJ-K-Rigid.txt CLIR6FormalRunRJ-K-Relax.txt |
|
Bilingual IR (BLIR) | C or K | J | CLIR6FormalRunRJ-J-Rigid.txt CLIR6FormalRunRJ-J-Relax.txt |
K or J | C | CLIR6FormalRunRJ-C-Rigid.txt CLIR6FormalRunRJ-C-Relax.txt |
|
C or J | K | CLIR6FormalRunRJ-K-Rigid.txt CLIR6FormalRunRJ-K-Relax.txt |
|
Multilingual IR (MLIR) | C or J or K or E | C and J and K | CLIR6FormalRunRJ-CJK-Rigid.txt CLIR6FormalRunRJ-CJK-Relax.txt |
(2) Two kinds of relevance judgment file
In this test collection, four categories of relevance are used for the
judgment, i.e.,"Highly Relevant," "Relevant," "Partially Relevant," and
"Irrelevant." However, since the trec_eval scoring program we use adopts binary
relevance, we have to decide the thresholds for the 4 categories of relevance.
For the reason, we provide two kinds of relevance judgment file:
(a) "Rigid"
relevance - "Highly Relevant" and "Relevant" are regarded as relevant.
(b)
"Relaxed" relevance - "Highly Relevant", "Relevant" and "Partially Relevant" are
regarded as relevant.
(3) Format of Relevance Judgement File
The format of a relevance judgment file is as follows:
(topic-ID) (dummy) (document-ID) (relevance judgment "0" or "1") (comment) |
The followings are the procedures to obtain this NTCIR-6 CLIR test collection. The test collection and data available from NII are free of charge.
Task@data ( without document data
)
Document data
Reference
NTCIR Project Office (Rm.1309)
National Institute of Informatics
2-1-2 Hitotsubashi Chiyoda-ku, Tokyo
102-8430, JAPAN
PHONE: +81-3-4212-2750
FAX: +81-3-4212-2751
Email: ntc-secretariat
The release of the new test collections and correction information shall be announced through the ntcir Mailing list
The test collection has been constructed and used for the NTCIR. They are
usable only for the research purpose use.
The documents collection included in the test collection were provided
to NII for used in NTCIR free of charge or for a fee. The providers of
the document data kindly understand the importance of the test collection
in the research on information access technologies and then granted the
use of the data for research purpose. Please remember that the document
data in the NTCIR test collection is copyrighted and has commercial value
as data. It is important for our continued reliable and good relationship
with the data producers/providers that we researchers must behave as a
reliable partners and use the data only for research purpose under the
user agreement and use them carefully not to violate any rights for them
.