[JAPANESE] [NTCIR Home] [NTCIR DATA Home]
The NTCIR-6 CLQA test collection can be used for experiments of cross-lingual information retrieval between Chinese (traditional), Japanese, and English (CJE) such as
* "X-Y subtask" indicates that questions are given in language X and answers are extracted from documents written in language Y
The documents are full-text of news articles in CJE languages, which were published in Asian areas from 1998 to 1999. The test collection also includes 200 Japanese questions (for J-J and J-E subtasks), 150 Chinese questions (for C-C and C-E subtasks), and 350 English questions (200 for E-J and E-E subtasks, and 150 for E-C and E-E subtasks), together with the correct answers to the questions and the IDs of the documents which support them to be answers.
collection | Task | Documents | Taskdata | |||||||
Genre | File name | Lang. | Year | # of docs | Sizes | Topic/Relevanc | Relevance judge | |||
Lang. | # | |||||||||
NTCIR-6 CLQA |
QA |
News articles |
CIRB020 (A) |
Traditional |
1998-1999 | 249,203 | 320 MB | CJE | J-E/J-J/E-J: 200, C-E/C-C/E-C/E-E: 150 | 3 grades * |
Mainichi (B) | Japanese | 220,078 | 282 MB | |||||||
EIRB010 (C) | English | 10,204 | 24.5 MB | |||||||
Mainichi Daily (A) | English | 12,723 | 33.3 MB | |||||||
Korea Times (A) | English | 19,599 | 55.8MB | |||||||
Hong Kong Standard (C) | English | 96,683 | 252MB |
*Right, Unsupported, Wrong
(A) | -- CIRB020, Mainichi Daily (English),Korea Times, Hong Kong Standard, are OK for deliver from NII to non-participants for research purpose use. |
(B) | --For the non-participants, Mainichi Newspaper Japanese Article Data Full-Text Article Database CD-ROMs are available for research purpose use from Nichigai Associates and Mainichi Shinbum and the document records in the CD-ROMs shall be converted into the NTCIR
standard record format by the script mai2.pl.(currently information is
available in Japanese only). To obtaib script mai2ntc-r.plFhttp://research.nii.ac.jp/ntcir/permission/ntcir-4/script/mai2ntc-r.pl_txt READMEymai2ntc-r.plzhttp://research.nii.ac.jp/ntcir/permission/ntcir-4/script/READMEforMainichiScript-r.txt |
(C) | EIRB010 is usable for participants only. Hong Kong Standard is not available now. |
The datasets used in NTCIR-6 CLQA test collection are as follows.
A.1 Chinese Dataset (traditional)
CIRB020: United Daily News, Economic Daily News, Min Sheng Daily, United Evening News, Star News (Copyright: UDN.COM) 1998-1999
A.2 Japanese Dataset
Mainichi Newspaper Article Data (Copyright: Mainichi Newspaper) 1998 - 1999
A.3 English Dataset
EIRB010: Taiwan News (Copyright: Taiwan News); China Times English News (Copyright: China Times Inc.) 1998-1999
Mainichi Daily News (Copyright: Mainichi Newspaper) 1998-1999
Korea Times (Copyright: Hankooki.com Co., Distribution rights: Korean Institute of Science and Technology Information) 1998-1999
Hong Kong Standard (Copyright: the Sing Tao Group, Distribution rights: Wisers Information Ltd.) 1998-1999
The following is the brief description for tag set.
Mandatory tags |
||
<DOC> |
</DOC> |
The tag for each document |
<DOCNO> |
</DOCNO> |
Document identifier |
<LANG> |
</LANG> |
Language code: ZH, EN, JA,KR |
<HEADLINE> |
</HEADLINE> |
Title of this news article |
<DATE> |
</DATE> |
Issue date |
<TEXT> |
</TEXT> |
Text of news article |
Optional tags |
||
<P> |
</P> |
Paragraph marker |
<SECTION> |
</SECTION> |
Section identifier in original newspapers |
<AE> |
</AE> |
Contain figures or not |
<WORDS> |
</WORDS> |
Number of words in 2 bytes |
Questions
The format of testing questions is:
[QID]: "[Question]"
[QID] is the form of [QuestionSetID]-[Lang]-[QuestionNo]-[SubQuestionNo], where [QuestionSetID] is "CLQA2". [Lang] is one of JA (Japanese), ZH (Chinese), and EN (English).
[QuestionNo] and [SubQuestionNo] consist of four numeric characters starting with "S" or "T" and two numeric characters, respectively. ("S" is for sample questions and "T" for test questions.) An example of questions is:
CLQA2-EN-T3003-00: "Who was the UN secretary-general in 1999?"
We will release 8 question files for the CLQA formal run which answers are restricted to the named entities. Chinese question files are in BIG5 encoding, Japanese question files are in EUC-JP encoding, and English question files are in ASCII encoding. The names of question files and their associations with CLQA subtasks are shown as follows.
Subtasks |
Question Set |
Ύκ |
#Q | Remark |
E-J | CLQA2-EN-T1200-ASCII.q | English | 200 |
Same 200 English questions |
E-E | CLQA2-EN-T0200-ASCII.q | English | 200 | |
J-E | CLQA2-JA-T0200-EUC-JP.q | Japanese | 200 | Same 200 Japanese questions |
J-J | CLQA2-JA-T1200-EUC-JP.q | Japanese | 200 | |
E-C | CLQA2-EN-T3200-ASCII.q | English | 150 | Same 150 English questions |
E-E | CLQA2-EN-T2200-ASCII.q | English | 150 | |
C-E | CLQA2-ZH-T2200-BIG5.q | Chinese | 150 | Same 150 Chinese questions |
C-C |
CLQA2-ZH-T3200-BIG5.q |
Chinese | 150 |
Gold Standard
The following is the description for tag set.
<QASET> | </QASET> | The tag for the whole QA set |
<VERSION> | </VERSION> | The version of this QA set |
<QA> | </QA> | The tag for a QA cluster: a QA cluster contains a set of question sentences which are the same question but written in different languages; moreover, all of the correct answers found in the test collections (in any language) are also collected in a QA cluster |
<QUESTION> | </QUESTION> | Question part in a QA cluster |
<Q> | </Q> | The tag for a question sentence in a QA cluster, which has the following attributes: the attribute LANG (with values EN, JA, and ZH) denotes the language in which the question is written, and the attribute QID gives the question ID referred in CLQA subtasks |
<Q_TYPE> | <Q_TYPE> | The question type of a question |
<ANSWER> | </ANSWER> | Answer part in a QA cluster |
<A> | </A> | The tag for a correct answer found in the test collections, which has the following attributes: the attribute LANG (with values EN, JA, and ZH) denotes the language in which the answer string is written, and the attribute DOCNO gives the document ID of a document where this answer appears. |
An example of a QA cluster is as follows.
<QA>
<QUESTION>
<Q LANG="EN" QID="CLQA2-EN-T3003-00">Who was the UN secretary-general in 1999?</Q>
<Q LANG="ZH" QID="CLQA2-ZH-T3003-00">κγγγNό ι·₯NH</Q>
<Q_TYPE>PERSON</Q_TYPE>
</QUESTION>
<ANSWER>
<A LANG="EN" DOCNO="HK-199908270280045">Kofi Annan</A>
<A LANG="ZH" DOCNO="udn_xxx_19991230_0727">ΐμ</A>
<A LANG="ZH" DOCNO="udn_xxx_19990107_0191">ΐμ</A>
<A LANG="ZH" DOCNO="udn_xxx_19990720_0238">ΐμ</A>
<A LANG="ZH" DOCNO="udn_xxx_19991115_0168">ΐμ</A>
<A LANG="ZH" DOCNO="udn_xxx_19991118_0056">ΐμ</A>
<A LANG="ZH" DOCNO="udn_xxx_19990411_0202">ΐμ</A>
<A LANG="ZH" DOCNO="udn_xxx_19990830_0190">ΐμ</A>
</ANSWER>
</QA>
The gold standard files for CLQA subtasks are:
CLQA2-EJ-T0200-070131-UTF-8.xml for J-E/E-E subtasks CLQA2-EJ-T1200-070131-UTF-8.xml for E-J/J-J subtasks CLQA2-EN-T2200-v1.2-UTF-8.xml for E-E subtask CLQA2-EN-T3200-v1.2-UTF-8.xml for E-C subtask CLQA2-ZH-T3200-v1.2-UTF-8.xml for C-C subtask i no gold standard for C-E subtask.j |
The followings are the procedures to obtain this CLQA test collection. The test collection and data available from NII are free of charge.
Task Data ( without document data )
Document Data
Documents to submit
Reference
NTCIR Project (Rm.1309)
National Institute of Informatics
2-1-2 Hitotsubashi Chiyoda-ku, Tokyo
102-8430, JAPAN
PHONE: +81-3-4212-2750
FAX: +81-3-4212-2751
Email: ntc-secretariat
The release of the new test collections and correction information shall be announced through the ntcir Mailing list
The test collection has been constructed and used for the NTCIR. They are
usable only for the research purpose use.
The documents collection included in the test collection were provided
to NII for used in NTCIR free of charge or for a fee. The providers of
the document data kindly understand the importance of the test collection
in the research on information access technologies and then granted the
use of the data for research purpose. Please remember that the document
data in the NTCIR test collection is copyrighted and has commercial value
as data. It is important for our continued reliable and good relationship
with the data producers/providers that we researchers must behave as a
reliable partners and use the data only for research purpose under the
user agreement and use them carefully not to violate any rights for them
.