NTCIR Project
NTCIR-6 CLQA
Research Purpose Use of Test Collection

[JAPANESE] [NTCIR Home] [NTCIR DATA Home]


NTCIR-6 CLQA (Cross Language Q&A data Test Collection)


Test Collection

The NTCIR-6 CLQA test collection can be used for experiments of cross-lingual information retrieval between Chinese (traditional), Japanese, and English (CJE) such as

* "X-Y subtask" indicates that questions are given in language X and answers are extracted from documents written in language Y

The documents are full-text of news articles in CJE languages, which were published in Asian areas from 1998 to 1999. The test collection also includes 200 Japanese questions (for J-J and J-E subtasks), 150 Chinese questions (for C-C and C-E subtasks), and 350 English questions (200 for E-J and E-E subtasks, and 150 for E-C and E-E subtasks), together with the correct answers to the questions and the IDs of the documents which support them to be answers.

collection Task Documents Taskdata
Genre File name Lang. Year # of docs Sizes Topic/Relevanc Relevance judge
Lang. #
NTCIR-6
CLQA

QA

News articles

CIRB020 (A)

Traditional
Chinese 

1998-1999 249,203 320 MB CJE J-E/J-J/E-J: 200,  C-E/C-C/E-C/E-E: 150 3 grades *
Mainichi (B) Japanese 220,078 282 MB
EIRB010 (C)  English 10,204 24.5 MB
Mainichi Daily (A)  English 12,723 33.3 MB
Korea Times (A)  English 19,599 55.8MB
Hong Kong Standard (C)  English 96,683 252MB

*Right, Unsupported, Wrong

(A) -- CIRB020, Mainichi Daily (English),Korea Times, Hong Kong Standard, are OK for deliver from NII to non-participants for research purpose use.
(B) --For the non-participants, Mainichi Newspaper Japanese Article Data Full-Text Article Database CD-ROMs are available for research purpose use from Nichigai Associates and Mainichi Shinbum and the document records in the CD-ROMs shall be converted into the NTCIR standard record format by the script mai2.pl.(currently information is available in Japanese only).
To obtaib script mai2ntc-r.plFhttp://research.nii.ac.jp/ntcir/permission/ntcir-4/script/mai2ntc-r.pl_txt
READMEymai2ntc-r.plzhttp://research.nii.ac.jp/ntcir/permission/ntcir-4/script/READMEforMainichiScript-r.txt
(C) EIRB010 is usable for participants only.
Hong Kong Standard is not available now.


Documents, Topics and Questions
@@

The datasets used in NTCIR-6 CLQA test collection are as follows. 

A.1 Chinese Dataset (traditional) 

CIRB020: United Daily News, Economic Daily News, Min Sheng Daily, United Evening News, Star News (Copyright: UDN.COM) 1998-1999 

A.2 Japanese Dataset 

Mainichi Newspaper Article Data (Copyright: Mainichi Newspaper) 1998 - 1999 

A.3 English Dataset 

EIRB010: Taiwan News (Copyright: Taiwan News); China Times English News (Copyright: China Times Inc.) 1998-1999

Mainichi Daily News (Copyright: Mainichi Newspaper) 1998-1999

Korea Times (Copyright: Hankooki.com Co., Distribution rights: Korean Institute of Science and Technology Information) 1998-1999

Hong Kong Standard (Copyright: the Sing Tao Group, Distribution rights: Wisers Information Ltd.) 1998-1999

 

The following is the brief description for tag set.

Mandatory tags

<DOC>

</DOC>

The tag for each document

<DOCNO>

</DOCNO>

Document identifier

<LANG>

</LANG>

Language code: ZH, EN, JA,KR

<HEADLINE>

</HEADLINE>

Title of this news article

<DATE>

</DATE>

Issue date

<TEXT>

</TEXT>

Text of news article

Optional tags

<P>

</P>

Paragraph marker

<SECTION>

</SECTION>

Section identifier in original newspapers

<AE>

</AE>

Contain figures or not

<WORDS>

</WORDS>

Number of words in 2 bytes




 

Questions

The format of testing questions is: 

[QID]: "[Question]" 

[QID] is the form of [QuestionSetID]-[Lang]-[QuestionNo]-[SubQuestionNo], where [QuestionSetID] is "CLQA2". [Lang] is one of JA (Japanese), ZH (Chinese), and EN (English).

[QuestionNo] and [SubQuestionNo] consist of four numeric characters starting with "S" or "T" and two numeric characters, respectively. ("S" is for sample questions and "T" for test questions.) An example of questions is: 

CLQA2-EN-T3003-00: "Who was the UN secretary-general in 1999?" 

We will release 8 question files for the CLQA formal run which answers are restricted to the named entities.  Chinese question files are in BIG5 encoding, Japanese question files are in EUC-JP encoding, and English question files are in ASCII encoding. The names of question files and their associations with CLQA subtasks are shown as follows. 

Subtasks

Question Set

ŒΎŒκ

#Q Remark
E-J CLQA2-EN-T1200-ASCII.q English 200

Same 200 English questions

E-E CLQA2-EN-T0200-ASCII.q English 200
J-E CLQA2-JA-T0200-EUC-JP.q Japanese 200 Same 200 Japanese questions
J-J CLQA2-JA-T1200-EUC-JP.q Japanese 200
E-C CLQA2-EN-T3200-ASCII.q English 150 Same 150 English questions
E-E CLQA2-EN-T2200-ASCII.q English 150
C-E CLQA2-ZH-T2200-BIG5.q Chinese 150 Same 150 Chinese questions
C-C

CLQA2-ZH-T3200-BIG5.q

Chinese 150

 

Gold Standard

 The following is the description for tag set.

<QASET> </QASET> The tag for the whole QA set
<VERSION> </VERSION> The version of this QA set
<QA> </QA> The tag for a QA cluster: a QA cluster contains a set of question sentences which are the same question but written in different languages; moreover, all of the correct answers found in the test collections (in any language) are also collected in a QA cluster
<QUESTION> </QUESTION> Question part in a QA cluster
<Q> </Q> The tag for a question sentence in a QA cluster, which has the following attributes: the attribute LANG (with values EN, JA, and ZH) denotes the language in which the question is written, and the attribute QID gives the question ID referred in CLQA subtasks
<Q_TYPE> <Q_TYPE> The question type of a question
<ANSWER> </ANSWER> Answer part in a QA cluster
<A> </A> The tag for a correct answer found in the test collections, which has the following attributes: the attribute LANG (with values EN, JA, and ZH) denotes the language in which the answer string is written, and the attribute DOCNO gives the document ID of a document where this answer appears.

 

An example of a QA cluster is as follows.

<QA>

<QUESTION>

<Q LANG="EN" QID="CLQA2-EN-T3003-00">Who was the UN secretary-general in 1999?</Q>

<Q LANG="ZH" QID="CLQA2-ZH-T3003-00">ˆκ‹γ‹γ‹γ”NŽž—ό‡š ”鏑’·₯’NH</Q>

<Q_TYPE>PERSON</Q_TYPE>

</QUESTION>

<ANSWER>

<A LANG="EN" DOCNO="HK-199908270280045">Kofi Annan</A>

<A LANG="ZH" DOCNO="udn_xxx_19991230_0727">ˆΐ“μ</A>

<A LANG="ZH" DOCNO="udn_xxx_19990107_0191">ˆΐ“μ</A>

<A LANG="ZH" DOCNO="udn_xxx_19990720_0238">ˆΐ“μ</A>

<A LANG="ZH" DOCNO="udn_xxx_19991115_0168">ˆΐ“μ</A>

<A LANG="ZH" DOCNO="udn_xxx_19991118_0056">ˆΐ“μ</A>

<A LANG="ZH" DOCNO="udn_xxx_19990411_0202">ˆΐ“μ</A>

<A LANG="ZH" DOCNO="udn_xxx_19990830_0190">ˆΐ“μ</A>

</ANSWER>

</QA>

 

The gold standard files for CLQA subtasks are: 

CLQA2-EJ-T0200-070131-UTF-8.xml for J-E/E-E subtasks

CLQA2-EJ-T1200-070131-UTF-8.xml for E-J/J-J subtasks

CLQA2-EN-T2200-v1.2-UTF-8.xml for E-E subtask

CLQA2-EN-T3200-v1.2-UTF-8.xml for E-C subtask

CLQA2-ZH-T3200-v1.2-UTF-8.xml for C-C subtask 

i no gold standard for C-E subtask.j

The followings are the procedures to obtain this CLQA test collection. The test collection and data available from NII are free of charge.


Task Data ( without document data )


Document Data

Documents to submit

Reference


Address

NTCIR Project (Rm.1309)
National Institute of Informatics
2-1-2 Hitotsubashi Chiyoda-ku, Tokyo
102-8430, JAPAN
PHONE: +81-3-4212-2750
FAX: +81-3-4212-2751
Email: ntc-secretariat

Mailing List

The release of the new test collections and correction information shall be announced through the ntcir Mailing list

Notice

The test collection has been constructed and used for the NTCIR. They are usable only for the research purpose use.
The documents collection included in the test collection were provided to NII for used in NTCIR free of charge or for a fee. The providers of the document data kindly understand the importance of the test collection in the research on information access technologies and then granted the use of the data for research purpose. Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. It is important for our continued reliable and good relationship with the data producers/providers that we researchers must behave as a reliable partners and use the data only for research purpose under the user agreement and use them carefully not to violate any rights for them .