The NTCIR-6 CLQA test collection can be used for experiments of cross-lingual information retrieval between Chinese (traditional), Japanese, and English (CJE) such as

* "X-Y subtask" indicates that questions are given in language X and answers are extracted from documents written in language Y

The documents are full-text of news articles in CJE languages, which were published in Asian areas from 1998 to 1999. The test collection also includes 200 Japanese questions (for J-J and J-E subtasks), 150 Chinese questions (for C-C and C-E subtasks), and 350 English questions (200 for E-J and E-E subtasks, and 150 for E-C and E-E subtasks), together with the correct answers to the questions and the IDs of the documents which support them to be answers.

collection	Task	Documents						Taskdata
		Genre	File name	Lang.	Year	# of docs	Sizes	Topic/Relevanc		Relevance judge
								Lang.	#
NTCIR-6 CLQA	QA	News articles	CIRB020 (A)	Traditional Chinese	1998-1999	249,203	320 MB	CJE	J-E/J-J/E-J: 200, C-E/C-C/E-C/E-E: 150	3 grades *
			Mainichi (B)	Japanese		220,078	282 MB
			EIRB010 (C)	English		10,204	24.5 MB
			Mainichi Daily (A)	English		12,723	33.3 MB
			Korea Times (A)	English		19,599	55.8MB
			Hong Kong Standard (C)	English		96,683	252MB

CIRB020: United Daily News, Economic Daily News, Min Sheng Daily, United Evening News, Star News (Copyright: UDN.COM) 1998-1999

EIRB010: Taiwan News (Copyright: Taiwan News); China Times English News (Copyright: China Times Inc.) 1998-1999

Korea Times (Copyright: Hankooki.com Co., Distribution rights: Korean Institute of Science and Technology Information) 1998-1999

Hong Kong Standard (Copyright: the Sing Tao Group, Distribution rights: Wisers Information Ltd.) 1998-1999

Questions

The format of testing questions is:

[QID]: "[Question]"

[QID] is the form of [QuestionSetID]-[Lang]-[QuestionNo]-[SubQuestionNo], where [QuestionSetID] is "CLQA2". [Lang] is one of JA (Japanese), ZH (Chinese), and EN (English).

[QuestionNo] and [SubQuestionNo] consist of four numeric characters starting with "S" or "T" and two numeric characters, respectively. ("S" is for sample questions and "T" for test questions.) An example of questions is:

CLQA2-EN-T3003-00: "Who was the UN secretary-general in 1999?"

We will release 8 question files for the CLQA formal run which answers are restricted to the named entities. Chinese question files are in BIG5 encoding, Japanese question files are in EUC-JP encoding, and English question files are in ASCII encoding. The names of question files and their associations with CLQA subtasks are shown as follows.

Subtasks	Question Set	言語	#Q	Remark
E-J	CLQA2-EN-T1200-ASCII.q	English	200	Same 200 English questions
E-E	CLQA2-EN-T0200-ASCII.q	English	200	Same 200 English questions
J-E	CLQA2-JA-T0200-EUC-JP.q	Japanese	200	Same 200 Japanese questions
J-J	CLQA2-JA-T1200-EUC-JP.q	Japanese	200	Same 200 Japanese questions
E-C	CLQA2-EN-T3200-ASCII.q	English	150	Same 150 English questions
E-E	CLQA2-EN-T2200-ASCII.q	English	150	Same 150 English questions
C-E	CLQA2-ZH-T2200-BIG5.q	Chinese	150	Same 150 Chinese questions
C-C	CLQA2-ZH-T3200-BIG5.q	Chinese	150	Same 150 Chinese questions

Gold Standard

The following is the description for tag set.

<QASET>	</QASET>	The tag for the whole QA set
<VERSION>	</VERSION>	The version of this QA set
<QA>	</QA>	The tag for a QA cluster: a QA cluster contains a set of question sentences which are the same question but written in different languages; moreover, all of the correct answers found in the test collections (in any language) are also collected in a QA cluster
<QUESTION>	</QUESTION>	Question part in a QA cluster
<Q>	</Q>	The tag for a question sentence in a QA cluster, which has the following attributes: the attribute LANG (with values EN, JA, and ZH) denotes the language in which the question is written, and the attribute QID gives the question ID referred in CLQA subtasks
<Q_TYPE>	<Q_TYPE>	The question type of a question
<ANSWER>	</ANSWER>	Answer part in a QA cluster
<A>	</A>	The tag for a correct answer found in the test collections, which has the following attributes: the attribute LANG (with values EN, JA, and ZH) denotes the language in which the answer string is written, and the attribute DOCNO gives the document ID of a document where this answer appears.

An example of a QA cluster is as follows.

<QA>

<Q LANG="EN" QID="CLQA2-EN-T3003-00">Who was the UN secretary-general in 1999?</Q>

<Q LANG="ZH" QID="CLQA2-ZH-T3003-00">一九九九年時聯合國秘書長是誰？</Q>

<Q_TYPE>PERSON</Q_TYPE>

</QUESTION>

<A LANG="EN" DOCNO="HK-199908270280045">Kofi Annan</A>

</ANSWER>

</QA>

The gold standard files for CLQA subtasks are:

CLQA2-EJ-T0200-070131-UTF-8.xml for J-E/E-E subtasks

CLQA2-EJ-T1200-070131-UTF-8.xml for E-J/J-J subtasks

CLQA2-EN-T2200-v1.2-UTF-8.xml for E-E subtask

CLQA2-EN-T3200-v1.2-UTF-8.xml for E-C subtask

CLQA2-ZH-T3200-v1.2-UTF-8.xml for C-C subtask

（ no gold standard for C-E subtask.）

The followings are the procedures to obtain this CLQA test collection. The test collection and data available from NII are free of charge.

The release of the new test collections and correction information shall be announced through the ntcir

Mailing list

The test collection has been constructed and used for the NTCIR. They are usable only for the research purpose use.
The documents collection included in the test collection were provided to NII for used in NTCIR free of charge or for a fee. The providers of the document data kindly understand the importance of the test collection in the research on information access technologies and then granted the use of the data for research purpose. Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. It is important for our continued reliable and good relationship with the data producers/providers that we researchers must behave as a reliable partners and use the data only for research purpose under the user agreement and use them carefully not to violate any rights for them .

Mandatory tags
<DOC>	</DOC>	The tag for each document
<DOCNO>	</DOCNO>	Document identifier
<LANG>	</LANG>	Language code: ZH, EN, JA,KR
<HEADLINE>	</HEADLINE>	Title of this news article
<DATE>	</DATE>	Issue date
<TEXT>	</TEXT>	Text of news article
Optional tags
<P>	</P>	Paragraph marker
<SECTION>	</SECTION>	Section identifier in original newspapers
<AE>	</AE>	Contain figures or not
<WORDS>	</WORDS>	Number of words in 2 bytes

NTCIR Project
NTCIR-6 CLQA
Research Purpose Use of Test Collection

NTCIR-6 CLQA (Cross Language Q&A data Test Collection)

(A)	-- CIRB020, Mainichi Daily (English),Korea Times, Hong Kong Standard, are OK for deliver from NII to non-participants for research purpose use.
(B)	--For the non-participants, Mainichi Newspaper Japanese Article Data Full-Text Article Database CD-ROMs are available for research purpose use from Nichigai Associates and Mainichi Shinbum and the document records in the CD-ROMs shall be converted into the NTCIR standard record format by the script mai2.pl.(currently information is available in Japanese only). To obtaib script mai2ntc-r.pl：http://research.nii.ac.jp/ntcir/permission/ntcir-4/script/mai2ntc-r.pl_txt README【mai2ntc-r.pl】http://research.nii.ac.jp/ntcir/permission/ntcir-4/script/READMEforMainichiScript-r.txt
(C)	EIRB010 is usable for participants only. Hong Kong Standard is not available now.

NTCIR Project NTCIR-6 CLQA Research Purpose Use of Test Collection

NTCIR-6 CLQA (Cross Language Q&A data Test Collection)

NTCIR Project
NTCIR-6 CLQA
Research Purpose Use of Test Collection