NTCIR Project
NTCIR-5 CLQA
Research Purpose Use of Test Collection

[JAPANESE] [NTCIR Home] [NTCIR DATA Home]

NTCIR-5 CLQA (Cross Language Q&A data Test Collection)

In the NTCIR-5 CLQA Task, the following subtasks were conducted.

1. Japanese to English (J-E) subtask
Find answers of Japanese questions in English documents.

2. Chinese to English (C-E) subtask
Find answers of Chinese questions in English documents.

3. English to Japanese (E-J) subtask
Find answers of English questions in Japanese documents.

4. English to Chinese (E-C) subtask
Find answers of English questions in Chinese documents.

5. Chinese to Chinese (C-C) subtask
Find answers of Chinese questions in Chinese documents.

Participants developed their systems using sample data (200-300 Q/A). In the formal run, 200 questions were provided for each subtask.

collection

Task

Documents

Taskdata

Genre

File name

Lang.
Year # of docs Sizes Topic/Relevanc
Judge

Lang.
Quer

Stages

NTCIR-5
CLQA
J-E

News articles
Daily Yomiuri
English
2000-2001 17,741 22.9 MB Japanese 200 Top 1

E-J

News articles
Yomiuri Japanese
2000-2001
658,719 343.3 MB English 200 Top 1

C-E

News articles
Daily Yomiuri English 2000-2001 17,741 22.9 MB Chinese 200 Top 1

E-C
News articles
CIRBO40r Traditional
Chinese
2000-2001
901,446 581.7 MB English 200
Top 1

C-C
News articles
CIRBO40r Traditional
Chinese
2000-2001
901,446 581.7 MB Chinese 200 Top 1

	-- data is available from ＮＩＩ
	-- For NTCIR Workshop participants, data is available from NII. --For non-participants, data is available from third party (Newspaper co., LDC, etc)

(1) Corpora

a) Japanese news articles published in Japan in the years of 2000-2001. It contains the document records extracted from Yomiuri Newspaper Japanese Article Data. Reserchers other than the NTCIR-5 CLQA participants need to purchase a research purpose licenses from Nihon Database Kaihatsu Co. Ltd. (detailed information is available at http://www.ndk.co.jp/jisyo_kiji/newspaper-yomiuri.html; currently information is available in Japanese only). The document format of the Data shall be converted into the NTCIR tag set.

README for Yomiuri Newspaper Japanese Article Data : http://research.nii.ac.jp/ntcir/permission/ntcir-4/script/READMEforYomiuri98+99.txt

To obtain script yomi2ntcir.pl. ：http://research.nii.ac.jp/ntcir/permission/ntcir-4/script/yomi2ntc.pl

README【script yomi2ntcir.pl. 】：http://research.nii.ac.jp/ntcir/permission/ntcir-4/script/READMEforYomiuriScript.txt

b) English news articles published in Japan in the years of 2000-2001. It contains the document records extracted from The Daily Yomiuri Newspaper Article Data. Reserchers other than the NTCIR-5 CLQA participants needs a research purpose license from Nihon Database Kaihatsu Co. Ltd. (detailed information is available at http://www.ndk.co.jp/jisyo_kiji/newspaper-yomiuri.html; currently information is available in Japanese only). The document format of the Data shall be converted into the NTCIR tag set.

c) CIRB040r is available for research purpose from NII.

(2) Tag set

Mandatory tags

<DOC>

</DOC>

The tag for each document

<DOCNO>

</DOCNO>

Document identifier

<LANG>

</LANG>

Language code: ZH, EN, JA,

<HEADLINE>

</HEADLINE>

Title of this news article

<DATE>

</DATE>

Issue date

<TEXT>

</TEXT>

Text of news article

Optional tags

<P>

</P>

Paragraph marker

<SECTION>

</SECTION>

Section identifier in original newspapers

<AE>

</AE>

Contain figures or not

<WORDS>

</WORDS>

Number of words in 2 bytes (for Mainichi Newspaper)

The Task data consists 5 Question set files and their Answers

Questions

For formal runs, there are 200 testing questions for each subtask. Answers of these questions are restricted to the named entities. (Please refer to the Section "Answer Types") The format of testing questions is:

[QID]: "[Question]"

[QID] is the form of [QuestionSetID]-[Lang]-[QuestionNo]-[SubQuestionNo], where [QuestionSetID] is "CLQA1". [Lang] is one of JA, ZH, and EN. (You will find the language code for Chinese here is “ZH” rather than “CH” used in document set. We would like to use ISO 639 from now on. However, due to the historical reason, the language code for documents is still “CH” this time. The “ZH” will be used for documents when we build new document set.) [QuestionNo] and [SubQuestionNo] consist of four numeric characters starting with "S" or "T" and two numeric characters, respectively. ("S" is for sample questions and "T" for test questions.) An example of questions is:

CLQA1-EN-S0001-00: "When Queen Victoria died?"

We release 5 question files for CLQA1 formal run. Chinese question files are in BIG5 encoding, Japanese question files are in EUC-JP encoding, English question files will be in ASCII encoding. The names of question files and their associations with CLQA1 subtasks are shown as follows.

CLQA1-JA-T0200-EUC-JP.q is for J-E subtask.
CLQA1-ZH-T0200-BIG5.q is for C-E subtask.
CLQA1-EN-T0200-ASCII.q is for E-J subtask.
CLQA1-EN-T1200-ASCII.q is for E-C subtask.
CLQA1-ZH-T1200-BIG5.q is for C-C subtask.

For the purpose of constructing QA systems, sample questions were prepared. There are:
- 300 sample questions for J-E subtask, E-J subtask, and C-E subtask, and
- 200 sample questions for C-C subtask and E-C subtask.

Answer Types

The types of answers to the testing questions in NTCIR-5 CLQA are restricted to named entity types. They are:

IREX named entity types:
ORGANIZATION
PERSON
LOCATION
ARTIFACT (product name, book title, pact, law, ...)
DATE
TIME
MONEY
PERCENT
Other named entity type:
NUMEX (numerical expressions other than MONEY and PERCENT)

For the full definition of IREX named entity types, please refer to its web pages:
http://nlp.cs.nyu.edu/irex/index-e.html
http://nlp.cs.nyu.edu/irex/NE/df990214.txt (definition, Japanese only)

Answer Format

Please use the output format to submit the system responses. Different answers (system responses) in the same language to the same question are written together in the same line. The format is:

[QID], [Lang](, "[Answer]", [DOCNO], [Reserved], [Reserved])*

where [QID] is the same as in the question file format above. It must be unique in the file, and ordered identically with in the corresponding question file. It is, however, allowed that some of [QID]s do not appear at the file. [Lang] is one of JA, ZH, and EN.
[Answer] is the answer to the question. It is written in CSV format which requires double quotes to express strings that can include comma (,) and new line in it. If a double quote occurs in the answer string, use "" (double quote twice) to denote its occurrence.
[DOCNO] is the identifier of the article or one of the articles used in the process of deriving the answer. The value of the tag in documents is used for the identifier. [Reserved] is a field for the future use. Examples of answer output are:

CLQA1-EN-T0001-00, EN, "1901", ENY-20001101CYM0398, ,
CLQA1-EN-T0001-00, JA, "１９０１年", JAY-20001101CYM0398, , , "一九〇一年", JAY-20001101CYM0398, ,

If there is no answer response to a question, the line terminates after the [Lang] field, such as:

CLQA1-EN-T0001-00, EN

In order to make sure the correctness of answer format, we will release a program for our participants to check the answer format before submitting the results.

The followings are the procedures to obtain this CLQA test collection. The test collection and data available from NII are free of charge.

Task　data ( without document data )

NTCIR-5 CLQA Task data are downloadable from NII/IDR at;
http://www.nii.ac.jp/dsc/idr/en/ntcir/ntcir.html

Document data

The application form of the test collection must be filled out and sent by E-mail to ntc-secretariat
After review of your application in NII,we will contact you the result within a few days.

< If your application are accepted >
Depending on the types of the data set, either a user agreement (memorandum) or a formal application is required. Please refer the list below for the required documents.

User Agreement (memorandum on Permission to Use Test Collection)（to obtain CLQA Test Collection-Document Data and Task Data-)
The user agreement form for each test collection that you would like to obtain must be filled out and sent by postal mail or courier to the address below.
Please download and make two copies of the form in double-sided print.
Signatures are needed on both agreement forms.
After counter-signed by NII side, one copy of the form will be sent to you and one copy will be kept by the NII.

Documents to submit

Application Form [txt]
User agreement form (sent by email)

Refference
The terms of use [PDF]

Task Overview of NTCIR 5 CLQA
Overview of the NTCIR-5 Cross-Lingual Question Answering Task (CLQA1)

Address

NTCIR Project (Rm.1309)

National Institute of Informatics
2-1-2 Hitotsubashi Chiyoda-ku, Tokyo
102-8430, JAPAN

PHONE: +81-3-4212-2750

FAX: +81-3-4212-2751

Email: ntc-secretariat

Mailing List

The release of the new test collections and correction information shall be announced through the ntcirMailing list

Notice

The test collection has been constructed and used for the NTCIR. They are usable only for the research purpose use.
The documents collection included in the test collection were provided to NII for used in NTCIR free of charge or for a fee. The providers of the document data kindly understand the importance of the test collection in the research on information access technologies and then granted the use of the data for research purpose. Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. It is important for our continued reliable and good relationship with the data producers/providers that we researchers must behave as a reliable partners and use the data only for research purpose under the user agreement and use them carefully not to violate any rights for them .

[JAPANESE] [NTCIR Home] [Top of this page] [NTCIR DATA Home]
Updated on : 2016-11-07
ntc-admin

collection	Task	Documents						Taskdata
		Genre	File name	Lang.	Year	# of docs	Sizes	Topic/Relevanc		Judge
		Genre	File name	Lang.	Year	# of docs	Sizes	Lang.	Quer	Stages
NTCIR-5 CLQA	J-E	News articles	Daily Yomiuri	English	2000-2001	17,741	22.9 MB	Japanese	200	Top 1
	E-J	News articles	Yomiuri	Japanese	2000-2001	658,719	343.3 MB	English	200	Top 1
	C-E	News articles	Daily Yomiuri	English	2000-2001	17,741	22.9 MB	Chinese	200	Top 1
	E-C	News articles	CIRBO40r	Traditional Chinese	2000-2001	901,446	581.7 MB	English	200	Top 1
	C-C	News articles	CIRBO40r	Traditional Chinese	2000-2001	901,446	581.7 MB	Chinese	200	Top 1

Mandatory tags
<DOC>	</DOC>	The tag for each document
<DOCNO>	</DOCNO>	Document identifier
<LANG>	</LANG>	Language code: ZH, EN, JA,
<HEADLINE>	</HEADLINE>	Title of this news article
<DATE>	</DATE>	Issue date
<TEXT>	</TEXT>	Text of news article
Optional tags
<P>	</P>	Paragraph marker
<SECTION>	</SECTION>	Section identifier in original newspapers
<AE>	</AE>	Contain figures or not
<WORDS>	</WORDS>	Number of words in 2 bytes (for Mainichi Newspaper)

NTCIR Project NTCIR-5 CLQA Research Purpose Use of Test Collection

NTCIR-5 CLQA (Cross Language Q&A data Test Collection)

NTCIR Project
NTCIR-5 CLQA
Research Purpose Use of Test Collection