NTCIR Project
NTCIR-8 ACLIA
(Advanced Cross-Lingual Information Retrieval and Question Answering)
Research Purpose Use of Test Collection

[JAPANESE] [NTCIR Home] [NTCIR Data Home]

NTCIR-8 ACLIA (Advanced Cross-Lingual Information Retrieval and Question Answering)

NTCIR-8 ACLIA test collection can be used for experiments of Complex Question Answering and Information Retrieval between Chinese (Simplified (CS), Traditional (CT)), Japanese (JA), English (EN) such as

CCLQA (Complex Cross-Lingual Question Answering including factoid questions)
- Cross-Lingual Question Answering (EN-JA/EN-CS/EN-CT subtask)*
- Monolingual Question Answering (JA-JA, CS-CS, and CT-CT subtask)
IR4QA (Information Retrieval for Question Answering)
- Cross-Lingual Information Retrieval (EN-JA/EN-CS/EN-CT subtask)*
- Monolingual Information Retrieval (JA-JA, CS-CS, and CT-CT subtask)

*"X-Y subtask" means to accept a question in language X and retrieve/extract answer in language Y.

The documents are full-text of news articles in CJ languages (except for English), which were published in Asian areas from 2002 to 2005 in formal run. This test collection includes 100 Japanese topics (for JA-JA), 100 Simplified Chinese topics (for CS-CS) , 100 Traditional Chinese topics (for CT-CT), 300 English topics (100 for EN-JA, 100 for EN-CS and 100 for EN-CT), weighted answer nuggets, and answer-bearing document ID. In the IR4QA task, we shared the same topic set with CCLQA. The test collection also includes relevance judgement information for IR4QA evaluation.

Collection	Document Data					Task Data
	Genre	Name	Language	Year	Volume	Question		Relevance judgment
						Language	#
NTCIR-8 ACLIA	News articles	Xinhua (B)	Simplified Chinese	2002-2005	308,845	CJE	100* for each language pair	Binary pyramid nugget matching (QA); Three-level relevance judgment (IR)
		UDN (A)	Traditional Chinese	2002-2005	1,663,517
		Mainichi (B)	Japanese	2002-2005	377,941

Xinhua (B)	--For the non-participants, LDC2007T38: Chinese Gigaword Third Edition including Xinhua News Articles file is available from Linguistic　Data Consortium (LDC). The format of the document records shall be converted with this perl script.
UDN (A)	NII delivers to non-participants for research purpose use.
Mainichi (B)	--For the non-participants, Mainichi Newspaper Japanese Article Data Full-Text Article Database CD-ROMs are available for research purpose use from Nichigai Associates and Mainichi Shinbum (currently information is available in Japanese only) and the document records in the CD-ROMs shall be converted into the NTCIR standard document format by the script mai2ntc-r-utf.pl. To obtain script mai2ntc-r-utf.pl: http://research.nii.ac.jp/ntcir/tools/mai2ntc-r-utf.pl_txt README 【mai2ntc-r-utf.pl】: http://research.nii.ac.jp/ntcir/tools/READMEforMainichiScript-r-utf.txt

*Removed a few IR4QA topics from the formal run such that a very small number of relevant document has been returned. Find more detail in the overview paper.

The documents sets included in the NTCIR-8 ACLIA test collection are as follows.

A.1 Chinese Data Set (Simplified)

Xinhua (Copyright: Xinhua News Agency) 2002-2005 (formal run)
Xinhua (Copyright: Xinhua News Agency) 1998-2001 (training)
Lianhe Zaobao (Copyright: Singapore Press Holdings Limited) 1998-2001 (training)

A.2 Chinese Data Set (Traditional)

UDN:(Copyright: United Daily News) 2002-2005 (formal run)
CIRB011: China Times, Commercial Times, China Times Express, Central Daily News, China Daily News (Copyright: UDN.COM news agency) 1998-1999 (training)
CIRB020: United Daily News, Economic Daily News, Min Sheng Daily, United Evening News, Star News (Copyright: UDN.COM) 1998-1999 (training)
CIRB40: United Daily News, United Express, Ming Hseng News, Economic Daily News (Copyright: UDN.COM news agency) 2000-2001 (training)

A.3 Japanese Data Set

Mainichi News Paper (Copyright: Mainichi Newspapers Co. Ltd.) 2002-2005 (formal run)
Mainichi News Paper (Copyright: Mainichi Newspapers Co. Ltd.) 1998-2001 (training)

Tags

Tags are based on the TREC format.

Mandatory tags
<DOC>	</DOC>	The tag for each document
<DOCNO>	</DOCNO>	Document identifier
<LANG>	</LANG>	Language code: CS, CT, EN, JA
<HEADLINE>	</HEADLINE>	Title of this news article
<DATE>	</DATE>	Issue date
<TEXT>	</TEXT>	Text of news article
Optional tags
<P>	</P>	Paragraph marker
<SECTION>	</SECTION>	Section identifier in original newspapers
<AE>	</AE>	Contain figures or not
<WORDS>	</WORDS>	Number of words in 2 bytes

Format

Question Format

Example Question

<TOPIC_SET>

  <METADATA>
    <DESCRIPTION>NTCIR-8 ACLIA questions</DESCRIPTION>
    <VERSION>v1.0</VERSION>
    <LANGUAGE TARGET="JA" />
    <CORPUS>Mainichi Newspaper (2002-2005)</CORPUS>
  </METADATA>
  
  <TOPIC ID="ACLIA2-JA-T1">
    <QUESTION LANG="EN"><![CDATA[What is Fatah?]]></QUESTION>
    <QUESTION LANG="JA"><![CDATA[ファタハとはどんな組織ですか？]]></QUESTION>
    <NARRATIVE LANG="EN"><![CDATA[The analyst is especially interested in major characteristics of the organization called Fatah.]]></NARRATIVE>
    <NARRATIVE LANG="JA"><![CDATA[ファタハの一般的な情報と活動内容についての回答を求めています。]]></NARRATIVE>
  </TOPIC>
  
</TOPIC_SET>

For more details about the question format, visit this page.

Gold Standard

Example Gold Standard

<TOPIC_SET>
  <METADATA>
    <DESCRIPTION>NTCIR-8 ACLIA Training questions and answers</DESCRIPTION>
    <VERSION>v1.0</VERSION>
    <LANGUAGE TARGET="JA" />
    <CORPUS>Mainichi Newspaper (2002-2005)</CORPUS>
  </METADATA>
  
  <TOPIC ID="ACLIA2-JA-T1" TITLE="ファタハ">
    <QUESTION LANG="EN"><![CDATA[What is Fatah?]]></QUESTION>
    <QUESTION LANG="JA"><![CDATA[ファタハとはどんな組織ですか？]]></QUESTION>
    <ANSWERTYPE>DEFINITION</ANSWERTYPE>
    <NARRATIVE LANG="EN"><![CDATA[The analyst is especially interested in major characteristics of the organization called Fatah.]]></NARRATIVE>
    <NARRATIVE LANG="JA"><![CDATA[ファタハの一般的な情報と活動内容についての回答を求めています。]]></NARRATIVE>
    <ANSWER>
      <TEXT ID="1" DOCNO="JA-010101032"><![CDATA[パレスチナ解放機構（ＰＬＯ）の主流派ファタハ]]></TEXT>
      <TEXT ID="2" DOCNO="JA-011218020"><![CDATA[ファタハが反イスラエル抵抗闘争の主体となっている]]></TEXT>
      <TEXT ID="3" DOCNO="JA-211221040"><![CDATA[アラファト議長の最大支持基盤であるファタハは１３日、]]></TEXT>
      <NUGGET ID="1" VITAL="10" NONVITAL="0" SCORE="1.0"><![CDATA[パレスチナ解放機構（ＰＬＯ）主流派]]></NUGGET>
      <NUGGET ID="2" VITAL="3" NONVITAL="7" SCORE="0.3"><![CDATA[反イスラエル抵抗闘争の主体となっている]]></NUGGET>
      <NUGGET ID="3" VITAL="9" NONVITAL="1" SCORE="0.9"><![CDATA[アラファト議長の最大支持基盤]]></NUGGET>
    </ANSWER>
  </TOPIC>
  
</TOPIC_SET>

For more details about the question format, visit this page.

The following is the procedure to obtain the test collection. The test collection and data are available from NII free of charge.

Task　data ( without document data )

NTCIR-8 ACLIA Task data are downloadable from NII/IDR at;
http://www.nii.ac.jp/dsc/idr/en/ntcir/ntcir.html

Document data

Please read "The terms of use" carefully before you apply data.

1. Please fill out "Application form of the test collection"and sent by E-mail to ntc-secretariat as an attachment file .

　 *In section 1 in the form,the name of the test collection, please fill out "NTCIR-8 ACLIA document data".
2. After review of your application in NII,we will contact you the result within a few days.

　　< If your application are accepted >
　　Please send us "User Agreement (memorandum on Permission to Use Test Collection)" by postal mail in the format shown in the following.
　　
- Please download it and print two copies (double-sided printing).
- Please fill out and sign on both agreement forms. (Signatures are needed on both agreement forms. )
- Please send them by postal mail or courier to the Address below.
3. After we received agreement forms, we will distribute the data.

After being counter-signed by the NII side, one copy of the User Agreement Form will be sent to you and one copy will be kept by NII.
Because it shows permission to use the data, please keep it in a safe place during the term available.

Documents to submit

Application Form [txt]
User Agreement Form (sent by email)

Reference

The terms of use [PDF]
Overview of the NTCIR-8 ACLIA: Advanced Cross-Lingual Information Access

Overview of the NTCIR-8 ACLIA IR4QA Task

NTCIR-8 ACLIA Task website

Tools

Address

NTCIR Project (Rm.1309)
National Institute of Informatics
2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo
102-8430, JAPAN
PHONE: +81-3-4212-2750
FAX: +81-3-4212-2751
Email: ntc-secretariat

Notice

The test collection was constructed and used for the NTCIR project. It is usable only for research purposes.
The document collection included in the test collection was made available to NII for use in the NTCIR project free of charge or for a fee. The providers of the document data understand the importance of such test collections in research on information access technologies and have kindly given their permission to use the data for research purposes. Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. To maintain a good relationship with the data producers/provider, we researchers must be reliable partners and use the data only for research purposes under the user agreement, and we must use the data carefully so as not to violate copyright.

[JAPANESE] [NTCIR Home] [Top of this page] [NTCIR Data Home]
Updated on : 2015-07-23
ntc-admin

NTCIR Project NTCIR-8 ACLIA (Advanced Cross-Lingual Information Retrieval and Question Answering) Research Purpose Use of Test Collection

NTCIR-8 ACLIA (Advanced Cross-Lingual Information Retrieval and Question Answering)

Format

NTCIR Project
NTCIR-8 ACLIA
(Advanced Cross-Lingual Information Retrieval and Question Answering)
Research Purpose Use of Test Collection