[JAPANESE] [NTCIR Home] [NTCIR Data Home]
NTCIR-8 ACLIA test collection can be used for experiments of Complex Question Answering and Information Retrieval between Chinese (Simplified (CS), Traditional (CT)), Japanese (JA), English (EN) such as
Collection | Document Data | Task Data | ||||||
Genre | Name | Language | Year | Volume | Question | Relevance judgment | ||
Language | # | |||||||
NTCIR-8 ACLIA |
News articles |
Xinhua (B) |
Simplified Chinese |
2002-2005 |
308,845 | CJE | 100* for each language pair | Binary pyramid nugget matching (QA); Three-level relevance judgment (IR) |
UDN (A) | Traditional Chinese |
2002-2005 |
1,663,517 | |||||
Mainichi (B) | Japanese | 2002-2005 | 377,941 |
Xinhua (B) | --For the non-participants, LDC2007T38: Chinese Gigaword Third Edition including Xinhua News Articles file is available from Linguistic Data Consortium (LDC). The format of the document records shall be converted with this perl script. |
UDN (A) | NII delivers to non-participants for research purpose use. |
Mainichi (B) | --For the non-participants, Mainichi Newspaper Japanese Article Data Full-Text Article Database CD-ROMs are available for research purpose use from Nichigai Associates and Mainichi Shinbum (currently information is available in Japanese only) and the document
records in the CD-ROMs shall be converted into the NTCIR standard document
format by the script mai2ntc-r-utf.pl. To obtain script mai2ntc-r-utf.pl: http://research.nii.ac.jp/ntcir/tools/mai2ntc-r-utf.pl_txt README 【mai2ntc-r-utf.pl】: http://research.nii.ac.jp/ntcir/tools/READMEforMainichiScript-r-utf.txt |
*Removed a few IR4QA topics from the formal run such that a very small number
of relevant document has been returned. Find more detail in the overview
paper.
The documents sets included in the NTCIR-8 ACLIA test collection are as follows.
A.1 Chinese Data Set (Simplified)
A.2 Chinese Data Set (Traditional)
A.3 Japanese Data Set
Tags
Tags are based on the TREC format.
Mandatory tags |
||
<DOC> |
</DOC> |
The tag for each document |
<DOCNO> |
</DOCNO> |
Document identifier |
<LANG> |
</LANG> |
Language code: CS, CT, EN, JA |
<HEADLINE> |
</HEADLINE> |
Title of this news article |
<DATE> |
</DATE> |
Issue date |
<TEXT> |
</TEXT> |
Text of news article |
Optional tags |
||
<P> |
</P> |
Paragraph marker |
<SECTION> |
</SECTION> |
Section identifier in original newspapers |
<AE> |
</AE> |
Contain figures or not |
<WORDS> |
</WORDS> |
Number of words in 2 bytes |
Example Question
<TOPIC_SET>
<METADATA>
<DESCRIPTION>NTCIR-8 ACLIA questions</DESCRIPTION>
<VERSION>v1.0</VERSION>
<LANGUAGE TARGET="JA" />
<CORPUS>Mainichi Newspaper (2002-2005)</CORPUS>
</METADATA>
<TOPIC ID="ACLIA2-JA-T1">
<QUESTION LANG="EN"><![CDATA[What is Fatah?]]></QUESTION>
<QUESTION LANG="JA"><![CDATA[ファタハとはどんな組織ですか?]]></QUESTION>
<NARRATIVE LANG="EN"><![CDATA[The analyst is especially interested in major characteristics of the organization called Fatah.]]></NARRATIVE>
<NARRATIVE LANG="JA"><![CDATA[ファタハの一般的な情報と活動内容についての回答を求めています。]]></NARRATIVE>
</TOPIC>
</TOPIC_SET>
Gold Standard
Example Gold Standard
<TOPIC_SET>
<METADATA>
<DESCRIPTION>NTCIR-8 ACLIA Training questions and answers</DESCRIPTION>
<VERSION>v1.0</VERSION>
<LANGUAGE TARGET="JA" />
<CORPUS>Mainichi Newspaper (2002-2005)</CORPUS>
</METADATA>
<TOPIC ID="ACLIA2-JA-T1" TITLE="ファタハ">
<QUESTION LANG="EN"><![CDATA[What is Fatah?]]></QUESTION>
<QUESTION LANG="JA"><![CDATA[ファタハとはどんな組織ですか?]]></QUESTION>
<ANSWERTYPE>DEFINITION</ANSWERTYPE>
<NARRATIVE LANG="EN"><![CDATA[The analyst is especially interested in major characteristics of the organization called Fatah.]]></NARRATIVE>
<NARRATIVE LANG="JA"><![CDATA[ファタハの一般的な情報と活動内容についての回答を求めています。]]></NARRATIVE>
<ANSWER>
<TEXT ID="1" DOCNO="JA-010101032"><![CDATA[パレスチナ解放機構(PLO)の主流派ファタハ]]></TEXT>
<TEXT ID="2" DOCNO="JA-011218020"><![CDATA[ファタハが反イスラエル抵抗闘争の主体となっている]]></TEXT>
<TEXT ID="3" DOCNO="JA-211221040"><![CDATA[アラファト議長の最大支持基盤であるファタハは13日、]]></TEXT>
<NUGGET ID="1" VITAL="10" NONVITAL="0" SCORE="1.0"><![CDATA[パレスチナ解放機構(PLO)主流派]]></NUGGET>
<NUGGET ID="2" VITAL="3" NONVITAL="7" SCORE="0.3"><![CDATA[反イスラエル抵抗闘争の主体となっている]]></NUGGET>
<NUGGET ID="3" VITAL="9" NONVITAL="1" SCORE="0.9"><![CDATA[アラファト議長の最大支持基盤]]></NUGGET>
</ANSWER>
</TOPIC>
</TOPIC_SET>
The following is the procedure to obtain the test collection. The test collection and data are available from NII free of charge.
Task data ( without document data )
Document data
Please read "The terms of use" carefully before you apply data.
- Application Form [txt]
- User Agreement Form (sent by email)
Reference
Address
NTCIR Project (Rm.1309)
National Institute of Informatics
2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo
102-8430, JAPAN
PHONE: +81-3-4212-2750
FAX: +81-3-4212-2751
Email: ntc-secretariat
Notice
The test collection was constructed and used for the NTCIR project. It is usable only for research purposes.
The document collection included in the test collection was made available
to NII for use in the NTCIR project free of charge or for a fee. The providers
of the document data understand the importance of such test collections
in research on information access technologies and have kindly given their
permission to use the data for research purposes. Please remember that
the document data in the NTCIR test collection is copyrighted and has commercial
value as data. To maintain a good relationship with the data producers/provider,
we researchers must be reliable partners and use the data only for research
purposes under the user agreement, and we must use the data carefully so
as not to violate copyright.