[JAPANESE] [NTCIR Home] [NTCIR Data Home]
NTCIR-7 ACLIA test collection can be used for experiments of Complex Question Answering and Information Retrieval between Chinese (Simplified (CS), Traditional (CT)), Japanese (JA), English (EN) such as
Collection | Task |
Document Data | Task Data | |||||||
Genre | Name | Language | Year | Volume | Size | Question |
Relevance judgment |
|||
Language | # | |||||||||
NTCIR-7 ACLIA |
QA |
News articles |
Xinhua (A) |
Simplified Chinese |
1998-2001 |
295,875 | 511 MB | CJE | EN-JA: 100 JA-JA: 100 EN-CS: 100 CS-CS: 100 EN-CT: 100 CT-CT: 100 |
Binary decision (system response conceptually containing the nugget or not) |
Lianhe Zaobao (B) | Simplified Chinese |
1998-2001 |
249,287 | 411 MB | ||||||
CIRB20 (B) | Traditional Chinese |
1998-1999 |
249,508 | 320 MB | ||||||
CIRB40 (B) | Traditional Chinese | 2000-2001 | 901,446 | 582 MB | ||||||
Mainichi (C) | Japanese | 1998-2001 | 419,759 | 544 MB | ||||||
IR |
News articles |
Xinhua (A) | Simplified Chinese |
1998-2001 |
295,875 | 511 MB | CJE | EN-JA: 98* JA-JA: 98* EN-CS: 97* CS-CS: 97* EN-CT: 95* CT-CT: 95* |
3-level relevance grading | |
Lianhe Zaobao (B) | Simplified Chinese |
1998-2001 |
249,287 | 411 MB | ||||||
CIRB20 (B) | Traditional Chinese |
1998-1999 |
249,508 | 320 MB | ||||||
CIRB40 (B) | Traditional Chinese | 2000-2001 | 901,446 | 582 MB | ||||||
Mainichi (C) | Japanese | 1998-2001 | 419,759 | 544 MB |
*Removed topics if no relevant document has been returned.
(A) | --For the non-participants, Chinese Gigaword including Xinhua News Articles file is available from Linguistic Data Consortium (LDC). The format of the document records shall be converted with this perl script. |
(B) | NII delivers to non-participants for research purpose use. |
(C) | --For the non-participants, Mainichi Newspaper Japanese Article Data Full-Text
Article Database CD-ROMs are available for research purpose use from Nichigai Associates and Mainichi Shinbum and the document records in the CD-ROMs shall be converted into the NTCIR
standard record format by the script mai2.pl.(currently information is
available in Japanese only). To obtaib script mai2ntc-r.pl:https://research.nii.ac.jp/ntcir/permission/ntcir-4/script/mai2ntc-r.pl_txt README (mai2ntc-r.pl) https://research.nii.ac.jp/ntcir/permission/ntcir-4/script/READMEforMainichiScript-r.txt |
The documents sets included in the NTCIR-7 ACLIA test collection are as follows.
A.1 Chinese Data Set (Simplified)
Xinhua (Copyright: Xinhua News Agency) 1998-2001
Lianhe Zaobao (Copyright: Singapore Press Holdings Limited) 1998-2001
A.2 Chinese Data Set (Traditional)
CIRB020: United Daily News, Economic Daily News, Min Sheng Daily, United
Evening News, Star News (Copyright: UDN.COM) 1998-1999
CIRB40: United Daily News, United Express, Ming Hseng News, Economic Daily
News (Copyright: UDN.COM news agency) 2000-2001
A.3 Japanese Data Set
Mainichi News Paper (Copyright: Mainichi Newspapers Co. Ltd.) 1998-2001
Tags
Tags are based on the TREC format.
Mandatory tags |
||
<DOC> |
</DOC> |
The tag for each document |
<DOCNO> |
</DOCNO> |
Document identifier |
<LANG> |
</LANG> |
Language code: ZH, EN, JA,KR |
<HEADLINE> |
</HEADLINE> |
Title of this news article |
<DATE> |
</DATE> |
Issue date |
<TEXT> |
</TEXT> |
Text of news article |
Optional tags |
||
<P> |
</P> |
Paragraph marker |
<SECTION> |
</SECTION> |
Section identifier in original newspapers |
<AE> |
</AE> |
Contain figures or not |
<WORDS> |
</WORDS> |
Number of words in 2 bytes |
Example Question
<TOPIC_SET> <METADATA> <DESCRIPTION>NTCIR-7 ACLIA Training questions</DESCRIPTION> <VERSION>v20071116</VERSION> <LANGUAGE TARGET="JA" /> <CORPUS>Mainichi Newspaper (1998-2001)</CORPUS> </METADATA> <TOPIC ID="ACLIA1-JA-T1"> <QUESTION LANG="EN"><![CDATA[What is Fatah?]]></QUESTION> <QUESTION LANG="JA"><![CDATA[ファタハとはどんな組織ですか?]]></QUESTION> <NARRATIVE LANG="EN"><![CDATA[The analyst is especially interested in major characteristics of the organization called Fatah.]]></NARRATIVE> <NARRATIVE LANG="JA"><![CDATA[ファタハの一般的な情報と活動内容についての回答を求めています。]]></NARRATIVE> </TOPIC> </TOPIC_SET>
Gold Standard
Example Gold Standard
<TOPIC_SET> <METADATA> <DESCRIPTION>NTCIR-7 ACLIA Training questions and answers</DESCRIPTION> <VERSION>v20071116</VERSION> <LANGUAGE TARGET="JA" /> <CORPUS>Mainichi Newspaper (1998-2001)</CORPUS> </METADATA> <TOPIC ID="ACLIA1-JA-T1" TITLE="ファタハ"> <QUESTION LANG="EN"><![CDATA[What is Fatah?]]></QUESTION> <QUESTION LANG="JA"><![CDATA[ファタハとはどんな組織ですか?]]></QUESTION> <ANSWERTYPE>DEFINITION</ANSWERTYPE> <NARRATIVE LANG="EN"><![CDATA[The analyst is especially interested in major characteristics of the organization called Fatah.]]></NARRATIVE> <NARRATIVE LANG="JA"><![CDATA[ファタハの一般的な情報と活動内容についての回答を求めています。]]></NARRATIVE> <ANSWER> <TEXT ID="1" DOCNO="JA-010101032"><![CDATA[パレスチナ解放機構(PLO)の主流派ファタハ]]></TEXT> <TEXT ID="2" DOCNO="JA-011218020"><![CDATA[ファタハが反イスラエル抵抗闘争の主体となっている]]></TEXT> <TEXT ID="3" DOCNO="JA-211221040"><![CDATA[アラファト議長の最大支持基盤であるファタハは13日、]]></TEXT> <NUGGET ID="1" VITAL="10" NONVITAL="0" SCORE="1.0"><![CDATA[パレスチナ解放機構(PLO)主流派]]></NUGGET> <NUGGET ID="2" VITAL="3" NONVITAL="7" SCORE="0.3"><![CDATA[反イスラエル抵抗闘争の主体となっている]]></NUGGET> <NUGGET ID="3" VITAL="9" NONVITAL="1" SCORE="0.9"><![CDATA[アラファト議長の最大支持基盤]]></NUGGET> </ANSWER> </TOPIC> </TOPIC_SET>
The followings are the procedures to obtain this NTCIR-7 ACLIA test collection. The test collection and data available from NII are free of charge.
Task Data ( without document data )
Document Data
Please read "The terms of use" carefully before you apply data.
- Application Form [txt]
- User agreement form (sent by email)
NTCIR Project (Rm.1309)
National Institute of Informatics
2-1-2 Hitotsubashi Chiyoda-ku, Tokyo
102-8430, JAPAN
PHONE: +81-3-4212-2750
FAX: +81-3-4212-2751
Email: idr-ntcir
Notice
The test collection has been constructed and used for the NTCIR. They are usable only for the research purpose use.
The documents collection included in the test collection were provided to NII for used in NTCIR free of charge or for a fee. The providers of the document data kindly understand the importance of the test collection in the research on information access technologies and then granted the use of the data for research purpose. Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. It is important for our continued reliable and good relationship with the data producers/providers that we researchers must behave as a reliable partners and use the data only for research purpose under the user agreement and use them carefully not to violate any rights for them .