NTCIR Project
Research Purpose Use of Test Collection




The NTCIR-9 INTENT (INTENT-1) Test Collections are the following:

(a) NTCIR-9 INTENT Chinese Subtopic Mining Test Collection
Japanese Subtopic Mining Test Collection
Chinese Document Ranking Test Collection
(d) NTCIR-9 INTENT Japanese Document Ranking Test Collection

->the NTCIR-9 1CLICK Test Collection is here.

For evaluating Document Ranking, web document corpora (SogouT for Chinese and ClueWeb09-JA for Japanese) need to be obtained separately (not from NII).
For Subtopic Mining, the document collections are not a requirement.

SogouT --This document collection is available from the Tsinghua-Sohu Joint Laboratory on Search Technology. The collection contains about 130M Chinese pages together with the corresponding link graph. The size is roughly 5TB uncompressed. The data was crawled and released on Nov 2008.

Further information regarding this collection can be found on the page: http://www.sogou.com/labs/dl/t.html
You can also directly contact chenjing to obtain the data set.
ClueWeb-JA --This document collection is available from the Language Technologies Institute at Carnegie Mellon University. The ClueWeb09-JA collection is composed of all the 67M Japanese pages in the ClueWeb09 collection. We appreciate Prof. Jamie Callan and his team providing the ClueWeb09-JA collection, which dramatically reduces the cost of participants. The data was crawled during January and February 2009.

Further information regarding the collections can be found on the page: http://boston.lti.cs.cmu.edu/Data/clueweb09/

Each Subtopic Mining test collection comprises the following:
(1) 100 topics (queries);
(2) Intents for each topic, obtained by manually clustering the subtopic strings submitted by the Subtopic Mining participants;
(3) An intent probability for each intent, estimated through assessor voting;
(4) Pooled subtopics that correspond to each intent, where each subtopic belongs to exactly one intent.

Each Document Ranking test collection comprises the following:
(1) 100 topics (same as Subtopic Mining);
(2) Intents for each topic (same as Subtopic Mining);
(3) An intent probability for each intent (same as Subtopic Mining);
(4) Pooled and judged documents with graded relevance, from L0 (judged nonrelevant) to L4 (highly relevant).

For computing evaluation metrics such as intent recall and D-measures, the NTCIREVAL toolkit can be used.

More details, including the test collection statistics, can be found at the INTENT-1 homepage:

and the INTENT-9 Task Overview:


The test collection and data are available from NII free of charge.


Contact us : ntc-secretariat


The test collection was constructed and used for the NTCIR project. It is usable only for research purposes.
The document collection included in the test collection was made available to NII for use in the NTCIR project free of charge or for a fee. The providers of the document data understand the importance of such test collections in research on information access technologies and have kindly given their permission to use the data for research purposes. Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. To maintain a good relationship with the data producers/provider, we researchers must be reliable partners and use the data only for research purposes under the user agreement, and we must use the data carefully so as not to violate copyright.