NTCIR Project
NTCIR-11 IMine
Research Purpose Use of Test Collection

[JAPANESE] [NTCIR Home] [NTCIR Data Home]

NTCIR-11 IMine

　

The NTCIR-11 INTENT (IMine) Test Collections comprises:

(a) NTCIR-11 IMine Chinese Subtopic Mining Test Collection
(b) NTCIR-11 IMine Japanese Subtopic Mining Test Collection
(c) NTCIR-11 IMine English Subtopic Mining Test Collection
(d) NTCIR-11 IMine Chinese Document Ranking Test Collection
(e) NTCIR-11 IMine English Document Ranking Test Collection
(f) NTCIR-11 IMine Japanese Search Task Mining Test Collection

Subtopic Mining Subtask: given a query, return a ranked list of "subtopic strings."
Document Ranking Subtask: given a query, return a diversified list of web pages.
Search Task Mining Subtask: given a query, return a ranked list of "task strings" that help to achieve the goal behind the query.

For evaluating Document Ranking, document collections need to be obtained separately:

Chinese: SogouT collection (see http://www.sogou.com/labs/dl/t-e.html)
English: ClueWeb12-B13 collection (see http://lemurproject.org/clueweb12/)

(SogouT for Chinese and ClueWeb12-B13 for English) need to be obtained separately (not from NII).
For Subtopic Mining and Search Task Mining subtasks, the document collections are not a requirement.

SogouT
(Version: 2012)

--This document collection is available from the Tsinghua-Sohu Joint Laboratory on Search Technology. The collection contains about 130 million Chinese Web pages together with the corresponding link graph. The size is roughly 5TB uncompressed. The data was crawled and released on 2012.

Further information regarding this collection can be found on the page: http://www.sogou.com/labs/dl/t-e.html.
You can also directly contact chenjing

to obtain the data set.

ClueWeb12-B13

--This document collection is available from the Language Technologies Institute at Carnegie Mellon University. The ClueWeb12-B13 collection is composed of all the 52M English pages in the ClueWeb12 collection. We appreciate Prof. Jamie Callan and his team providing the ClueWeb12-B13 collection, which dramatically reduces the cost of participants. The data was crawled during Feburary and May 2012.

Further information regarding the collections can be found on the page: http://lemurproject.org/clueweb12/

Subtopic Mining Test Collection comprises of the following data:

(1) 50 topics (queries)
(2) Second-level hierarchical intents for each topic, obtained by manually clustering the subtopic strings submitted by the Subtopic Mining participants
(3) An intent probability for each intent, estimated through assessor voting
(4) Pooled subtopics that correspond to each intent

Document Ranking Test Collection comprises of the following data:

(1) 50 topics (the same as the subtopic mining subtask)
(2) Pooled and judged documents with graded relevance, from L0 (judged nonrelevant) to L4 (highly relevant).

Search Task Mining Test Collection comprises of the following data:

(1) 50 topics (queries)
(2) Gold Standard task strings with thier importance for each topic
(3) Pooled participant task strings with matching information with gold standard task strings.

Form more details, please refer to the README file and
the NTCIR-11 IMine Task overview paper available at the
NTCIR-11 online proceedings:
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings11/NTCIR/toc_ntcir.html

The test collection and data are available from NII free of charge.

NTCIR-11 IMine Task data are downloadable from NII/IDR at:
http://www.nii.ac.jp/dsc/idr/en/ntcir/ntcir.html

Reference

The terms of use [PDF]
Overview of the NTCIR-11 IMine Task [PDF]

NTCIR-11 IMine website
http://www.thuir.org/IMine/

Tools
http://research.nii.ac.jp/ntcir/tools/tools-en.html

Contact us : ntc-secretariat

Notice

The test collection was constructed and used for the NTCIR project. It is usable only for research purposes.
The document collection included in the test collection was made available to NII for use in the NTCIR project free of charge or for a fee. The providers of the document data understand the importance of such test collections in research on information access technologies and have kindly given their permission to use the data for research purposes. Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. To maintain a good relationship with the data producers/provider, we researchers must be reliable partners and use the data only for research purposes under the user agreement, and we must use the data carefully so as not to violate copyright.

[JAPANESE] [NTCIR Home] [Top of this page] [NTCIR Data Home]
Updated on : 2015-10-01
ntc-admin

NTCIR Project NTCIR-11 IMine Research Purpose Use of Test Collection

NTCIR-11 IMine

NTCIR Project
NTCIR-11 IMine
Research Purpose Use of Test Collection