NTCIR Project
NTCIR-5 WEB(Web Retrieval Test Collection)
Research Purpose Use of Test Collection


NTCIR-5 WEB (IR Test Collection)

Test Collection
NTCIR-5 WEB test collection consists of "Document Data" which is a collection of text data processed from the crawled documents provided mainly on "Web servers of Japan" and "Task Data" which is a collection of search topics and the relevance judgments of the documents.

"Task Data" consists of 400 mandatory topics and 841 optional topics for 'Navigational Retrieval (Navi 2)'. "Document Data" named 'NW1000G-04' consists of web documents of approximately 1400GB in size and 100 million in number.

Collection Task Documents Task data
Genre Filename Lang. Year # of docs Size Topic/ Relevance judge
Lang. # grades  
NTCIR-5WEB  IR Web (html/text) NW1000G-04 multiple*1 crawled in 2004-2005 approx. 100M approx. 1400GB J 400+841(opt.) 2

* All data will be delivered by NII.
*1 Mostly Japanese or English (some in other languages)

Documents, Topics and Questions


NW1000G-04 consists of several versions of Web page data and several attached lists. The Web pages were crawled mainly from "Web sites of Japan" from January 2004 to January 2005. The Web page data contains four versions: raw data (raw), data with Japanese characters converted to EUC (euc), data with unnecessary tags and others removed (cook), and data processed with a morphological analyzer (mecab). There are four kinds of lists: a list of crawled sites (sitelist), a list of documents (doclist), a list of links (linklist) and a list of anchor texts (anclist).

This task data is made in the Navigational Retrieval Subtask (Navi-2) at NTCIR-5 WEB Task. The task data consists of "topics" and "qrels". The topics consist of two parts: "mandatory topics" and "optional topics", and the qrels consist of the corresponding parts. The mandatory topics, which includes 400 topics, were used for the system evaluation of the formal run at Navi-2. The optional topics, which include 841 topics, were not used for the system evaluation. They were used together with 400 mandatory topics just for further analyzing in detail and enhancing the test collection. Submission of their run results was optional and some teams actually did not. It was instructed to process these topics under the same system conditions as the mandatory topics.

NTCIR-5WEB Test Collection(Documents,Task data) are available from NII/IDR. To obtain the data, please refer to NII/IDR site.


The terms of use [PDF]
README(document data)[txt] 
Task Overview of NTCIR 5 WEB
Overview of the NTCIR-5 WEB Navigational Retrieval Subtask 2 (Navi-2) 
Overview of the NTCIR-5 WEB Query Term Expansion Subtask         

The test collection has been constructed and used for the NTCIR. They are usable only for the research purpose use. The documents collection included in the test collection were provided to NII for used in NTCIR free of charge or for a fee. The providers of the document data kindly understand the importance of the test collection in the research on information access technologies and then granted the use of the data for research purpose. Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. It is important for our continued reliable and good relationship with the data producers/providers that we researchers must behave as a reliable partners and use the data only for research purpose under the user agreement and use them carefully not to violate any rights for them.