NTCIR Project
NTCIR-4 WEB(Web Retrieval Test Collection)
Research Purpose Use of Test Collection

[JAPANESE] [NTCIR Home] [NTCIR DATA Home]


NTCIR-4 WEB (IR Test Collection)

Test Collection
NTCIR-4 WEB test collection consists of "Document Data" which is a collection of tagged text data of the crawled documents provided mainly on the web in the JP domain Web and "Task Data" which is a collection of search topics and the relevance judgments of the documents.

There are two types of data set for the "Task Data", and those are 'Informational Retrieval (Info 1)' and 'Navigational Retrieval (Navi 1)'. "Document Data" is 'NW100G-01', the same as that of NTCIR-3 WEB, which is composed of web documents of approximately 100GB in size.

"Document data" is delivered in DVD-R's or in a hard disk drive and "Task Data" is delivered by electric means through the Internet.

Collection Task Documents Task data
Genre Filename Lang. Year # of docs Size Topic/ Relevance judge   
Lang. #
NTCIR-4 WEB Info 1 IR Web (html/text) NW100G-01*2 multiple*1 crawled in 2001 about110,000,00 100GB J* 80 4 grades
NTCIR-4 WEB Navi 1 IR J* 300 3 grades

* English translation is available
*1almost Japanese or English (some in other languages)
*2 NW100G-01 is common for NTCIR-3 WEB and NTCIR-4 WEB.

*The entire collection is provided by NII.

Documents, Topics and Questions

  

About Document Data:
The document data used for NTCIR-4 WEB is the "NW100G-01" also used for the NTCIR-3 WEB. The "NW100G-01" consists of approximately 11 million Web documents of approximately 100GB in size and their meta data. Its content is outlined below.

Crawling condition:
    Target sites: http server in .jp domain
    Target ports: All ports
    Target pages: Text files such as HTML and plaintext

NW100G-01 contains the following files. Refer to readme.data file delivered with the document data for details.

List files
    aliaslist : list of aliased sites
    doclist : list of web documents page data delivered
    duplist : list of duplicate pages
    sitelist : list of crawled sites
    targetlist : list of documents to be search targets
    linklist : list of links from pages in doclist to pages in targetlist

Document files
    raw: Original document data as they were crawled and their metadata
    euc: Document data with Japanese characters converted to EUC code and their metadata
    cooked: Document data with unnecessary tags and elements removed and their metadataAbout Document Data:
The document data used for NTCIR-4 WEB is the "NW100G-01" also used for the NTCIR-3 WEB. The "NW100G-01" consists of approximately 11 million Web documents of approximately 100GB in size and their meta data. Its content is outlined below.

Crawling condition:
    Target sites: http server in .jp domain
    Target ports: All ports
    Target pages: Text files such as HTML and plaintext

NW100G-01 contains the following files. Refer to readme.data file delivered with the document data for details.

List files
    aliaslist : list of aliased sites
    doclist : list of web documents page data delivered
    duplist : list of duplicate pages
    sitelist : list of crawled sites
    targetlist : list of documents to be search targets
    linklist : list of links from pages in doclist to pages in targetlist

Document files
    raw: Original document data as they were crawled and their metadata
    euc: Document data with Japanese characters converted to EUC code and their metadata
    cooked: Document data with unnecessary tags and elements removed and their metadata


About Search Topics:

- Informational Retrieval (Info 1) -
Topics were created assuming a web search situation where a searcher searches for documents that contain topical information of his/her information need. Efforts were made in order to avoid large discrepancy between the search topics and the aim and the back ground of the search. Care was also taken to avoid time dependent topics. After the selection process and several times of poolings by the organizers, comprehensive relevance judgments were conducted for 35 topics out of 153 topics that were delivered to participants of this project. Then the relevance judgments were conducted focusing on higher ranked results for 80 topics (including initially assessed 35 topics).

- Navigational Retrieval (Navi 1) -
Topics were created assuming a web search situation where the searcher searches for representative web pages of a known item. 11 topic creators respectively wrote down natural search items in their daily activities. 300 topics were selected by the organizers considering the balance of types and the appropriateness of the topics.


About Relevance judgment:

- Informational Retrieval (Info 1) -
Four relevance levels (highly relevant, relevant, partially relevant and irrelevant) were used for the relevance judgment of contents of the the web documents. Cases where (i) Searcher searches for relevant documents comprehensively and (ii) Searcher searches for a few relevant documents, were evaluated.

- Navigational Retrieval (Navi 1) -
Three relevance levels (relevant, partially relevant and irrelevant) were used for the relevance judgment of the web documents in terms of the representativeness.

For more details of NTCIR-4 Web Informational Retrieval (Info 1) please refer to -> [PDF]
For more details of NTCIR-4 Web Navigational Retrieval (Navi 1) please refer to -> [PDF]


NTCIR-4WEB Test Collection(Documents,Task data) are available from NII/IDR. Please visit at NII/IDR.

    Reference 

The terms of use [PDF]
README(document data)[txt]
Notice
The test collection has been constructed and used for the NTCIR. They are usable only for the research purpose use.
The documents collection included in the test collection were provided to NII for used in NTCIR free of charge or for a fee.
The providers of the document data kindly understand the importance of the test collection in the research on information access technologies and then granted the use of the data for research purpose.
Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. It is important for our continued reliable and good relationship with the data producers/providers that we researchers must behave as a reliable partners and use the data only for research purpose under the user agreement and use them carefully not to violate any rights for them.