[JAPANESE] [NTCIR Home] [NTCIR DATA Home]
NTCIR-5 WEB test collection consists of "Document Data" which is a collection of text data processed from
the crawled documents provided mainly on "Web servers of Japan" and
"Task Data" which is a collection of search topics and the
relevance judgments of the documents.
"Task Data" consists of 400 mandatory topics and 841 optional topics for 'Navigational Retrieval (Navi 2)'. "Document
Data" named 'NW1000G-04' consists of web documents of approximately 1400GB in size and 100 million in number.
Collection | Task | Documents | Task data | |||||||
Genre | Filename | Lang. | Year | # of docs | Size | Topic/ Relevance | judge | |||
Lang. | # | grades | ||||||||
NTCIR-5WEB | IR | Web (html/text) | NW1000G-04 | multiple*1 | crawled in 2004-2005 | approx. 100M | approx. 1400GB | J | 400+841(opt.) | 2 |
* All data will be delivered by NII.
*1 Mostly Japanese or English (some in other languages)
NTCIR-5WEB Test Collection(Documents,Task data) are available from NII/IDR. To obtain the data, please refer to NII/IDR site.
Reference
The terms of use [PDF]
README(document data)[txt]
Task Overview of NTCIR 5 WEB
Overview of the NTCIR-5 WEB Navigational Retrieval Subtask 2 (Navi-2)
Overview of the NTCIR-5 WEB Query Term Expansion Subtask
Notice
The test collection has been constructed and used for the NTCIR. They are usable only for the research purpose use. The documents collection included in the test collection were provided to NII for used in NTCIR free of charge or for a fee. The providers of the document data kindly understand the importance of the test collection in the research on information access technologies and then granted the use of the data for research purpose. Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. It is important for our continued reliable and good relationship with the data producers/providers that we researchers must behave as a reliable partners and use the data only for research purpose under the user agreement and use them carefully not to violate any rights for them.