[JAPANESE] [NTCIR Home] [NTCIR DATA
Home]
NTCIR-5 WEB (IR Test Collection)
- Distribution of NTCIR-5WEB Document Data is currently unavailable. We will
announce through the ntcir
Mailing list once it becomes
available again.

NTCIR-5 WEB test collection consists of "Document Data" which is a collection of text data processed from
the crawled documents provided mainly on "Web servers of Japan" and
"Task Data" which is a collection of search topics and the
relevance judgments of the documents.
"Task Data" consists of 400 mandatory topics and 841 optional topics for 'Navigational Retrieval (Navi 2)'. "Document
Data" named 'NW1000G-04' consists of web documents of approximately 1400GB in size and 100 million in number.
| Collection |
Task |
Documents |
Task data |
| Genre |
Filename |
Lang. |
Year |
# of docs |
Size |
Topic/ Relevance |
judge |
| Lang. |
# |
grades |
| NTCIR-5WEB |
IR |
Web (html/text) |
NW1000G-04 |
multiple*1 |
crawled in 2004-2005 |
approx. 100M |
approx. 1400GB |
J |
400+841(opt.) |
2 |
* All data will be delivered by NII.
*1 Mostly Japanese or English (some in other languages)

@
@
- NW1000G-04 consists of several versions of Web page data and several attached lists. The Web pages were crawled mainly from "Web sites of Japan" from January 2004 to January 2005. The Web page data contains four versions: raw data (raw), data with Japanese characters converted to EUC (euc), data with unnecessary tags and others removed (cook), and data processed with a morphological analyzer (mecab). There are four kinds of lists: a list of crawled sites (sitelist), a list of documents (doclist), a list of links (linklist) and a list of anchor texts (anclist).
- This task data is made in the Navigational Retrieval Subtask (Navi-2) at
NTCIR-5 WEB Task. The task data consists of "topics" and "qrels".
The topics consist of two parts: "mandatory topics" and "optional
topics", and the qrels consist of the corresponding parts. The mandatory
topics, which includes 400 topics, were used for the system evaluation
of the formal run at Navi-2. The optional topics, which include 841 topics,
were not used for the system evaluation. They were used together with 400
mandatory topics just for further analyzing in detail and enhancing
the test collection. Submission of their run results was optional and some
teams actually did not. It was instructed to process these topics under
the same system conditions as the mandatory topics.

The followings are the procedures to obtain the test collection. The test collection and data available from NII are free of charge.
- The application form of the test collection must be filled out and sent by E-mail to ntc-secretariat
- Depending on the types of the data set, either a user agreement (memorandum)
or a formal application is required. Please refer the list below for the
required documents.
- User Agreement (memorandum on Permission to Use Test Collection)
- The user agreement form for each test collection that you would like to obtain must be filled out and
sent by postal mail or courier to the address below.
- Please download and make two copies of the form in double-sided print.
- Signatures are needed on both agreement forms.
- After counter-signed by NII side, one copy of the form will be sent to you
and one copy will be kept by the NII.
Formal Application
- You can apply for different dataset by one application. One copy of the
formal application must be downloaded, filled out and sent by postal mail or courier to the Address below.
- After review in the NII, the permission of use of the data will be sent to
the applicant.
- Documents to submit
- Application Form [txt]
User agreement form for WN1000G-04 documents[PDF]
Formal Application for the Task Data (topics and relevance judgements)[PDF]
@@@@Reference@
The terms of use [PDF]
README(document data)[txt]@
Task Overview of NTCIR 5 WEB
Overview of the NTCIR-5 WEB Navigational Retrieval Subtask 2 (Navi-2)@
Overview of the NTCIR-5 WEB Query Term Expansion Subtask @@@@@@@@
Address
NTCIR Project Office (Rm.1309)
National Institute of Informatics
2-1-2 Hitotsubashi Chiyoda-ku, Tokyo
102-8430, JAPAN
PHONE: +81-3-4212-2750
FAX: +81-3-4212-2751
Email: ntc-secretariat
Mailing List
The release of the new test collections and correction information shall be announced through the ntcir
Mailing list
Notice
The test collection has been constructed and used for the NTCIR. They are
usable only for the research purpose use. The documents collection included
in the test collection were provided to NII for used in NTCIR free of charge
or for a fee. The providers of the document data kindly understand the
importance of the test collection in the research on information access
technologies and then granted the use of the data for research purpose.
Please remember that the document data in the NTCIR test collection is
copyrighted and has commercial value as data. It is important for our continued
reliable and good relationship with the data producers/providers that we
researchers must behave as a reliable partners and use the data only for
research purpose under the user agreement and use them carefully not to
violate any rights for them.