[JAPANESE] [NTCIR Home] [NTCIR DATA
Home]
NTCIR-4 WEB (IR Test Collection)
- Distribution of NTCIR-4 WEB Document Data is currently unavailable. We
will announce through the ntcir
Mailing list once it becomes
available again.

NTCIR-4 WEB test collection consists of "Document Data" which is a collection of tagged text data of
the crawled documents provided mainly on the web in the JP domain Web and
"Task Data" which is a collection of search topics and the
relevance judgments of the documents.
There are two types of data set for the "Task Data", and those
are 'Informational Retrieval (Info 1)' and
'Navigational Retrieval (Navi 1)'. "Document
Data" is 'NW100G-01', the same as that of NTCIR-3 WEB, which is
composed of web documents of approximately 100GB in size.
"Document
data" is delivered in DVD-R's or in a hard disk drive and "Task Data" is
delivered by electric means through the Internet.
| Collection |
Task |
Documents |
Task data |
| Genre |
Filename |
Lang. |
Year |
# of docs |
Size |
Topic/ Relevance |
judge |
| Lang. |
# |
| NTCIR-4 WEB Info 1 |
IR |
Web (html/text) |
NW100G-01*2 |
multiple*1 |
crawled in 2001 |
about110,000,00 |
100GB |
J* |
80 |
4 grades |
| NTCIR-4 WEB Navi 1 |
IR |
J* |
300 |
3 grades |
* English translation is available
*1almost Japanese or English (some in other languages)
*2 NW100G-01 is common for NTCIR-3 WEB and NTCIR-4 WEB.
*The entire collection is provided by NII.

- About Document Data:
The document data used for NTCIR-4 WEB
is the "NW100G-01" also used for the NTCIR-3 WEB. The "NW100G-01" consists of
approximately 11 million Web documents of approximately 100GB in size and their
meta data. Its content is outlined below.
Crawling
condition:
Target sites: http server in .jp domain
Target ports: All ports
Target pages: Text files such as HTML
and plaintext
NW100G-01 contains the following files. Refer to
readme.data file delivered with the document data for
details.
List files
aliaslist : list of aliased
sites
doclist : list of web documents page data delivered
duplist
: list of duplicate pages
sitelist : list of crawled sites
targetlist : list of documents to be search targets
linklist :
list of links from pages in doclist to pages in targetlist
Document files
raw: Original document data as
they were crawled and their metadata
euc: Document data with Japanese
characters converted to EUC code and their metadata
cooked: Document data
with unnecessary tags and elements removed and their metadataAbout Document Data:
The document data used for NTCIR-4 WEB
is the "NW100G-01" also used for the NTCIR-3 WEB. The "NW100G-01" consists of
approximately 11 million Web documents of approximately 100GB in size and their
meta data. Its content is outlined below.
Crawling
condition:
Target sites: http server in .jp domain
Target ports: All ports
Target pages: Text files such as HTML
and plaintext
NW100G-01 contains the following files. Refer to
readme.data file delivered with the document data for
details.
List files
aliaslist : list of aliased
sites
doclist : list of web documents page data delivered
duplist
: list of duplicate pages
sitelist : list of crawled sites
targetlist : list of documents to be search targets
linklist :
list of links from pages in doclist to pages in targetlist
Document files
raw: Original document data as
they were crawled and their metadata
euc: Document data with Japanese
characters converted to EUC code and their metadata
cooked: Document data
with unnecessary tags and elements removed and their metadata
- About Search Topics:
- Informational Retrieval
(Info 1) -
Topics were created assuming a web search situation where a searcher searches
for documents that contain topical information of his/her information need.
Efforts were made in order to avoid large discrepancy between the search
topics and the aim and the back ground of the search. Care was also taken
to avoid time dependent topics. After the selection process and several
times of poolings by the organizers, comprehensive relevance judgments
were conducted for 35 topics out of 153 topics that were delivered to participants
of this project. Then the relevance judgments were conducted focusing on
higher ranked results for 80 topics (including initially assessed 35 topics).
- Navigational
Retrieval (Navi 1) -
Topics were created assuming a web search
situation where the searcher searches for representative web pages of a known
item. 11 topic creators respectively wrote down natural search items in their
daily activities. 300 topics were selected by the organizers considering the
balance of types and the appropriateness of the topics.
About
Relevance judgment:
- Informational Retrieval
(Info 1) -
Four relevance levels (highly relevant, relevant,
partially relevant and irrelevant) were used for the relevance judgment of
contents of the the web documents. Cases where (i) Searcher searches for
relevant documents comprehensively and (ii) Searcher searches for a few relevant
documents, were evaluated.
- Navigational Retrieval (Navi 1)
-
Three relevance levels (relevant, partially relevant and
irrelevant) were used for the relevance judgment of the web documents in terms
of the representativeness.
For more details of NTCIR-4 Web Informational
Retrieval (Info 1) please refer to -> [PDF]
For more details of NTCIR-4 Web Navigational Retrieval (Navi 1)
please refer to -> [PDF]

The test collection has been constructed and used for the NTCIR. They are usable only for the research purpose use.
The documents collection included in the test collection were provided
to NII for used in NTCIR free of charge or for a fee. The providers of
the document data kindly understand the importance of the test collection
in the research on information access technologies and then granted the
use of the data for research purpose. Please remember that the document
data in the NTCIR test collection is copyrighted and has commercial value
as data. It is important for our continued reliable and good relationship
with the data producers/providers that we researchers must behave as a
reliable partners and use the data only for research purpose under the
user agreement and use them carefully not to violate any rights for them
.
The followings are the procedures to obtain the test collection. The test collection and data available from NII are free of charge.
- The application form of the test collection must be filled out and sent by E-mail to ntc-secretariat
- Depending on the types of the data set, either a user agreement
(memorandum) or a formal application is required. Please refer the
list below for the required documents.
- User Agreement (memorandum on Permission to Use Test Collection)
- The user agreement form for each test collection that you would like to obtain must be filled out and
sent by postal mail or courier to the address below.
- Please download and make two copies of the form in double-sided print.
- Signatures are needed on both agreement forms.
- After counter-signed by NII side, one copy of the form will be sent to you
and one copy will be kept by the NII.
Formal Application
- You can apply for different dataset by one application. One copy of the
formal application must be downloaded, filled out and sent by postal mail or courier to the Address below.
- After review in the NII, the permission of use of the data will be sent to
the applicant.
- Documents to submit
- Application Form [txt]
User agreement form [PDF] * not necessary if you are already using NW100G-01.
Formal Application [PDF] *For Topics and Relevance judgments
- Reference The terms of use [PDF]
Reference README(document data)[txt]
Address
NTCIR Project (Rm.1309)
National Institute of Informatics
2-1-2 Hitotsubashi Chiyoda-ku, Tokyo
102-8430, JAPAN
PHONE: +81-3-4212-2750
FAX: +81-3-4212-2751
Email: ntc-secretariat
Mailing List
The release of the new test collections and correction information shall
be announced through the ntcir
Mailing list