[JAPANESE] [NTCIR Home] [NTCIR DATA
Home]
NTCIR-4 WEB (IR Test Collection)
- NTCIR-4 WEB Test Collection(Task Data, Document Data) are available from
NII/IDR. Please visit at NII/IDR.
NTCIR-4 WEB test collection consists of "Document Data" which is a collection of tagged text data of
the crawled documents provided mainly on the web in the JP domain Web and
"Task Data" which is a collection of search topics and the
relevance judgments of the documents.
There are two types of data set for the "Task Data", and those
are 'Informational Retrieval (Info 1)' and
'Navigational Retrieval (Navi 1)'. "Document
Data" is 'NW100G-01', the same as that of NTCIR-3 WEB, which is
composed of web documents of approximately 100GB in size.
"Document
data" is delivered in DVD-R's or in a hard disk drive and "Task Data" is
delivered by electric means through the Internet.
Collection |
Task |
Documents |
Task data |
Genre |
Filename |
Lang. |
Year |
# of docs |
Size |
Topic/ Relevance |
judge |
Lang. |
# |
NTCIR-4 WEB Info 1 |
IR |
Web (html/text) |
NW100G-01*2 |
multiple*1 |
crawled in 2001 |
about110,000,00 |
100GB |
J* |
80 |
4 grades |
NTCIR-4 WEB Navi 1 |
IR |
J* |
300 |
3 grades |
* English translation is available
*1almost Japanese or English (some in other languages)
*2 NW100G-01 is common for NTCIR-3 WEB and NTCIR-4 WEB.
*The entire collection is provided by NII.
- About Document Data:
The document data used for NTCIR-4 WEB
is the "NW100G-01" also used for the NTCIR-3 WEB. The "NW100G-01" consists of
approximately 11 million Web documents of approximately 100GB in size and their
meta data. Its content is outlined below.
Crawling
condition:
Target sites: http server in .jp domain
Target ports: All ports
Target pages: Text files such as HTML
and plaintext
NW100G-01 contains the following files. Refer to
readme.data file delivered with the document data for
details.
List files
aliaslist : list of aliased
sites
doclist : list of web documents page data delivered
duplist
: list of duplicate pages
sitelist : list of crawled sites
targetlist : list of documents to be search targets
linklist :
list of links from pages in doclist to pages in targetlist
Document files
raw: Original document data as
they were crawled and their metadata
euc: Document data with Japanese
characters converted to EUC code and their metadata
cooked: Document data
with unnecessary tags and elements removed and their metadataAbout Document Data:
The document data used for NTCIR-4 WEB
is the "NW100G-01" also used for the NTCIR-3 WEB. The "NW100G-01" consists of
approximately 11 million Web documents of approximately 100GB in size and their
meta data. Its content is outlined below.
Crawling
condition:
Target sites: http server in .jp domain
Target ports: All ports
Target pages: Text files such as HTML
and plaintext
NW100G-01 contains the following files. Refer to
readme.data file delivered with the document data for
details.
List files
aliaslist : list of aliased
sites
doclist : list of web documents page data delivered
duplist
: list of duplicate pages
sitelist : list of crawled sites
targetlist : list of documents to be search targets
linklist :
list of links from pages in doclist to pages in targetlist
Document files
raw: Original document data as
they were crawled and their metadata
euc: Document data with Japanese
characters converted to EUC code and their metadata
cooked: Document data
with unnecessary tags and elements removed and their metadata
- About Search Topics:
- Informational Retrieval
(Info 1) -
Topics were created assuming a web search situation where a searcher searches
for documents that contain topical information of his/her information need.
Efforts were made in order to avoid large discrepancy between the search
topics and the aim and the back ground of the search. Care was also taken
to avoid time dependent topics. After the selection process and several
times of poolings by the organizers, comprehensive relevance judgments
were conducted for 35 topics out of 153 topics that were delivered to participants
of this project. Then the relevance judgments were conducted focusing on
higher ranked results for 80 topics (including initially assessed 35 topics).
- Navigational
Retrieval (Navi 1) -
Topics were created assuming a web search
situation where the searcher searches for representative web pages of a known
item. 11 topic creators respectively wrote down natural search items in their
daily activities. 300 topics were selected by the organizers considering the
balance of types and the appropriateness of the topics.
About
Relevance judgment:
- Informational Retrieval
(Info 1) -
Four relevance levels (highly relevant, relevant,
partially relevant and irrelevant) were used for the relevance judgment of
contents of the the web documents. Cases where (i) Searcher searches for
relevant documents comprehensively and (ii) Searcher searches for a few relevant
documents, were evaluated.
- Navigational Retrieval (Navi 1)
-
Three relevance levels (relevant, partially relevant and
irrelevant) were used for the relevance judgment of the web documents in terms
of the representativeness.
For more details of NTCIR-4 Web Informational
Retrieval (Info 1) please refer to -> [PDF]
For more details of NTCIR-4 Web Navigational Retrieval (Navi 1)
please refer to -> [PDF]
NTCIR-4WEB Test Collection(Documents,Task data) are available from NII/IDR.
Please visit at NII/IDR.
Reference
The terms of use [PDF]
README(document data)[txt]
Notice
The test collection has been constructed and used for the NTCIR. They are
usable only for the research purpose use.
The documents collection included in the test collection were provided
to NII for used in NTCIR free of charge or for a fee.
The providers of the document data kindly understand the importance of the test collection in the research on information access technologies and then granted the use of the data for research purpose.
Please remember that the document data in the NTCIR test collection is
copyrighted and has commercial value as data. It is important for our continued
reliable and good relationship with the data producers/providers that we
researchers must behave as a reliable partners and use the data only for
research purpose under the user agreement and use them carefully not to
violate any rights for them.