NTCIR Project
NTCIR-4 WEB(Web Retrieval Test Collection)
Research Purpose Use of Test Collection

[JAPANESE] [NTCIR Home] [NTCIR DATA Home]

NTCIR-4 WEB (IR Test Collection)

NTCIR-4 WEB Test Collection(Task Data, Document Data) are available from NII/IDR. Please visit at NII/IDR.

NTCIR-4 WEB test collection consists of "Document Data" which is a collection of tagged text data of the crawled documents provided mainly on the web in the JP domain Web and "Task Data" which is a collection of search topics and the relevance judgments of the documents.

There are two types of data set for the "Task Data", and those are 'Informational Retrieval (Info 1)' and 'Navigational Retrieval (Navi 1)'. "Document Data" is 'NW100G-01', the same as that of NTCIR-3 WEB, which is composed of web documents of approximately 100GB in size.

"Document data" is delivered in DVD-R's or in a hard disk drive and "Task Data" is delivered by electric means through the Internet.

Collection	Task	Documents						Task data
		Genre	Filename	Lang.	Year	# of docs	Size	Topic/ Relevance		judge
								Lang.	#
NTCIR-4 WEB Info 1	IR	Web (html/text)	NW100G-01*2	multiple*1	crawled in 2001	about110,000,00	100GB	J*	80	4 grades
NTCIR-4 WEB Navi 1	IR							J*	300	3 grades

* English translation is available
*１almost Japanese or English (some in other languages)
*2 NW100G-01 is common for NTCIR-3 WEB and NTCIR-4 WEB.

＊The entire collection is provided by NII.

About Document Data:
The document data used for NTCIR-4 WEB is the "NW100G-01" also used for the NTCIR-3 WEB. The "NW100G-01" consists of approximately 11 million Web documents of approximately 100GB in size and their meta data. Its content is outlined below.

Crawling condition:
　　　　Target sites: http server in .jp domain
　　　　Target ports: All ports
　　　　Target pages: Text files such as HTML and plaintext

NW100G-01 contains the following files. Refer to readme.data file delivered with the document data for details.

List files
　　　　aliaslist : list of aliased sites
　　　　doclist : list of web documents page data delivered
　　　　duplist : list of duplicate pages
　　　　sitelist : list of crawled sites
　　　　targetlist : list of documents to be search targets
　　　　linklist : list of links from pages in doclist to pages in targetlist

Document files
　　　　raw: Original document data as they were crawled and their metadata
　　　　euc: Document data with Japanese characters converted to EUC code and their metadata
　　　　cooked: Document data with unnecessary tags and elements removed and their metadataAbout Document Data:
The document data used for NTCIR-4 WEB is the "NW100G-01" also used for the NTCIR-3 WEB. The "NW100G-01" consists of approximately 11 million Web documents of approximately 100GB in size and their meta data. Its content is outlined below.

Crawling condition:
　　　　Target sites: http server in .jp domain
　　　　Target ports: All ports
　　　　Target pages: Text files such as HTML and plaintext

NW100G-01 contains the following files. Refer to readme.data file delivered with the document data for details.

List files
　　　　aliaslist : list of aliased sites
　　　　doclist : list of web documents page data delivered
　　　　duplist : list of duplicate pages
　　　　sitelist : list of crawled sites
　　　　targetlist : list of documents to be search targets
　　　　linklist : list of links from pages in doclist to pages in targetlist

Document files
　　　　raw: Original document data as they were crawled and their metadata
　　　　euc: Document data with Japanese characters converted to EUC code and their metadata
　　　　cooked: Document data with unnecessary tags and elements removed and their metadata

About Search Topics:

- Informational Retrieval (Info 1) -
Topics were created assuming a web search situation where a searcher searches for documents that contain topical information of his/her information need. Efforts were made in order to avoid large discrepancy between the search topics and the aim and the back ground of the search. Care was also taken to avoid time dependent topics. After the selection process and several times of poolings by the organizers, comprehensive relevance judgments were conducted for 35 topics out of 153 topics that were delivered to participants of this project. Then the relevance judgments were conducted focusing on higher ranked results for 80 topics (including initially assessed 35 topics).

- Navigational Retrieval (Navi 1) -
Topics were created assuming a web search situation where the searcher searches for representative web pages of a known item. 11 topic creators respectively wrote down natural search items in their daily activities. 300 topics were selected by the organizers considering the balance of types and the appropriateness of the topics.

About Relevance judgment:

- Informational Retrieval (Info 1) -
Four relevance levels (highly relevant, relevant, partially relevant and irrelevant) were used for the relevance judgment of contents of the the web documents. Cases where (i) Searcher searches for relevant documents comprehensively and (ii) Searcher searches for a few relevant documents, were evaluated.

- Navigational Retrieval (Navi 1) -
Three relevance levels (relevant, partially relevant and irrelevant) were used for the relevance judgment of the web documents in terms of the representativeness.

For more details of NTCIR-4 Web　Informational Retrieval (Info 1) please refer to -> [PDF]
For more details of NTCIR-4 Web　Navigational Retrieval (Navi 1) please refer to -> [PDF]

NTCIR-4WEB Test Collection(Documents,Task data) are available from NII/IDR. Please visit at NII/IDR.

　　　　Reference　

The terms of use [PDF]
README(document data)[txt]

Notice

The test collection has been constructed and used for the NTCIR. They are usable only for the research purpose use.
The documents collection included in the test collection were provided to NII for used in NTCIR free of charge or for a fee.
The providers of the document data kindly understand the importance of the test collection in the research on information access technologies and then granted the use of the data for research purpose.
Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. It is important for our continued reliable and good relationship with the data producers/providers that we researchers must behave as a reliable partners and use the data only for research purpose under the user agreement and use them carefully not to violate any rights for them.

[JAPANESE] [NTCIR Home] [Top of this page] [NTCIR DATA Home]
Updated on : 2015-07-22
ntc-admin

NTCIR Project NTCIR-4 WEB(Web Retrieval Test Collection) Research Purpose Use of Test Collection

NTCIR-4 WEB (IR Test Collection)

NTCIR Project
NTCIR-4 WEB(Web Retrieval Test Collection)
Research Purpose Use of Test Collection