[Japanese] [NTCIR-WEB home] [NTCIR-5 home] [NTCIR home]

Call For Participation:
WEB Task at the 5th NTCIR Workshop
(NTCIR-5 WEB)

Revised: 2005-01-13
Created: 2004-07-25


What's New


Table of Contents

Organization
Contact Information
Registration
Document Data
Open Laboratory
Task Overview
Navigational Retrieval Task 2
Query Term Expansion Task (pilot subtask)
Type of Participation
Schedule

Organization

Contact Information

Registration

Please carefully make reference to 'Application Home', and then complete the registration form of NTCIR-5 WEB HERE. Please also visit 'User Agreement Forms' and follow the instruction.

Document Data

The NTCIR-5 WEB will use 'NW1000G-04' as the document data, which was mainly crawled from *.jp domain in 2004 and is about 1 terabytes of total page data size. Its subset of 300 gigabytes 'NW300G-04' is also under consideration. Contents and formats of the data will be almost the same as that of 'NW100G-01' used in NTCIR-3/4 WEB. The organizers will prepare four versions of the document data as follows:

(1) RAW:
Web pages as they were crawled,
(2) EUC:
Web pages with Japanese character code in RAW converted to EUC,
(3) COOKED:
Plain text data extracted from EUC by removing HTML tags and useless elements, and
(4) SEGMENTED:
Segmented text data processed from COOKED by a morphological analyzer.

RAW, EUC and COOKED are delivered to all the participants stored in hard disk drives as were in the NTCIR-4 WEB. SEGMENTED, a newly prepared version in the NTCIT-5 WEB, is estimated to be so large that some special treatments will be necessary according to the participants' demands. Therefore, its deliverly will be somewhat later.

Open Laboratory

The computer resources in the 'Open Laboratory' located at National Institute of Informatics will be available for the participants within the limits of the existing resources. The organizers will announce how to apply for the Open Laboratory.

Task Overview

The WEB Task has been attempting to push ahead, from various viewpoints of actual use of the Web, researches of information access systems for large-scale Web documents that have structures composed of tags and hyper-links since the 3rd NTCIR Workshop. However, because of the organizers' circumstances, the WEB Task at the 5th NTCIR Workshop (NTCIR-5 WEB) focuses on a single main subtask, "Navigational Retrieval Task 2", and takes up only a newly proposed pilot subtask, "Query Term Expansion Task". Other subtasks conducted in the WEB Tasks at the 3rd/4th NTCIR Workshop (NTCIR-3/4 WEB) may possibly be taken up again in the future NTCIR Workshops.

Current task description of each subtask is provided below. The details will be announced on the NTCIR-WEB home pages. We expect active contributions by the workshop participants and requests or advice from the researchers in related research areas to perform the NTCIR-5 WEB and to construct more usable test collections of the Web documents.

Navigational Retrieval Task 2

Navigational Retrieval Task is one of the subtasks newly proposed at the NTCIR-4 WEB. The 'navigational retrieval' indicates searches that guide a user's information seeking process. The NTCIR-4/5 WEB focuses on a known item search, a kind of the navigational retrieval.

The known item search is to find representative Web pages of a given item, but not a given Web page. A representative Web page may be a site top page, an entry page to a series of related pages, or a single fully informative page. Two types of users' situations are supposed as follows: (i) the case where the user requests the typical pages of a known object (e.g., a person, shop, or facility), and he/she carries out a search using the name of the object, and (ii) the case where the user knows the requested object but does not remember the name, so he/she carries out a search using the attribute information or the related information about the object. In both of the cases, the number of relevant documents tend to be just one or a few. Consequently, the subtask can be regarded as including home page finding and named page finding in TREC Web Track, but not restricted to them.

Ordinary information retrieval systems often use document text contents only, while processing and utilizing anchors, link structures, logical document units, etc. are deemed to be effective for Web retrieval. The result of the Navigational Retrieval Task 1 suggests that this tendency is remarkable in the known item search. Therefore, the organizers encourage participation with systems applying original methods suitable for this subtask.

The following is an outline of the task description. Please refer to the Overview of the Navigational Retrieval Task 1 in the Working Notes of the NTCIR-4.

Topics (questions):
The organizers will create the topics assuming both Type (i) and (ii). The requested object is supposed to be a certain product/sevice, shop, facility, organization, person, event, information source, document, etc. 
Results Submission:
The participating groups are requested to submit their run results using the identification numbers of up to 100 retrieved documents ranked for each topic. In evaluating the results, the organizers may use only a part of the 100 documents.
The file format of the results list is required to be suitable for 'trec_eval', as the same in the Target Retrieval Task at the NTCIR-3 WEB (samples). The number of runs a participating group can submit is not decided yet. They are also asked to specify the priority of each run.
Relevance Assessment:
A document pool will be made for each topic, gathering a certain number of top-ranked documents from all the search results submitted by the participants. Human assessors will judge whether or not an individual document in the pool is deemed as a representative page of the requested object. They will also judge any other document considered to be possibly a representative page based on hyperlinks, URLs, etc.
Evaluation:
The organizers will apply MRR ('Mean Reciprocal Rank'), DCG ('Discounted Cumulative Gain') and other evaluation measures suitable for navigational retrieval. Duplicated documents and closely linked documents will be taken into account.

Query Term Expansion Task (pilot subtask)

Query Term Expansion Task is a newly proposed pilot subtask. Its detailed task definition will be fixed based on discussions among the organizers and the participants. For more information, please visit the subtask's web page.

Type of Participation

Each group can participate in either or both of the above mentioned two subtasks.

Schedule

DATE
ACTION
2004-08-01
Call for Participation (preliminary)
2004-09-20
Registration Due
* Registrations after this date will be accepted as long as possible.
2004-10-01
Document Data Release
* Provided in a few divisions as they are prepared. The first one will be of about 300GB.
2004-12-01
Dry-Run Topics Release
2005-01-01
Dry-Run Results Submission
2005-03-01
Dry-Run Evaluation Results Release
2005-04-15
Formal-Run Topics Release
2005-05-15
Formal-Run Results Submission
2005-08-01
Formal-Run Evaluation Results Release
2005-10-01
Submission Due of Camera-ready Manuscript for the Working Notes
* Working Notes will be delivered at the Workshop Meeting.
2005-12-6--9
Workshop Meeting
2006-02-
Submission Due of Camera-ready Manuscript for the Proceedings
* The Proceedings will be published broadly.
ntcadm-web@nii.ac.jp

[Top] [Japanese] [NTCIR-WEB home] [NTCIR-5 home] [NTCIR home]