[Japanese] [NTCIR-WEB home] [NTCIR-4 home] [NTCIR home]

Call For Participation:
WEB Task at the 4th NTCIR Workshop (NTCIR-4 WEB)
(CLOSED. The latest information is available here.)

Updated: 2004-08-26
Created: 2003-02-20

What's New


Table of Contents

Organization
Contact Information
Registration
Task Overview
A. Informational Retrieval Task 2
B. Navigational Retrieval Task 1
C. Geographic Information Task 1
D. Topical Classificaiton Task 1
Document Data
Schedule
Notes


Organization

Contact Information


Registration

Please carefully make reference to 'How to Participate', and then complete the registration form of NTCIR-4 WEB HERE.

Task Overview

The WEB Task at the 4th NTCIR Workshop (NTCIR-4 WEB) attempts to push ahead researches of information access systems for large-scale Web documents that have structures composed of tags and hyper-links, making use of the experiences of the Web Retrieval Task at the 3rd NTCIR Workshop (NTCIR-3 WEB. the CFP/the overview/the publications). The organizers investigated actual use of the Web from various viewpoints, and designed the following sub-tasks to evaluate the required fundamental techniques.

A. Informational Retrieval Task 2
B. Navigational Retrieval Task 1
C. Geographic Information Task 1
D. Topical Classificaiton Task 1
Provisional task descriptions of each sub-task are provided below. The details will be announced on the web site of NTCIR-WEB. The organizers may not conduct a part of sub-tasks that do not hold a sufficient number of participants at March 20, 2003. We expect active contributions by the workshop participants and requests or advice from the researchers in related research areas to perform the NTCIR-4 WEB and to construct more usable test collections of the Web documents.

A. Informational Retrieval Task 2 (K. Eguchi, K. Oyama)

Informational Retrieval Task is similar to a traditional ad-hoc search or a subject search against scientific documents or newspapers, etc., where a system performs the search using a given topic (question) for the static document set. This task packages Survey Retrieval Task and Target Retrieval Task at the NTCIR-3 WEB, and carries out effectiveness evaluation in consideration of hyper-links among pages and non-redundancy of page contents. --One of the most important differences between the Survey Retrieval Task and the Target Retrieval Task is user models assumed for the tasks. The former is the model where a user attempted to comprehensively find documents relevant to his/her information needs, and the latter is the model where the precision of the ranked search results is emphasized (the overview). -- The organizers encourage the systems that hold those kinds of functions. The following is the provisional task description of the Informational Retrieval Task.

Topics (questions):
The topic format was basically inherited from the NTCIR-3 WEB (samples), except for the fields of the concept list <CONC> and the given relevant documents <RDOC>.
Results Submission:
The participating groups are requested to submit their run results using the identification numbers of 1,000 retrieved documents ranked for each topic. In evaluating the results, the organizers may use only a part of the 1,000 documents.
The run results of both 'automatic' and 'interactive' systems are accepted. Any search systems involving manual intervention during the search process are deemed 'interactive', with all the others being 'automatic'.
The participating groups using automatic systems have to submit at least two lists of their run results: that of the run using only the topic field of <TITLE> (i.e., up to three terms that were simulating the query terms) and that of the run using only <DESC> (i.e., the most fundamental description of the user's information needs in a single sentence). They can also optionally submit their run results using other topic fields. The participating groups can not submit more than four runs for this sub-task. They are also asked to specify the priority of each run.
The participating groups are requested to report which fields of the topics were used in the automatic or interactive systems.
The file format of the results list is required to be suitable for 'trec_eval', as the same in the Survey Retrieval Task at the NTCIR-3 WEB (samples).
Relevance Assessment:
Pooled documents that are composed of the top-ranked search results submitted by each participant are considered to be the relevant document candidates. Human assessors will judge the relevance of each document in the pool. At that time, the assessors will judge the 'multi-grade relevance' as highly relevant, fairly relevant, partially relevant or irrelevant. In addition, they will choose the five (or three) best documents that are most relevant to the topic statement with priority of relevance. When the assessors judge the relevance of a document, they can browse the page and its out-linked pages that are included in the pool, as 'one-click-distance document model' at the NTCIR-3 WEB (the overview).
Evaluation:
The organizers will adopt two kinds of user models for evaluations; (i) the model where a user attempted to comprehensively find documents relevant to his/her information needs, such as the Survey Retrieval Tasks at the NTCIR-3 WEB, and (ii) the model where the user requires just one answer or only a few answers, so the precision of the ranked search results is emphasized, such as the Target Retrieval Task at the NTCIR-3 WEB (the overview). The topic set is the same under the two models.
The organizers apply the three types of evaluation measures: (i) those based on precision and/or recall using the 'trec_eval', (ii) DCG ('discounted cumulative gain'), and (iii) WRR ('weighted reciprocal rank', the overview). The organizers will perform evaluations in consideration of redundancy, when groups of related documents (i.e., those where the main parts can be regarded as the same, or those connected by hyperlinks significantly) are appeared in each run results list.

B. Navigational Retrieval Task 1 (K. Eguchi, K. Oyama)

Navigational Retrieval Task is one of the newly proposed tasks at the NTCIR-4 WEB. The 'navigational retrieval' indicates searches that guide a user's information seeking process. The NTCIR-4 WEB focuses on a known item search, which is one of the navigational retrieval, which finds a user's known page but not the pages that include a given topic such as in the Informational Retrieval Task. The organizers suppose the following two types of situations: (i) the case where the user requests the typical pages of a known object (e.g., a person, shop, or facility), and he/she carries out a search using the name of the object, and (ii) the case where the user knows the requested object but does not remember the name, so he/she carries out a search using the attribute information or the related information about the object. In these cases, the relevant documents tend to be composed of just one document or a few documents. The organizers will accept the participation using the systems developed for the Informational Task, but encourage the proposals of original systems suitable for the Navigational Retrieval Task. --The systems suitable for the Informational Task often use document contents. On the other hand, the systems using tags of title, headings and anchor, or the systems based on link analysis might be more effective for the Navigational Task. -- The following is the provisional task description of this sub-task. The details will be announced when they are fixed.

Topics (questions):
The organizers will create the topics assuming only the aforementioned Type (i), or both Type (i) and (ii). --If both types will be adopted, the type information will be specified in each topic. -- At present, the organizers suppose a certain person, shop or facility as the requested object. The name and the kind of the object will also be specified in the topic.
Results Submission:
The participating groups are requested to submit their run results using the identification numbers of up to 100 retrieved documents ranked for each topic. In evaluating the results, the organizers may use only a part of the 100 documents.
The file format of the results list is required to be suitable for 'trec_eval', as the same in the Target Retrieval Task at the NTCIR-3 WEB (samples). The participating groups can not submit more than four runs for this sub-task. They are also asked to specify the priority of each run.
Relevance Assessment:
Pooled documents that are composed of the top-ranked search results submitted by each participant are considered to be the relevant document candidates. Human assessors will judge whether or not an individual document candidate can be deemed as a typical document of the target object.
Evaluation:
The organizers will apply MRR ('Mean reciprocal rank') and success rates focused on each top-ranked document.

C. Geographic Information Task 1 (T. Sagara, M. Arikawa)

Geographic Information Task is one of the newly proposed tasks at the NTCIR-4 WEB. Geographic information is close to our daily lives, and is one of the real ways to access Web information. Researches and developments of such aspects have been increasing recently, however, comparative evaluations of such kinds of techniques has not been carried out so far.

The NTCIR-4 WEB focuses on the technology that the system extracts geographic information from Web documents relevant to a given viewpoint. The organizers will provide document set to be processed ('target data set'; several ten thousands of pages) that includes geographic descriptions, which are roughly gathered by the baseline search system from the Web document collection. The participants are supposed to use the target data set to carry out the process mentioned below. The following is the provisional task description of this sub-task. The details will be announced when they are fixed.

Topics (Questions) and Results Submission:
Data set to be processed: the target data set (mentioned above).
Available resources: a toponym dictionary, and a table of street address, latitude and longitude.
Given information: general concept terms such as 'restaurant' or 'school'.
Expected results: geographic objects relevant to a given concept, each of which should be reported according to the format of (document-ID, name of the geographic object, geographic description, latitude and longitude).
E.g., Question: Extract location of 'universities'.
NW000003291,"The University of Tokyo", "7-3-1, Hongo, Bunkyo-ward", 139.768616,35.708927
NW000004193,"Tokyo Metropolitan College of Allied Medical Sciences", "03-3819-1211", 139.77603, 35.74796
NW000091353,"Tokyo Medical and Dental University", "1 minute walk from Ochanomizu Station on Marunouchi Line", 139.76750, 35.69838
Notes: The geographic descriptions include telephone numbers, postal codes, and descriptive directions, etc. The positions of the descriptions are not limited within the target data set, and can be on the out-linked pages or in-linked pages from the target data set. Moreover, the systems that exactly specify the location using multiple geographic descriptions will also be encouraged in this sub-task. The participants are expected to obtain not only geographic description of a relevant object, but also the latitude and longitude.
Evaluation:
The organizers will evaluate the effectiveness of submitted results on the basis of precision, recall, and the distance between the reported location and the exact location of a relevant object. How to specify relevant objects is under consideration.
The precision is defined as precision = p/n, when n indicates the number of the results submitted by a participating group for each question, and p the number of geographic objects judged to be relevant out of the submitted results.
The recall is defined as recall = r/s, when r indicates the number of relevant geographic objects reported by a participating group, and s the number of all the geographic objects judged to be relevant.
The distance from the exact location of a relevant object can be measured by the following manner: (a) a human assessor searches one of the objects reported by a participating group using a map, and specify the exact location of it, (b) the organizers computes the Euclidean distance between the exact location v and the reported location w as distance = |v - w|.
The participants can use an address matching system that can obtain the latitude and longitude using a street address. The system will be provided by CSIS at the University of Tokyo. A table of postal codes and street addresses can be downloaded from the Web site of Postal Services Agency, Japan.

D. Topical Classification Task 1 (K. Eguchi)

The Topical Classification Task attempts to evaluate techniques for supporting user's browsing process by means of classification-based output presentation when the user submits very short queries that have ambiguity, such as the 'Search Results Classification Task' at the NTCIR-3 WEB. The Search Results Classification Task was proposed as a pilot study and adopted as one of the 'Optional Tasks' at the NTCIR-3 WEB. However, very unfortunately, no classification results were submitted. Therefore, the NTCIR-4 WEB will conduct this sub-task as a pilot task, again. The organizers will provide document set ('target data set') retrieved by the baseline search system, and the participants are supposed to use the target data set carry out the Topical Classification Task. The following is the provisional task description of this sub-task, but the details are under consideration. The organizers hope to reflect the participants' opinions in conducting this sub-task.

Topics (Questions) and Results Submission:
The participants are expected to classify documents within the target data set into some labeled groups, and then submit the classification results. The target data set is composed of the documents retrieved using short and ambiguous query terms. The participants can use documents other than the target data set to improve the classification effectiveness.
The file format of the classification results will be announced later. The participating groups can not submit more than four runs for this sub-task. They are also asked to specify the priority of each run.
For example, when using 'Hidetoshi Nakata', who is a famous Japanese soccer player, as the query terms, the results were supposed to be classified into 'sites', 'schedules', 'magazines or TV programs', 'photographs' and 'supporters' diaries'. We do not set a limit on the number of classes. Hierarchical classification is also acceptable. The labels of the classes can be topical terms that represented the classification, typical page titles, or machine-like identification codes, e.g., 'cluster A' and 'cluster B'.
Evaluation:
In evaluating the Search Results Classification Task, the organizers are investigating the following aspects of the comparative evaluation method.
  • the relevance of each class to the documents in it (accuracy of classification)
  • whether the classifications are easily understood or not
  • the number of classes
  • the number of documents included in each class
  • the number of classes that include the relevant documents and their distribution
  • the number of clicks that are spent to reach the relevant documents

Document Data

The NTCIR-4 WEB will use 'NW100G-01' (samples/the overview) as the document data, of which size is about 100 gigabytes, as the same at NTCIR-3 WEB. The organizers will deliver the NW100G-01 to the participants of the NTCIR-4 WEB, but the way to do it has not been decided. The computer resources in the 'Open Laboratory' located at National Institute of Informatics (NII) are available only for the participants who request to use them, within the limits of the existing resources. The organizers will announce how to apply the Open Laboratory.

Data Contents and Format:
Document Data Contents
List of gathered sites
List of aliased sites
List of duplicated pages
Metadata of each pages (fetched URL, time, http headers, etc.)
Page data (original data)
Document Data Processing
Raw page data
Page data that pre-converted the Japanese character codes to EUC
Page data that pre-converted the Japanese character codes to EUC and eliminated all the HTML tags
Elimination of Unnecessary Documents
Pages are eliminated only if each of them obstructs building of document collection, as follows:
- looped path
- dynamically generated pages (above 10 pages)
- huge data that are not text data obviously
Document Data Format
1 file per site
samples
Open Lab. Environments:
Computer resources
Shared file server that provides document data etc.
Computers for works and auxiliary storages
Host computers
Sun Blade 100, LINUX PC
The participants can use specific versions or other kinds of OS *under their charges.*
Auxiliary storages
500 gigabytes/team
Network environments
An exclusive segment that is protected by a firewall
Remote access
Individual computers for works are controlled by the firewall.
We set up the remote access conditions based on the participants' requirement.
Remote accesses from computers for works to the outside are also controlled by the firewall.
Take-in machines
We accept take-in machines as far as the space, the power supply, the administrative conditions, and other circumstances allow.

Schedule

DATE ACTION
2003-02-20 Call for Participation (tentative)
  * We will make it detailed one by one.
2003-03-20 Registration Due
* The organizers may not conduct a part of sub-tasks that do not hold a sufficient number of participants at this date.
* Registrations after this date will be accepted as possible.
2003-03-30 Document Data Release
  Task A Tasks B and D
2003-06-01   Dry-Run Topics (Questions) Release
2003-07-01   Dry-Run Results Submission
2003-09-01 Topics (Questions) Release Dry-Run Evaluation Results Release
2003-10-01 Results Submission  
2003-10-16   Formal-Run Topics (Questions) Release
2003-11-16   Formal-Run Results Submission
2004-02-20 Evaluation Results Release Formal-Run Evaluation Results Release
  ( The detailed schedule of Task C will be announced when they are fixed. )
2004-03-19 Submission Due of Camera-ready manuscript for the Working Notes
* Working Notes will be delivered at the Workshop Meeting.
2004-05 late Workshop Meeting (at NII, Tokyo)
  Submission Due of Camera-ready manuscript for the Proceedings
* The Proceedings will be published broadly.
ntcweb-org@nii.ac.jp

[Top] [Japanese] [NTCIR-WEB home] [NTCIR-4 home] [NTCIR home]