[Japanese][NTCIR][workshop][register][agreement] Tasks [clir][patent][qac][tsc][web]
Call For Participation:
Web Retrieval Task at the 3rd NTCIR Workshop
(CLOSED)
Last Updated: 2003-01-08
News:
Organization of Web Retrieval Task, NTCIR Workshop 3
-
Executives Committee
- Koji Eguchi (National Institute of Informatics), Co-chair
- Keizo Oyama (National Institute of Informatics), Co-chair
- Emi Ishida (National Institute of Informatics)
- Noriko Kando (National Institute of Informatics)
- Kazuko Kuriyama (National Institute of Informatics)
-
Advisory Committee
- Kazuhiro Kazama (NTT Corporation)
- Hiroyuki Kawano (Kyoto University)
- Kazuaki Kishida (Surugadai University/National Institute of Informatics)
- Kazuo Sumita (Toshiba Corporation)
- Akihiko Takano (National Institute of Informatics)
- Toshikazu Fukushima (NEC Corporation)
- Kunio Matsui (Fujitsu Laboratories)
- Hayato Yamana (Waseda University)
Contact Information
-
ML for Executives Committee: ntcacm-web@nii.ac.jp
-
ML for open discussions (in English): ntc-web@nii.ac.jp
-
ML for open discussions (in Japanese): ntc-webj@nii.ac.jp
-
How to subscribe the MLs: http://research.nii.ac.jp/ntcir/ml-en.html
Schedule (2001-10-10 updated)
-
2001-9-30: Application due
-
2001-10-31: Application due extended
-
2001-11: Deliver sample document data
-
2001-12: Start open lab.
-
2002-1: Dry run
-
2002-2: Deliver evaluation results of dry run / Round table discussion
-
2002-3: Deliver formal topics
-
2 weeks later: Submit search results
-
2002-7: Relevance judgments / Deliver the evaluation results
-
2002-8: Round table discussion
-
2002-8-15: Paper due
-
2002-10-08/2002-10-10: NTCIR3 workshop meeting
Overview of Web Retrieval Task
The objectives of Web Retrieval Task in NTCIR-3 are 'to research the
retrieval of Web documents that have a structure with tags and links'.
Task design and evaluation methods are considered from the viewpoint of
features of Web retrieval. Meanwhile, the topic format and some evaluation
measures are inherited from the conventional matters of past NTCIRs to
enable comparison with the results of these.
We have prepared two types of document collections, mainly gathered
from the '.jp' domain: one is over 100 GBytes, reflecting reality, and
the other is a selected 10 GBytes, to assist participants to handle this
task easily. Because the data size is too large to handle easily and some
restrictions exist on delivery of the original data, the participants will
only be allowed to use the original document collections inside the National
Institute of Informatics (NII). Participants will use computer resources
in the open laboratory located at NII to perform data processing, e.g.
indexing of the original document data, and will then take out the resulting
data and perform experiments using these in their own laboratories.
The Web retrieval task is composed of the following subtasks for the
two document collections: 100 GBytes and 10 GBytes.
Subtasks in Web Retrieval Task
-
A. Survey Retrieval (both recall and precision are evenly weighted for
evaluation)
-
A1. Topic Retrieval
-
A2. Similarity Retrieval
-
B. Target Retrieval (precision-oriented)
-
C. Optional Tasks
-
C1. Search Results Classification
-
C2. Speech-Driven Retrieval
-
C2. etc.
A. Survey Retrieval
Survey retrieval is similar to the traditional ad-hoc retrieval for scientific
documents or newspapers, where the system performs searches using newly
provided topics for a static document set.
-
The Topics:
-
Both automatic and interactive systems are welcome. Any IR systems containing
manual intervention during the search process are "interactive". All the
others are "automatic".
-
In the case of A1, the topics are described in almost the same format as
the past NTCIR workshops. As a mandatory run, automatic system must submit
the result of the search using only <DESC> and using only <TITLE>.
<DESC> provides a basic description of the user request, and <TITLE>
is composed of 1-3 words that represent the essence of the user request.
As a non-mandatory run, automatic system is allowed to use any fields of
the topics. The participant should report which fields of the topics are
used by automatic systems or interactive systems.
-
In the case of A2, the automatic system submit the result of the search
using <TITLE> and <RDOC>, which identifies three relevant documents;.
The automatic system is allowed to use only <RDOC> or a part of it.
-
Results Submission and Evaluation:
-
The runs will be submitted as the ranked top 1000 documents retrieved for
each topic. The document pool composed of the top-ranked search results
submitted by each participant are considered to be as relevant document
candidates. Human assessors judge the relevance of each document in the
pool. They judge the multi-grade relevance: highly relevant, relevant,
partially relevant, or irrelevant, which are proposed in the past NTCIRs,
or top relevant, which is newly proposed in this task. The relevance judgments
will be performed using trec_eval and ' weighted mean average precision'
that being considered the relevance grade and the ranking. The page is
the basic unit of runs and relevance judgments, however, when they judge
the relevance on a page, the pages within one click distance from it can
be referred to, only if the ones are included in the relevant document
candidates pool. The participant can not submit more than four runs for
each sub-task. The participant specify the priority of each run.
-
Evidential Passages:
-
'Evidential passages', i.e. parts of each relevant document that provide
evidence of relevance judgment, will be submitted since Web pages are various
in their length. While the page is the basic unit for evaluation, evidential
passages can be used for complementary evaluation. The submission of evidential
passages is not mandatory, and we consider the whole page as the evidence
if they are not submitted.
B. Target Retrieval
Target retrieval is attempting to evaluate the effectiveness of the retrieval
in a case where the user requires just one answer or at most a few (e.g.,
a fact-type retrieval, or a retrieval of a site top page), where precision
should be emphasized.
-
The Topics:
-
Automatic systems, which performs language processing for the topic to
formulate the query, and interactive systems, in which the user specifies
the query by reviewing the topic, are acceptable.
-
As a mandatory run, automatic system must submit the result of the search
using only <DESC> and using only <TITLE>. <DESC> provides a basic
description of the user request, and <TITLE> is composed of 1-3 words
that represent the essence of the user request. As a non-mandatory run,
automatic system allow to use any fields of the topics. The participant
should report which fields of the topics are used by automatic systems
or interactive systems.
-
Results Submission and Evaluation:
-
The runs will be submitted as the ranked top 10 documents retrieved for
each topic, having evidential passages attached (not mandatory). The submission
of evidential passages is not mandatory. Several evaluation measures will
be applied, for example:
-
B1. TREC Q&A Track-like method: the inversed rank of first-appeared
relevant document
-
B2. Utility: scoring as +1 for the relevant and -1 for the irrelevant,
or scoring the relevant according to relevance grades, e.g. +3 for the
highly relevant, +2 for relevant, and +1 for partially relevant.
-
B3. Reliability: scoring as +1 for the relevant and -1 for the irrelevant
and incorrect, and 0 for the irrelevant
The participant can not submit more than four runs for each sub-task. The
participant specify the priority of each run.
C. Optional Tasks
The participants can freely submit proposals, using the document set used
in sub-task A and B, relating to their own research interests. The results
are to be presented as a paper/poster in the NTCIR-3 workshop meeting.
If the proposal involves several participants, it can be adopted as a sub-task
and investigated in detail. `C1. Search results classification' and `C2.
Speech-Driven Retrieval' are examples of optional tasks.
-
C1. Search Results Classification:
-
This sub-task tries to evaluate highly precise searching and techniques
for supporting user-nabigation, in the case when the user submit very short
queries.
-
The participant performs searching using only the lead term in <TITLE>
of the topic, classifies the search results into some labeled groups, and
then submits the resulting 200 documents. The classification processing
can be perform on more than top 200 document retrieved.
-
For example, in the case using 'Hidetoshi Nakata' who is one of the famous
Japanese soccer players as the query, the results are classified into 'sites',
'schedules', 'magazines/TV programs', ' photographs' and 'supporters' dialys'.
We do not set the limitation on the number of classes. Hierarchical classification
are also acceptable. The label of classes can be machine-like identification
codes, e.g. 'cluster A' and 'cluster B', or typical page titles.
-
For evaluation, we measure in the manner of the following examples, describe
the features of the systems, and compare among them through discussions
on ML or round-tables.
-
whether the classification are easily understood or not
-
the number of classes
-
the number of documents that included in each class
-
the relevance between each class and the documents in it
-
the number of classes that include the relevant documents and their distribution
-
whether the needed information can be found or not.
-
C2. Speech-Driven Retrieval:
-
This sub-task is proposed by Dr. Atsushi Fujii, University of Library and
Information Science, and Dr. Katunobu Itou, National Institute of Advanced
Industril Science and Technology. Details are available at
"Page of Speech-Driven Retrieval Task"(Japanese only).
-
C3. etc.:
-
The other examples of optional tasks are not limited to mirror sites detection,
data compression, comparable pages alignment, pattern discovery and so
on.
The Topics
We have surveyed the actual situation of Web retrieval and the information
needs using some questionnares at several universities to design the topic
format. The topic format is basically inherited the one of past NTCIRs.
The usable fields and mandatory fields are varied according to the sub-tasks.
Here is two examples of the topics.
<TOPIC>
<NUM>002</NUM>
<TITLE>中田英寿,試合,今後</TITLE>
<DESC>中田英寿の今後の試合予定を知りたい.</DESC>
<NARR>適合文献は,中田英寿の今後の試合予定を示しているもの.
チケット予約とは連動していなくてよい.ファンの個人的なHPなど
でも具体的な日程,場所,時間がわかるのであれば,正解とする.
今後は,ページが作成された時点からみて「今後」.試合の印象記
などは正解ではない.</NARR>
<CONC>中田英寿,サッカー,試合日程,試合,スケジュール</CONC>
<RDOC>ntcweb003983762345,ntcweb000123453874634,ntcweb00023432934</RDOC>
<USER>大学3年,女性</USER>
</TOPIC>
<TOPIC>
<NUM>034</NUM>
<TITLE>エルニーニョ,世界,影響</TITLE>
<DESC>「エルニーニョ」現象とその世界の気象への影響(海水温,
気圧,降雨量などへの影響を含む)について説明している文書を
探したい.</DESC>
<NARR>適合文献は,「エルニーニョ」の影響についての情報を提供する
もの.海と陸上の大気との相互作用は,エルニーニョ現象に関連する
ものならば,関心がある.「エルニーニョ」は,世界の気候に影響を
及ぼすので,特に南太平洋で重要である.</NARR>
<CONC>エルニーニョ,気象,海水温,気圧,降雨量,大気,南太平洋
</CONC>
<RDOC>ntcweb000003425444,ntcweb000232333923,ntcweb000234338778</RDOC>
<USER>中学2年,男性</USER>
</TOPIC>
-
<DESC> (DESCRIPTION) represents the most fundamental description of
the user's information needs. We consider the basic format of the topic
as the manner of "(1) of (2)", e.g. "'the play schedule' of 'Nakata'" or
"'recipes' of 'healthy cookies'".
-
Meanwhile, <TITLE> specify 1-3 terms representing the most fundamental
subjects, not representing all of the aspects of the user's information
needs.
-
<NARR> (NARRATIVE) gives the details on backgrounds, retrieval purposes,
relevance judgments criteria, term definitions and so on.
-
<CONC> (CONCEPTS) gives the synonyms, related terms or broader terms
that are defined by the topic creator.
-
<RDOC> (RELEVANT DOCUMENTS) gives the identification numbers of three
relevant documents.
-
<USER> (USER ATTRIBUTES) gives the attributes of the topic creator,
e.g. the social position and the gender.
We will select the topics, considering the balances on junres, retrieval
purposes and so on.
Document Set
The Definition of Document Set and its Distribution
The document sets should be explicitly specified for test collections.
We adopted the following method to do so among several possible ones, because
the web retrieval task is our first challange and there are many unknown
factors.
-
Extract a part of gathered web document set, then define the set of the
URLs of the extracted documents as a retrieval document set.
-
Provide document data used for retrieval processing.
As this method is the same as those of conventional test collections, many
well-known techniques can be utilized for identifying relevant document
sets and for systems evaluation. It is also important that the effectiveness
of the produced test collection can be kept for a long time.
Document Set for the Workshop
-
Document Collection:
-
Web pages gathered by Web robot via the Web
-
Small Collection
-
the size of the collection: 10GB,the number of documents: 1-2M
-
Large Collection
-
the size of the collection: 100GB,the number of documents: 10-20M
-
We extract the collection so that forward links can be utilized.
-
There exists a question how we can keep the usefullness of backward links.
-
Gathering Domain:
-
Sites: http servers on mainly '.jp' domain
-
Ports: all
-
Pages of other sites are gathered only if each of them is linked from the
already gathered pages in the above mentioned sites.
-
File format: HTML,PlainText
-
Document Data and Document Set:
-
The retrieval document set consists of documents whose document data are
provided and documents which are just linked from the provided document
data.
-
Distribution:
-
Document data are only avaluable in the open laboratory at NII.
-
The retrieval document set is given as a list of document identification
numbers with corresponding URLs.
-
Participants are allowed to take out only the processed data that needed
for making indices.
-
Open Lab. Environments:
-
As there are many uncertain factors, detailed configuration is performed
taking the participants' requirements into account. The following is a
tentative plan.
-
Computer resources
-
Shared file server that provides document data etc.
-
Computers for works and auxiliary storages
-
Host computers
-
Sun Blade,LINUX,Windows 2000
-
We issue user accounts without root permissions.
-
Auxiliary storages
-
For large collection: 500GBytes/team
-
For small collection: 100GBytes/team
-
Data backup facilities
-
DVD-R, magnetic tape equipments, and so on
-
We might need to arrange the schedules if the number of participants would
be large.
-
Software
-
We will prepare basic softwares based on the participants' requirement.
-
Network environments
-
An exclusive segment that is protected by a firewall
-
We do not permit accessing each other among computers for works.
-
Remote access
-
Individual computers for works are controlled by the firewall and tcp_wrapper
software.
-
We set up the remote access conditions based on the participants' requirement.
-
Remote accesses from computers for works to the outside are also controlled
by the firewall.
-
Working space
-
We prepare working desks that are suitable for two teams at a same time.
-
Take-in machines
-
We accept take-in machines as far as the space, the power supply, the managemental
conditions, and other circumstances allow.
-
Data Contents and Format:
-
Document Data Contents
-
List of gathered sites
-
List of aliased sites
-
List of duplicated pagess
-
Metadata of each pages (fetched URL, time, http headers, etc.)
-
Page data (original data)
-
Document Data Processing
-
Page data is provided with Japanese character code preconverted to EUC.
-
Page data without code conversion is also available.
-
No other data preprocessing will be performed.
-
Elimination of Unnecessary Documents
-
Pages are eliminated only if each of them obstracts building of document
collection.
-
looped path
-
dynamically generated pages (gathered up to 10 pages)
-
Document Data Format
-
1 file per site
-
Document data format:
<NW:DOC>
<NW:META>
<NW:URL>URL of the page</NW:URL>
<NW:DATE>gathered date</NW:DATE>
...
</NW:META>
<NW:DATA>
<NW:DSIZE>data bytes</NW:DSIZE>
page contents
</NW:DATA>
</NW:DOC>
Retrieval document set list
ntcweb003983762345 http://www.nii.ac.jp/index.html
Sample Data
Relevance Jedgments
The human assessors judge the relevance on each element of the relevant
document candidates pool, which is composed of the top-ranked search results
submitted by each participant. Several different assessors judge the relevance
for each topic. The retrieved do
cuments by 'search masters' who are well versed in web searching or
searchers will be added to the relevant document candidates pool to enhance
the coverage of relevant documents.
Notes
NII is constructing a test collection for web retrieval on the basis of
the aforementioned thoughts. There exist many subjects under discussion
at the time when we describe this article, so that the data or methods
in the actual workshop can be adopted in the different manners from the
ones described in this article.
Finally, we expect active contributions by the workshop participants
and requests or advice from the researcher in related research areas to
perform the Web retrieval task in NTCIR-3 and to construct more usable
test collections of the Web documents.
[Japanese][NTCIR][workshop][register][agreement] Tasks [clir][patent][qac][tsc][web]