[Session Notes] Introduction to NTCIR (Basically for new comers)
[
Meeting Program]
Date: June 15, 2010
Time: 11:00-12:00
Speaker: Noriko Kando
Summary:
In this topic, Dr. Noriko Kando gives new comers an introduction to NTCIR. Her talk involves NTCIR history, its main concern, the evaluation workshop, the process of NTCIR workshops, tasks at past NTCIRs and overview of NTCIR-8. NTCIR, as research infrastructure for evaluation information access, is the abbreviation for NII Test Collection for Information Retrieval. NTCIR project started in late 1997, held once every 18 months. The NTCIR project organizes a series of evaluation workshops designed to enhance research in information-access technologies by providing an infrastructure for large-scale evaluations. The talk also involves brief introduction to information retrieval and information access. The evaluation workshop is not competition but provides participants with a set of data usable for experiments and unified procedures for evaluation. The process of NTCIR workshop includes call for task proposal, selection of tasks, discussion about task design, evaluation methodologies and metrics, registration to task participant, submission of experimental results on test data, release of right answer, paper submission and meeting. The speaker also introduces what kind of activity a participant can take in NTCIR. At last, she calls for NTCIR-9 task proposals.
by@Jian Zhang
[
Return to top]
[Session Notes] EVIA 2010
Co-Chairs: Tetsuya Sakai, Mark Sanderson & William Webber
[
Meeting Program][
Online Proceedings]
Invited talk
Date: June 15, 2010
Time: 13:00-13:50
Speaker: John Nave
Title: Microsoftfs Bing and User Behavior Evaluation
Summary:
Microsoftfs Bing takes a different approach to web search. The speaker
gives the reason why they call Bing a gdecision engineh and what are the
unique benefits this approach provides to users. This talk describes the
vision and design principles of Bing and how this translates into what
they build. The vision for Bing is global in nature, and the speaker talks
about surveys about searchers in Japan. They cover the areas where searchers
in here appear similar to searchers worldwide, and also touch on areas
where their behaviors are unique. During the presentation, the speaker
gives a lot of examples and the speaker and audience discuss actively during
the Q&A part.
Time: 13:50-14:10
Speaker: Sukomal Pal
Title: Estimating Pool-depth on Per Query Basis
Summary:
They demonstrate a simple and pragmatic approach for the creation of smaller
pools for evaluation of ad hoc retrieval systems. Instead of using an apriori-fixed
depth, variable pool-depth based pooling is adopted. The pool for each
topic is incrementally built and judged interactively. When no new relevant
document is found for a reasonably long run of pool-depths, pooling can
be stopped for the topic. Based on available effort and required performance
level, the proposed approach can be adjusted for optimality. Experiments
on TREC-7, TREC-8 and NTCIR-5 data show its efficacy in substantially reducing
pool size without seriously compromising reliability of evaluation. Unlike
other low-cost evaluation proposals, their method is not statistical sampling
based, nor does it look for a few good topics. Within the traditional framework
of the Cranfield paradigm, it offers an interactive pooling approach based
on variable pool-depth per query. The approach reduces assessment effort
to a great extent for most of the queries where the pool saturates quickly.
Again, for the queries where the rate of finding new reldocs is quite high,
better estimates of recall can be ensured by going deeper in the pool (k
> 100). That k is determined dynamically per query requiring more assessor
responsibility can be a potential criticism. To build a large test collection
based on the Cranfield methodology, their simple approach can be cost-effective
yet reliable. Their results conform to the findings of the NTCIR CLIA and
IR4QA tasks that popular documents (documents retrieved by many systems
at high ranks) are more likely to be relevant. However, compared to other
low-cost evaluation methodologies, they have not checked how much reusable
their method is, i.e. how accurately the collection built based on their
proposal can evaluate a enew systemf. The study will certainly be one direction
of their future work.
Time: 14:10-14:40
Speaker: William Webber
Title: The Effect of Pooling and Evaluation Depth on Metric Stability
Summary:
The profusion of information retrieval effectiveness metrics has inspired
the development of meta-evaluative criteria for choosing between them.
One such criterion is discriminative power; that is, the proportion of
system pairs whose difference in effectiveness is found statistically significant.
Studies of discriminative power frequently find normalized discounted cumulative
gain (nDCG) to be the most discriminative metric, but there has been no
satisfactory explanation of which feature makes it so discriminative. They
examine the discriminative power of nDCG and several other metrics under
different evaluation and pooling depths, and with different forms of score
normalization. They find that evaluation depth is more important to metric
behaviour and discriminative power than metric type; that evaluating beyond
pooling depth does not seem to lead to a misleading system reinforcement
effect; and that nDCG does seem to have a genuine, albeit slight, edge
in discriminative power under a range of conditions.
Time: 14:50-15:05
Speaker: Nicola Ferro
Title: 10 Years of CLEF Data in DIRECT: Where We Are and Where We Can Go
Summary:
They discuss the evolution of large-scale evaluation campaigns and the
corresponding evaluation infrastructures needed to carry them out. They
present the next challenges for these initiatives and show how digital
library systems can play a relevant role in supporting the research conducted
in these forums by acting as virtual research environments. Some compelling
issues are discussed that large scale evaluation campaigns should take
into consideration when they come to the management, description, and access
to the scientific data produced during their course. They have then presented
the DIRECT system. They have developed in CLEF since 2005 in order to start
to address some of those issues. Finally, They have discussed some ongoing
activities targeted towards using DIRECT not only as a campaign management
tool but also as a dissemination source for the scientific data produced
during the last ten years of CLEF campaigns. Moreover, they have outlined
some possible future directions that they will pursue to favor an active
involvement of the users with the managed data. Much work is still to come,
such as for example to conduct user studies to assess the actual utilization
of the system and to gather suggestions for possible future directions.
Time: 15:10-15:35
Speaker: Tetsuya Sakai
Title: Ranking Retrieval Systems without Relevance Assessments ? Revisited
Summary:
They re-examine the problem of ranking retrieval systems without relevance
assessments in the context of collaborative evaluation forums such as TREC
and NTCIR. The problem was first tackled by Soboroff, Nicholas and Cahan
in 2001, using data from TRECs 3-8. Their long-term goal is to semi-automate
repeated evaluation of search engines; their short-term goal is to provide
NTCIR participants with a gsystem ranking forecasth prior to conducting
manual relevance assessments, thereby reducing researchersf idle time and
accelerating research. Their extensive experiments using graded-relevance
test collections from TREC and NTCIR compare several existing methods for
ranking systems without relevance assessments. They show that (a) The simplest
method of forming gpseudo-qrelsh based on how many systems returned each
pooled document performs as well as any other existing method; and that
(b) the NTCIR system rankings tend to be easier to predict than the TREC
robust track system rankings, and moreover, the NTCIR pseudoqrels yield
fewer false alarms than the TREC pseudo-qrels do in statistical significance
testing. These differences between TREC and NTCIR may be because TREC sorts
pooled documents by document IDs before relevance assessments, while NTCIR
sorts them primarily by the number of systems that returned the document.
However, they show that, even for the TREC robust data, documents returned
by many systems are indeed more likely to be relevant than those returned
by fewer systems. Their experimental results challenge a few previous studies.
Lack of reproducibility and lack of grealh progress are growing concerns
in the IR community and elsewhere. While sharing data and programs among
researchers is certainly important for improving this situation, equally
important are (1) describing the algorithms and experiments clearly, (2)
evaluating using diverse data sets and multiple evaluation metrics. They
believe that the present study has a strength over similar studies in these
aspects. They are on the long way towards semi-automatic evaluation of
Web search engines, where gmajority voteh is not really an option: Utilizing
click through data, for example, is a more realistic approach for such
a purpose. On the bright side, however, based on the insights from the
present study, the NTCIR-8 ACLIA-2 IR4QA task has actually adopted their
proposed framework of providing gsystem ranking forecastsh to participants
right after the run submission deadline. The actual usefulness of such
forecasts will be investigated in future work. The relationship between
the accuracy of forecasts and the maturity of evaluation workshops should
also be investigated.
Time: 15:35-16:05
Speaker: Ian Soboroff
Title: Test Collection Diagnosis and Treatment
Summary:
Test collections are a mainstay of information retrieval research. Since
the 1990s, large reusable test collections have been developed in the context
of community evaluations such as TREC, NTCIR, CLEF, and INEX. Recently,
advances in pooling practice as well as crowd sourcing technologies have
placed test collection building back into the hands of the small research
group or company. In all of these cases, practitioners should be aware
of, and concerned about the quality of test collections. They surveys work
in test collection quality measures, references case studies to illustrate
their use, and provides guidelines on assessing the quality of test collections
in practice. Creating test collections is more of an art than a science
at present. Over many years and through the creation of many test collections,
a small body of techniques and analysis tools have been developed to help
diagnose when a test collection may have problems. These techniques are
compiled in various articles and in the tacit knowledge of test collection
builders. They bring together this information and create a guide to the
prominent results in this area. They plan to make portable code available
for computing the tests described here, using an open-source code repository,
further aiding practitioners in the field. The issue on test collection
diagnosis will remain open. There is a long way to go. Future-proofing
test collections is a hard problem, making all the harder by new large-scale
collections that discourage computationally intensive but possibly revolutionary
results. New collections for novel tasks and media domains are also at
risk of poor reliability due to immature participating systems. Crowd sourcing
and other methods for compiling inexpensive relevance judgments force us
to consider the quality of that data. A strong suite of tools develop is
needed to build to measure the quality of test collections and improve
the reliability of modern, Canfield-style experiments. Much of what is
known about building test collections comes from evaluation forums where
the system outputs from multiple research groups and many systems are combined
using pooling or some close equivalent. However, small research groups
as well as companies need purpose built test collections to measure search
quality, and in those contexts the diversity and richness of an evaluation
forum is impossible to achieve. Future work includes study on this area
and understanding of how practices can be translated to those communities.
Time: 16:20-16:50
Speaker: Tetsuya Sakai
Title: Simple Evaluation Metrics for Diversified Search Results
Summary:
Traditional information retrieval research has mostly focused on satisfying
clearly specified information needs. However, in reality, queries are often
ambiguous and/or underspecified. In light of this, evaluating search result
diversity is beginning to receive attention. They propose simple evaluation
metrics for diversified Web search results. Their presumptions are that
one or more interpretations are possible for each given query, and that
graded relevance assessments are available for intent-document pairs as
opposed to query-document pairs. Their goals are (a) to retrieve documents
that cover as many intents as possible; and (b) to rank documents that
are highly relevant to more popular intents higher than those that are
marginally relevant to less popular intents. Unlike the Intent-Aware (IA)
metrics, their metrics successfully avoid ignoring minor intents. Unlike
ƒ¿-nDCG, their metrics can accommodate (i) which intents are more likely
than others for a given query; and (ii) graded relevance within each intent.
Furthermore, unlike these existing metrics, their metrics do not require
approximation, and they range between 0 and 1. Experiments with the binary-relevance
Diversity Task data from the TREC 2009 Web Track suggest that their metrics
correlate well with existing metrics but can be more intuitive. Hence,
they argue that their metrics are suitable for diversity evaluation given
either the intent likelihood information or per-intent graded relevance,
or preferably both.
Time: 16:45-17:15
Speaker: Ruihua Song
Title: Constructing a Test Collection with Multi-Intent Queries
Summary:
Users often issue vague queries; when their intents cannot be predicted
precisely, a natural solution is to diversify the search results, hoping
that some of the results correspond to the intent: This is usually called
gresult diversificationh. Only a few studies have been completed to systematically
evaluate approaches on result diversity. Some questions still remain unanswered:
1) As all intents cannot exhaustively be listed in an evaluation, how does
an incomplete intent set influence evaluation results? 2) Intents are not
equally popular; so how can the probability of each intent be estimated?
They address these questions in building up a test collection for multi-intent
queries. The labeling tool that they have developed allows assessors to
add new intents while performing relevance assessments. Thus, it is possible
to investigate the influence of an incomplete intent set through experiments.
Moreover, they propose two simple methods to estimate the probabilities
of the underlying intents. Experimental results indicate that the evaluation
results are different if taking the probabilities into consideration.
Time: 17:15-17:45
Speaker: Katrin Lamm
Title: The Influence of Expectation and System Performance on User Satisfaction with Retrieval Systems
Summary:
Correlations between information retrieval system performance and user
satisfaction are an important research topic. The expectation of users
is a factor in most models of customer satisfaction in marketing research;
however, it has not been used in experiments with information retrieval
systems so far. In an interdisciplinary effort between information retrieval
and psychology they developed an experimental design which uses the so-called
confirmation/disconfirmation paradigm (C/D-paradigm) as a theoretical framework.
This paradigm predicts that the satisfaction of users is strongly governed
by their expectations towards a product or a system. They report a study
with 89 participants in which two levels of both system performance and
user expectation were tested. The results show that user expectation has
an effect on the satisfaction as predicted by the C/D-paradigm. In addition,
they confirmed previous studies which hint that system performance correlates
with user satisfaction. The experiment also revealed that users significantly
relax their relevance criteria and compensate for low system performance.
Time: 17:45-18:15
Speaker: Keun Chan Park
Title: A Game-based Evaluation Method for Subjective Tasks Involving Text Content Analysis
Summary:
Standard test collections have remarkably supported the growth of research
fields by allowing direct comparisons among algorithms. However test collections
only exist in popular research areas. Moreover constructing a test collection
needs huge amount of time, cost and human labor. On that account, many
research fields including newly emerging areas are evaluating their result
by manually constructed test sets. However test sets are unreliable because
they often use small number of raters. It is even more unreliable when
the task is subjective. They define subjective task as a task where the
judgment may differ from individuals due to various aspects such as preference
and interest but still preserving a sense of commonality. They address
the problem of evaluating subjective task using a computer game. Playing
the game, as a side effect, performs subjective task and utilizing the
piles of game result lead to an objective evaluation. Their result outperforms
the baseline significantly in terms of efficiency and show that evaluating
through their approach is nearly the same evaluating with a gold standard.
by@Jian Zhang
[
Return to top]
Last updated: July 08, 2010