EVIA Abstract

Microsoft's Bing and User Behavior Evaluation [Pdf] [Table of Content]

      John Nave

Microsoftfs Bing takes a different approach to web search. Why do we call Bing a gdecision engineh and what are the unique benefits this approach provides to users? This talk will describe the vision and design principles of Bing and how this translates into what we build. Behind the code there is a great deal of customer research and a few fundamental insights that motivate our designs. Evaluation of user behaviors suggest that there are many areas were they could be getting much more benefit from a web search engine. The vision for Bing is global in nature, but what about searchers in Japan? We will cover the areas where searchers in here appear similar to searchers worldwide, and also touch on areas where their behaviors are unique.

Estimating Pool-depth on Per Query Basis [Pdf] [Slides] [Table of Content]

      Sukomal Pal, Mandar Mitra, Samaresh Maiti

This paper demonstrates a simple and pragmatic approach for the creation of smaller pools for evaluation of ad hoc retrieval systems. Instead of using an apriori-fixed depth, variable pool-depth based pooling is adopted. The pool for each topic is incrementally built and judged interactively. When no new relevant document is found for a reasonably long run of pool-depths, pooling can be stopped for the topic. Based on available effort and required performance level, the proposed approach can be adjusted for optimality. Experiments on TREC-7, TREC-8 and NTCIR-5 data show its efficacy in substantially reducing poolsize without seriously compromising reliability of evaluation.

The Effect of Pooling and Evaluation Depth on Metric Stability [Pdf] [Slides] [Table of Content]

      William Webber, Alistair Moffat, Justin Zobel

The profusion of information retrieval effectiveness metrics has inspired the development of meta-evaluative criteria for choosing between them. One such criterion is discriminative power; that is, the proportion of system pairs whose difference in effectiveness is found statistically significant. Studies of discriminative power frequently find normalized discounted cumulative gain (nDCG) to be the most discriminative metric, but there has been no satisfactory explanation of which feature makes it so discriminative. In this paper, we examine the discriminative power of nDCG and several other metrics under different evaluation and pooling depths, and with different forms of score normalization. We find that evaluation depth is more important to metric behaviour and discriminative power than metric type; that evaluating beyond pooling depth does not seem to lead to a misleading system reinforcement effect; and that nDCG does seem to have a genuine, albeit slight, edge in discriminative power under a range of conditions.

10 Years of CLEF Data in DIRECT: Where We Are and Where We Can Go [Pdf] [Slides] [Table of Content]

      Maristella Agosti, Giorgio Maria Di Nunzio, Marco Dussin, Nicola Ferro

This paper discusses the evolution of large-scale evaluation campaigns and the corresponding evaluation infrastructures needed to carry them out. We present the next challenges for these initiatives and show how digital library systems can play a relevant role in supporting the research conducted in these fora by acting as virtual research environments.

Ranking Retrieval Systems without Relevance Assessments - Revisited [Pdf] [Table of Content]

      Tetsuya Sakai, Chin-Yew Lin

We re-examine the problem of ranking retrieval systems without relevance assessments in the context of collaborative evaluation forums such as TREC and NTCIR. The problem was first tackled by Soboroff, Nicholas and Cahan in 2001, using data from TRECs 3-8 [16]. Our long-term goal is to semi-automate repeated evaluation of search engines; our short-term goal is to provide NTCIR participants with a gsystem ranking forecasth prior to conducting manual relevance assessments, thereby reducing researchersf idle time and accelerating research. Our extensive experiments using graded-relevance test collections from TREC and NTCIR compare several existing methods for ranking systems without relevance assessments. We show that (a) The simplest method of forming gpseudo-qrelsh based on how many systems returned each pooled document performs as well as any other existing method; and that (b) the NTCIR system rankings tend to be easier to predict than the TREC robust track system rankings, and moreover, the NTCIR pseudoqrels yield fewer false alarms than the TREC pseudo-qrels do in statistical significance testing. These differences between TREC and NTCIR may be because TREC sorts pooled documents by document IDs before relevance assessments, while NTCIR sorts them primarily by the number of systems that returned the document. However, we show that, even for the TREC robust data, documents returned by many systems are indeed more likely to be relevant than those returned by fewer systems.

Test Collection Diagnosis and Treatment [Pdf] [Table of Content]

      Ian Soboroff

Test collections are a mainstay of information retrieval re- search. Since the 1990s, large reusable test collections have been developed in the context of community evaluations such as TREC, NTCIR, CLEF, and INEX. Recently, ad- vances in pooling practice as well as crowdsourcing tech- nologies have placed test collection building back into the hands of the small research group or company. In all of these cases, practitioners should be aware of, and concerned about the quality of test collections. This paper surveys work in test collection quality measures, references case studies to illustrate their use, and provides guidelines on assessing the quality of test collections in practice.

Simple Evaluation Metrics for Diversified Search Results [Pdf] [Table of Content]

      Tetsuya Sakai, Nick Craswell, Ruihua Song, Stephen Robertson, Zhicheng Dou, Chin-Yew Lin

Traditional information retrieval research has mostly focussed on satisfying clearly specified information needs. However, in reality, queries are often ambiguous and/or underspecified. In light of this, evaluating search result diversity is beginning to receive attention. We propose simple evaluation metrics for diversified Web search results. Our presumptions are that one or more interpretations (or intents) are possible for each given query, and that graded relevance assessments are available for intent-document pairs (as opposed to query-document pairs). Our goals are (a) to retrieve documents that cover as many intents as possible; and (b) to rank documents that are highly relevant to more popular intents higher than those that are marginally relevant to less popular intents. Unlike the Intent-Aware (IA) metrics proposed by Agrawal et al., our metrics successfully avoid ignoring minor intents. Unlike ƒ¿-nDCG proposed by Clarke et al., our metrics can accomodate (i) which intents are more likely than others for a given query; and (ii) graded relevance within each intent. Furthermore, unlike these existing metrics, our metrics do not require approximation, and they range between 0 and 1. Experiments with the binary-relevance Diversity Task data from the TREC 2009 Web Track suggest that our metrics corrrelate well with existing metrics but can be more intuitive. Hence, we argue that our metrics are suitable for diversity evaluation given either the intent likelihood information or per-intent graded relevance, or preferably both.

Constructing a Test Collection with Multi-Intent Queries [Pdf] [Table of Content]

      Ruihua Song, Dongjie Qi, Hua Liu, Tetsuya Sakai, Jian-Yun Nie, Hsiao-Wuen Hon, Yong Yu

Users often issue vague queries; when we cannot predict their intents precisely, a natural solution is to diversify the search results, hoping that some of the results correspond to the intent: This is usually called gresult diversificationh. Only a few studies have been completed to systematically evaluate approaches on result diversity. Some questions still remain unanswered: 1) As we cannot exhaustively list all intents in an evaluation, how does an incomplete intent set influence evaluation results? 2) Intents are not equally popular; so how can we estimate the probability of each intent? In this paper, we address these questions in building up a test collection for multi-intent queries. The labeling tool that we have developed allows assessors to add new intents while performing relevance assessments. Thus, we can investigate the influence of an incomplete intent set through experiments. Moreover, we propose two simple methods to estimate the probabilities of the underlying intents. Experimental results indicate that the evaluation results are different if we take the probabilities into consideration.

The Influence of Expectation and System Performance on User Satisfaction with Retrieval Systems [Pdf] [Slides] [Table of Content]

      Katrin Lamm, Thomas Mandl, Christa Womser-Hacker, Werner Greve

Correlations between information retrieval system performance and user satisfaction are an important research topic. The expectation of users is a factor in most models of customer satisfaction in marketing research; however, it has not been used in experiments with information retrieval systems so far. In an interdisciplinary effort between information retrieval and psychology we developed an experimental design which uses the so-called confirmation/disconfirmation paradigm (C/D-paradigm) as a theoretical framework. This paradigm predicts that the satisfaction of users is strongly governed by their expectations towards a product or a system. We report a study with 89 participants in which two levels of both system performance and user expectation were tested. The results show that user expectation has an effect on the satisfaction as predicted by the C/D-paradigm. In addition, we confirmed previous studies which hint that system performance correlates with user satisfaction. The experiment also revealed that users significantly relax their relevance criteria and compensate for low system performance.

A Game-based Evaluation Method for Subjective Tasks Involving Text Content Analysis [Pdf] [Table of Content]

      Keun Chan Park, Jihee Ryu, Kyung-min Kim, Sung Hyon Myaeng

Standard test collections have remarkably supported the growth of research fields by allowing direct comparisons among algorithms. However test collections only exist in popular research areas. Moreover constructing a test collection needs huge amount of time, cost and human labor. On that account, many research fields including newly emerging areas are evaluating their result by manually constructed test sets. However test sets are unreliable because they often use small number of raters. It is even more unreliable when the task is subjective. We define subjective task as a task where the judgment may differ from individuals due to various aspects such as preference and interest but still preserving a sense of commonality. We address the problem of evaluating subjective task using a computer game. Playing the game, as a side effect, performs subjective task and utilizing the piles of game result lead to an objective evaluation. Our result outperforms the baseline significantly in terms of efficiency and show that evaluating through our approach is nearly the same evaluating with a gold standard.