NTCIR-8 Meeting Session Notes

[Session Notes] Introduction to NTCIR (Basically for new comers)
[Meeting Program]

Date: June 15, 2010
Time: 11:00-12:00
Speaker: Noriko Kando

Summary:
In this topic, Dr. Noriko Kando gives new comers an introduction to NTCIR. Her talk involves NTCIR history, its main concern, the evaluation workshop, the process of NTCIR workshops, tasks at past NTCIRs and overview of NTCIR-8. NTCIR, as research infrastructure for evaluation information access, is the abbreviation for NII Test Collection for Information Retrieval. NTCIR project started in late 1997, held once every 18 months. The NTCIR project organizes a series of evaluation workshops designed to enhance research in information-access technologies by providing an infrastructure for large-scale evaluations. The talk also involves brief introduction to information retrieval and information access. The evaluation workshop is not competition but provides participants with a set of data usable for experiments and unified procedures for evaluation. The process of NTCIR workshop includes call for task proposal, selection of tasks, discussion about task design, evaluation methodologies and metrics, registration to task participant, submission of experimental results on test data, release of right answer, paper submission and meeting. The speaker also introduces what kind of activity a participant can take in NTCIR. At last, she calls for NTCIR-9 task proposals.

by　Jian Zhang
[Return to top]

[Session Notes] EVIA 2010
Co-Chairs: Tetsuya Sakai, Mark Sanderson & William Webber
[Meeting Program][Online Proceedings]

Invited talk
Date: June 15, 2010
Time: 13:00-13:50
Speaker: John Nave
Title: Microsoft’s Bing and User Behavior Evaluation

Summary:
Microsoft’s Bing takes a different approach to web search. The speaker gives the reason why they call Bing a “decision engine” and what are the unique benefits this approach provides to users. This talk describes the vision and design principles of Bing and how this translates into what they build. The vision for Bing is global in nature, and the speaker talks about surveys about searchers in Japan. They cover the areas where searchers in here appear similar to searchers worldwide, and also touch on areas where their behaviors are unique. During the presentation, the speaker gives a lot of examples and the speaker and audience discuss actively during the Q&A part.

Time: 13:50-14:10
Speaker: Sukomal Pal
Title: Estimating Pool-depth on Per Query Basis

Summary:
They demonstrate a simple and pragmatic approach for the creation of smaller pools for evaluation of ad hoc retrieval systems. Instead of using an apriori-fixed depth, variable pool-depth based pooling is adopted. The pool for each topic is incrementally built and judged interactively. When no new relevant document is found for a reasonably long run of pool-depths, pooling can be stopped for the topic. Based on available effort and required performance level, the proposed approach can be adjusted for optimality. Experiments on TREC-7, TREC-8 and NTCIR-5 data show its efficacy in substantially reducing pool size without seriously compromising reliability of evaluation. Unlike other low-cost evaluation proposals, their method is not statistical sampling based, nor does it look for a few good topics. Within the traditional framework of the Cranfield paradigm, it offers an interactive pooling approach based on variable pool-depth per query. The approach reduces assessment effort to a great extent for most of the queries where the pool saturates quickly. Again, for the queries where the rate of finding new reldocs is quite high, better estimates of recall can be ensured by going deeper in the pool (k > 100). That k is determined dynamically per query requiring more assessor responsibility can be a potential criticism. To build a large test collection based on the Cranfield methodology, their simple approach can be cost-effective yet reliable. Their results conform to the findings of the NTCIR CLIA and IR4QA tasks that popular documents (documents retrieved by many systems at high ranks) are more likely to be relevant. However, compared to other low-cost evaluation methodologies, they have not checked how much reusable their method is, i.e. how accurately the collection built based on their proposal can evaluate a ‘new system’. The study will certainly be one direction of their future work.

Time: 14:10-14:40
Speaker: William Webber
Title: The Effect of Pooling and Evaluation Depth on Metric Stability

Summary:
The profusion of information retrieval effectiveness metrics has inspired the development of meta-evaluative criteria for choosing between them. One such criterion is discriminative power; that is, the proportion of system pairs whose difference in effectiveness is found statistically significant. Studies of discriminative power frequently find normalized discounted cumulative gain (nDCG) to be the most discriminative metric, but there has been no satisfactory explanation of which feature makes it so discriminative. They examine the discriminative power of nDCG and several other metrics under different evaluation and pooling depths, and with different forms of score normalization. They find that evaluation depth is more important to metric behaviour and discriminative power than metric type; that evaluating beyond pooling depth does not seem to lead to a misleading system reinforcement effect; and that nDCG does seem to have a genuine, albeit slight, edge in discriminative power under a range of conditions.

Time: 14:50-15:05
Speaker: Nicola Ferro
Title: 10 Years of CLEF Data in DIRECT: Where We Are and Where We Can Go

Summary:
They discuss the evolution of large-scale evaluation campaigns and the corresponding evaluation infrastructures needed to carry them out. They present the next challenges for these initiatives and show how digital library systems can play a relevant role in supporting the research conducted in these forums by acting as virtual research environments. Some compelling issues are discussed that large scale evaluation campaigns should take into consideration when they come to the management, description, and access to the scientific data produced during their course. They have then presented the DIRECT system. They have developed in CLEF since 2005 in order to start to address some of those issues. Finally, They have discussed some ongoing activities targeted towards using DIRECT not only as a campaign management tool but also as a dissemination source for the scientific data produced during the last ten years of CLEF campaigns. Moreover, they have outlined some possible future directions that they will pursue to favor an active involvement of the users with the managed data. Much work is still to come, such as for example to conduct user studies to assess the actual utilization of the system and to gather suggestions for possible future directions.

Time: 15:10-15:35
Speaker: Tetsuya Sakai
Title: Ranking Retrieval Systems without Relevance Assessments ? Revisited

Summary:
They re-examine the problem of ranking retrieval systems without relevance assessments in the context of collaborative evaluation forums such as TREC and NTCIR. The problem was first tackled by Soboroff, Nicholas and Cahan in 2001, using data from TRECs 3-8. Their long-term goal is to semi-automate repeated evaluation of search engines; their short-term goal is to provide NTCIR participants with a “system ranking forecast” prior to conducting manual relevance assessments, thereby reducing researchers’ idle time and accelerating research. Their extensive experiments using graded-relevance test collections from TREC and NTCIR compare several existing methods for ranking systems without relevance assessments. They show that (a) The simplest method of forming “pseudo-qrels” based on how many systems returned each pooled document performs as well as any other existing method; and that (b) the NTCIR system rankings tend to be easier to predict than the TREC robust track system rankings, and moreover, the NTCIR pseudoqrels yield fewer false alarms than the TREC pseudo-qrels do in statistical significance testing. These differences between TREC and NTCIR may be because TREC sorts pooled documents by document IDs before relevance assessments, while NTCIR sorts them primarily by the number of systems that returned the document. However, they show that, even for the TREC robust data, documents returned by many systems are indeed more likely to be relevant than those returned by fewer systems. Their experimental results challenge a few previous studies. Lack of reproducibility and lack of “real” progress are growing concerns in the IR community and elsewhere. While sharing data and programs among researchers is certainly important for improving this situation, equally important are (1) describing the algorithms and experiments clearly, (2) evaluating using diverse data sets and multiple evaluation metrics. They believe that the present study has a strength over similar studies in these aspects. They are on the long way towards semi-automatic evaluation of Web search engines, where “majority vote” is not really an option: Utilizing click through data, for example, is a more realistic approach for such a purpose. On the bright side, however, based on the insights from the present study, the NTCIR-8 ACLIA-2 IR4QA task has actually adopted their proposed framework of providing “system ranking forecasts” to participants right after the run submission deadline. The actual usefulness of such forecasts will be investigated in future work. The relationship between the accuracy of forecasts and the maturity of evaluation workshops should also be investigated.

Time: 15:35-16:05
Speaker: Ian Soboroff
Title: Test Collection Diagnosis and Treatment

Summary:
Test collections are a mainstay of information retrieval research. Since the 1990s, large reusable test collections have been developed in the context of community evaluations such as TREC, NTCIR, CLEF, and INEX. Recently, advances in pooling practice as well as crowd sourcing technologies have placed test collection building back into the hands of the small research group or company. In all of these cases, practitioners should be aware of, and concerned about the quality of test collections. They surveys work in test collection quality measures, references case studies to illustrate their use, and provides guidelines on assessing the quality of test collections in practice. Creating test collections is more of an art than a science at present. Over many years and through the creation of many test collections, a small body of techniques and analysis tools have been developed to help diagnose when a test collection may have problems. These techniques are compiled in various articles and in the tacit knowledge of test collection builders. They bring together this information and create a guide to the prominent results in this area. They plan to make portable code available for computing the tests described here, using an open-source code repository, further aiding practitioners in the field. The issue on test collection diagnosis will remain open. There is a long way to go. Future-proofing test collections is a hard problem, making all the harder by new large-scale collections that discourage computationally intensive but possibly revolutionary results. New collections for novel tasks and media domains are also at risk of poor reliability due to immature participating systems. Crowd sourcing and other methods for compiling inexpensive relevance judgments force us to consider the quality of that data. A strong suite of tools develop is needed to build to measure the quality of test collections and improve the reliability of modern, Canfield-style experiments. Much of what is known about building test collections comes from evaluation forums where the system outputs from multiple research groups and many systems are combined using pooling or some close equivalent. However, small research groups as well as companies need purpose built test collections to measure search quality, and in those contexts the diversity and richness of an evaluation forum is impossible to achieve. Future work includes study on this area and understanding of how practices can be translated to those communities.

Time: 16:20-16:50
Speaker: Tetsuya Sakai
Title: Simple Evaluation Metrics for Diversified Search Results

Summary:
Traditional information retrieval research has mostly focused on satisfying clearly specified information needs. However, in reality, queries are often ambiguous and/or underspecified. In light of this, evaluating search result diversity is beginning to receive attention. They propose simple evaluation metrics for diversified Web search results. Their presumptions are that one or more interpretations are possible for each given query, and that graded relevance assessments are available for intent-document pairs as opposed to query-document pairs. Their goals are (a) to retrieve documents that cover as many intents as possible; and (b) to rank documents that are highly relevant to more popular intents higher than those that are marginally relevant to less popular intents. Unlike the Intent-Aware (IA) metrics, their metrics successfully avoid ignoring minor intents. Unlike α-nDCG, their metrics can accommodate (i) which intents are more likely than others for a given query; and (ii) graded relevance within each intent. Furthermore, unlike these existing metrics, their metrics do not require approximation, and they range between 0 and 1. Experiments with the binary-relevance Diversity Task data from the TREC 2009 Web Track suggest that their metrics correlate well with existing metrics but can be more intuitive. Hence, they argue that their metrics are suitable for diversity evaluation given either the intent likelihood information or per-intent graded relevance, or preferably both.

Time: 16:45-17:15
Speaker: Ruihua Song
Title: Constructing a Test Collection with Multi-Intent Queries

Summary:
Users often issue vague queries; when their intents cannot be predicted precisely, a natural solution is to diversify the search results, hoping that some of the results correspond to the intent: This is usually called “result diversification”. Only a few studies have been completed to systematically evaluate approaches on result diversity. Some questions still remain unanswered: 1) As all intents cannot exhaustively be listed in an evaluation, how does an incomplete intent set influence evaluation results? 2) Intents are not equally popular; so how can the probability of each intent be estimated? They address these questions in building up a test collection for multi-intent queries. The labeling tool that they have developed allows assessors to add new intents while performing relevance assessments. Thus, it is possible to investigate the influence of an incomplete intent set through experiments. Moreover, they propose two simple methods to estimate the probabilities of the underlying intents. Experimental results indicate that the evaluation results are different if taking the probabilities into consideration.

Time: 17:15-17:45
Speaker: Katrin Lamm
Title: The Influence of Expectation and System Performance on User Satisfaction with Retrieval Systems

Summary:
Correlations between information retrieval system performance and user satisfaction are an important research topic. The expectation of users is a factor in most models of customer satisfaction in marketing research; however, it has not been used in experiments with information retrieval systems so far. In an interdisciplinary effort between information retrieval and psychology they developed an experimental design which uses the so-called confirmation/disconfirmation paradigm (C/D-paradigm) as a theoretical framework. This paradigm predicts that the satisfaction of users is strongly governed by their expectations towards a product or a system. They report a study with 89 participants in which two levels of both system performance and user expectation were tested. The results show that user expectation has an effect on the satisfaction as predicted by the C/D-paradigm. In addition, they confirmed previous studies which hint that system performance correlates with user satisfaction. The experiment also revealed that users significantly relax their relevance criteria and compensate for low system performance.

Time: 17:45-18:15
Speaker: Keun Chan Park
Title: A Game-based Evaluation Method for Subjective Tasks Involving Text Content Analysis

Summary:
Standard test collections have remarkably supported the growth of research fields by allowing direct comparisons among algorithms. However test collections only exist in popular research areas. Moreover constructing a test collection needs huge amount of time, cost and human labor. On that account, many research fields including newly emerging areas are evaluating their result by manually constructed test sets. However test sets are unreliable because they often use small number of raters. It is even more unreliable when the task is subjective. They define subjective task as a task where the judgment may differ from individuals due to various aspects such as preference and interest but still preserving a sense of commonality. They address the problem of evaluating subjective task using a computer game. Playing the game, as a side effect, performs subjective task and utilizing the piles of game result lead to an objective evaluation. Their result outperforms the baseline significantly in terms of efficiency and show that evaluating through their approach is nearly the same evaluating with a gold standard.

by　Jian Zhang
[Return to top]

Last updated: July 08, 2010