EVIA2013 Abstracts

Preface
William Webber and Ruihua Song
[Pdf] [Table of Content]
The Unreusability of Diversified Search Test Collections
Tetsuya Sakai
[Pdf] [Table of Content]

Traditional ``ad hoc'' test collections, typically built based on depth-100 pools, are often used a posteriori by non-contributors, i.e., research groups that did not contribute the pools. The Leave One Out (LOO) test is useful for testing whether the test collections are actually reusable: that is, whether the non-contributors can be evaluated fairly relative to the contributors' official performances. In contrast, at the recent web search result diversification tasks of TREC and NTCIR, diversity test collections have been built using shallow pools: the pool depths lie between 20 and 40. Thus it is unlikely that these diversity test collections are reusable: in fact, the organisers of these diversity tasks never claimed that they are. Nevertheless, these collections are also used a posteriori by non-contributors. In light of this, Anonymous et al. demonstrated by means of LOO tests that the NTCIR-9 INTENT-1 Chinese diversity test collection is not reusable, and also showed that condensed-list evaluation metrics generally provide better estimates of the non-contributors' true performances than raw evaluation metrics. This paper generalises and strengthens their findings through LOO tests with the latest TREC 2012 diversity test collection.
A Subtopic Taxonomy-Aware Framework for Diversity Evaluation
Fei Chen, Yiqun Liu, Min Zhang, Shaoping Ma and Lei Chen
[Pdf] [Table of Content]

To evaluate search result diversification, which is supposed to meet different needs behind a same query, a number of evaluation frameworks are proposed and adopted by benchmarks such as TREC and NTCIR. These frameworks usually do not consider the subtopic taxonomy information. Many previous works on document ranking have shown that different kinds of information needs require different ranking strategies to be satisfied. It is thus necessary to involve subtopic taxonomy in the evaluation framework of search result diversification. In this paper, we propose a novel framework called the Subtopic Taxonomy-Aware (STA) framework to redefine the existing measures. Measures in this new framework take the subtopic taxonomy information into consideration for diversity evaluations. On the other hand, finding the optimal diversified results of many measures is proved as a NP-hard problem. We also propose a pruning algorithm which can decrease this problem to a computable search. Experiments based on both the TREC and NTCIR test collections show the effectiveness of our proposed framework.
Nugget-Based Computation of Graded Relevance
Charles L. A. Clarke
[Pdf] [Table of Content]

We propose a simple method for assigning graded relevance values to documents judged during the course of a retrieval experiment. In making this proposal, we aim to avoid the potential for ambiguity and greater cognitive load associated with standard graded relevance judgments. Under our proposal, we first decompose a retrieval topic into a number of informational nuggets. For each document, a binary judgment is made with respect to each nugget. The ratio of relevant nuggets to total nuggets becomes the graded relevance value assigned to that document. To provide support for this idea, we turn to test collections created for the TREC Web Track. Along with the usual graded relevance judgments required by traditional effectiveness measures, these test collections include topic decompositions created for the purpose of evaluating novelty and diversity. By exploiting these test collections for our own purposes, we demonstrate a clear relationship between our proposed method and traditional graded relevance. In addition to supporting our proposal, our experiments suggest that informational nuggets can provide a unified approach to relevance assessment, supporting both traditional effectiveness measures and newer measures of novelty and diversity.
User Perception of Search Task Differentiation
Sargol Sadeghi, Mark Sanderson and Falk Scholer
[Pdf] [Table of Content]

This paper examines a new approach to making the evaluation of personal search systems more feasible. Comparability and diverse coverage of personal search tasks are two main issues in evaluating these systems. To address these issues, the proposed approach relies on identifying the differences between search tasks. An experiment was conducted to measure user perceptions of such differences across pairs of typical search tasks, grouped by an underlying feature. A range of features were found to influence user perceptions of task differences. This new knowledge can be used to identify similar and different tasks, which further facilitate comparability and diverse coverage of varied personal tasks for system evaluation.
Extrinsic Evaluation of Patent MT: Review and Commentary
Douglas W. Oard and Noriko Kando
[Pdf] [Table of Content]

There has been a long history of work on the application of Machine Translation (MT) to support cross-language information access for patent collection. Much of this work has leveraged fairly traditional information retrieval evaluation designs as a basis for extrinsic (i.e., task-based) evaluation, but other evaluation designs are also possible. This survey reviews the work to date on extrinsic evaluation of patent MT in cross-language information access applications, identifying gaps in the literature, and formulating some open research questions.
Creation of a New Evaluation Benchmark for Information Retrieval Targeting Patient Information Needs
Lorraine Goeuriot, Liadh Kelly, Gareth J. F. Jones, Guido Zuccon, Hanna Suominen, Allan Hanbury, Henning Müller and Johannes Leveling
[Pdf] [Table of Content]

Health information is one of the most important subject that internet users are searching for online. It is therefore critical to bring users relevant and valuable information Several medical information retrieval evaluation campaigns have been organized, providing benchmarks that have been widely used to improve medical IR. However, most of these benchmarks are focusing on specialized information needs, targeting physicians and other medical professionals. We describe in this paper a new IR evaluation collection and a new way to assess patients information needs: realistic short queries will be accompanied with discharge summaries, describing the context in which the patient have been diagnosed with a given disorder and has written the query. The collection is part of a three tasks CLEF lab called CLEF eHealth and will be used for the first time this year as a benchmark.
Biomedical Test Collection with Multiple Query Representation
Borim Ryu and Jinwook Choi
[Pdf] [Table of Content]

The objective of this study is to validate pseudo gold standards using multiple queries for biomedical document collection. Aspect query is quasi, similar query of the original query text such as synonym. It was used to build a set of relevance judgments. Four aspect queries per one query were created manually by fifteen college students. By collecting the top ranked documents in retrieved sets generated with various queries, aspect query based pseudo gold standard is generated. In order to demonstrate its feasibility, we calculated rank correlations of ranking order and compared between human judgment and pseudo judgment documents. Experimental results verified the high correlations by up to 0.863 (p<0.01). Any difference among query worker group was observed and biomedical related background knowledge seems to be prerequisite to create query sentences was not necessary. According to our experimental study, the method using multiple aspect queries was proven as the way to build relevance judgment without human experts.
Evaluating Flowchart Recognition for Patent Retrieval
Mihai Lupu, Florina Piroi and Allan Hanbury
[Pdf] [Table of Content]

A set of measures for assessing the effectiveness of flowchart recognition methods in the context of patent-related use cases are presented. Two perspectives on the task are envisaged: a traditional, re-use of bitmap flowcharts use-case and a search-related use-case. A graph topology-based measure is analyzed for the first and a particular version of precision/recall for the second. We find that the graph-based measure has a higher discriminating power, but comes at higher computational costs than the search-based measures. The evaluation of the runs in the absence of ground truth is also investigated and found to provide comparable results if runs from the same group are not allowed to unbalance the synthetically generated truth sets.
Evaluating Contextual Suggestion
Adriel Dean-Hall, Charles L. A. Clarke, Jaap Kamps and Paul Thomas
[Pdf] [Table of Content]

As its primary evaluation measure, the TREC 2012 Contextual Suggestion Track used precision@5. Unfortunately, this measure is not ideally suited to the task. The task in this track is different from IR systems where precision@5, and similar measures, could more readily be used. Track participants returned travel suggestions that included brief descriptions, where the availability of these descriptions allows users to quickly skip suggestions that are not of interest to them. A user's reaction to a suggestion could be negative ('dislike'), as well as positive ('like') or neutral, and too many disliked suggestions may cause the user to abandon the results. Neither of these factors are handled appropriately by traditional evaluation methodologies for information retrieval and recommendation. Building on the time-biased gain framework of Smucker and Clarke, which recognizes time as a critical element in user modeling for evaluation, we propose a new evaluation measure that directly accommodates these factors.
Evaluation Metrics for Nuclear Forensics Search
Fredric Gey, Charles Wang, Chloe Reynolds, Electra Sutton and Ray Larson
[Pdf] [Table of Content]

Nuclear forensics search is an emerging subfield of scientific search: Nuclear forensics plays an important technical role in international security. Nuclear forensic search is grounded in the science of nuclear isotope decay and the rigor of nuclear engineering. However two aspects are far from determined: Firstly, what matching formulae should be used to match between unknown (e.g. smuggled) nuclear samples and libraries of analyzed nuclear samples of known origin? Secondly, what is the appropriate evaluation measure to be applied to assess the effectiveness of search? Using a database of spent nuclear fuel samples we formulated a search experiment to try to identify the particular nuclear reactor from which an unknown sample might have came. This paper describes the experiment and also compares alternative evaluation metrics (precision at 1, 5 and 10 and mean reciprocal rank) used to judge search success.