Proceedings of the Seventh International Workshop on Evaluating Information Access (EVIA 2016), a Satellite Workshop of the NTCIR-12 Conference, June 7, 2016 Tokyo Japan

Abstracts

An Easter Egg Hunting Approach to Test Collection Building in Dynamic Domains
Seyyed Hadi Hashemi, Charles L. A. Clarke, Adriel Dean-Hall, Jaap Kamps and Julia Kiseleva
[Pdf] [Table of Content]

Test collections for online evaluation remain crucial for information retrieval research and industrial practice, yet the classical Sparck Jones and Van Rijsbergen approach to test collection building based on the pooling of runs on a large collection is expensive and being pushed beyond its limits with the ever increasing size and dynamic nature of the collections. We experiment with a novel approach to reusable test collection building, where we inject judged pages into an existing corpus, and have systems retrieve pages from the extended corpus with the aim to create a reusable test collection. In a metaphorical way, we hide the Easter eggs for systems to retrieve. Our experiments exploit the unique setup of the TREC Contextual Suggestion Track, which allowed both submissions from a fixed corpus (ClueWeb12) as well as from the open web. We conduct an extensive analysis of the reusability of the test collection based on ClueWeb12, and found it too low for reliable online testing. Then, we detail the expansion with judged pages from the open web, and do extensive analysis on the reusability of the resulting expanded test collection, and observe a dramatic increase in reusability. Our approach offers novel and cost effective ways to build new test collections, and to refresh and update existing test collections. This explores new ways of effective maintenance of online test collections for dynamic domains such as the web.
On Estimating Variances for Topic Set Size Design
Tetsuya Sakai and Lifeng Shang
[Pdf] [Table of Content]

Topic set size design is a suite of statistical techniques for determining the appropriate number of topics when constructing a new test collection. One vital input required for these techniques is an estimate of the population variance of a given evaluation measure, which in turn requires a topic-by-run score matrix. Hence, to build a new test collection, a pilot data set is a prerequisite. Recently, we ran an IR task at NTCIR-12 where the number of topics was actually determined using topic set size design with an initial pilot data set based on only five similar runs; a test collection was then constructed accordingly by pooling 44 runs from 16 participating teams for 100 topics. In this study, we treat the new test collection with the associated runs as a more reliable pilot data set to investigate how many teams and topics are actually necessary in the pilot data for obtaining accurate variance estimates.
A Laboratory-Based Method for the Evaluation of Personalised Search
Camilla Sanvitto, Debasis Ganguly, Gareth J. F. Jones and Gabriella Pasi
[Pdf] [Table of Content]

Comparative evaluation of Information Retrieval Systems (IRSs) using publically available test collections has become an established practice in Information Retrieval (IR). By means of the popular Cranfield evaluation paradigm IR test collections enable researchers to compare new methods to existing approaches. An important area of IR research where this strategy has not been applied to date is Personalised Information Retrieval (PIR), which has generally relied on user-based evaluations. This paper describes a method that enables the creation of publically available extended test collections to allow repeatable laboratory-based evaluation of personalised search.
Promoting Repeatability Through Open Runs
Ellen Voorhees, Shahzad Rajput and Ian Soboroff
[Pdf] [Table of Content]

The 2015 Text REtrieval Conference (TREC) introduced the concept of `Open Runs' in response to the increasing focus on repeatability of information retrieval experiments. An Open Run is a TREC submission backed by a software repository such that the software in the repository reproduces the system that created that exact run. The ID of the repository was captured during the process of submitting the run and published as part of the metadata describing the run in the TREC proceedings. Submitting a run as an Open Run was optional: either a repository ID was provided at submission time or it was not, and further processing of the run was identical in either case. Unfortunately, this initial offering was not successful. While a healthy 79 runs were submitted as Open Runs, we could not in fact reproduce any of them. This paper explores possible reasons for the difficulties and makes suggestions for how to address the deficiencies so as to strengthen the Open Run program for TREC 2016.
Evaluating Search Among Secrets
Douglas W. Oard, Katie Shilton and Jimmy Lin
[Pdf] [Table of Content]

Today's search engines are designed with a single fundamental goal: to help us find the things we want to see. Paradoxically, the very fact that they do this well means that there are many collections that we are not allowed to search. Citizens are not allowed to search some government records because there may be intermixed information that needs to be protected. Scholars are not yet allowed to see much of the growing backlog of unprocessed archival collections for similar reasons. These limitations, and many more, are direct consequences of the fact that today's search engines are not designed to protect sensitive information. We need to change that by creating a new class of search algorithms designed to effectively search among secrets by balancing the user's interest in finding relevant content with the provider's interest in protecting sensitive content. This paper describes some first thoughts on evaluation for that task.
Automatic Sentence Ordering Assessment Based on Similarity
Liana Ermakova
[Pdf] [Table of Content]

One of the tasks of text generation is sentence ordering since it is crucial for readability. Nevertheless, there is no common approach for evaluation of sentence ordering. The state-of-the art methods are based on the comparison with a human-provided order. However, in many cases it is impossible or time and resource consuming. Therefore, we propose three completely automatic approaches for sentence order assessment where the similarity between adjacent sentences is used as a measure of text coherence. We showed that the methods based on word and noun similarities have very high agreement with the human-provided judgment. We also propose an automatic evaluation framework for analysis of the metrics of sentence order that requires only a text collection.
Two-layered Summaries for Mobile Search: Does the Evaluation Measure Reflect User Preferences?
Makoto P. Kato, Virgil Pavlu, Tetsuya Sakai, Takehiro Yamamoto and Hajime Morita
[Pdf] [Table of Content]

This paper addresses two-layered summarization for mobile search, and proposes an evaluation framework for such summaries. A single summary is not always satisfactory for all variety of users with different intents, and mobile devices impose hard constraints on the summary format. In a two-layered summary, the first layer contains general useful information, while the second layer contains information interesting for different types of users. As users with different interests can take their own reading paths, they could find their desired information more efficiently than if all layers are presented as one block of text, by skipping certain parts of the second layer. Our proposed evaluation metric, M-measure, takes into account all the possible reading paths in a two-layered summary, and is defined as the expected utility of these paths. Our user study compared M-measure with pairwise user preferences on two-layered summaries, and found that M-measure agrees with the user preferences on more than 70% summary pairs.