Proceedings of the Seventh International Workshop on Evaluating Information Access (EVIA 2016), a Satellite Workshop of the NTCIR-12 Conference, June 7, 2016 Tokyo Japan
Abstracts
-
An Easter Egg Hunting Approach to Test Collection Building in Dynamic Domains
Seyyed Hadi Hashemi, Charles L. A. Clarke, Adriel Dean-Hall, Jaap Kamps and Julia Kiseleva
[Pdf]
[Table of Content]
Test collections for online evaluation remain crucial for
information retrieval research and industrial practice, yet the
classical Sparck Jones and Van Rijsbergen approach to test
collection building based on the pooling of runs on a large
collection is expensive and being pushed beyond its limits
with the ever increasing size and dynamic nature of the collections.
We experiment with a novel approach to reusable
test collection building, where we inject judged pages into
an existing corpus, and have systems retrieve pages from
the extended corpus with the aim to create a reusable test
collection. In a metaphorical way, we hide the Easter eggs
for systems to retrieve. Our experiments exploit the unique
setup of the TREC Contextual Suggestion Track, which allowed
both submissions from a fixed corpus (ClueWeb12) as
well as from the open web. We conduct an extensive analysis
of the reusability of the test collection based on ClueWeb12,
and found it too low for reliable online testing. Then, we detail
the expansion with judged pages from the open web,
and do extensive analysis on the reusability of the resulting
expanded test collection, and observe a dramatic increase
in reusability. Our approach offers novel and cost effective
ways to build new test collections, and to refresh and update
existing test collections. This explores new ways of effective
maintenance of online test collections for dynamic domains
such as the web.
-
On Estimating Variances for Topic Set Size Design
Tetsuya Sakai and Lifeng Shang
[Pdf]
[Table of Content]
Topic set size design is a suite of statistical techniques for determining
the appropriate number of topics when constructing a new
test collection. One vital input required for these techniques is an
estimate of the population variance of a given evaluation measure,
which in turn requires a topic-by-run score matrix. Hence, to build
a new test collection, a pilot data set is a prerequisite. Recently,
we ran an IR task at NTCIR-12 where the number of topics was
actually determined using topic set size design with an initial pilot
data set based on only five similar runs; a test collection was then
constructed accordingly by pooling 44 runs from 16 participating
teams for 100 topics. In this study, we treat the new test collection
with the associated runs as a more reliable pilot data set to investigate
how many teams and topics are actually necessary in the pilot
data for obtaining accurate variance estimates.
-
A Laboratory-Based Method for the Evaluation of Personalised Search
Camilla Sanvitto, Debasis Ganguly, Gareth J. F. Jones and Gabriella Pasi
[Pdf]
[Table of Content]
Comparative evaluation of Information Retrieval Systems
(IRSs) using publically available test collections has become
an established practice in Information Retrieval (IR). By
means of the popular Cranfield evaluation paradigm IR test
collections enable researchers to compare new methods to
existing approaches. An important area of IR research where
this strategy has not been applied to date is Personalised
Information Retrieval (PIR), which has generally relied on
user-based evaluations. This paper describes a method that
enables the creation of publically available extended test collections
to allow repeatable laboratory-based evaluation of
personalised search.
-
Promoting Repeatability Through Open Runs
Ellen Voorhees, Shahzad Rajput and Ian Soboroff
[Pdf]
[Table of Content]
The 2015 Text REtrieval Conference (TREC) introduced the
concept of `Open Runs' in response to the increasing focus on
repeatability of information retrieval experiments. An Open
Run is a TREC submission backed by a software repository
such that the software in the repository reproduces the system
that created that exact run. The ID of the repository
was captured during the process of submitting the run and
published as part of the metadata describing the run in the
TREC proceedings. Submitting a run as an Open Run was
optional: either a repository ID was provided at submission
time or it was not, and further processing of the run was
identical in either case. Unfortunately, this initial offering
was not successful. While a healthy 79 runs were submitted
as Open Runs, we could not in fact reproduce any of them.
This paper explores possible reasons for the difficulties and
makes suggestions for how to address the deficiencies so as
to strengthen the Open Run program for TREC 2016.
-
Evaluating Search Among Secrets
Douglas W. Oard, Katie Shilton and Jimmy Lin
[Pdf]
[Table of Content]
Today's search engines are designed with a single fundamental goal: to help us find the things we want to see. Paradoxically, the very fact that they do this well means that there are many collections that we are not allowed to search. Citizens are not allowed to search some government records because there may be intermixed information that needs to be protected. Scholars are not yet allowed to see much of the growing backlog of unprocessed archival collections for similar reasons. These limitations, and many more, are direct consequences of the fact that today's search engines are not designed to protect sensitive information. We need to change that by creating a new class of search algorithms designed to effectively search among secrets by balancing the user's interest in finding relevant content with the provider's interest in protecting sensitive content. This paper describes some first thoughts on evaluation for that task.
-
Automatic Sentence Ordering Assessment Based on Similarity
Liana Ermakova
[Pdf]
[Table of Content]
One of the tasks of text generation is sentence ordering since
it is crucial for readability. Nevertheless, there is no common
approach for evaluation of sentence ordering. The state-of-the
art methods are based on the comparison with a human-provided order.
However, in many cases it is impossible or
time and resource consuming. Therefore, we propose three
completely automatic approaches for sentence order assessment
where the similarity between adjacent sentences is used
as a measure of text coherence. We showed that the methods
based on word and noun similarities have very high agreement
with the human-provided judgment. We also propose
an automatic evaluation framework for analysis of the metrics
of sentence order that requires only a text collection.
-
Two-layered Summaries for Mobile Search: Does the Evaluation Measure Reflect User Preferences?
Makoto P. Kato, Virgil Pavlu, Tetsuya Sakai, Takehiro Yamamoto and Hajime Morita
[Pdf]
[Table of Content]
This paper addresses two-layered summarization for mobile search,
and proposes an evaluation framework for such summaries. A single
summary is not always satisfactory for all variety of users with
different intents, and mobile devices impose hard constraints on the
summary format. In a two-layered summary, the first layer contains
general useful information, while the second layer contains information
interesting for different types of users. As users with different
interests can take their own reading paths, they could find their
desired information more efficiently than if all layers are presented
as one block of text, by skipping certain parts of the second layer.
Our proposed evaluation metric, M-measure, takes into account all
the possible reading paths in a two-layered summary, and is defined
as the expected utility of these paths. Our user study compared M-measure
with pairwise user preferences on two-layered summaries,
and found that M-measure agrees with the user preferences on more
than 70% summary pairs.
|