EVIA2014 Abstract

Preface
Stefano Mizzaro and Ruihua Song
[Pdf] [Table of Content]
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
Tetsuya Sakai
[Pdf] [Table of Content]

Recently, Anonymous proposed two methods for determining the topic set size n for a new test collection based on variance estimates from past data: the first method determines the minimum n to ensure high statistical power, while the second method determines the minimum n to ensure tight confidence invervals. These methods are based on statistical techniques described by Nagata. While Anonymous used variance estimates based on one-way ANOVA, Anonymous used the 95% percentile method proposed by Webber, Moffat and Zobel. This paper reruns the experiments reported by Anonymous using variance estimates based on two-way ANOVA, which turn out to be slightly larger than their one-way ANOVA counterparts and substantially larger than the percentile-based ones. If researchers should choose to "err on the side of over-sampling" as recommened by Ellis, the variance estimation method based on two-way ANOVA and the results reported in this paper are probably the ones researchers should adopt. We also establish empirical relationships between the two topic set size design methods, and discuss the balance between n and the pool depth "pd" using both methods.
Magnitudes of Relevance: Relevance Judgements, Magnitude Estimation, and Crowdsourcing
Falk Scholer, Eddy Maddelena, Stefano Mizzaro and Andrew Turpin
[Pdf] [Table of Content]

Magnitude Estimation is a psychophysical scaling technique where the intensity of a stimulus is rated by the assignment of a number. We report on a preliminary investigation on using magnitude estimation for gathering document relevance judgements, as commonly used in test collection-based evaluation of information retrieval systems. Unlike classical binary or ordinal relevance scales, magnitude estimation leads to a ratio scale of measurement, more suitable for statistical analysis and potentially allowing a more precise measurement of relevance. By performing a crowdsourcing experiment, we show that magnitude estimation relevance judgments are consistent with ordinal relevance ones; we study the difference of using a bounded or an unbounded scale; we show that magnitude estimation can be a useful tool to understand the perceived relevance when using an ordinal scale; and we investigate document presentation order effects.
Axiometrics: Axioms of Information Retrieval Effectiveness Metrics
Eddy Maddalena and Stefano Mizzaro
[Pdf] [Table of Content]

There are literally dozens (most likely more than one hundred) information retrieval effectiveness metrics, and counting, but a common, general, and formal understanding of their properties is still missing. In this paper we aim at improving and extending the recently published work by Busin and Mizzaro [6]. That paper proposes an axiomatic approach to Information Retrieval (IR) effectiveness metrics, and more in detail: (i) it defines a framework based on the notions of measure, measurement, and similarity; (ii) it provides a general definition of IR effectiveness metric; and (iii) it proposes a set of axioms that every effectiveness metric should satisfy. Here we build on their work and more specifically: we design a different and improved set of axioms, we provide a definition of some common metrics, and we derive some theorems from the axioms.
Computing Confidence Intervals for Common IR Measures
Ian Soboroff
[Pdf] [Table of Content]

Confidence intervals quantify the uncertainty in an average and offer a robust alternative to hypothesis testing. We measure the performance of standard and bootstrapped confidence intervals on a number of common IR measures using several TREC and NTCIR collections. The performance of an interval is its empirical coverage of the estimated statistic. We find that both standard and bootstrapped intervals give excellent coverage for all measures except in situations of abysmal retrieval performance. We recommend using standard confidence intervals when statistical software is handy, and bootstrap percentile intervals as equivalent when no statistical libraries are available.
Assessing Contextual Suggestion
Adriel Dean-Hall and Charles L. A. Clarke
[Pdf] [Table of Content]

Assessment for the TREC Contextual Suggestion Track is unusual in that it depends on the personal preferences of assessors. During the initial phase of the track, assessors rate points-of-interests in a source city (Philadelphia for TREC 2013) in terms of their own interests. These rankings are distributed to participating groups, who are given about a month to generate point-of-interest suggestions for fifty other target cities around the United States, with personalized suggestions generated for each assessor on the basis of their ratings for the source city. These suggestions are then returned for rating, with each assessor rating their personalized suggestions for one or two of the target cites. Effectiveness scores (e.g., precision at rank 5) are then computed from these ratings. Unlike traditional TREC tasks, such as adhoc retrieval, it is not possible to measure assessor agreement, since each assessor is rating each point-in-interest in terms of their own personal preferences. Instead, we measure assessor consistency, which we define as an assessor's tendency to rank systems in the same order as other assessors. While consistency can be quite high for some assessors, and appears reasonable for most assessors, we have been unable to identify predictors of assessor consistency, including past consistency.