[Session Notes] Panel Discussion: TREC is 20 years old, Where Now for Evaluation
Date: December 6, 2011
Session Chair: William Webber (The University of Maryland, USA)
Ian Soboroff (TREC/TAC)
Gareth Jones (MediaEval)
Andrew Trotman and Shlomo Geva (INEX)
Hideo Joho (NTCIR)
Mark Sanderson (RMIT University, Melbourne, Australia)
On December 6, 2011, the Fourth Sesquiannual EVIA workshop included a lively
discussion about the future of evaluation campaigns. The underlying question
for the panel discussion: "TREC is 20 years old, where now for evaluation
campaigns?" might sound a bit provocative - particularly since Ian
Soboroff of TREC/TAC was one of the panellists. The other panellists were
Andrew Trotman of INEX, Gareth Jones of MediEval, Hideo Joho of NTCIR and
Mark Sanderson of RMIT University (Australia) .
A vibrant discussion was started after Mark Sanderson presented some concerns about the way evaluation campaigns are run. The criticism was based on the claims that (a) evaluation campaigns run in long cycles, which may seem impractical, (b) the test data collections, which are often tedious to construct, are not extensively used in IR research while the use of private data is often considered "cool," at major IR conferences, such as SIGIR, (c) the text IR community was unable to measure its year-over-year progress as opposed to other communities, such as speech recognition, and finally (d) there is little incentive for researchers to publish at evaluation campaigns as the value of such publications, despite the hard work involved in conducting research to participate, is generally not considered equivalent to a top conference paper. As Mark Sanderson says, "(Researchers) don't get promoted for unranked publications…" This culminated into a more general question regarding the value of evaluation campaigns and how they should be conducted.
Ian Soboroff was the first one to respond on Mark Sanderson's criticism: "I disagree on all points" he replied. Soboroff then explained that the value of these campaigns is primarily established in the community of people solving the same problem. He argued that at an evaluation workshop you can have much more specific and technical discussions than at most major IR conferences. He continued, "A SIGIR people often do not want to say what they will do next before they actually publish it." At the same time, he emphasized the importance of growing a community of people around a certain problem. Fred Gey of the University of California, Berkley (in audience) added that
"in evaluation campaigns people can try their strange ideas and they
do not get penalised. The value is in allowing people to experiment - an
essential component for the advancement of research. Unfortunately, we
have moved away slightly from this approach." Gareth Jones also agrees
on the community building aspect and explained that evaluations, such as
CLEF, were very important triggers for new research groups that appeared
even in countries where there was not much IR research in the past.
Andrew Trotman was more critical about the
progress of evaluation campaigns than Ian Soboroff. Trotman claimed there has
been little progress over the last 10 to 20 years. He says, "We have to move away from something that we do now to make
people interact with data." He described that one of the problems
is that PhD students and their supervisors often believe that progress
can be achieved by using the same methods on different datasets or for
slightly different problems.
Andrew assumes that a way forward might be that people will submit and
make available their programs (perhaps even as services/live systems).
Shlomo Geva of INEX (listed as the panelist, but sitting in the audience)
was even more critical, asking, "Why are we still submitting runs
instead of search engines?" Andrew Trotman then continued to explain
that we have to fundamentally change the way we make evaluations, we need
to move away from one-off tasks to something more continuous. We need to
have live systems where findings can be replicated, but we might sacrifice
exact reproducibility - the ability of an experiment to be reproduced with
exactly the same output. This feature would be difficult to maintain with
live systems where the underlying data is changing.
Ian Soboroff acknowledged that there are
some specific evaluation tasks for which live systems might be more
appropriate. Hideo Joho agrees with Andrew that interactive systems are often
difficult to evaluate in the current evaluation campaigns. However, he also added
that the barriers should be lowered for people to enter evaluation campaigns. He
indicated that people do no want to participate in tasks that are too
difficult. He believed it is necessary
to attract a community of people and then grow the task with the community.
Jones sees one of the current bottlenecks in IR research in the difficulty of
developing test collections and the access to large amounts of data. He also speculates
that there are possibly great ideas circulating around that cannot be solved
because of insufficient access to data - data owned by large companies
in the IR field ? such as Google, Yahoo, and Microsoft. Andrew Trotman replied that this is one of the main reasons why we need
the online system. The session chair, William Webber of University of Maryland,
then asked if this means that we need to build our own Google; Andrew Trotman
answers that we just need a live system to be able to experiment with.
The discussion then shifted to the question whether "large" data
are really needed to achieve new discoveries. Shlomo Geva believes that
not necessarily, as many of the currently used methods and functions in
IR systems, such as BM25, have not been extensively researched on large
So, where do we go from here? Gareth Jones
thinks researchers need the same level of access to data as the private
companies while Hideo Joho emphasizes that researchers should build test
collections and openly release them. Andrew Trotman goes even further by asking
for building live systems. Finally, Ian Soboroff, believes that we should not
fundamentally change our evaluation approach. He agrees building a live system
is great, but essentially we should be doing what we have been doing until now.
by Petr Knoth, The Open University, UK