NTCIR-9 Session Notes

　DAY-1 December 6 Tuesday: Breakout Sessions & EVIA 2011

[Session Notes] Panel Discussion: TREC is 20 years old, Where Now for Evaluation Campaigns?
[Meeting Program]

Date: December 6, 2011
Time: 16:45-17:30
Session Chair: William Webber (The University of Maryland, USA)
Speakers:
Ian Soboroff (TREC/TAC)
Gareth Jones (MediaEval)
Andrew Trotman and Shlomo Geva (INEX)
Hideo Joho (NTCIR)
Mark Sanderson (RMIT University, Melbourne, Australia)

Summary:

On December 6, 2011, the Fourth Sesquiannual EVIA workshop included a lively discussion about the future of evaluation campaigns. The underlying question for the panel discussion: "TREC is 20 years old, where now for evaluation campaigns?" might sound a bit provocative - particularly since Ian Soboroff of TREC/TAC was one of the panellists. The other panellists were Andrew Trotman of INEX, Gareth Jones of MediEval, Hideo Joho of NTCIR and Mark Sanderson of RMIT University (Australia) .

A vibrant discussion was started after Mark Sanderson presented some concerns about the way evaluation campaigns are run. The criticism was based on the claims that (a) evaluation campaigns run in long cycles, which may seem impractical, (b) the test data collections, which are often tedious to construct, are not extensively used in IR research while the use of private data is often considered "cool," at major IR conferences, such as SIGIR, (c) the text IR community was unable to measure its year-over-year progress as opposed to other communities, such as speech recognition, and finally (d) there is little incentive for researchers to publish at evaluation campaigns as the value of such publications, despite the hard work involved in conducting research to participate, is generally not considered equivalent to a top conference paper. As Mark Sanderson says, "(Researchers) don't get promoted for unranked publications…" This culminated into a more general question regarding the value of evaluation campaigns and how they should be conducted.

Ian Soboroff was the first one to respond on Mark Sanderson's criticism: "I disagree on all points" he replied. Soboroff then explained that the value of these campaigns is primarily established in the community of people solving the same problem. He argued that at an evaluation workshop you can have much more specific and technical discussions than at most major IR conferences. He continued, "A SIGIR people often do not want to say what they will do next before they actually publish it." At the same time, he emphasized the importance of growing a community of people around a certain problem. Fred Gey of the University of California, Berkley (in audience) added that "in evaluation campaigns people can try their strange ideas and they do not get penalised. The value is in allowing people to experiment - an essential component for the advancement of research. Unfortunately, we have moved away slightly from this approach." Gareth Jones also agrees on the community building aspect and explained that evaluations, such as CLEF, were very important triggers for new research groups that appeared even in countries where there was not much IR research in the past.

Andrew Trotman was more critical about the progress of evaluation campaigns than Ian Soboroff. Trotman claimed there has been little progress over the last 10 to 20 years. He says, "We have to move away from something that we do now to make people interact with data." He described that one of the problems is that PhD students and their supervisors often believe that progress can be achieved by using the same methods on different datasets or for slightly different problems.

Andrew assumes that a way forward might be that people will submit and make available their programs (perhaps even as services/live systems). Shlomo Geva of INEX (listed as the panelist, but sitting in the audience) was even more critical, asking, "Why are we still submitting runs instead of search engines?" Andrew Trotman then continued to explain that we have to fundamentally change the way we make evaluations, we need to move away from one-off tasks to something more continuous. We need to have live systems where findings can be replicated, but we might sacrifice exact reproducibility - the ability of an experiment to be reproduced with exactly the same output. This feature would be difficult to maintain with live systems where the underlying data is changing.

Ian Soboroff acknowledged that there are some specific evaluation tasks for which live systems might be more appropriate. Hideo Joho agrees with Andrew that interactive systems are often difficult to evaluate in the current evaluation campaigns. However, he also added that the barriers should be lowered for people to enter evaluation campaigns. He indicated that people do no want to participate in tasks that are too difficult. He believed it is necessary to attract a community of people and then grow the task with the community.

Gareth Jones sees one of the current bottlenecks in IR research in the difficulty of developing test collections and the access to large amounts of data. He also speculates that there are possibly great ideas circulating around that cannot be solved because of insufficient access to data - data owned by large companies in the IR field ? such as Google, Yahoo, and Microsoft. Andrew Trotman replied that this is one of the main reasons why we need the online system. The session chair, William Webber of University of Maryland, then asked if this means that we need to build our own Google; Andrew Trotman answers that we just need a live system to be able to experiment with.

The discussion then shifted to the question whether "large" data are really needed to achieve new discoveries. Shlomo Geva believes that not necessarily, as many of the currently used methods and functions in IR systems, such as BM25, have not been extensively researched on large collections.

So, where do we go from here? Gareth Jones thinks researchers need the same level of access to data as the private companies while Hideo Joho emphasizes that researchers should build test collections and openly release them. Andrew Trotman goes even further by asking for building live systems. Finally, Ian Soboroff, believes that we should not fundamentally change our evaluation approach. He agrees building a live system is great, but essentially we should be doing what we have been doing until now.

by　Petr Knoth, The Open University, UK

Last Modified:2011.12.12