EVIA2008 Abstract List

In this paper we present an attempt to build a test collection for Mongolian IR as well as some preliminary tests about the key issues in Mongolian Information Retrieval: using a stoplist and using word stemming. Our preliminary tests will show that while these basic operations on Mongolian can bring slight improvements in retrieval effectiveness, many problems remain. The results using stemming and stoplist show that the stemming and stoplist can potentially lead to some gain in retrieval effectiveness, but the gain is still limited.

The ACLIA IR4QA Task at NTCIR-7 is an ad hoc document retrieval task involving three document languages. Although IR4QA used pooling for collecting relevance assessments, it was unique in that the pooled documents were sorted before presenting them to the assessors, based on the assumption that "popular" documents are more likely to be relevant than others. We show that this assumption is indeed valid for the IR4QA test collections.

A quality analysis is provided for each of the three components of a simple Chinese Question Answering system: passage retrieval, entity extraction and candidate selection. The order of least effective component is: answer selection, retrieval and extraction. In cross-lingual QA, deficiencies in question translation not only lead to retrieval loss, but also have adverse effects at answer selection.

We follow the opinion that Question Answering (QA) performance can be improved by combining different systems. Thus, we planned an evaluation oriented to promote the specialization and further collaboration between QA systems. This multi-stream QA requires to develop the modules able to select the proper stream according to the question and the candidate answers provided. We describe here the evaluation framework we have developed with special focus on the evaluation measures and the study of their behavior in a comparative evaluation.

This paper examines the robustness of the evaluation measures which were used at INEX 2007 to rank XML retrieval systems in the focused adhoc task. We study the behaviour of the measures when the completeness assumption of the Cranfield evaluation methodology (i.e. the assumption that all relevant information items within a test collection have been identified and included in the judgment pool) is violated. We also study how the measures behave when evaluation is based on progressively smaller sets of queries. We show that the official measure used for the Focused Task of the INEX 2007 adhoc track (Interpolated Precision at 1% recall or iP[0.01]) is less stable under both types of variations, while MAiP, which is similar to the MAP measure used in traditional document retrieval, is the most stable measure among the INEX 2007 focused task evaluation measures. Our experiments are in line with their precedents in the document retrieval domain, and our findings are also in agreement with earlier findings.

Although Average Precision (AP) has been the most widely-used retrieval effectiveness metric since the advent of Text Retrieval Conference (TREC), the general belief among researchers is that it lacks a user model. In light of this, Robertson recently pointed out that AP can be interpreted as a special case of Normalised Cumulative Precision (NCP), computed as an expectation of precision over a population of users who eventually stop at different ranks in a list of retrieved documents. He regards AP as a crude version of NCP, in that the probability distribution of the user's stopping behaviour is uniform across all relevant documents. In this paper, we generalise NCP further and demonstrate that AP and its graded-relevance version Q-measure are in fact reasonable metrics despite the above uniform probability assumption. From a probabilistic perspective, these metrics emphasise long-tail users who tend to dig deep into the ranked list, and thereby achieve high reliability. We also demonstrate that one of our new metrics, called $\NCU_{gu, \beta=1}$, maintains high correlation with AP {\em and} shows the highest {\em discriminative power}, i.e., the proportion of statistically significantly different system pairs given a confidence level, by utilising graded relevance in a novel way. Our experimental results are consistent across NTCIR and TREC.

Yuka Egusa, Masao Takaku, Hitoshi Terai, Hitomi Saito, Noriko Kando and Makiko Miwa

Recently, Scholer and Turpin [Proc. SIGIR 2008] proposed the use of techniques from the field of psychophysics to determine a relevance threshold for a user. Using this threshold, they observed, one could match the relevance criteria of users to those of judges used to develop a test collection, hence selected users should have a better search experience with systems judged superior on that collection. In this paper we show that, when the level of relevance of documents is measured using a categorical scale such as TREC relevance levels, rather than a numerical or physical scale, then the psychophysical techniques for determining thresholds cannot be meaningfully applied in some cases. We demonstrate that the choice of mapping from the categorical scale to a numerical scale has a marked effect on the thresholds derived. Instead, we propose a simpler methodology for matching users to judges. Using the average split agreement approach, only 12 of our 40 student users can be considered aligned with the relevance criteria of TREC judges on three TREC topics.

This paper presents a proposal for relaxed relevance for patent mining. The essential argument is that assignment of a complete international patent classification (IPC) to a document is a difficult task and that because the IPC code has several levels of hierarchy, relaxed relevance judgments as higher levels may provide better performance of the same classification algorithms.

This paper proposes a methodology for the construction of a patent test collection for the task of prior art search. Key to the justification of the methodology is an analysis of the nature and structure of patent documents and the patenting process. These factors enable a corpus of patent documents to be reverse engineered in order to arrive at high quality, realistic, relevance assessments. The paper first outlines the case for such a prior art search test collection along with the characteristics of patent documents, before describing the proposed method. Further research and development will be directed towards the application of this methodology to create a suite of prior art search topics for the evaluation of patent retrieval systems. We also include a preliminary analysis of its application on European patents.

This short presentation introduces CHORUS, an European coordination action project and its research roadmap and recommendations for multimedia information access research projects in the near future. One of the central points is that some of the challenges of multimedia access projects motivate backing off from one single model of evaluation: benchmarking of system components such as search algorithms, interface usability, knowledge representation and other aspects of the system as an artefact should be understood separately from validation of usefulness and acceptability of the system as a tool.

Emerging personal lifelog (PL) collections contain permanent digital records of information associated with individuals’ daily lives. This can include materials such as emails received and sent, web content and other documents with which they have interacted, photographs, videos and music experienced passively or created, logs of phone calls and text messages, and also personal and contextual data such as location (e.g. via GPS sensors), persons and objects present (e.g. via Bluetooth) and physiological state (e.g. via biometric sensors). PLs can be collected by individuals over very extended periods, potentially running to many years. Such archives have many potential applications including helping individuals recover partial forgotten information, sharing experiences with friends or family, telling the story of one’s life, clinical applications for the memory impaired, and fundamental psychological investigations of memory. The Centre for Digital Video Processing (CDVP) at Dublin City University is currently engaged in the collection and exploration of applications of large PLs. We are collecting rich archives of daily life including textual and visual materials, and contextual context data. An important part of this work is to consider how the effectiveness of our ideas can be measured in terms of metrics and experimental design. While these studies have considerable similarity with traditional evaluation activities in areas such as information retrieval and summarization, the characteristics of PLs mean that new challenges and questions emerge. We are currently exploring the issues through a series of pilot studies and questionnaires. Our initial results indicate that there are many research questions to be explored and that the relationships between personal memory, context and content for these tasks is complex and fascinating.

EVIA Abstract