"In this paper, we give a brief introduction of the HTRDP Chinese Information Retrieval Evaluation, which is sponsored by HTRDP (High Technology Research and Development Program of China, namely 863 Program). The web data collection, query design, evaluation metrics, and the evaluation procedures will presents in detail. Like TREC, NTCIR and CLEF, its purpose is to provide the infrastructure necessary for large-scale evaluation of Chinese information retrieval methodologies and to help advance the state of the art in information retrieval technology. We conclude our paper with results analysis and future works."
"In this paper we present the Vietnamese specialities in word boundary, morphology, part of speech that must be addressed in information retrieval relative tasks. Our experiments have shown how different types of Vietnamese index terms: ÒtiếngÓ, words, compound words, combination of word and compound word contribute to Vietnamese text processing and retrieval. We also introduce our Vietnamese test collection on which experimentations have been done and report the method used to construct this test collection."
"The Indian subcontinent can be regarded as another Europe, due to its lingual diversity. Geographically, the Indian subcontinent consists of six countries, namely Pakistan, Bangladesh, Nepal, Sri Lanka, Bhutan and India. The total population in this part of the world is about 1,300 million and about 25 official languages are used by this population. Among the major languages of this region, Hindi and Bengali rank among the top ten most-spoken languages of the world. Over the past few years (2000--2007), a large volume of Indian language (IL) electronic documents have come into existence at a growth rate of 700.0 \%. The need for developing IR systems to deal with this growing repository is, therefore, unquestionable. Considering this need, the Government of India has recently formed a national consortium of academic and research organizations, that has been entrusted with the task of developing a Cross Lingual Information Access (CLIA) system for Indian language content. This paper will outline the issues that will need to be addressed, and the activities of the newly formed consortium."
"This paper discusses some challenging issues that are found in the evaluation of web search engines by using Thai queries. The discussions are based on our experience in evaluating and comparing the search performance of 7 search engines on Thai queries. The issues addressed in this paper will help in improving further evaluations of search engines for Thai."
"Pooling is the most common technique used to build modern test collections. Evidence is mounting that pooling may not yield reusable test collections for very large document sets. This paper describes the approach taken in the TREC 2006 Terabyte Track: an initial shallow pool was judged to gather relevance information, which was then used to draw a random sample of further documents to judge. The sample judgments rank systems somewhat differently than the pool. Some analysis and plans for further research are discussed."
"Large-scale information retrieval evaluation efforts such as TREC and NTCIR have tended to adhere to binary-relevance evaluation metrics, even when graded relevance data were available. However, the NTCIR-6 Crosslingual Task has finally started adopting graded-relevance metrics, though only as additional metrics. This paper compares three existing graded-relevance metrics that were mentioned in the Call for Participation of the NTCIR-6 Crosslingual Task in terms of the ability to control how severely ``late arrival'' of relevant documents should be penalised. We argue and demonstrate that Q-measure is more flexible than normalised Discounted Cumulative Gain and generalised Average Precision. We then suggest a brief guideline for conducting a reliable information retrieval evaluation with graded relevance."
"In some classification problems, such as the patent classification based on the F-term which is one subtask in the NTCIT-6, there are general or specific relations between the class labels. It is desirable that the relations among the labels should be taken into account in the evaluation measures for those problems. For example, if a system assigns an incorrect label to one instance and the assigned label has close relation with the true label, then the system may deserve some credit, rather than having no credit at all for any incorrect label assignment. In this paper we propose some new evaluation measures based on the relations among the label, which can be considered as the label relation sensitive version of the important evaluation measure such as averaged precision and F-measure. We also present the results by applying the new evaluation measures to all the submitted runs for the NTCIR- 6 F-term patent classification."
"We describe WiQA 2006, a pilot task aimed at studying question answering using Wikipedia. Going beyond traditional factoid questions, the task considered at WiQA 2006 was to return---given an source article from Wikipedia---to identify snippets from other Wikipedia articles, possibly in languages different from the language of the source article, that add new and important information to the source article, and that do so without repetition. A total of 7 teams took part, submitting 20 runs. Our main findings are two-fold: (i)~while challenging, the tasks considered at WiQA are do-able as participants achieved impressive scores as measured in terms of yield, mean reciprocal rank, and precision, (ii)~on the bilingual task, substantially higher scores were achieved than on the monolingual tasks."
"This paper examines the current way of keeping the data produced during an evaluation campaign of Information Retrieval Systems (IRS) and highlights some shortenings of it. In particular, the Cranfield methodology has been designed for creating comparable experiments and evaluating the performances of IRS rather than modeling and managing the scientific data produced during an evaluation campaign. The data produced during an evaluation campaign of IRSs are valuable scientific data, and as a consequence, their lineage should be tracked since it allows us to judge the quality and applicability of information for a given use; those data should be enriched progressively adding further analyses and interpretations on them; it should be possibile to cite them and their further elaboration, since this is an effective way for explicitly mentioning and making references to useful information, for improving the cooperation among researchers and to facilitate the transfer of scientific and innovative results from research groups to the industrial sector."
"This is the third year of the evaluation of geographic information retrieval (GeoCLEF) within the Cross-Language Evaluation Forum (CLEF). GeoCLEF 2006 presented topics and documents in four languages (English, German, Portuguese and Spanish). After two years of evaluation we are beginning to understand the challenges to both Geographic Information Retrieval from text and of evaluation of the results of geographic information retrieval. This poster enumerates some of these challenges to evaluation and comments on the limitations encountered in the first two evaluations."
"This paper presents a simple approach to utilize past test collections as a material for user experiments. We have built a Web-based user interface for NTCIR-5 WEB run results, and conducted a user experiment with 29 subjects to investigate whether performance evaluation metrics of information retrieval systems used in test collections such as TREC and NTCIR comparable to user performance. In this experiment, we selected three types of systems from among systems that participated in NTCIR-5 WEB, and then selected three topics with roughly the same values from among several search topics. The results of the experiment showed no significant differences among these systems and topics in the time for search. While, in general, the user experiment itself have been successfully conducted and shown similar trends with prior study, the approach seems to have some limitations mainly on interactivity and cached page display."
"Good test collections, coupled with good evaluation metrics, are very useful for evaluating Information Access systems efficiently. But useful to whom? The in vitro (or Cranfield) evaluation paradigm has been criticised, mainly because of the absence of the user. On the other hand, user-in-the-loop evaluations are expensive, unrepeatable and often inconclusive. In light of this, we propose a new task for NTCIR that aims to directly measure the correlation between user satisfaction and evaluation metric values. To this end, we plan to reuse NTCIR-5 and NTCIR-6 Japanese monolingual newspaper test collections from the crosslingual task. Our final goal is to design new evaluation metrics that accurately approximate user satisfaction scores."