Tutorial
NTCIR-16 Conference Tutorial
Date: June 14th (Tue), 2022
(Time: 10:00 - 12:00(JST), 1:00 - 3:00 (UTC), Jun 13, 21:00 - 23:00 (EDT))
Title: Evaluating Evaluation Measures, Evaluating Information Access Systems, Designing and Constructing Test Collections, and Evaluating Again
Speaker: Tetsuya Sakai (Waseda University)
I plan to cover the following topics in this tutorial:
1. Why is (offline) evaluation important?
2. On a few evaluation measures used at NTCIR
3. How should we choose the evaluation measures?
4. How should we design and build a test collection?
5. How should we ensure the quality of the gold data?
6. How should we report the results?
7. Quantifying reproducibility and progress
8. Summary
Tetsuya Sakai is a professor at the Department of Computer Science and Engineering, Waseda University, Japan. He is also a General Research Advisor of Naver Corporation, Korea (2021-), and a visiting professor at the National Institute of Informatics, Japan (2015-). He joined Toshiba in 1993 and obtained a Ph.D from Waseda in 2000. From 2000 to 2001, he was supervised by the late Karen Sparck Jones at the Computer Laboratory, University of Cambridge, as a visiting researcher. In 2007, he joined NewsWatch, Inc. as the director of the Natural Language Processing Lab. In 2009, he joined Microsoft Research Asia. He joined the Waseda faculty in 2013. He was Associate Dean (IT Strategies Division) from 2015 to 2017, and Department Head from 2017 to 2019. He is an ACM distinguished member, and a senior associate editor of ACM TOIS.
Keynote 1
NTCIR-16 Conference Keynote 1
Date: June 15th (Wed), 2022 Title: Information Retrieval Evaluation as Search Simulation Speaker: ChengXiang Zhai (University of Illinois at Urbana-Champaign, USA)
Due to the empirical nature of the Information Retrieval (IR) task, experimental evaluation of IR methods and systems is essential. Historically, evaluation initiatives such as TREC, CLEF, and NTICR have made significant impacts on IR research and resulted in many test collections that can be reused by researchers to study a wide range of IR tasks in the future. However, despite its great success, the traditional Cranfield evaluation methodology using a test collection has significant limitations, especially for evaluating an interactive IR system, and it remains an open challenge how to evaluate interactive IR systems using reproducible experiments. In this talk, I will discuss how we can address this challenge by framing the problem of IR evaluation more generally as search simulation, i.e., having an IR system interact with simulated users and measuring the performance of the system based on its interaction with the simulated users. I will first present a general formal framework for evaluating IR systems based on search session simulation, discussing how the framework can not only cover the traditional Cranfield evaluation method as a special case but also reveal potential limitations of the traditional IR evaluation measures. I will then review the recent research progress in developing formal models for user simulation and evaluating user simulators. Finally, I will discuss how we may leverage the current IR test collections to support simulation-based evaluation by developing and deploying user simulators based on those existing collections. I will conclude the talk with a brief discussion of important future research directions in simulation-based IR evaluation. ChengXiang Zhai is a Donald Biggar Willett Professor in Engineering of the Department of Computer Science at the University of Illinois at Urbana-Champaign, where he also holds a joint appointment at the Carl R. Woese Institute for Genomic Biology, Department of Statistics, and the School of Information Sciences. He received a Ph.D. in Computer Science from Nanjing University in 1990, and a Ph.D. in Language and Information Technologies from Carnegie Mellon University in 2002. He worked at Clairvoyance Corp. as a Research Scientist and a Senior Research Scientist from 1997 to 2000. His research interests are in the general area of intelligent information systems, including specifically intelligent information retrieval, data mining, natural language processing, machine learning, and their applications in domains such as biomedical informatics, and intelligent education systems. He has published over 300 papers in these areas and holds 6 patents. He offers two Massive Open Online Courses (MOOCs) on Coursera covering Text Retrieval and Search Engines and Text Mining and Analytics, respectively, and was a key contributor of the Lemur text retrieval and mining toolkit. He served as Associate Editors for major journals in multiple areas including information retrieval (ACM TOIS, IPM), data mining (ACM TKDD), intelligent systems (ACM TIST), and medical informatics (BMC MIDM), Program Co-Chairs of NAACL HLT'07, SIGIR'09, and WWW'15, and Conference Co-Chairs of CIKM'16, WSDM'18, and IEEE BigData'20. He is an ACM Fellow and a member of ACM SIGIR Academy. He received multiple awards, including ACM SIGIR Gerard Salton Award, ACM SIGIR Test of Time Paper Award (three times), the 2004 Presidential Early Career Award for Scientists and Engineers (PECASE), Alfred P. Sloan Research Fellowship, IBM Faculty Award, HP Innovation Research Award, Microsoft Beyond Search Research Award, UIUC Rose Award for Teaching Excellence, and UIUC Campus Award for Excellence in Graduate Student Mentoring. He has graduated 38 PhD students and over 50 MS students. NTCIR-16 Conference Keynote 2 Date: June 15th (Wed), 2022 Title: Cranfield is Dead; Long Live Cranfield Speaker: Ellen Voorhees (NIST, USA)
Evaluating search system effectiveness is a foundational hallmark of
information retrieval research. Doing so requires infrastructure
appropriate for the task at hand, which has frequently entailed using
the Cranfield paradigm: test collections and associated evaluation
measures. Observers have declared Cranfield moribund multiple times in
its 60 year history, though each time test collection construction
techniques and evaluation measure definitions have evolved to restore
Cranfield as a useful tool. Now Cranfield's effectiveness is once more
in question since corpora sizes have grown to the point that finding a
few relevant documents is easy enough to saturate high-precision
measures while deeper measures are unstable because too few of the
relevant documents have been identified. In this talk I'll review how
Cranfield evolved in the past and examine its prospects for the future. Ellen Voorhees is a Fellow at the US National Institute of Standards and
Technology (NIST). For most of her tenure at NIST she managed the Text
REtrieval Conference (TREC) project, a project that develops the
infrastructure required for large-scale evaluation of search engines and
other information access technology. Currently she is examining how best
to bring the benefits of large-scale community evaluations to bear on
the problems of trustworthy AI. Voorhees' general research focuses on
developing and validating appropriate evaluation schemes to measure
system effectiveness for diverse user tasks.
Voorhees is a fellow of the ACM, a member of the ACM SIGIR Academy, and
has been elected as a fellow of the Washington Academy of Sciences. She
has published numerous articles on information retrieval techniques and
evaluation methodologies and serves on the review boards of several
journals and conferences. NTCIR-16 Conference Keynote 3 Date: June 17th (Fri), 2022 Title: The Impact of Query Variability and Relevance Measurement Scales on Information Retrieval Evaluation Speaker: Falk Scholer (RMIT University, Australia)
Information retrieval makes extensive use of test collections for the measurement of search system effectiveness. Broadly speaking, this evaluation framework includes three components: search queries; a collection of documents to search over; and relevance judgements. In this talk, we'll consider two aspects of this process: queries, and relevance scales.
Test collections typically use a single query to represent a more complex search topic or information need. However, different people may generate a wide range of query variants when instantiating information needs. We'll consider the implications of this for the evaluation of search systems, and the potential benefits and costs of incorporating variant queries into a test collection framework.
Relevance judgements are used to indicate whether the documents returned by a retrieval system are appropriate responses for the query. They can be made using a variety of different scales, including ordinal (binary or graded) and techniques such as magnitude estimation. We'll examine a number of different approaches, and explore their benefits and drawbacks for judging relevance for retrieval evaluation.
Falk Scholer is a Professor in the Data Science discipline of the School of Computing Technologies at RMIT University in Melbourne, Australia. His research is in the area of information access and retrieval, focusing on understanding how systems such as search engines can assist users to resolve their information needs, and how their effectiveness can be measured. He also works on issues of fairness, accountability, transparency and ethics of systems and algorithms as part of the ARC Centre of Excellence in Automated Decision Making and Society, and on misinformation, fake news and fact-checking with the RMIT FactLab research hub. Falk is the Deputy Director of the RMIT Centre for Information Discovery and Data Analytics (CIDDA), which brings together experts across different academic disciplines, schools and colleges including computing technologies, science, maths and statistics, engineering, and business. He also teaches a range of courses, including on web development and programming, data science, HCI, and databases, and is the Program Manager for the postgraduate Master of Data Science. Falk also has a keen interest in research ethics and integrity, and chairs the STEM College Human Ethics Advisory Network (CHEAN). Date: June 17th (Fri), 2022 Title: What is happening in CLEF 2022
Speaker: Nicola Ferro (University of Padua)
TBA Nicola Ferro (
http://www.dei.unipd.it/~ferro/
) is full professor of computer science
at the University of Padua, Italy. His research interests include
information retrieval, its experimental evaluation, multilingual
information access and digital libraries and he published more than 350
papers on these topics. He is co-organizer of the Covid-19 MLIA @ Eval
initiative and he is the chair of the CLEF evaluation initiative,
which involves more than 200 research groups world-wide in large-scale
IR evaluation activities. He was the coordinator of the EU 7FP Network
of Excellence PROMISE on information retrieval evaluation. He is
associate editor of ACM TOIS and was general chair of ECIR 2016, and
short papers program co-chair of ECIR 2020. Date: June 17th (Fri), 2022 Title: TREC's Neural CLIR Track Speaker: Douglas W. Oard (University of Maryland)
Test collections are a product of their time, with older collections containing relevance judgments only for the documents that could be found a decade or more ago when those test collections were first made. Neural Information Retrieval (IR) techniques have recently changed the playing field, ranking and re-ranking documents better than traditional IR techniques. Neural methods have also substantially improved the quality of translation technology, which is of particular importance for Cross-Language IR (CLIR). Accurately measuring the effect of these improvements might thus require a new generation of CLIR test collections. The goal of the Neural CLIR (NeuCLIR) track at TREC 2022 is to begin the process of creating such collections. In this brief talk, I’ll answer the question “What’s new in NeuCLIR?” Douglas Oard is a Professor in the College of Information Studies and the Institute for Advanced Computer Studies (UMIACS) at the University of Maryland (USA) and a Visiting Professor at the National Institute of Informatics (Japan). He has rich experience with the design and evaluation of systems for
CLIR, and is one of the track coordinators for TREC’s new Neural CLIR track.
Last modified: 2022-06-16
(Time: 10:00 - 11:00(JST), 1:00 - 2:00 (UTC), Jun 14, 21:00 - 22:00 (EDT))
Abstract.
Biography.
(Time: 20:00 - 21:00 (JST), 11:00 - 12:00 (UTC), 7:00 - 8:00 (EDT))
Abstract.
Biography.
(Time: 17:00 - 18:00(JST), 8:00 - 9:00 (UTC), 4:00 - 5:00 (EDT))
Abstract.
Biography.
NTCIR-16 Invited Talk 1
(Time: 18:05 - 18:15 (JST), 9:05 - 9:15 (UTC), 5:05 - 5:15 (EDT))
Abstract:
Biography:
NTCIR-16 Invited Talk 2
(Time: 18:15 - 18:25 (JST), 9:15 - 9:25 (UTC), 5:15 - 5:25 (EDT))
Abstract:
Biography: