Proceedings of the Tenth International Workshop on Evaluating Information Access (EVIA 2023),
a Satellite Workshop of the NTCIR-17 Conference
December 12-15, 2023
National Institute of Informatics, Tokyo, Japan

Abstracts


    [Preface]


  • Qingyao Ai and Douglas W. Oard
    [Pdf] [Table of Content]
  • Return to Top


    [Keynote]


  • Ian Soboroff
    [Pdf] [Table of Content]
    The astounding emergence of ChatGPT and other AI systems that generate content, and their apparently incredible performance, are an inspiration to the research community. The performance of these LLMs is so impressive it is widely supposed that we can use them to measure their own effectiveness! We have had evaluation methods for generated content, including question answering, summarization, and translation, and in this talk I dust them off and present both a historical view and how we might approach those methods today. tl;dr, we have a lot of work to do.
  • Return to Top


    [Panel]


  • Douglas W. Oard, Akiko Aizawa, Inho Kang, Yiqun Liu and Paul Thomas
    [Pdf] [Table of Content]
    The goal of this panel is to inspire thinking about evaluation of generative Large Language Models (LLM) at NTCIR. This will be a discussion-focused panel, with panelists initially setting the stage with responses to a few questions, followed by a wide-ranking discussion among the panel and with the audience. We will consider the panel to be a success if some future NTCIR tasks are influenced by our discussion.
  • Return to Top


    [EVIA]


  • Tetsuya Sakai
    [Pdf] [Table of Content]
    NTCIR-17 introduced the FairWeb-1 task, which evaluated web page rankings in terms of both relevance and group fairness. The present study shows how their evaluation framework can be extended for the evaluation of multi-turn, textual conversational search systems. By using the full test topic set of FairWeb-1 to harvest actual user-system conversations from the New Bing and Google Bard, we demonstrate how a series of system turns can be evaluated using our evaluation framework, which we call GFRC (Group Fairness and Relevance of Conversations). In addition, based on observations from our pilot experiment, we briefly discuss a few open questions in human-in-the-loop evaluation of conversational search in general.
  • Nuo Chen, Jiqun Liu, Tetsuya Sakai and Xiao-Ming Wu
    [Pdf] [Table of Content]
    In recent years, the influence of cognitive effects and biases on users' thinking, behaving, and decision-making has garnered increasing attention in the field of interactive information retrieval. The decoy effect, one of the main empirically confirmed cognitive biases, refers to the shift in preference between two choices when a third option (the decoy) which is inferior to one of the initial choices is introduced. However, it is not clear how the decoy effect influences user interactions with and evaluations on Search Engine Result Pages (SERPs). To bridge this gap, our study seeks to understand how the decoy effect at the document level influences users' interaction behaviors on SERPs, such as clicks, dwell time, and usefulness perceptions. We conducted experiments on two publicly available user behavior datasets and the findings reveal that, compared to cases where no decoy is present, the probability of a document being clicked could be improved and its usefulness score could be higher, should there be a decoy associated with the document.
  • Uyen Lai, Gurjit Randhawa and Paul Sheridan
    [Pdf] [Table of Content]
    Heaps’ law is an empirical relation in text analysis that predicts vocabulary growth as a function of corpus size. While this law has been validated in diverse human-authored text corpora, its applicability to large language model generated text remains unexplored. This study addresses this gap, focusing on the emulation of corpora using the suite of GPT-Neo large language models. To conduct our investigation, we emulated corpora of PubMed abstracts using three different parameter sizes of the GPT-Neo model. Our emulation strategy involved using the initial five words of each PubMed abstract as a prompt and instructing the model to expand the con- tent up to the original abstract’s length. Our findings indicate that the generated corpora adhere to Heaps’ law. Interestingly, as the GPT-Neo model size grows, its generated vocabulary increasingly adheres to Heaps’ law as as observed in human-authored text. To further improve the richness and authenticity of GPT-Neo outputs, future iterations could emphasize enhancing model size or refining the model architecture to curtail vocabulary repetition.
  • Fernando Diaz
    [Pdf] [Table of Content]
    Across a variety of ranking tasks, researchers use reciprocal rank to measure the effectiveness for users interested in exactly one relevant item. Despite its widespread use, evidence suggests that reciprocal rank is brittle when discriminating between systems. This brittleness, in turn, is compounded in modern evaluation settings where current, high-precision systems may be difficult to distinguish. We study the scenario where there is more than one relevant item and address the lack of sensitivity of reciprocal rank by introducing and connecting it to the concept of best-case retrieval, an evaluation method focusing on assessing the quality of a ranking for the most satisfied possible user across possible recall requirements. This perspective allows us to generalize reciprocal rank and define a new preference-based evaluation we call lexicographic precision or lexiprecision. By mathematical construction, we ensure that lexiprecision preserves differences detected by reciprocal rank, while empirically improving sensitivity and robustness across a broad set of retrieval and recommendation tasks.
  • Return to Top