Proceedings of the Eleventh International Workshop on Evaluating Information Access (EVIA 2025),
a Satellite Workshop of the NTCIR-18 Conference

June 10, 2025
National Institute of Informatics, Tokyo, Japan

Abstracts

[EVIA]

LLM-based Relevance Assessment Still Can’t Replace Human Relevance Assessment

Charles Clarke and Laura Dietz
[Pdf] [Table of Content]
The use of large language models (LLMs) for relevance assessmentin information retrieval has gained significant attention, with recent studies suggesting that LLM-based judgments provide comparable evaluations to human judgments. Notably, based on TREC2024 data, Upadhyay et al. (2024) make a bold claim that LLM-basedrelevance assessments, such as those generated by the Umbrelasystem, can fully replace traditional human relevance assessmentsin TREC-style evaluations. This paper critically examines this claim,highlighting practical and theoretical limitations that underminethe validity of this conclusion.First, we question whether the evidence provided by Upadhyayet al. genuinely supports their claim, particularly when the testcollection is intended to serve as a benchmark for future researchinnovations. Second, we submit a system deliberately crafted toexploit automatic evaluation metrics, demonstrating that it canachieve artificially inflated scores without truly improving retrievalquality. Third, we simulate the consequences of circularity by analyzing Kendall’s tau correlations under the hypothetical scenarioin which all systems adopt Umbrela as a final-stage re-ranker,illustrating how reliance on LLM-based assessments can distortsystem rankings. Theoretical challenges – including the inherentnarcissism of LLMs, the risk of overfitting to LLM-based metrics,and the potential degradation of future LLM performance – thatmust be addressed before LLM-based relevance assessments can beconsidered a viable replacement for human judgments.
Structured Evaluation of Legal Reasoning in LLMs: Chain-of-Thought Prompting and Human Scoring for Retrieval Robustness

Ying-Chu Yu, Sieh-Chuen Huang and Hsuan-Lei Shao
[Pdf] [Table of Content]
This study investigates the legal reasoning abilities of Large Language Models (LLMs) in Taiwan’s Status Law (family and inheritance law) and evaluates the effects of Chain-of-Thought (CoT) prompting on answer quality. Six essay questions from past judicial and graduate law exams were decomposed into 68 sub-questions targeting issue spotting, statutory application, legal reasoning, and property calculation. Four LLMs (ChatGPT-4o, Gemini, Copilot, and Grok3) were evaluated using a two-stage framework: decomposed sub-question accuracy (Stage 1) and full-length essay response performance with and without CoT prompting (Stage 2), with human scoring conducted by a law professor and a student. Results show that CoT prompting consistently improves legal reasoning quality across models, notably enhancing issue coverage, statutory citation accuracy, and reasoning structure. Gemini achieved the most significant accuracy gains (from 83.2% to 94.5%, p < 0.05) and was selected for detailed qualitative analysis. Beyond model-specific findings, this study contributes to retrieval evaluation research by addressing statistical consistency challenges in human scoring, proposing a diagnostic evaluation method adaptable for multilingual and multimedia legal corpora, and suggesting extensions for evaluating enterprise-level legal information systems. These findings underscore the value of structured prompting strategies in supporting more interpretable, transferable, and scalable legal AI evaluation frameworks.
Evaluating Group Fairness and Relevance in Conversational Search: An Alternative Formulation

Tetsuya Sakai, Sijie Tao and Young-In Song
[Pdf] [Table of Content]
The Conversational Search (CS) Subtask of the NTCIR-18 FairWeb-2 Task used Sakai's GFRC (Group Fairness and Relevance for Conversations) measure for evaluating the participating systems. As the Relevance and Group Fairness components were not directly integrated in GFRC and the measure lacked a clear user model, the present pilot study discusses an alternative called GFRC2. By directly transferring the general idea of the GFR (Group Fairness and Relevance) framework for web search to the task of evaluating generated conversations, we formulate GFRC2 as a form of expected user experience for a population of users who go through the words within the conversation. This also lets us visualise the Relevance and Group Fairness component scores for each cluster of users who are assumed to abandon the conversation at a particular relevant nugget. We demonstrate the steps of computing GFRC2 using real runs from the FairWeb-2 CS Subtask.

Return to Top