Proceedings of the Eleventh International Workshop on Evaluating Information Access (EVIA 2025),
a Satellite Workshop of the NTCIR-18 Conference
June 10, 2025
National Institute of Informatics, Tokyo, Japan
-
Charles Clarke and Laura Dietz
[Pdf]
[Table of Content]
The use of large language models (LLMs) for relevance
assessmentin information retrieval has gained significant
attention, with recent studies suggesting that LLM-based
judgments provide comparable evaluations to human
judgments. Notably, based on TREC2024 data, Upadhyay et al.
(2024) make a bold claim that LLM-basedrelevance
assessments, such as those generated by the Umbrelasystem,
can fully replace traditional human relevance assessmentsin
TREC-style evaluations. This paper critically examines this
claim,highlighting practical and theoretical limitations
that underminethe validity of this conclusion.First, we
question whether the evidence provided by Upadhyayet al.
genuinely supports their claim, particularly when the
testcollection is intended to serve as a benchmark for
future researchinnovations. Second, we submit a system
deliberately crafted toexploit automatic evaluation
metrics, demonstrating that it canachieve artificially
inflated scores without truly improving retrievalquality.
Third, we simulate the consequences of circularity by
analyzing Kendall’s tau correlations under the hypothetical
scenarioin which all systems adopt Umbrela as a final-stage
re-ranker,illustrating how reliance on LLM-based
assessments can distortsystem rankings. Theoretical
challenges – including the inherentnarcissism of LLMs, the
risk of overfitting to LLM-based metrics,and the potential
degradation of future LLM performance – thatmust be
addressed before LLM-based relevance assessments can
beconsidered a viable replacement for human judgments.
-
Ying-Chu Yu, Sieh-Chuen Huang and Hsuan-Lei Shao
[Pdf]
[Table of Content]
This study investigates the legal reasoning abilities of
Large Language Models (LLMs) in Taiwan’s Status Law (family
and inheritance law) and evaluates the effects of
Chain-of-Thought (CoT) prompting on answer quality. Six
essay questions from past judicial and graduate law exams
were decomposed into 68 sub-questions targeting issue
spotting, statutory application, legal reasoning, and
property calculation. Four LLMs (ChatGPT-4o, Gemini,
Copilot, and Grok3) were evaluated using a two-stage
framework: decomposed sub-question accuracy (Stage 1) and
full-length essay response performance with and without CoT
prompting (Stage 2), with human scoring conducted by a law
professor and a student.
Results show that CoT prompting consistently improves legal
reasoning quality across models, notably enhancing issue
coverage, statutory citation accuracy, and reasoning
structure. Gemini achieved the most significant accuracy
gains (from 83.2% to 94.5%, p < 0.05) and was selected for
detailed qualitative analysis. Beyond model-specific
findings, this study contributes to retrieval evaluation
research by addressing statistical consistency challenges
in human scoring, proposing a diagnostic evaluation method
adaptable for multilingual and multimedia legal corpora,
and suggesting extensions for evaluating enterprise-level
legal information systems. These findings underscore the
value of structured prompting strategies in supporting more
interpretable, transferable, and scalable legal AI
evaluation frameworks.
-
Tetsuya Sakai, Sijie Tao and Young-In Song
[Pdf]
[Table of Content]
The Conversational Search (CS) Subtask of the
NTCIR-18 FairWeb-2 Task used Sakai's
GFRC (Group Fairness and Relevance for Conversations)
measure
for evaluating the participating systems.
As the Relevance and Group Fairness components were
not directly integrated in GFRC and the measure lacked a
clear user model,
the present pilot study discusses an alternative called
GFRC2.
By directly transferring the general idea of
the GFR (Group Fairness and Relevance) framework for web
search
to the task of evaluating generated conversations,
we formulate GFRC2 as a form of expected user experience
for a population of users who go through the words within
the conversation.
This also lets us visualise the Relevance and Group Fairness
component scores for each cluster of users
who are assumed to abandon the conversation
at a particular relevant nugget.
We demonstrate
the steps of computing GFRC2 using
real runs from the FairWeb-2 CS Subtask.
Return to Top