The 18th NTCIR Conference
Evaluation of Information Access Technologies
June 10-13, 2025
National Institute of Informatics, Tokyo, Japan
-
Makoto P. Kato, Noriko Kando, Charles L. A. Clarke and
Yiqun Liu
[Pdf]
[Table of Content]
Return to Top
-
Chung-Chi Chen, Qingyao Ai and Shoko Wakamiya
[Pdf]
[Table of Content]
The NTCIR project, organized by the National Institute of
Informatics (NII) in Japan, has been a key platform for
information retrieval (IR) and natural language processing
(NLP) research since 1997. NTCIR-18, running from January
2024 to June 2025, features seven core tasks and three
pilot tasks covering LLM evaluation, advanced IR,
domain-specific NLP, and personal data management. A total
of 113 teams worldwide participated, registering 178 times
across tasks. This paper provides an overview of NTCIR-18,
highlighting its objectives, methodologies, and key
findings, along with future directions.
Return to Top
-
Maarten de Rijke
[Pdf]
[Table of Content]
-
Douglas W. Oard
[Pdf]
[Table of Content]
Return to Top
-
Mark Sanderson
[Pdf]
[Table of Content]
Return to Top
Core Tasks
-
Junjie Chen, Haitao Li, Zhumin Chu, Yiqun Liu and Qingyao Ai
[Pdf]
[Table of Content]
In this paper, we provide an overview of the NTCIR-18
Automatic Evaluation of LLMs (AEOLLM) task. As large
language models (LLMs) grow popular in both academia and
industry, how to effectively evaluate the capacity of LLMs
becomes an increasingly critical but still challenging
issue. Existing methods can be divided into two types:
manual evaluation, which is expensive, and automatic
evaluation, which faces many limitations including task
format (the majority belong to multiple-choice questions)
and evaluation criteria (occupied by reference-based
metrics). To advance the innovation of automatic
evaluation, we propose the AEOLLM task which focuses on
generative tasks and encourages reference-free methods.
Besides, we set up diverse subtasks such as dialogue
generation, text expansion, summary generation and
non-factoid question answering to comprehensively test
different methods. This year, we received 48 runs from 4
teams in total. This paper will describe the background of
the task, the data set, the evaluation measures and the
evaluation results, respectively.
-
Xiao Fu, Navdeep Singh Bedi, Noriko Kando, Fabio Crestani
and Aldo Lipani
[Pdf]
[Table of Content]
We propose an efficient evaluation pipeline for
Retrieval-Augmented Generation (RAG) systems tailored for
low-resource settings. Our method uses ensemble similarity
measures combined with a logistic regression classifier to
assess answer quality from multiple system outputs using
only the available queries and replies. Experiments across
diverse tasks demonstrate competitive accuracy and a
reasonable correlation with ground truth rankings,
establishing our approach as a reliable metric.
-
Yumi Kim, Meen Chul Kim and Jongwook Lee
[Pdf]
[Table of Content]
In this study, we aim to propose automated evaluation
methods of LLMs that approximate human judgment by
exploring and comparing two distinct approaches: (1)
LLM-based scoring, which utilizes GPT models with prompt
engineering, and (2) feature-based machine learning, using
transformer-based metrics such as BERTScore, semantic
similarity, and keyword coverage. As part of this research,
we participated in the NTCIR-18 Automatic Evaluation of
LLMs (AEOLLM) task.
We submitted the results of the test data set and the
reserved data set to NTCIR-18 and analyzed the results
obtained. The results show that GPT-4o Mini (with the
updated prompt) achieved the highest performance, while the
feature-based approach performed competitively, surpassing
GPT-3.5 Turbo and showing a small gap with GPT-4o Mini.
LLM-based methods offered scalability but lacked
explainability, whereas feature-based approaches provided
better interpretability but required extensive tuning,
highlighting the trade-offs between the two strategies.
Throughout the analysis, We expect that the findings of our
work will provide insights into the understanding of human
judgment and automated evaluation of LLMs.
-
Chia-Hui Lin, Cen-Chieh Chen, Tao-Hsing Chang and Fu-Yuan
Hsu
[Pdf]
[Table of Content]
In recent years, large language models (LLMs) have been
widely applied to various natural language processing (NLP)
tasks, demonstrating exceptional performance. To evaluate
the output quality of these LLMs, numerous studies utilize
one LLM as an evaluator to assess the quality of outputs
from other LLMs, showing promising results on public
benchmarks. However, the performance of LLMs as evaluators
on many unpublished benchmarks still needs improvement. To
achieve better evaluation performance, some studies have
attempted to fine-tune evaluators based on large amounts of
data, incurring significant manual costs and posing
substantial limitations in practical applications.
Therefore, this paper leverages data augmentation to
increase the volume of training data and employs the odds
ratio preference optimization (ORPO) algorithm for
reinforcement learning to optimize the evaluator. This
study uses the dataset provided by NTCIR-18’s Automatic
Evaluation of LLMs (AEOLLM) task for training and testing.
The proposed method achieves an accuracy of 0.7658 on the
summary generation subtask of AEOLLM, the highest among all
compared models. Additionally, it yields the second-highest
performance in both Kendall’s tau and Spearman correlation
coefficient on the summary generation and text expansion
subtasks among all compared models.
-
Lang Mei, Chong Chen and Jiaxin Mao
[Pdf]
[Table of Content]
As large language models (LLMs) gain widespread attention
in both academia and industry, it becomes increasingly
critical and challenging to effectively evaluate their
capabilities. Existing evaluation methods can be broadly
categorized into two types: manual evaluation and automatic
evaluation. Manual evaluation, while comprehensive, is
often costly and resource-intensive. Conversely, automatic
evaluation offers greater scalability but is constrained by
the limitations of its evaluation criteria (dominated by
reference-based answers). To address these challenges,
NTCIR-18\footnote{https://research.nii.ac.jp/ntcir/ntcir-18/tasks.html#AEOLLM}
introduced the AEOLLM (Automatic Evaluation of LLMs) task,
aiming to encourage reference-free evaluation methods that
can overcome the limitations of existing approaches.
In this paper, to enhance the evaluation performance of the
AEOLLM task, we propose three key methods to improve the
reference-free evaluation: 1) Multi-model Collaboration:
Leveraging multiple LLMs to approximate human ratings
across various subtasks; 2) Prompt Auto-optimization:
Utilizing LLMs to iteratively refine the initial task
prompts based on evaluation feedback from training samples;
and 3) In-context Learning (ICL) Optimization: Based on the
multi-task evaluation feedback, we train a specialized
in-context example retrieval model, combined with a
semantic relevance retrieval model, to jointly identify the
most effective in-context learning examples.
Experiments conducted on the final dataset demonstrate that
our approach achieves superior performance on the AEOLLM
task.
Return to Top
-
Sijie Tao, Tetsuya Sakai, Junjie Wang, Hanpei Fang, Yuxiang
Zhang, Haitao Li, Yiteng Tu, Nuo Chen and Maria Maistro
[Pdf]
[Table of Content]
This paper provides an overview of the NTCIR-18 FairWeb-2
Task. Our task considers not only document relevance but
also group fairness. We designed two subtasks: the Web
Search Subtask, and the Conversational Search Subtask. We
designed three types of search topics for this task:
researchers (R), movies (M), and Youtube contents (Y). For
each topic type, attribute sets are defined for considering
group fairness. For the Web Search Subtask, we received 23
runs from five teams, including six runs from the
organisers team. For the Conversational Search Subtask, we
received four runs from two teams, including one run from
the organisers team. In this paper, we describe the task,
the test collection construction and the official evalution
results of the submitted runs.
-
Clara Rus, Jasmin Kareem, Chen Xu, Yuanna Liu, Zhirui Deng
and Maria Heuss
[Pdf]
[Table of Content]
Balancing utility and fairness in the search results is an
important and challenging problem for the IR community. The
FairWeb-2 Task of NTCIR-18 aims to tackle this using three
main search topics: movies, researchers and YouTube videos.
This paper presents the approach employed by the AMS42 team
as part of the FairWeb-2 Task of NTCIR-18. The AMS42 team
submitted 5 runs. First, we focus on retrieving documents
which are relevant to the given queries. Next, we employ
two fairness approaches. One of which makes use of
estimated sensitive attribute values to balance relevance
and fairness in the retrieved results, and another which
relies on the model's semantic understanding of sensitive
attribute values derived from the document content.
Finally, we discuss the challenges identified while working
on the FairWeb-2 Task.
-
Atsuya Ishikawa, Sijie Tao and Tetsuya Sakai
[Pdf]
[Table of Content]
This report presents the participation of RSLFW team at the
NTCIR-18 FairWeb-2 task.We implemented several different
retrieval methods to generate five runs using BM25, ColBERT
and PM-2 algorithm.In addition to the runs submitted, the
results are analyzed through comparison with the official
baseline and FairWeb-1 reproduction (revived) runs.
-
Narendra Kumar, Arjun Mukherjee, Sukomal Pal and Thomas
Mandl
[Pdf]
[Table of Content]
As information retrieval systems become increasingly
sophisticated,
ensuring fairness and algorithmic neutrality in search
results has
emerged as a critical challenge. Traditional ranking
algorithms often
prioritize relevance, which can unintentionally amplify the
visibil-
ity of majority groups while limiting representation for
minority
perspectives. This imbalance can lead to biased search
results that
reinforce existing disparities. To address this issue,
fairness-aware
retrieval methods aim to ensure equitable representation by
balanc-
ing relevance with exposure fairness while maintaining
algorithmic
neutrality. In this study, we investigate the impact of
query modifi-
cations on group fairness in ranked search results.
Specifically, we
examine how expanding queries to encompass a broader range
of
relevant content influences fairness between different
groups while
considering their protected attributes. Our findings
contribute to
ongoing efforts to design information retrieval systems
that provide
more inclusive and bias free access to information.
-
Amogh Raina and Tetsuya Sakai
[Pdf]
[Table of Content]
This paper describes our participation in the
Conversational Search
Subtask of the FairWeb-2 Task at NTCIR-18. Our system,
COPWA,
was designed to balance conversational relevance and group
fair
ness while retrieving entities from researcher, movie, and
YouTube
content topics. We detail our approach, evaluation
results, and analysis of our system’s performance using
the GFRC (Group Fairness
and Relevance of Conversations) framework.
-
Huixue Su, Haitao Li, Yiteng Tu, Qingyao Ai and Yiqun Liu
[Pdf]
[Table of Content]
The fairness of search systems remains a critical challenge
in information retrieval. Building upon our previous work
in FairWeb‑1, this paper presents the THUIR team’s approach
in the NTCIR‑18 FairWeb‑2 Task. Specifically, we developed
a simple yet effective retrieval pipeline that integrates
multiple neural rerankers with results aggregated via
Reciprocal Rank Fusion to generate balanced search rankings
across various entity types. Additionally, we submitted a
revived run that combines a PM2-based result
diversification algorithm with dense retrieval scores. Our
experimental results yield competitive performance on
multiple evaluation metrics, demonstrating that
enhancements in retrieval relevance inherently promote
balanced group fairness. With the right combination of
techniques, it is possible to achieve a synergistic
reinforcement between relevance and fairness.
Return to Top
-
Chung-Chi Chen, Chin-Yi Lin, Cheng-Chih Chiu, Hen-Hsen
Huang, Alaa Alhamzeh, Yu-Lieh Huang, Hiroya Takamura and
Hsin-Hsi Chen
[Pdf]
[Table of Content]
This paper provides an overview of the FinArg-2 shared
tasks in NTCIR-18. Building upon the fundamental argument
identification tasks in FinArg-1, this iteration focuses on
temporal inference. Forward-looking statements frequently
appear in financial documents, and we aim to capture the
duration of a premise's impact on a company's operations,
the temporal reference associated with an argument, and the
validity period of a claim. Similar to FinArg-1, we utilize
earnings conference calls, professional research reports,
and social media data for analysis. A total of 20 teams
registered for FinArg-2, with 7 active teams submitting
their results. \textcolor{red}{We will highlight some
methods after receiving participants' submissions.}
-
Adhitia Erfina and Phuong Le-Hong
[Pdf]
[Table of Content]
FinArg-2 is part of the NTCIR Financial Argument shared
task series which aims to improve argument understanding in
financial analysis. FinArg-2 aims to introduce "Temporal
Inference of Financial Arguments" focusing on the
assessment of temporal information, which is a distinct
phenomenon in financial opinions. FTRI participates in
FinArg-2 on the Earnings Conference Calls (ECC) subtask,
where models must identify the temporal reference
associated with an argument. At the initial stage we
conducted experiments on variation of transformers models
using several configurations at the preprocessing and
training stages. BERT-Base-Uncased, BERT-Large-Uncased, and
RoBERTa-Base-Uncased showed slightly superior performance
compared to the other models. So, in the overall model that
we created, we only fine-tuned those models as our baseline
model. Our first model’s output FTRI_ECC_1, we use a
transformer encoder approach with BERT-Large, resulting in
71.43% Micro F1 and 68.58% Macro F1. Our second model’s
output FTRI_ECC_2, we use attention mask in Claim, Premise,
and (Year + Quarter) approach with BERT-Base, resulting in
69.05% Micro F1 and 65.76% Macro F1. Our third model’s
output FTRI_ECC_3, we use TF-IDF (Claim + Premise) +
One-hot encoding (Year + Quarter) approach with BERT-Base,
resulting in 77.38% Micro F1 and 75.07% Macro F1, which is
the best results in this ECC Subtask. The evaluation
results show that the 3 output models we created are in the
top 4 among other participants based on Micro and Macro F1.
-
Xuan-Yu You, Di Jie Liew, Wen-Chao Yeh and Yung-Chun Chang
[Pdf]
[Table of Content]
The TMUNLPG1 team participated in the FinArg-2 Task of
NTCIR-18, focusing on the Detection of Argument Temporal
References and Assessment of the Claim's Validity Period in
the finance domain using Earning Conference Call and Social
Media datasets. The team ranked 6th and 2nd in these
subtasks, respectively. This paper presents the team's
methodologies, results, and conclusions. For Earnings
Conference Call (ECC) Argument Temporal References, we
utilized a combination of feature engineering, ensemble
strategy, and data augmentation to achieve a Micro F1 score
of 0.6905. In Social Media Assessment of the Claim's
Validity Period, we developed an enhanced approach
combining domain-specific transformer architectures with
statistical feature engineering. By integrating FinBERT
with Log-Likelihood Ratio (LLR) and Pointwise Mutual
Information (PMI) features, we achieved a Micro F1 score of
0.742 on the unified dataset and demonstrated robust
performance on the test set. The methodology incorporates
weighted pooling strategies and adaptive learning rate
optimization to improve temporal validity prediction
accuracy. Our results highlight the effectiveness of
combining domain-specific language models with traditional
statistical approaches in financial text analysis,
contributing to advancements in temporal natural language
processing for the financial domain.
-
Bor-Jen Chen, Wen-Hsin Hsiao, Jun-Yu Wu, Cheng-Yun Wu and
Min-Yuh Day
[Pdf]
[Table of Content]
The increasing availability of financial texts from
earnings conference calls (ECCs) and social media has
created a need for advanced natural language processing
(NLP) techniques to extract meaningful insights. This study
develops a classification framework that integrates
fine-tuning and prompt-based learning to improve financial
argument classification. We apply this framework to two
tasks from the NTCIR-18 FinArg-2 competition: detecting
temporal references in ECCs and assessing the validity
period of claims in social media. Encoder-based models are
fine-tuned for structured classification, while
decoder-based models leverage both fine-tuning and
prompt-based learning. Data augmentation techniques enhance
model generalization, and performance is evaluated using
Micro-F1 and Macro-F1 scores. The primary contribution of
this research is demonstrating how fine-tuning and
prompt-based learning can complement each other in
financial NLP. By optimizing classification strategies,
this study provides insights for improving argument
analysis in financial applications, benefiting researchers,
practitioners, and FinTech developers.
-
Takahiro Kawamoto and Xin Kang
[Pdf]
[Table of Content]
This paper presents our participation in FinArg-2, which
succceeds the FinArg-1 task. While FinArg-1 focused on
sentiment analysis and argument classification, FinArg-2
extends this to temporal.
We experiment with a method of classifying text into two
types: "Premise" and "Claim." Based on these premises and
claim, we have developed a method suitable for accurately
classifying the temporal relationships between sentences.
In order to classify sentences, we trained a classification
model on labeled data, and compared traditional machine
learning approaches with models that use large scale
language models. Among the models tested, DeBERTa and Llama
achieved the highest classification accuracy, demonstrating
the model that used a large-scale language model showed
auperior results.
-
Tong-Ru Wu and Jheng-Long Wu
[Pdf]
[Table of Content]
Large Language Models (LLMs) have shown promising
capabilities for zero-shot text classification, yet they
often do not outperform fine-tuned traditional models like
BERT when trained on sufficient labeled data. However,
acquiring large-scale human-labeled datasets can be
challenging, particularly in specialized domains. To
address this gap, we propose Repeat-Error-Correction
Learning, a framework that iteratively identifies and
rewrites misclassified samples to augment the training set.
First, we train a base BERT model using available
text–label pairs. Next, the trained model infers labels on
the same dataset, and we collect the misclassified samples.
An LLM, such as GPT-4o-mini, then rewrites these erroneous
texts while preserving their original labels. The rewritten
texts are reintroduced into the training set, and the model
is fine-tuned on this expanded corpus. By iteratively
refining the training data through error correction and
text rewriting, the proposed method aims to achieve robust
classification performance despite limited initial
annotations. Our results indicate that fine-tuning the base
model by adding rewritten misclassified text achieved the
highest validation set Micro-F1 score (77.33%). These
findings contribute to a deeper understanding of a
cost-friendly and efficient way to generate data for
augmenting text classification models.
-
Pan Hongrui and Wu Jheng-Long
[Pdf]
[Table of Content]
Social media claims often have shifting validity that
influences
downstream tasks like misinformation detection, financial
predictions, and domain-specific decisions. This study
proposes a
novel approach that merges original text with automatically
generated template text to highlight temporal cues. By
integrating
this enriched data into the training process, the model more
effectively gauges how long a claim remains reliable, even
when
its relevance rapidly evolves. This strategy addresses the
challenge
of ephemeral statements whose validity fluctuates as new
information emerges. Experimental results underscore the
method’s effectiveness, achieving a macro-F1 score of
78.10%.
These findings highlight the importance of systematically
assessing
claim longevity, providing a pathway to more robust content
analysis and better-informed decisions in ever-changing
online
environments.
-
Min-Chin Ho and Jheng-Long Wu
[Pdf]
[Table of Content]
The SCU-1 team participated in the "Detection of Argument
Temporal References in Earnings Conference Calls" subtask
of the NTCIR-18 FinArg-2 task. This study reports our
approach to solving the problem and discusses the official
results. We analyze the impact of step-by-step reasoning,
model collaboration, and prompt design on the
classification performance of large language models (LLMs).
Through a series of experiments, we found that providing
detailed explanations and incorporating previous model
predictions significantly improved classification accuracy.
Additionally, we compared different LLM discussion
mechanisms and prompt design strategies, revealing that
allowing models to reference each other and reason based on
prior outputs effectively enhances decision-making quality.
Run 3, which included complete reasoning steps and prior
model outputs, achieved the best performance, highlighting
the advantages of cross-model reference and optimized
prompt design. These findings offer new directions for
improving LLM-based classification tasks.
-
Sai Saketh Nandam, Charan Srinivas Kumar Reddy Dasari and
Anand Kumar Madasamy
[Pdf]
[Table of Content]
The SCaLAR IT team participated in the Detection of Argument
Temporal References subtask of the NTCIR-18 FinArg-2 Task.
This
paper presents our approach to solving the classification
of financial
arguments based on temporal references. We explored
multiple ar-
chitectures combining a BERT-based model with
knowledge-based
and temporal feature extraction techniques. To improve the
perfor-
mance,integrated BERT with TF-IDF based temporal features
were
extracted using STANZA and BERT embeddings to enhance tempo-
ral reference detection. Our first model
BERTForSequenceClassifier
achieves the Micro F1 score of 70.24% and Macro F1 score of
67.85%
outperforming most approaches of other teams. However
incorpo-
rating additional temporal features improved the Macro F1
score,
indicating better performance across all classes. We
analyze the
effectiveness of different feature representations in our
research.
-
Hugo Dutra, Leonardo Martinho, Gabriel Assis, Jonnathan
Carvalho and Aline Paes
[Pdf]
[Table of Content]
This paper presents AIDAVANCE's approach to Subtask 2
(Detection of Argument Temporal References) of the NTCIR-18
FinArg-2 Task. We explored different classification
strategies, including direct multi-class classification, a
hierarchical cascade approach that first identifies the
presence of a temporal reference before further
categorization, and an LLM-based argument rewriting method.
Our best model, a fine-tuned mDeBERTa using the multi-class
approach, ranked fourth overall, achieving a Micro-F1 score
of 0.6905 and a Macro-F1 score of 0.6711. Our findings
reinforce that fine-tuning smaller encoder models remains
an effective strategy for specialized classification tasks,
even outperforming state-of-the-art LLMs.
Return to Top
-
Liting Zhou, Cathal Gurrin, Hsin-Hung Chen, Hideo Joho,
Chenyang Lyu, Longyue Wang, Graham Healy, Ly Duyen Tran,
Quang-Linh Tran, Hoang Bao Le, Duc-Tien Dang-Nguyen and
Tianbo Ji
[Pdf]
[Table of Content]
NTCIR-18 marked the sixth iteration of the Lifelog task,
which aims to advance research on multimodal lifelog
organization, search, and access. This task builds on
methodologies successfully deployed in previous NTCIR
conferences. In this paper, we detail the test collection,
outline the specific tasks, provide an overview of
submissions, and present findings from the NTCIR-18
Lifelog-6 task. We conclude with recommendations for future
developments in lifelog research.
-
Luca Rossetto
[Pdf]
[Table of Content]
This paper discusses vitrivr's participation in the Lifelog
Semantic Access subtask of the 6th edition of the NTCIR
Lifelog.
It is based on the system that participated in the 2024
Lifelog Search Challenge and only replaces the interactive
query interface with an LLM-based query transformation
method.
All results are generated in one pass without any further
re-processing or refinement.
-
Quang-Linh Tran, Binh Nguyen, Gareth Jones and Cathal Gurrin
[Pdf]
[Table of Content]
We present the participation of the MemoriEase lifelog
retrieval system in the NTCIR-18 Lifelog 6 Task. This
current MemoriEase system is an automatic and enhanced
version of the MemoriEase system at the Lifelog Search
Challenge 2024 (LSC). We report our methods for the two
core sub-tasks in the NTCIR-18 Lifelog 6 task, Lifelog
Semantic Access (LSAT) and Lifelog Question Answer (LQAT).
We enhance the main architecture of the MemoriEase system
utilizing the BLIP2 and CLIP embedding models to extract
visual embedding and perform a comparison between the two
models. In addition, we also use pseudo-relevance feedback
for ad-hoc queries. For the LQAT sub-task, we use our
retrieval model as the retriever and GPT-4o as a reader to
generate answers to questions. Results of the LSAT sub-task
show that our system found 369 images in 1,995 relevant
images. The performance on known-item search queries is
higher than on Ad-hoc queries, with 28.22% R@5 compared to
5.98% R@5, respectively. In the LQAT sub-task, the LLM
model generates 8 correct answers in 24 questions. Although
the performance is not high, it shows the advantages and
drawbacks of the MemoriEase retrieval system and the QA
model.
-
Jiahan Chen, Da Li and Keping Bi
[Pdf]
[Table of Content]
In recent years, sharing lifelogs recorded through wearable
devices such as sports watches and GoPros, has gained
significant popularity. Lifelogs involve various types of
information, including images, videos, and GPS data,
revealing users' lifestyles, dietary patterns, and physical
activities. The Lifelog Semantic Access Task(LSAT) in the
NTCIR-18 Lifelog-6 Challenge focuses on retrieving relevant
images from a large scale of users' lifelogs based on
textual queries describing an action or event. It serves
users' need to find images about a scenario in the
historical moments of their lifelogs. We propose a
multi-stage pipeline for this task of searching images with
texts, addressing various challenges in lifelog retrieval.
Our pipeline includes: filtering blurred images, rewriting
queries to make intents clearer, extending the candidate
set based on events to include images with temporal
connections, and reranking results using a multimodal large
language model(MLLM) with stronger relevance judgment
capabilities. The evaluation results of our submissions
have shown the effectiveness of each stage and the entire
pipeline.
-
Thang-Long Nguyen-Ho, Allie Tran, Minh-Triet Tran, Cathal
Gurrin and Graham Healy
[Pdf]
[Table of Content]
This paper presents our work in the Lifelog Semantic Access
Task (LSAT) at NTCIR-18, focusing on automatic searching
methods for finding distinct life moments. Our experiments
explore and compare different retrieval strategies,
including keyword matching-based search combined with
embedding extraction, vector embedding-based semantic
search using a multimodal model, and hybrid methods that
take advantage of both approaches. Our proposed method
improved retrieval accuracy by directing the model's
attention to key query terms while prioritizing semantic
relevance and the presence of requested entities in the
retrieved moments. Experimental results demonstrated that
the best-performing method relies on embeddings
incorporating extended descriptions and highlighted
keywords. Conversely, the hybrid methods in our experiments
have less effective results, likely due to limitations in
the keyword-matching search algorithm. This work's findings
underscore the richer descriptive entities within queries
to enhance the retrieval of life moments, ensuring a focus
on core semantic and visual elements.
Return to Top
-
Eiji Aramaki, Shoko Wakamiya, Shuntaro Yada, Shohei Hisada,
Tomohiro Nishiyama, Lenard Paulo Tamayo, Jingnan Xiao,
Axalia Levenchaud, Pierre Zweigenbaum, Christoph Otto,
Jerycho Pasniczek, Philippe Thomas, Nathan Pohl, Wiebke
Duettmann, Lisa Raithel and Roland Roller
[Pdf]
[Table of Content]
This paper presents an overview of the Medical Natural
Language Processing for AI Chat (MedNLP-CHAT) task,
conducted as part of the shared task at NTCIR-18.
Recently, medical chatbot services have emerged as a
promising solution to address the shortage of medical and
healthcare professionals. However, the potential risks
associated with these chatbots remain insufficiently
understood.
Given this context, we designed the MedNLP-CHAT task to
evaluate medical chatbots from multiple risk perspectives,
including medical, legal, and ethical aspects. In this
shared task, participants were required to analyze a given
medical question along with the corresponding chatbot
response and determine whether the response posed a
potential medical, legal, or ethical risk (binary
classification).
Nine teams participated in this task applying different
approaches, yielding valuable insights.
-
Hsuan-Lei Shao, Chih-Chuan Fan, Wei-Hsin Wang and Wan-Chen
Shen
[Pdf]
[Table of Content]
The NTCIR-18 MedNLP-CHAT RISK task evaluates the potential
medical, ethical, and legal risks posed by
chatbot-generated responses to patient inquiries. This
study investigates a sentence-level risk classification
approach to identify specific sentences within chatbot
responses that contribute to risk assessment rather than
treating entire responses as monolithic risk units. Our
methodology involved automatic sentence segmentation,
contextual risk annotation, and threshold-based
classification, leveraging traditional natural language
processing (NLP) models instead of large language models
(LLMs) to ensure interpretability and stability.
Despite the conceptual validity of our approach, our system
did not perform competitively, particularly in ethical and
legal risk classification. A key limitation was using a
single model for all risk types, which failed to capture
the nuanced distinctions between medical, ethical, and
legal risk factors. Additionally, dataset constraints and
class imbalance (fewer than 30 positive samples per risk
category) limited model generalization. While
sentence-level annotation improved granularity, it
introduced challenges in handling cross-sentence risk
dependencies, where risks emerge from multi-sentence
interactions rather than isolated statements.
Our findings highlight the need for more advanced risk
classification frameworks, incorporating sequence-aware
models, domain-specific fine-tuning, and context-sensitive
risk evaluation. We also discuss the cultural relativity of
risk perception, emphasizing that risk assessments should
account for jurisdictional differences in medical, legal,
and ethical norms. Future research should explore hybrid
NLP architectures, data augmentation techniques, and
adaptive risk modeling to enhance chatbot safety and
reliability in medical AI applications.
-
Ayantika Das and Anupam Mondal
[Pdf]
[Table of Content]
Risk prediction in the context of medical, ethical, and
legal is crucial for ensuring safety and informed
decision-making. This study explores machine learning
approaches for the MedNLP-CHAT task, utilizing
English-translated datasets from Japanese and German
subtasks. The textual data underwent preprocessing,
including tokenization, n-gram extraction, and
lemmatization, before being modeled using Logistic
Regression, Nu-SVC (nu=0.1) [2], Gradient Boosting, and XGB
Regressor. Objective risks were framed as a binary
classification task, while subjective labels were predicted
via regression, ensuring alignment with human-annotated
distributions. Performance was evaluated using accuracy,
precision, recall, F1-score, and Earth Mover’s Distance
(EMD). The findings indicate the model’s strengths and
weaknesses, emphasizing the need to enhance how class
imbalances and potential overfitting are addressed. This
work increases AI-driven risk assessment with applications
in regulatory compliance, healthcare, and ethical AI
development.
-
Lenard Paulo V. Tamayo, Sa'Idah Zahrotul Jannah, Mohamad
Alnajjar, Axalia Levenchaud, Shaowen Peng, Shoko Wakamiya
and Eiji Aramaki
[Pdf]
[Table of Content]
Chatbots are widely used in the healthcare sector, making
their accuracy and reliability essential. Beyond providing
factually correct information, chatbots must also consider
the human aspect of their responses. Large language models
(LLMs) can be utilized to evaluate chatbot responses,
employing prompting strategies such as chain-of-thought and
few-shot prompting to enhance reasoning and optimize output
quality. This study evaluates a chatbot’s answers to
medical questions using both objective and subjective
assessments. Different prompting techniques were applied:
objective evaluation used baseline, chain-of-thought (COT),
and chain-of-thought with few-shot (COTF) prompting, while
subjective evaluation used baseline and baseline with
few-shot (Baseline-f) prompting. The results revealed that
COTF prompting with both models improved the performance of
objective evaluation, while few-shot prompting enhanced
subjective evaluation.
-
Michael Van Supranes, Martin Augustine Borlongan, Joseph
Ryan Lansangan, Genelyn Ma. Sarte, Shaowen Peng, Shoko
Wakamiya and Eiji Aramaki
[Pdf]
[Table of Content]
This paper presents our submission to the MedNLP-CHAT Task
at NTCIR-18, which focuses on detecting medical, ethical,
and legal risks in chatbot-generated responses. We propose
a two-step prompt-based classification framework using the
Gemini-1.5-flash model. The method first generates support
statements to guide reasoning, which are then integrated
into a few-shot prompt for final classification. We
evaluated our approach on the English versions of the
Japanese and German subtasks, submitting two systems per
subtask that varied in example selection strategy and label
distribution. Our systems achieved strong performance in
detecting medical risks—particularly in the German
subtask—while ethical and legal risks were more
challenging. To better understand the design factors
influencing performance, we conducted ablation studies
across 24 prompt variants. Logistic regression and CHAID
analyses revealed that accuracy depends on complex
interactions between subtask language, example similarity,
actual label, and selection method. Higher similarity
improves classification of risk-present cases but harms
performance on risk-absent cases, indicating a trade-off
between recall and false positives. The $k$-nearest method
was more effective under high similarity, while $k$-spread
offered balanced results across classes. Although the
two-step prompting strategy did not show a statistically
significant advantage overall, the best-performing
configuration used five support statements, with
diminishing gains beyond that. Our findings suggest that
optimized prompt design, particularly with controlled
support and example selection, can improve risk detection
without requiring large-scale training or high
computational resources.
-
Aoi Ohara, Nanami Murata, Ami Yuge and Rei Noguchi
[Pdf]
[Table of Content]
We developed model systems for detecting medical, legal,
and ethical risks in medical chatbot answers by using BERT
and ChatGPT language models. The ChatGPT model system,
which refers to external medical knowledge, performed best
in detecting medical risk, while the BERT model system
performed well in detecting legal and ethical risks. The
hybrid model system reduces missed risks by combining the
best of the BERT and ChatGPT model systems and has the best
recall values for all risk determination models. This study
demonstrates the usefulness of utilizing external medical
knowledge and the effectiveness of the hybrid approach.
-
Pei-Ying Yang, Tzu-Cheng Peng, Wen-Chao Yeh, Chien Chin
Chen and Yung-Chun Chang
[Pdf]
[Table of Content]
The TMUNLPG2 team participated in the Japanese subtask of
the NTCIR-18 Medical Natural Language Processing for AI
Chat (MedNLP-CHAT) Task. This paper presents our
methodological approach and analyzes the official results.
For the Japanese subtask, we implemented two distinct
methodologies addressing the objective and subjective
components. In the objective task, we fine-tuned a
pre-trained language model enhanced with focal loss,
comprehensive feature engineering, and strategic data
augmentation techniques to optimize performance. For the
subjective task, we developed specialized feature
engineering methods to extract implicit semantic
relationships within question-answer pairs, subsequently
leveraging these features to train a robust deep learning
architecture. Our approach yielded significant results,
with TMUNLPG2 achieving the highest average F1-score among
seven participating teams in the objective task and
securing second place in the subjective task. These
outcomes demonstrate the efficacy of our methodological
framework and highlight its potential applications in
advancing medical natural language processing systems.
-
Hiroki Tanioka
[Pdf]
[Table of Content]
Artificial intelligence (AI) is rapidly transforming many
fields, and healthcare is no exception. The current state
of AI in healthcare is characterized by a shift toward
addressing ethical concerns and developing a robust
framework for AI integration. Generative AI, a subset of AI
that includes Large Language Models (LLMs), has emerged as
a game changer with the potential to revolutionize medical
consultations. Therefore, the AITOK team participated in
Japanese/German subtasks of the NTCIR-18 MedNLP-CHAT using
statistical knowledge only, GPT-3.5 Turbo, and GPT-4o,
respectively. This report describes the problem-solving
approach using generative AI for medical, legal, and
ethical issues in medical consultation and its formal
results.
-
Jun-Yu Wu, Cheng-Yun Wu, Bor-Jen Chen, Wen-Hsin Hsiao and
Min-Yuh Day
[Pdf]
[Table of Content]
The IMNTPU team presents a multilingual evaluation of
Agentic AI for chatbot risk classification in the NTCIR-18
MedNLP-CHAT task. Our framework integrates fine-tuned small
models, optimized few-shot prompting with GPT-4o, and
multi-agent aggregation via majority and trust-weighted
voting. Results show that Agentic AI enhances decision
consistency, especially in subjective tasks like ethical
risk, but yields limited gains in structured domains such
as medical and legal assessment. Language-specific outcomes
reveal that annotation quality and linguistic complexity
jointly affect model performance, with Japanese systems
showing the most stability. Confidence analysis highlights
a decoupling between model certainty and accuracy,
underscoring the need for adaptive trust and calibration
strategies. Building on these insights, we propose a
Trust-Guided Agentic AI architecture featuring
self-consistency filtering, dynamic trust updating, and
Chain-of-Thought prompting to further improve reliability
in safety-critical AI systems.
-
Guanqi Cheng, Chang Qu and Ali Braytee
[Pdf]
[Table of Content]
Our team, UTSolve, participated in the Medical Natural
Language Processing for AI Chat (MedNLP-CHAT)
task~\footnote{https://sociocom.naist.jp/mednlp-chat/} at
NTCIR-18. The task involved classifying various medical
texts into medical, ethical, and legal risks. In this
report, we utilized BioBERT, a pre-trained biomedical
language model that was trained on a large amount of
biological text data to predict the risk level of medical
texts. We also evaluated the medical and clinical language
models MedBERT and ClinicalBERT. Based on prediction
performance, BioBERT achieved the best classification
results, with a weighted F1 score of 0.7812 for medical
risk, 0.8629 for ethical risk, and 0.7288 for legal risk.
Return to Top
-
Yuta Nakamura, Koji Fujimoto, Jonas Kluckert, Michael
Krauthammer, Jun Kanzawa, Akira Katayama, Tomohiro Kikuchi,
Ryo Kurokawa, Wataru Gonoi, Yuki Tashiro, Shouhei Hanaoka,
Shuntaro Yada and Eiji Aramaki
[Pdf]
[Table of Content]
Radiology reports play a vital role in clinical workflows,
serving as a primary means for radiologists to communicate
imaging findings to physicians. However, the increasing
number of imaging studies has made it challenging to
produce and interpret comprehensive reports in a timely
manner. Natural language processing (NLP) has shown
potential to alleviate this burden, yet most existing
studies are limited to English, while clinical reports are
often written in local languages. To address this gap, we
have developed and released Japanese medical text datasets
through a series of shared tasks. Our recent efforts,
including NTCIR-16 Real-MedNLP and NTCIR-17 RR-TNM, focused
on automating lung cancer staging from radiology reports
using the TNM classification system. This task is
clinically significant, yet challenging due to the implicit
nature of staging information and the complexity of TNM
criteria.
In this paper, we introduce the NTCIR-18 RadNLP 2024 shared
task, which extends the previous task with finer-grained
classification, a larger and bilingual corpus, and new
sentence-level subtasks. We present the dataset,
participating systems, and evaluation results, aiming to
provide practical insights into building NLP systems for
cancer staging support.
-
Yoshifumi Okura and Yuki Kataoka
[Pdf]
[Table of Content]
This study aims to develop and evaluate a system that
automatically extracts the TNM classification of lung
cancer (T: primary tumor, N: lymph node metastasis, M:
distant metastasis) from radiological diagnosis reports. In
the initial experiments, inference was performed using
`gemini-2.0-flash-thinking-exp-1219`. By incorporating
explicit TNM classification criteria and unit
specifications—features absent in conventional methods—and
introducing error analysis and prompt improvements through
meta-prompting, an overall accuracy improvement of
approximately 15% was achieved after prompt modification.
In the final evaluation, using the `o1 2024-12-01-preview`
model, we achieved approximately 70% joint accuracy (fine),
76% T accuracy, 93% N accuracy, and 95% M accuracy. This
paper provides a detailed account of the experimental
procedures and the improvement process at each stage.
-
Junya Sato, Kosuke Kita, Daiki Nishigaki, Miyuki Tomiyama
and Masatoshi Hori
[Pdf]
[Table of Content]
In this paper, we describe our proposed systems for the
Japanese main task and sub task in Natural Language
Processing for Radiology 2024 shared task. We employed
Generative Pre-trained Transformer models and applied a
few-shot prompting approach to tackle the classification
task for lung cancer TNM staging from free-text radiology
reports. Our method first performs zero-shot prompting
using training data and then refines the final predictions
by incorporating examples of incorrect predictions into the
prompt. We demonstrate that this approach outperforms
several BERT-based models and other open-source large
language models. On the test data, our method achieved a
Joint Accuracy (fine) of 0.732 for the main task and an
overall micro F2.0 of 0.688 for the sub task, ranking 3rd
in both categories.
-
Tsz-Yeung Lau and Shih-Hung Wu
[Pdf]
[Table of Content]
This study investigates the application of Large Language
Models (LLMs) for automated lung cancer staging based on
radiology reports, as part of the CYUT team’s participation
in the NTCIR-18 RadNLP Main Task.
Through data analysis, we observed a moderate correlation
among the T, N, and M staging classes. Experimental results
indicated that jointly prompting LLMs to predict all three
classes simultaneously yields improved performance.
Additionally, standardizing measurement units to
millimeters, rather than centimeters, proved to be a more
effective strategy. Based on these findings, we refined our
prompting methodology and applied it to both LLMs and
reasoning-augmented models, including OpenAI’s O-series and
DeepSeek-R1. These reasoning-models, enhanced through
post-training with Chain-of-Thought (CoT) reasoning,
demonstrated superior staging accuracy.
As LLMs are generative models, their outputs may vary
across different runs, introducing inconsistency in
predictions. To mitigate this variability, we adopted an
ensemble learning strategy aimed at consolidating divergent
LLM outputs into a more stable and reliable lung cancer
staging system. Experimental results demonstrate that
ensemble methods consistently outperform individual models,
enhancing both the robustness and reliability of staging
from radiology reports.
Our approach achieved second place in the NTCIR-18 RadNLP
Main Task (English), underscoring the effectiveness of
LLM-based ensemble techniques for TNM classification. The
implementation is available at github:
anson70242/NTCIR-18-RadNLP-CYUT.
-
Ryutaro Mori, Koichi Okuda, Shota Hosokawa, Taisei Komoda,
Tsudou Watanabe and Yasuyuki Takahashi
[Pdf]
[Table of Content]
We participated in the NTCIR-18 RadNLP2024 shared task [1]
and investigated the automation of TNM classification using
large language models (LLMs), specifically GPT-4o-mini,
GPT-4o, and o1-mini. Our approach integrates cosine
similarity-based retrieval using embedding vectors and
few-shot learning to enhance classification accuracy. As a
result of the experiment, o1-mini achieved the highest
classification accuracy. However, the accuracy on the test
data declined by approximately 30% compared to the
validation data. In particular, the low classification
accuracy of the T factor highlighted challenges in
interpreting tumor size and extent of infiltration. In this
paper, we analyze these results and report our approach to
this task along with official results.
-
Daiki Shirafuji and Takafumi Niwa
[Pdf]
[Table of Content]
Recent advances in language models (LMs) have significantly
improved the handling of complex medical narratives
compared to classical methods. However, one major obstacle
to the practical usage of these LMs in the medical domain
is that the models lack training on medical knowledge. In
particular, standard tokenizers trained on open-domain
corpora fail to accurately capture domain-specific
terminologies, abbreviations, and writing styles in
radiology reports or clinical notes. To address this issue,
we propose a two-step domain-transfer method that updates
both the tokenizer vocabulary and the LM representations.
First, we replace low-frequency tokens in the original
general-domain vocabulary with high-frequency bi- and
tri-grams extracted from medical text, ensuring that
domain-relevant tokens are learned. Second, we continually
pre-train the LM on the medical corpus using the masked
language modeling to more closely align the model
parameters to the domain-specific language parameters. We
evaluated the effectiveness of this approach in the RadNLP
2024 shared task on lung cancer staging from radiology
reports, covering both English and Japanese. Experimental
results indicate that our method improves performance on
this specialized task, suggesting that customizing
tokenizers and re-training language models can
substantially mitigate the domain gap. In the future, we
address standardizing radiology report formats to
facilitate more robust and accurate automated analysis.
-
Soma Onishi, Daisaku Shibata, Masanori Tsujikawa, Ryo
Ishii, Junya Tominaga and Hideki Ota
[Pdf]
[Table of Content]
We propose a novel method for automatically inferring TNM
stages from radiology reports. The proposed method includes
a two-stage reasoning process. In Stage 1, kNN few-shot
learning with the Chain of Thought is used for initial
inference, followed by a self-review to evaluate the
reasoning process. In Stage 2, if the inference results
after the self-review are inconsistent, a second review is
conducted from an alternative perspective. The proposed
method achieved superior results in the NTCIR-18 RadNLP
2024 Main Task (Japanese), outperforming other teams by
approximately 7.4 points, thereby winning the competition.
The proposed method is designed as an extension of prompt
engineering. It requires no complex training, which makes
it applicable to various large language models.
-
Chirag Bhawnani, Dhananjaya Bedkani Linganaik, Sanjeeth J.
Veigas and Vishnu Kumar Jakhoria
[Pdf]
[Table of Content]
The management of lung cancer heavily relies on precise
staging, which is traditionally derived from comprehensive
radiology reports generated through imaging techniques like
CT and MRI. However, these reports often lack explicit
staging details, posing challenges for healthcare
professionals who must manually extract relevant
information.
To address this issue, we propose an automated solution as
part of our submission to the RadNLP (Natural Language
Processing for Radiology) shared task at the NTCIR-18
international conference. Our approach utilizes tailored
Natural Language Processing (NLP) techniques to enhance the
processing of radiology reports. In this paper, we describe
our methodology for the RadNLP subtask,
which involves document segmentation to identify eight key
classes within radiology reports, and the primary task,
which focuses on the automated TNM staging of lung cancer.
For the subtask, we employed an ensemble of three
fine-tuned, hyperparameter-optimized BERT-based medical
language models, which yielded an overall micro F2 score of
0.9433, securing the top rank in the competition. For the
main task, we developed individual pipelines for T, N, and
M staging, consisting of BERT-based models and LLMs in a
multistage processing framework, resulting in a joint
accuracy of 0.5679 and an overall 4th place finish in the
competition. Our solution not only streamlines the
extraction of critical information but also aims to improve
the accuracy and efficiency of cancer staging, ultimately
supporting clinical decision-making and contributing to
better patient outcomes
-
Aoi Kondo, Tan You Quan Bernon, Tsubasa Oka, Hiroaki Koga
and Mikio Oda
[Pdf]
[Table of Content]
The NITKC team participated in the RadNLP Shared task of
TNM classification from lung cancer radiology reports
written in English, using an LLM-based approach. LLM
accuracy varies depending on training methods and the
number of parameters. We aimed to solve this task using
open-source LLMs with fewer parameters than closed-source,
proprietary LLMs and made improvements accordingly.
Open-source LLMs have less prior knowledge than
closed-source LLMs, putting them at a disadvantage for TNM
classification. To address this, we used Graph-RAG to
improve accuracy and address issues by representing domain
knowledge for unfamiliar tasks as a graph and incorporating
it as knowledge into the LLM. This method uses a graph
database to represent domain knowledge for TNM
classification in a graph structure. It dynamically
incorporates the graph information into LLM prompts,
compensating for the knowledge gaps in open-source LLMs and
enabling more accurate inference. Additionally, to enhance
performance, we trained BioBERT and MedBERT on a dataset
labeled with lung cancer progression stages and utilized
these inference results concurrently. As a result, we
achieved a joint accuracy of 0.2963 in the TNM
classification task. This demonstrates that our approach
effectively mitigates the limitations of open-source LLMs
in TNM classification.
-
Marina Higashi, Rintaro Ito, Keita Kato, Ryota Asai, Shingo
Iwano and Shinji Naganawa
[Pdf]
[Table of Content]
Lung cancer is the most common cause of cancer death in
Japan. The TNM classification is essential for lung cancer
diagnosis and treatment planning, and CT imaging plays a
crucial role in its evaluation. However, the number of
thoracic radiologists is limited in Japan. The development
of a system to automatically extract TNM classification
from radiology reports would be beneficial to radiologists
and other clinicians. Large language models (LLMs) have
recently shown remarkable progress in natural language
processing, opening new possibilities for medical
applications. The NURad team participated in the NTCIR-18
Natural Language Processing for Radiology (RadNLP) task .
This paper describes our approach to the problem and
discusses the official results.
We explored different prompts, LLM models (Llama3, Open AI
O1pro, Google Gemini 2.0, Google Notebook LM), and data
types (Japanese and English). We also investigated
fine-tuning with clinical data. The final model, utilizing
a short prompt and trained on both Japanese and English
datasets using Google Notebook LM, did not incorporate
clinical data.
Our final model with Google Notebook LM achieved a TNM
(fine) score of 0.93 on the validation dataset. However,
the score decreased to 0.54 on the test dataset. This
decline was more pronounced for the T classification
compared to the N and M classifications.
This study demonstrates the potential of LLMs for automated
TNM classification from radiology reports, but also
highlights challenges in generalization to unseen data,
particularly for T classification. Further research is
needed to improve the robustness and accuracy of LLM-based
TNM classification systems.
-
Keisuke Hidaka
[Pdf]
[Table of Content]
Here, we report our approach to the NTCIR-18 RadNLP2024
Shared Task (Japanese Track, Main Task). In this study, we
developed a system to determine the TNM classification from
lung cancer using Japanese radiology reports. Specifically,
we provided Google DeepMind’s Gemini 2.0 Flash Experimental
(gemini-2.0-flash-exp) with a prompt that combines
Chain-of-Thought (CoT) and Many-Shot In-Context Learning
(ICL), enabling automatic prediction of the T, N, and M
factors for each case. Besides accuracy, interpretability
is crucial in the medical domain; thus, having the model
output the rationale for its TNM classification ensures a
degree of transparency. Moreover, by including numerous
examples of CoT-based reasoning—written by a radiologist
with 5 years of dedicated experience in diagnostic
radiology—to explain how the TNM classification is derived,
we achieved improved inference accuracy.
Furthermore, to address privacy concerns and the need for
local inference without network connectivity in clinical
settings, we performed Supervised Fine-Tuning (SFT) using
Gemma2-9b-it, a comparatively lightweight open-source
model. By providing the model with CoT-based reasoning
steps leading to TNM classification as training data, we
observed improved inference accuracy.
These findings demonstrate that additional data and prompt
strategies to support large language model (LLM)-based
inference can be highly effective in automating TNM
classification while also indicating the feasibility of
realizing interpretability in LLM-based medical
applications.
-
Wuraola Oyewusi, Eliana Vasquez Osorio, Gareth Price and
Goran Nenadic
[Pdf]
[Table of Content]
The RadNLP 2024 (Natural Language Processing for Radiology)
shared task at the international conference NTCIR-18
(English track) focuses on document classification for lung
cancer staging, aiming to automatically determine the stage
(i.e., the degree of progression) of lung cancer from
radiology reports. Our approach involved data
preprocessing, stratified data augmentation, and
fine-tuning RadBERT—a transformer model pre-trained on
radiology-specific text. We employed back-translation for
data augmentation and 5-fold cross-validation to improve
model robustness and address class imbalance.
The results demonstrated that data augmentation
significantly improved validation performance, with T
accuracy increasing from 39.39% to 94.05% during K-fold
validation and reaching 100% on the task validation set.
However, a substantial performance gap was observed on the
task test set, with joint accuracy dropping from 96.3% on
the task validation set to 12.35%. This highlights
challenges in model generalization due to limited dataset
diversity and domain-specific language variability.
This report details our methodology, results, and discusses
the challenges encountered, highlighting the need for
further research to improve the robustness and
generalizability of automated lung cancer staging from
limited radiology reports.
-
Wen-Chao Yeh, Yan-Chun Hsing, Tzu-Yi Li, Nitisalapa
Timsatid, Shih-Chuan Chang, Shih-Hsin Hsiao, Chu-Chun Wang,
Pak-Yue Chan, Wen-Lian Hsu and Yung-Chun Chang
[Pdf]
[Table of Content]
The TMUNLPG3 team participated in the Lung Cancer Staging
main task and Multi-label Sentence Classification subtask
of the NTCIR-18 RadNLP Task. This paper illustrates our
approach to address the challenges and discusses the
official results. We tackled Lung Cancer TNM Staging
maintask to highest among all participants in the English
track by adopting LLM and Few-Shot prompt engineering. Our
solution also performed excellently in the Multi-label
Sentence Classification subtask.
-
Manuel-Carlos Díaz-Galiano, Lucas Molino-Piñar, Álvaro
Herrera Arjonilla and Maite Martín-Valdivia
[Pdf]
[Table of Content]
This paper presents our participation in the NTCIR-18
RadNLP 2024 English main task and subtask. We describe our
proposed solution to address the problem and discuss the
official results. Our approach is based on large language
models, with additional experiments involving data
augmentation, retrieval-augmented generation, and prompting
for the main task. Additionally, for the subtask, we
employed a ModernBERT model with pre-training and
hyperparameter optimization. Our best-performing submission
in the main task, scores 0.5309\% in overall joint accuracy
(fine) evaluation. Also, our best-performing submission in
the subtask, scores 0.8189\% in overall micro F2.0
evaluation. Results from additional runs also show that
data augmentation could further improve model performance
beyond our best submission.
-
Yuki Tashiro, Yuta Nakamura and Eiji Aramaki
[Pdf]
[Table of Content]
This paper describes our approach to the RadNLP 2024
Maintask as participants of NTCIR-18. The RADNLP 2024 Main
Task is to classify the stage of lung cancer from radiology
reports. Our approach utilizes GPT-4o for inference,
employing prompt engineering techniques. We achieved an
accuracy of 0.5648 on the Japanese test data, demonstrating
the robustness of closed-source models.
-
Takashi Nishibayashi, Mitsuhisa Ota and Masahiro Kazama
[Pdf]
[Table of Content]
The Ubie team participated in the RadNLP core task on lung
cancer staging classification based on Japanese radiology
reports at NTCIR-18. This paper reports our approach and
analyzes the official results. We investigated the impact
of prompt engineering on TNM classification using large
language models (LLMs). We compared multiple proprietary
models available as of January 2025 (Gemini 1.5 Pro, Gemini
Exp. 1206, and o1) using various prompt configurations,
including zero-shot, few-shot, chain-of-thought (CoT), and
self-feedbacked instruction.
The results demonstrate significant performance
improvements driven by model evolution in this medical text
classification task. Analysis of prompt variations revealed
differential impacts based on model capabilities. For
Gemini models tested, explicitly prompting reasoning steps
(CoT) led to the most substantial performance gains. In
contrast, the o1 model, a reasoning model performing
internal CoT and self-evaluation, showed limited benefit
from explicit reasoning prompts, suggesting that strategies
effective for non-reasoning models are less critical for
advanced reasoning models. This finding, consistent with
general guidance on prompting reasoning models, is also
observed in our medical text classification experiments.
The effectiveness of self-feedbacked instruction varied,
showing no improvement for Gemini 1.5 Pro, possibly due to
inadequate feedback generation and its dependence on
factors like few-shot example selection.
While prompt engineering offered limited gains for the
reasoning model evaluated, it provided substantial
performance benefits for non-reasoning models, highlighting
its value for optimizing models without inherent advanced
reasoning capabilities.
-
Aman Sinha and Ioana Buhnila
[Pdf]
[Table of Content]
We present our results on the main task and subtask of the
NTCIR-18 RadNLP 2024 shared task on the English language.
We tested to what extent Large Language Models (LLMs) and
Pretrained Language Models (PLMs) can identify and classify
tumor types and subtypes. Our results for the main task
showed that LLMs have difficulties in understanding
different subtypes of tumors. For the tumor sentence
segment classification subtask, we obtained competitive
overall score with pretrained language models with an
overall score of 0.83 for micro F2.0 metric. Our results
showed that in low amount of data setting, we have a better
chance with clinical PLMs in comparison to general and
domain specific LLMs. Providing additional information
such definitions in case clinical staging classification
can help LLMs achieve better scores on fine-grained
classification.
-
Tomoki Terada and Rei Noguchi
[Pdf]
[Table of Content]
We developed highly interpretable classification models of
lung cancer stage using Bag-of-Words representations that
consist of predefined key terms based on domain knowledge.
These models had high medical validity and provided new
clinical insights. This study demonstrates the
effectiveness of domain knowledge in improving model
accuracy and the usefulness of model interpretability in
the medical field.
-
Yosuke Yamagishi, Ryosuke Tomiyama and Yui Ueda
[Pdf]
[Table of Content]
Automated extraction of TNM staging information from
radiology reports is a challenging task that requires
understanding complex clinical language and applying
detailed staging criteria. In this paper, we present our
approach to the NTCIR-18 RadNLP 2024 shared task on
automated lung cancer staging from Japanese radiology
reports. We developed a hybrid system that combines large
language models (LLMs) with rule-based processing in a
two-stage pipeline: first extracting structured information
from reports using GPT-4o models, then applying
classification rules to determine the appropriate TNM
stages. Our approach employed different strategies for each
classification component: a rule-based method for the
complex T classification and a more flexible LLM-based
approach for N and M classifications. Evaluation results
showed strong performance on the validation dataset (joint
accuracy of 0.8148) but revealed a significant drop in T
classification performance on the test dataset (from 0.8704
to 0.4769), while N and M classifications maintained high
accuracy levels. This performance disparity highlights the
trade-offs between rule-based precision and LLM flexibility
in clinical NLP systems. Our findings suggest that
balancing these approaches and leveraging larger
development datasets could improve the robustness of
automated cancer staging systems for real-world clinical
applications.
-
Takahito Nakajima
[Pdf]
[Table of Content]
Lung cancer TNM classification from narrative radiology
reports presents challenges due to expression variability
and complex relationships between findings. This study
develops an automated TNM classification system utilizing
large language models (LLMs) with supervised fine-tuning
(SFT) and specialized prompting (SP) approaches. We
evaluated our system on the NTCIR-18 RadNLP 2024 Task
dataset, achieving 72.69\% (Japanese) and 55.56\% (English)
fine-grained accuracy, ranking 5th among 15 teams. Our
system demonstrated particularly high performance in
N-factor classification (>93.98\% accuracy) and in the
subtask of textual analysis (ranking 1st in Japanese and
3rd in English tracks). Error analysis revealed challenges
in interpreting complex expressions and implicit
information. This system shows potential for clinical
workflow optimization, standardization of TNM
classification, and educational support, with implications
for improving cancer staging practices.
Return to Top
Pilot Tasks
-
Hideo Joho, Atsushi Keyaki, Yuuki Tachioka and Shuhei
Yamamoto
[Pdf]
[Table of Content]
This paper provides an overview of the NTCIR-18 Transfer-2
task that aims to bring together researchers from
Information Retrieval, Machine Learning, and Natural
Language Processing to develop a suite of technology for
transferring resources generated for one purpose to another
in the context of dense retrieval. Two subtasks were run
for this round: the Retrieval Augmented Generation (RAG)
subtask and the Dense Multimodal Retrieval (DMR) subtask.
This paper presents the dataset developed and evaluation
results of participant runs. Note that this paper includes
material from our earlier work published
in~\cite{emtcir04}, revised for the current work.
-
Yuuki Tachioka and Yasunori Terao
[Pdf]
[Table of Content]
The ditlab team participated in the RAG and DMR tasks of
the NTCIR-18 Transfer-2 task. For the RAG task, we proposed
a late fusion method for answer generation that uses
multiple contexts retrieved by the dense passage retriever.
Unlike sequential approaches that input contexts
sequentially into large language models (LLM), our method
processes contexts in parallel and employs majority voting
to determine the final answer. We also fine-tuned the LLM
using a LoRA-based method to better handle quiz-style
questions, achieving over 10 points gains against the
baseline in terms of accuracy.For the DMR task, we
introduce a modality-aware sensor encoder that processes
numerical and textual sensor features separately, and
enhance geolocation features by converting
latitude/longitude data into address strings via k-nearest
neighbor matching. Although our baseline performance is
degraded from the official baseline due to the mismatch of
data between the training and evaluation data, our approach
improved the image-to-sensor retrieval performance from our
baseline.
-
Riku Mizuguchi, Takeshi Yamazaki and Shuhei Yamamoto
[Pdf]
[Table of Content]
This paper presents the participation of the YMX2L research
team in the NTCIR-18 Transfer-2 Dense Multimodal Retrieval
(DMR) task. Our approach focuses on the integration of
visual and sensor data, leveraging data augmentation
techniques and object detection to enhance retrieval
performance. The experimental results demonstrate the
effectiveness of our proposed methods and highlight key
features that contribute to addressing the challenges of
multimodal dense retrieval.
Return to Top
-
Key-Sun Choi and You-Sang Cho
[Pdf]
[Table of Content]
The Hidden-Rad task, introduced as a pilot challenge at
NTCIR18, aims to improve the interpretability of AI systems
in radiologyrelated diagnostic reasoning by encouraging
models to explicitly explain the rationale behind clinical
interpretations. Traditional radiology reports often focus
on final diagnoses while omitting the underlying causal
reasoning. To address this, Hidden-Rad defines two
subtasks: Task 1 targets diagnostic explanation generation
using radiology reports, with optional use of X-ray images;
Task 2 evaluates the interpretation of diagnostic reasoning
from structured clinical questionnaires. The task is built
on an enriched subset of the MIMIC-CXR dataset and includes
formal evaluation criteria provided via a public
repository. In total, three teams submitted 40 runs for
Task 1, while two teams submitted 16 runs for Task 2. The
top-performing systems achieved 69% and 78.84% for each
subtask, respectively, demonstrating the potential for
integrating causal reasoning into clinical report
generation. The findings highlight future directions for
explainable medical AI through the use of domain-specific
knowledge graphs and customized language models.
-
Youngseob Won, Younggyun Hahm, Chanhyuk Yoon and Seong Tae
Kim
[Pdf]
[Table of Content]
The Teddysum team participated in the HIDDEN-RAD task at
NTCIR-18, which focuses on extracting and reconstructing
causal explanations in radiology report generation. Our
approach integrates Chain-of-Thought (CoT) prompting,
Retrieval-Augmented Generation (RAG) leveraging RadGraph,
and a Tree-of-Thought (ToT)-inspired evaluation mechanism
to enhance causal reasoning. For Task 1, we employ
KG-LLaVA, a visual language model, to convert chest X-ray
images into textual descriptions before integrating them
into our reasoning pipeline. For Task 2, our text-based
framework directly applies structured prompting and
retrieval-based reasoning. Our method secured 1st place in
Task 2, demonstrating the effectiveness of structured
causal inference in radiology report generation. We discuss
the advantages, limitations, and future directions for
improving AI-driven causal explanation models in medical
applications.
-
Mercy Ranjit, Rahul Kumar, Shaury Srivastav, Anirban Porya
and Tanuja Ganu
[Pdf]
[Table of Content]
This paper presents the participation of the Microsoft
Research RADPHI3 team in the Hidden-RAD Challenge: Hidden
Causality Inclusion in Radiology Reports. The task aims to
recover hidden causality from radiology reports, optionally
accompanied by their corresponding frontal chest X-rays
(CXRs). We fine-tune small language models, specifically
Rad-Phi-3.5 Vision-CXR, to recover causality analysis in
both language-only and multi-modal settings, given
radiology reports and radiology images as inputs. We also
include baselines of various models in the general domain,
including models specifically tuned for reasoning tasks
such as GPT-4o, LLaMA 3.3, Phi4, DeepSeek, OpenAI o1,
OpenAI o1-mini, and OpenAI o3-mini3. Through these
experiments, we evaluated the effectiveness of
general-domain, reasoning-specialized, and fine-tuned
domain-specific small language models in generating causal
explanations given radiology reports and images optionally
as inputs.
-
Ju-Min Cho, Ho-Jin Yi, Myung-Kyu Kim, Se-Jin Jeong and
Seung-Hoon Na
[Pdf]
[Table of Content]
The nash team participated in the NTCIR-18 Hidden-RAD Task,
focusing on generating causality-based diagnostic
inferences from radiology reports.
In Subtask 1, we applied a cost-efficient API-driven
inference pipeline to recover hidden causalities within
MIMIC-CXR reports. Our pipeline integrates few-shot
in-context learning, retrieval-enhanced prompting, and
strict candidate selection using an evaluation checklist.
By leveraging retrieved similar cases to enrich the prompt
dynamically, this approach achieved the highest ranking
(1st place) in the official evaluation.
In Subtask 2, we explored structured diagnostic reasoning
using PRISMA-Guided Causal Explanation, applying
prompt-based systematic reasoning to enhance
interpretability. Our method, leveraging structured PRISMA
flow with large language models, secured 2nd place in the
official evaluation. Additionally, we investigated an
alternative approach that combined fine-tuning and
domain-specific prompting to improve model adaptability.
While this method was not included in the final ranking, it
demonstrated potential in enhancing domain-specific model
interpretability.
These findings contribute to the advancement of explainable
AI (XAI) in radiology, bridging the gap between automated
diagnosis and human expert decision-making.
Return to Top
-
Tokinori Suzuki, Douglas W. Oard, Shashank Bhardwaj, Emi
Ishita and Yoichi Tomiura
[Pdf]
[Table of Content]
This paper describes the NTCIR-18 SUSHI Pilot Task. The
task included two subtasks: folder search and archival
reference detection. Details are presented for each
subtask on the design of the test collection, the system
runs submitted by participating teams, and the evaluation
results for those submitted runs.
-
Haruki Fujimaki and Makoto P. Kato
[Pdf]
[Table of Content]
This paper describes the KASYS team's participation in the
NTCIR-18 SUSHI Task by presenting a multi-level metadata
aggregation and retrieval approach for Subtask A, which
focuses on retrieving undigitized historical materials with
sparse item-level metadata. Our system leverages the
hierarchical organization of the data---comprising Box,
Folder, and Item levels---by aggregating metadata from
lower to higher levels and applying two search strategies
(``Merge'' and ``Each''). We evaluate traditional BM25
alongside dense retrieval models (E5 and ColBERT) without
fine-tuning, and hyperparameter optimization using Optuna
is employed to determine the optimal weight for each level.
Although our multi-level score aggregation strategy was
designed to exploit the hierarchical structure of the data,
it did not yield a significant performance improvement over
a simpler BM25 baseline. Future work will explore improved
preprocessing of noisy metadata, hybrid retrieval methods
combining BM25 with dense re-ranking, and model fine-tuning
to further enhance performance in searching undigitized
archival collections.
-
Douglas W. Oard, Shashank Bhardwaj and Emi Ishita
[Pdf]
[Table of Content]
The University of Maryland participated in both subtasks of
the SUSHI Pilot Task. This paper describes the design of
the systems used for each task, and it presents some
preliminary analysis of the available results. The
generation of data that has been shared with other
participating teams is also described.
-
Tokinori Suzuki and Yoichi Tomiura
[Pdf]
[Table of Content]
Kyushu University's team (QshuNLP) participated the both
subtasks of the NTCIR-18 SUSHI pilot task. In this paper,
we describe our approaches, systems, and analyze the
results.
Return to Top
-
Yasutomo Kimura, Sato Eisaku, Kazuma Kadowaki and Hokuto
Ototake
[Pdf]
[Table of Content]
This paper provides an overview of the NTCIR-18 U4 shared
task, which focuses on unifying, understanding, and
utilizing unstructured data in financial reports. This task
aims to improve methods for extracting and analyzing
information, particularly from tables, within annual
securities reports. These reports are crucial for
understanding a company's financial performance, yet their
complex and varied table structures present significant
challenges for automated processing. To address these
issues, the task comprises two subtasks, Table Retrieval
and Table Question Answering, designed to evaluate and
advance system capabilities for handling real world
financial documents. The dataset, drawn from TOPIX100
companies, encompasses diverse table formats and content,
serving as a rigorous test bed for participants.
Performance is assessed via a leaderboard that evaluates
JSON formatted system outputs, promoting transparent and
reproducible results. The NTCIR-18 U4 task saw 10 active
teams participate, submitting a total of 210 submissions.
-
Koji Tanaka, Daiki Shirafuji and Tatsuhiko Saito
[Pdf]
[Table of Content]
Recently, Large Language Models (LLMs) are gaining
increased attention in the domain of Table Question
Answering (TQA), particularly for extracting data from
tables in documents. However, directly entering entire
tables as long text into LLMs often leads to incorrect
answers because most LLMs cannot inherently capture complex
table structures. In this paper, we propose a cell
extraction method for TQA without manual identification,
even for complex table headers. Our approach estimates
table headers by computing similarities between a given
question and individual cells via a hybrid retrieval
mechanism that integrates a language model and TF-IDF. We
then select as the answer the cells at the intersection of
the most relevant row and column. Furthermore, the language
model is trained using contrastive learning on a small
dataset of question-header pairs to enhance performance. We
evaluated our approach in the TQA dataset from the shared
task "Unifying, Understanding, and Utilizing Unstructured
Data in Financial Reports" (U4) held in the NTCIR-18
conference, which our team (WhiteME) participated in. The
experimental results show that our pipeline achieves an
accuracy of 74.6%, outperforming existing LLMs such as
GPT-4o mini (63.9%). In summary, we found that focusing on
the header relationships through our hybrid retrieval
strategy effectively addresses structural uncertainties in
complex tables.
-
Long Si, Yin Zhang, Xiaotian Wang and Takehito Utsuro
[Pdf]
[Table of Content]
The goal of this paper is to develop a system for
participating in the
information extraction task from tables in securities
reports (NTCIR-
18 U4 Task). The NTCIR-18 U4 Task consists of two distinct
tasks:
(1) retrieving the table that contains the relevant data.
(2) extracting
the desired data from the table to address the question.
For the first
task, we will utilize a pre-trained model that has
demonstrated strong
performance in table retrieval, and we will fine-tune the
model to
enhance its effectiveness for this specific task. In the
second task,
We will employ the latest Large Language Models (LLMs),
which
have shown excellent results across a variety of Natural
Language
Processing tasks. This approach is expected to achieve
state-ofthe-
art performance, surpassing existing pre-trained BERT-based
models.
-
Yukihiro Seito
[Pdf]
[Table of Content]
This paper presents the methods and results of Team SMM for
the U4 task at NTCIR-18.
In the Table Retrieval subtask, we designed methods for
table retrieval using a cell-level multi-vector retriever
and a single-vector retriever to enhance retrieval accuracy.
The retriever first narrows down candidate tables to the
top 10 based on retrieval score. Then, a
cross-encoder-based reranker classifies these candidates
into three categories: positive, negative, and hard
negative. Finally, the table with the highest probability
of being positive is selected as the final retrieved result.
For the Table Question Answering subtask, we employ a
T5-based model for answer generation to produce multiple
candidate answers and introduce a Cell ID Estimator that
identifies which cells in the table were used as the basis
for generating each candidate answer by leveraging cell,
row, and column embeddings. The estimator then selects the
final answer based on the highest supporting cell score.
The test set is divided into public and private splits,
inspired by Kaggle's evaluation methodology. The public
split is used for leaderboard updates, while the private
split ensures robustness by preventing models from
overfitting to leaderboard data. Final evaluations include
both splits to provide a more reliable assessment of model
performance.
In the formal run, our method achieved an accuracy of
97.70\% (public) and 97.55\% (private) for Table Retrieval
(ID 62), and for Table Question Answering, 86.34\% and
86.57\% on cell ID and value prediction, respectively, on
the public split, with corresponding accuracies of 82.76\%
and 81.94\% on the private split.
-
So Takasago and Tomoyoshi Akiba
[Pdf]
[Table of Content]
In this paper, we propose a three-stage method for the U4
TableQA task. The method first analyzes and segments the
target table into header and data cell sections using a
machine learning classifier. Then, it generates natural
language descriptions for each data cell using sentence
templates based on the table structure. Finally, it
retrieves relevant sentences matching the input question
from the generated sentence set to form the TableQA result.
This approach is also extended to the Table Retrieval task.
Evaluation experiments showed that the Table Retrieval task
achieved an accuracy of 0.3569, whereas for the TableQA
task, the accuracy of cell_id prediction was 0.7797, and
the value prediction was 0.7168.
-
Yuki Fujita, Ryota Mizushima, Hokuto Ototake and Kenji
Yoshimura
[Pdf]
[Table of Content]
This paper describes the proposed methods and results of
the FUSINT team in the U4 task. For the Table Retrieval
task, we propose a method for retrieving specific tables in
Securities Reports based on a given question. Our approach
involves filtering using cosine similarity and reranking,
followed by a binary classification model. We achieved
approximately 90% accuracy, but challenges remain in
preprocessing and generalizing the section prediction
model. Future work should explore methods that can handle a
wider variety of question formats. For the Table QA task,
we propose a method for identifying table cells in
Securities Reports, focusing on standardizing table
structures and resolving inconsistencies in cell values.
One advantage of our approach is its ability to visualize
the reasoning process. While challenges remain in handling
hierarchical tables due to matrix segmentation, our method
successfully identified cell positions with a high accuracy
of approximately 92%.
-
Xin Fan, Kazuya Uesato, Yuma Hayashi and Tsuyoshi Morioka
[Pdf]
[Table of Content]
The AIREV team participated in the NTCIR-18 U4 shared task,
which comprises two subtasks, Table Retrieval (TR) and
Table Question Answering (TQA), designed to evaluate and
advance system capabilities for handling real-world
financial documents. This paper reports our approach to
solving two subtasks and discusses the experimental
results. Our proposed approaches are primarily based on
fine-tuning pre-trained LLMs on specific downstream tasks
involving several key components, converting tabular form
data to natural language representations, well-designed
prompts, Bert-based re-ranking, and LLM-based retrieval.
Our proposed approaches are placed in the second position
in the leaderboard on both the TR and TQA subtasks, based
on the performance compared to the other participant teams,
demonstrating the effectiveness of our proposed method.
-
Hayato Aida, Kosuke Takahashi and Takahiro Omi
[Pdf]
[Table of Content]
This paper reports the methods, results and analysis of
STMK24 for the NTCIR-U4 Table QA (TQA) task. STMK24
approaches TQA as a Visual Document Understanding task, and
tables are transformed into three different modalities:
image, text, and layout of the content. To simply
comprehend the structures of the tables, our model is
trained to infer the cell IDs of the tables, and the cell
values are automatically extracted through rule-based
conversion. We investigated the impact of each modality on
Table QA performance and confirmed that the model achieves
high cell ID inference accuracy when utilizing all
modalities.
-
Hiroyuki Higa, Maeyama Yuuki and Kazuhiro Takeuchi
[Pdf]
[Table of Content]
Financial reports, such as securities reports, contain
various figures and tables that play a crucial role in
conveying structured information. In this study, we focus
on the analysis of tables by integrating both textual and
tabular data. We present a method that leverages natural
language processing (NLP) techniques to assess the
correctness of extracted information.
Return to Top
|