The 18th NTCIR Conference
Evaluation of Information Access Technologies
June 10-13, 2025
National Institute of Informatics, Tokyo, Japan

    [Preface]


  • Makoto P. Kato, Noriko Kando, Charles L. A. Clarke and Yiqun Liu
    [Pdf] [Table of Content]
  • Return to Top


    [Overview]


  • Chung-Chi Chen, Qingyao Ai and Shoko Wakamiya
    [Pdf] [Table of Content]
    The NTCIR project, organized by the National Institute of Informatics (NII) in Japan, has been a key platform for information retrieval (IR) and natural language processing (NLP) research since 1997. NTCIR-18, running from January 2024 to June 2025, features seven core tasks and three pilot tasks covering LLM evaluation, advanced IR, domain-specific NLP, and personal data management. A total of 113 teams worldwide participated, registering 178 times across tasks. This paper provides an overview of NTCIR-18, highlighting its objectives, methodologies, and key findings, along with future directions.
  • Return to Top


    [Keynote]


  • Maarten de Rijke
    [Pdf] [Table of Content]
  • Douglas W. Oard
    [Pdf] [Table of Content]
  • Return to Top


    [Panel]


  • Mark Sanderson
    [Pdf] [Table of Content]
  • Return to Top



    Core Tasks


    [AEOLLM]


  • Junjie Chen, Haitao Li, Zhumin Chu, Yiqun Liu and Qingyao Ai
    [Pdf] [Table of Content]
    In this paper, we provide an overview of the NTCIR-18 Automatic Evaluation of LLMs (AEOLLM) task. As large language models (LLMs) grow popular in both academia and industry, how to effectively evaluate the capacity of LLMs becomes an increasingly critical but still challenging issue. Existing methods can be divided into two types: manual evaluation, which is expensive, and automatic evaluation, which faces many limitations including task format (the majority belong to multiple-choice questions) and evaluation criteria (occupied by reference-based metrics). To advance the innovation of automatic evaluation, we propose the AEOLLM task which focuses on generative tasks and encourages reference-free methods. Besides, we set up diverse subtasks such as dialogue generation, text expansion, summary generation and non-factoid question answering to comprehensively test different methods. This year, we received 48 runs from 4 teams in total. This paper will describe the background of the task, the data set, the evaluation measures and the evaluation results, respectively.
  • Xiao Fu, Navdeep Singh Bedi, Noriko Kando, Fabio Crestani and Aldo Lipani
    [Pdf] [Table of Content]
    We propose an efficient evaluation pipeline for Retrieval-Augmented Generation (RAG) systems tailored for low-resource settings. Our method uses ensemble similarity measures combined with a logistic regression classifier to assess answer quality from multiple system outputs using only the available queries and replies. Experiments across diverse tasks demonstrate competitive accuracy and a reasonable correlation with ground truth rankings, establishing our approach as a reliable metric.
  • Yumi Kim, Meen Chul Kim and Jongwook Lee
    [Pdf] [Table of Content]
    In this study, we aim to propose automated evaluation methods of LLMs that approximate human judgment by exploring and comparing two distinct approaches: (1) LLM-based scoring, which utilizes GPT models with prompt engineering, and (2) feature-based machine learning, using transformer-based metrics such as BERTScore, semantic similarity, and keyword coverage. As part of this research, we participated in the NTCIR-18 Automatic Evaluation of LLMs (AEOLLM) task. We submitted the results of the test data set and the reserved data set to NTCIR-18 and analyzed the results obtained. The results show that GPT-4o Mini (with the updated prompt) achieved the highest performance, while the feature-based approach performed competitively, surpassing GPT-3.5 Turbo and showing a small gap with GPT-4o Mini. LLM-based methods offered scalability but lacked explainability, whereas feature-based approaches provided better interpretability but required extensive tuning, highlighting the trade-offs between the two strategies. Throughout the analysis, We expect that the findings of our work will provide insights into the understanding of human judgment and automated evaluation of LLMs.
  • Chia-Hui Lin, Cen-Chieh Chen, Tao-Hsing Chang and Fu-Yuan Hsu
    [Pdf] [Table of Content]
    In recent years, large language models (LLMs) have been widely applied to various natural language processing (NLP) tasks, demonstrating exceptional performance. To evaluate the output quality of these LLMs, numerous studies utilize one LLM as an evaluator to assess the quality of outputs from other LLMs, showing promising results on public benchmarks. However, the performance of LLMs as evaluators on many unpublished benchmarks still needs improvement. To achieve better evaluation performance, some studies have attempted to fine-tune evaluators based on large amounts of data, incurring significant manual costs and posing substantial limitations in practical applications. Therefore, this paper leverages data augmentation to increase the volume of training data and employs the odds ratio preference optimization (ORPO) algorithm for reinforcement learning to optimize the evaluator. This study uses the dataset provided by NTCIR-18’s Automatic Evaluation of LLMs (AEOLLM) task for training and testing. The proposed method achieves an accuracy of 0.7658 on the summary generation subtask of AEOLLM, the highest among all compared models. Additionally, it yields the second-highest performance in both Kendall’s tau and Spearman correlation coefficient on the summary generation and text expansion subtasks among all compared models.
  • Lang Mei, Chong Chen and Jiaxin Mao
    [Pdf] [Table of Content]
    As large language models (LLMs) gain widespread attention in both academia and industry, it becomes increasingly critical and challenging to effectively evaluate their capabilities. Existing evaluation methods can be broadly categorized into two types: manual evaluation and automatic evaluation. Manual evaluation, while comprehensive, is often costly and resource-intensive. Conversely, automatic evaluation offers greater scalability but is constrained by the limitations of its evaluation criteria (dominated by reference-based answers). To address these challenges, NTCIR-18\footnote{https://research.nii.ac.jp/ntcir/ntcir-18/tasks.html#AEOLLM} introduced the AEOLLM (Automatic Evaluation of LLMs) task, aiming to encourage reference-free evaluation methods that can overcome the limitations of existing approaches. In this paper, to enhance the evaluation performance of the AEOLLM task, we propose three key methods to improve the reference-free evaluation: 1) Multi-model Collaboration: Leveraging multiple LLMs to approximate human ratings across various subtasks; 2) Prompt Auto-optimization: Utilizing LLMs to iteratively refine the initial task prompts based on evaluation feedback from training samples; and 3) In-context Learning (ICL) Optimization: Based on the multi-task evaluation feedback, we train a specialized in-context example retrieval model, combined with a semantic relevance retrieval model, to jointly identify the most effective in-context learning examples. Experiments conducted on the final dataset demonstrate that our approach achieves superior performance on the AEOLLM task.
  • Return to Top


    [FairWeb-2]


  • Sijie Tao, Tetsuya Sakai, Junjie Wang, Hanpei Fang, Yuxiang Zhang, Haitao Li, Yiteng Tu, Nuo Chen and Maria Maistro
    [Pdf] [Table of Content]
    This paper provides an overview of the NTCIR-18 FairWeb-2 Task. Our task considers not only document relevance but also group fairness. We designed two subtasks: the Web Search Subtask, and the Conversational Search Subtask. We designed three types of search topics for this task: researchers (R), movies (M), and Youtube contents (Y). For each topic type, attribute sets are defined for considering group fairness. For the Web Search Subtask, we received 23 runs from five teams, including six runs from the organisers team. For the Conversational Search Subtask, we received four runs from two teams, including one run from the organisers team. In this paper, we describe the task, the test collection construction and the official evalution results of the submitted runs.
  • Clara Rus, Jasmin Kareem, Chen Xu, Yuanna Liu, Zhirui Deng and Maria Heuss
    [Pdf] [Table of Content]
    Balancing utility and fairness in the search results is an important and challenging problem for the IR community. The FairWeb-2 Task of NTCIR-18 aims to tackle this using three main search topics: movies, researchers and YouTube videos. This paper presents the approach employed by the AMS42 team as part of the FairWeb-2 Task of NTCIR-18. The AMS42 team submitted 5 runs. First, we focus on retrieving documents which are relevant to the given queries. Next, we employ two fairness approaches. One of which makes use of estimated sensitive attribute values to balance relevance and fairness in the retrieved results, and another which relies on the model's semantic understanding of sensitive attribute values derived from the document content. Finally, we discuss the challenges identified while working on the FairWeb-2 Task.
  • Atsuya Ishikawa, Sijie Tao and Tetsuya Sakai
    [Pdf] [Table of Content]
    This report presents the participation of RSLFW team at the NTCIR-18 FairWeb-2 task.We implemented several different retrieval methods to generate five runs using BM25, ColBERT and PM-2 algorithm.In addition to the runs submitted, the results are analyzed through comparison with the official baseline and FairWeb-1 reproduction (revived) runs.
  • Narendra Kumar, Arjun Mukherjee, Sukomal Pal and Thomas Mandl
    [Pdf] [Table of Content]
    As information retrieval systems become increasingly sophisticated, ensuring fairness and algorithmic neutrality in search results has emerged as a critical challenge. Traditional ranking algorithms often prioritize relevance, which can unintentionally amplify the visibil- ity of majority groups while limiting representation for minority perspectives. This imbalance can lead to biased search results that reinforce existing disparities. To address this issue, fairness-aware retrieval methods aim to ensure equitable representation by balanc- ing relevance with exposure fairness while maintaining algorithmic neutrality. In this study, we investigate the impact of query modifi- cations on group fairness in ranked search results. Specifically, we examine how expanding queries to encompass a broader range of relevant content influences fairness between different groups while considering their protected attributes. Our findings contribute to ongoing efforts to design information retrieval systems that provide more inclusive and bias free access to information.
  • Amogh Raina and Tetsuya Sakai
    [Pdf] [Table of Content]
    This paper describes our participation in the Conversational Search Subtask of the FairWeb-2 Task at NTCIR-18. Our system, COPWA, was designed to balance conversational relevance and group fair ness while retrieving entities from researcher, movie, and YouTube content topics. We detail our approach, evaluation results, and analysis of our system’s performance using the GFRC (Group Fairness and Relevance of Conversations) framework.
  • Huixue Su, Haitao Li, Yiteng Tu, Qingyao Ai and Yiqun Liu
    [Pdf] [Table of Content]
    The fairness of search systems remains a critical challenge in information retrieval. Building upon our previous work in FairWeb‑1, this paper presents the THUIR team’s approach in the NTCIR‑18 FairWeb‑2 Task. Specifically, we developed a simple yet effective retrieval pipeline that integrates multiple neural rerankers with results aggregated via Reciprocal Rank Fusion to generate balanced search rankings across various entity types. Additionally, we submitted a revived run that combines a PM2-based result diversification algorithm with dense retrieval scores. Our experimental results yield competitive performance on multiple evaluation metrics, demonstrating that enhancements in retrieval relevance inherently promote balanced group fairness. With the right combination of techniques, it is possible to achieve a synergistic reinforcement between relevance and fairness.
  • Return to Top


    [FinArg-2]


  • Chung-Chi Chen, Chin-Yi Lin, Cheng-Chih Chiu, Hen-Hsen Huang, Alaa Alhamzeh, Yu-Lieh Huang, Hiroya Takamura and Hsin-Hsi Chen
    [Pdf] [Table of Content]
    This paper provides an overview of the FinArg-2 shared tasks in NTCIR-18. Building upon the fundamental argument identification tasks in FinArg-1, this iteration focuses on temporal inference. Forward-looking statements frequently appear in financial documents, and we aim to capture the duration of a premise's impact on a company's operations, the temporal reference associated with an argument, and the validity period of a claim. Similar to FinArg-1, we utilize earnings conference calls, professional research reports, and social media data for analysis. A total of 20 teams registered for FinArg-2, with 7 active teams submitting their results. \textcolor{red}{We will highlight some methods after receiving participants' submissions.}
  • Adhitia Erfina and Phuong Le-Hong
    [Pdf] [Table of Content]
    FinArg-2 is part of the NTCIR Financial Argument shared task series which aims to improve argument understanding in financial analysis. FinArg-2 aims to introduce "Temporal Inference of Financial Arguments" focusing on the assessment of temporal information, which is a distinct phenomenon in financial opinions. FTRI participates in FinArg-2 on the Earnings Conference Calls (ECC) subtask, where models must identify the temporal reference associated with an argument. At the initial stage we conducted experiments on variation of transformers models using several configurations at the preprocessing and training stages. BERT-Base-Uncased, BERT-Large-Uncased, and RoBERTa-Base-Uncased showed slightly superior performance compared to the other models. So, in the overall model that we created, we only fine-tuned those models as our baseline model. Our first model’s output FTRI_ECC_1, we use a transformer encoder approach with BERT-Large, resulting in 71.43% Micro F1 and 68.58% Macro F1. Our second model’s output FTRI_ECC_2, we use attention mask in Claim, Premise, and (Year + Quarter) approach with BERT-Base, resulting in 69.05% Micro F1 and 65.76% Macro F1. Our third model’s output FTRI_ECC_3, we use TF-IDF (Claim + Premise) + One-hot encoding (Year + Quarter) approach with BERT-Base, resulting in 77.38% Micro F1 and 75.07% Macro F1, which is the best results in this ECC Subtask. The evaluation results show that the 3 output models we created are in the top 4 among other participants based on Micro and Macro F1.
  • Xuan-Yu You, Di Jie Liew, Wen-Chao Yeh and Yung-Chun Chang
    [Pdf] [Table of Content]
    The TMUNLPG1 team participated in the FinArg-2 Task of NTCIR-18, focusing on the Detection of Argument Temporal References and Assessment of the Claim's Validity Period in the finance domain using Earning Conference Call and Social Media datasets. The team ranked 6th and 2nd in these subtasks, respectively. This paper presents the team's methodologies, results, and conclusions. For Earnings Conference Call (ECC) Argument Temporal References, we utilized a combination of feature engineering, ensemble strategy, and data augmentation to achieve a Micro F1 score of 0.6905. In Social Media Assessment of the Claim's Validity Period, we developed an enhanced approach combining domain-specific transformer architectures with statistical feature engineering. By integrating FinBERT with Log-Likelihood Ratio (LLR) and Pointwise Mutual Information (PMI) features, we achieved a Micro F1 score of 0.742 on the unified dataset and demonstrated robust performance on the test set. The methodology incorporates weighted pooling strategies and adaptive learning rate optimization to improve temporal validity prediction accuracy. Our results highlight the effectiveness of combining domain-specific language models with traditional statistical approaches in financial text analysis, contributing to advancements in temporal natural language processing for the financial domain.
  • Bor-Jen Chen, Wen-Hsin Hsiao, Jun-Yu Wu, Cheng-Yun Wu and Min-Yuh Day
    [Pdf] [Table of Content]
    The increasing availability of financial texts from earnings conference calls (ECCs) and social media has created a need for advanced natural language processing (NLP) techniques to extract meaningful insights. This study develops a classification framework that integrates fine-tuning and prompt-based learning to improve financial argument classification. We apply this framework to two tasks from the NTCIR-18 FinArg-2 competition: detecting temporal references in ECCs and assessing the validity period of claims in social media. Encoder-based models are fine-tuned for structured classification, while decoder-based models leverage both fine-tuning and prompt-based learning. Data augmentation techniques enhance model generalization, and performance is evaluated using Micro-F1 and Macro-F1 scores. The primary contribution of this research is demonstrating how fine-tuning and prompt-based learning can complement each other in financial NLP. By optimizing classification strategies, this study provides insights for improving argument analysis in financial applications, benefiting researchers, practitioners, and FinTech developers.
  • Takahiro Kawamoto and Xin Kang
    [Pdf] [Table of Content]
    This paper presents our participation in FinArg-2, which succceeds the FinArg-1 task. While FinArg-1 focused on sentiment analysis and argument classification, FinArg-2 extends this to temporal. We experiment with a method of classifying text into two types: "Premise" and "Claim." Based on these premises and claim, we have developed a method suitable for accurately classifying the temporal relationships between sentences. In order to classify sentences, we trained a classification model on labeled data, and compared traditional machine learning approaches with models that use large scale language models. Among the models tested, DeBERTa and Llama achieved the highest classification accuracy, demonstrating the model that used a large-scale language model showed auperior results.
  • Tong-Ru Wu and Jheng-Long Wu
    [Pdf] [Table of Content]
    Large Language Models (LLMs) have shown promising capabilities for zero-shot text classification, yet they often do not outperform fine-tuned traditional models like BERT when trained on sufficient labeled data. However, acquiring large-scale human-labeled datasets can be challenging, particularly in specialized domains. To address this gap, we propose Repeat-Error-Correction Learning, a framework that iteratively identifies and rewrites misclassified samples to augment the training set. First, we train a base BERT model using available text–label pairs. Next, the trained model infers labels on the same dataset, and we collect the misclassified samples. An LLM, such as GPT-4o-mini, then rewrites these erroneous texts while preserving their original labels. The rewritten texts are reintroduced into the training set, and the model is fine-tuned on this expanded corpus. By iteratively refining the training data through error correction and text rewriting, the proposed method aims to achieve robust classification performance despite limited initial annotations. Our results indicate that fine-tuning the base model by adding rewritten misclassified text achieved the highest validation set Micro-F1 score (77.33%). These findings contribute to a deeper understanding of a cost-friendly and efficient way to generate data for augmenting text classification models.
  • Pan Hongrui and Wu Jheng-Long
    [Pdf] [Table of Content]
    Social media claims often have shifting validity that influences downstream tasks like misinformation detection, financial predictions, and domain-specific decisions. This study proposes a novel approach that merges original text with automatically generated template text to highlight temporal cues. By integrating this enriched data into the training process, the model more effectively gauges how long a claim remains reliable, even when its relevance rapidly evolves. This strategy addresses the challenge of ephemeral statements whose validity fluctuates as new information emerges. Experimental results underscore the method’s effectiveness, achieving a macro-F1 score of 78.10%. These findings highlight the importance of systematically assessing claim longevity, providing a pathway to more robust content analysis and better-informed decisions in ever-changing online environments.
  • Min-Chin Ho and Jheng-Long Wu
    [Pdf] [Table of Content]
    The SCU-1 team participated in the "Detection of Argument Temporal References in Earnings Conference Calls" subtask of the NTCIR-18 FinArg-2 task. This study reports our approach to solving the problem and discusses the official results. We analyze the impact of step-by-step reasoning, model collaboration, and prompt design on the classification performance of large language models (LLMs). Through a series of experiments, we found that providing detailed explanations and incorporating previous model predictions significantly improved classification accuracy. Additionally, we compared different LLM discussion mechanisms and prompt design strategies, revealing that allowing models to reference each other and reason based on prior outputs effectively enhances decision-making quality. Run 3, which included complete reasoning steps and prior model outputs, achieved the best performance, highlighting the advantages of cross-model reference and optimized prompt design. These findings offer new directions for improving LLM-based classification tasks.
  • Sai Saketh Nandam, Charan Srinivas Kumar Reddy Dasari and Anand Kumar Madasamy
    [Pdf] [Table of Content]
    The SCaLAR IT team participated in the Detection of Argument Temporal References subtask of the NTCIR-18 FinArg-2 Task. This paper presents our approach to solving the classification of financial arguments based on temporal references. We explored multiple ar- chitectures combining a BERT-based model with knowledge-based and temporal feature extraction techniques. To improve the perfor- mance,integrated BERT with TF-IDF based temporal features were extracted using STANZA and BERT embeddings to enhance tempo- ral reference detection. Our first model BERTForSequenceClassifier achieves the Micro F1 score of 70.24% and Macro F1 score of 67.85% outperforming most approaches of other teams. However incorpo- rating additional temporal features improved the Macro F1 score, indicating better performance across all classes. We analyze the effectiveness of different feature representations in our research.
  • Hugo Dutra, Leonardo Martinho, Gabriel Assis, Jonnathan Carvalho and Aline Paes
    [Pdf] [Table of Content]
    This paper presents AIDAVANCE's approach to Subtask 2 (Detection of Argument Temporal References) of the NTCIR-18 FinArg-2 Task. We explored different classification strategies, including direct multi-class classification, a hierarchical cascade approach that first identifies the presence of a temporal reference before further categorization, and an LLM-based argument rewriting method. Our best model, a fine-tuned mDeBERTa using the multi-class approach, ranked fourth overall, achieving a Micro-F1 score of 0.6905 and a Macro-F1 score of 0.6711. Our findings reinforce that fine-tuning smaller encoder models remains an effective strategy for specialized classification tasks, even outperforming state-of-the-art LLMs.
  • Return to Top


    [Lifelog-6]


  • Liting Zhou, Cathal Gurrin, Hsin-Hung Chen, Hideo Joho, Chenyang Lyu, Longyue Wang, Graham Healy, Ly Duyen Tran, Quang-Linh Tran, Hoang Bao Le, Duc-Tien Dang-Nguyen and Tianbo Ji
    [Pdf] [Table of Content]
    NTCIR-18 marked the sixth iteration of the Lifelog task, which aims to advance research on multimodal lifelog organization, search, and access. This task builds on methodologies successfully deployed in previous NTCIR conferences. In this paper, we detail the test collection, outline the specific tasks, provide an overview of submissions, and present findings from the NTCIR-18 Lifelog-6 task. We conclude with recommendations for future developments in lifelog research.
  • Luca Rossetto
    [Pdf] [Table of Content]
    This paper discusses vitrivr's participation in the Lifelog Semantic Access subtask of the 6th edition of the NTCIR Lifelog. It is based on the system that participated in the 2024 Lifelog Search Challenge and only replaces the interactive query interface with an LLM-based query transformation method. All results are generated in one pass without any further re-processing or refinement.
  • Quang-Linh Tran, Binh Nguyen, Gareth Jones and Cathal Gurrin
    [Pdf] [Table of Content]
    We present the participation of the MemoriEase lifelog retrieval system in the NTCIR-18 Lifelog 6 Task. This current MemoriEase system is an automatic and enhanced version of the MemoriEase system at the Lifelog Search Challenge 2024 (LSC). We report our methods for the two core sub-tasks in the NTCIR-18 Lifelog 6 task, Lifelog Semantic Access (LSAT) and Lifelog Question Answer (LQAT). We enhance the main architecture of the MemoriEase system utilizing the BLIP2 and CLIP embedding models to extract visual embedding and perform a comparison between the two models. In addition, we also use pseudo-relevance feedback for ad-hoc queries. For the LQAT sub-task, we use our retrieval model as the retriever and GPT-4o as a reader to generate answers to questions. Results of the LSAT sub-task show that our system found 369 images in 1,995 relevant images. The performance on known-item search queries is higher than on Ad-hoc queries, with 28.22% R@5 compared to 5.98% R@5, respectively. In the LQAT sub-task, the LLM model generates 8 correct answers in 24 questions. Although the performance is not high, it shows the advantages and drawbacks of the MemoriEase retrieval system and the QA model.
  • Jiahan Chen, Da Li and Keping Bi
    [Pdf] [Table of Content]
    In recent years, sharing lifelogs recorded through wearable devices such as sports watches and GoPros, has gained significant popularity. Lifelogs involve various types of information, including images, videos, and GPS data, revealing users' lifestyles, dietary patterns, and physical activities. The Lifelog Semantic Access Task(LSAT) in the NTCIR-18 Lifelog-6 Challenge focuses on retrieving relevant images from a large scale of users' lifelogs based on textual queries describing an action or event. It serves users' need to find images about a scenario in the historical moments of their lifelogs. We propose a multi-stage pipeline for this task of searching images with texts, addressing various challenges in lifelog retrieval. Our pipeline includes: filtering blurred images, rewriting queries to make intents clearer, extending the candidate set based on events to include images with temporal connections, and reranking results using a multimodal large language model(MLLM) with stronger relevance judgment capabilities. The evaluation results of our submissions have shown the effectiveness of each stage and the entire pipeline.
  • Thang-Long Nguyen-Ho, Allie Tran, Minh-Triet Tran, Cathal Gurrin and Graham Healy
    [Pdf] [Table of Content]
    This paper presents our work in the Lifelog Semantic Access Task (LSAT) at NTCIR-18, focusing on automatic searching methods for finding distinct life moments. Our experiments explore and compare different retrieval strategies, including keyword matching-based search combined with embedding extraction, vector embedding-based semantic search using a multimodal model, and hybrid methods that take advantage of both approaches. Our proposed method improved retrieval accuracy by directing the model's attention to key query terms while prioritizing semantic relevance and the presence of requested entities in the retrieved moments. Experimental results demonstrated that the best-performing method relies on embeddings incorporating extended descriptions and highlighted keywords. Conversely, the hybrid methods in our experiments have less effective results, likely due to limitations in the keyword-matching search algorithm. This work's findings underscore the richer descriptive entities within queries to enhance the retrieval of life moments, ensuring a focus on core semantic and visual elements.
  • Return to Top


    [MedNLP-CHAT]


  • Eiji Aramaki, Shoko Wakamiya, Shuntaro Yada, Shohei Hisada, Tomohiro Nishiyama, Lenard Paulo Tamayo, Jingnan Xiao, Axalia Levenchaud, Pierre Zweigenbaum, Christoph Otto, Jerycho Pasniczek, Philippe Thomas, Nathan Pohl, Wiebke Duettmann, Lisa Raithel and Roland Roller
    [Pdf] [Table of Content]
    This paper presents an overview of the Medical Natural Language Processing for AI Chat (MedNLP-CHAT) task, conducted as part of the shared task at NTCIR-18. Recently, medical chatbot services have emerged as a promising solution to address the shortage of medical and healthcare professionals. However, the potential risks associated with these chatbots remain insufficiently understood. Given this context, we designed the MedNLP-CHAT task to evaluate medical chatbots from multiple risk perspectives, including medical, legal, and ethical aspects. In this shared task, participants were required to analyze a given medical question along with the corresponding chatbot response and determine whether the response posed a potential medical, legal, or ethical risk (binary classification). Nine teams participated in this task applying different approaches, yielding valuable insights.
  • Hsuan-Lei Shao, Chih-Chuan Fan, Wei-Hsin Wang and Wan-Chen Shen
    [Pdf] [Table of Content]
    The NTCIR-18 MedNLP-CHAT RISK task evaluates the potential medical, ethical, and legal risks posed by chatbot-generated responses to patient inquiries. This study investigates a sentence-level risk classification approach to identify specific sentences within chatbot responses that contribute to risk assessment rather than treating entire responses as monolithic risk units. Our methodology involved automatic sentence segmentation, contextual risk annotation, and threshold-based classification, leveraging traditional natural language processing (NLP) models instead of large language models (LLMs) to ensure interpretability and stability. Despite the conceptual validity of our approach, our system did not perform competitively, particularly in ethical and legal risk classification. A key limitation was using a single model for all risk types, which failed to capture the nuanced distinctions between medical, ethical, and legal risk factors. Additionally, dataset constraints and class imbalance (fewer than 30 positive samples per risk category) limited model generalization. While sentence-level annotation improved granularity, it introduced challenges in handling cross-sentence risk dependencies, where risks emerge from multi-sentence interactions rather than isolated statements. Our findings highlight the need for more advanced risk classification frameworks, incorporating sequence-aware models, domain-specific fine-tuning, and context-sensitive risk evaluation. We also discuss the cultural relativity of risk perception, emphasizing that risk assessments should account for jurisdictional differences in medical, legal, and ethical norms. Future research should explore hybrid NLP architectures, data augmentation techniques, and adaptive risk modeling to enhance chatbot safety and reliability in medical AI applications.
  • Ayantika Das and Anupam Mondal
    [Pdf] [Table of Content]
    Risk prediction in the context of medical, ethical, and legal is crucial for ensuring safety and informed decision-making. This study explores machine learning approaches for the MedNLP-CHAT task, utilizing English-translated datasets from Japanese and German subtasks. The textual data underwent preprocessing, including tokenization, n-gram extraction, and lemmatization, before being modeled using Logistic Regression, Nu-SVC (nu=0.1) [2], Gradient Boosting, and XGB Regressor. Objective risks were framed as a binary classification task, while subjective labels were predicted via regression, ensuring alignment with human-annotated distributions. Performance was evaluated using accuracy, precision, recall, F1-score, and Earth Mover’s Distance (EMD). The findings indicate the model’s strengths and weaknesses, emphasizing the need to enhance how class imbalances and potential overfitting are addressed. This work increases AI-driven risk assessment with applications in regulatory compliance, healthcare, and ethical AI development.
  • Lenard Paulo V. Tamayo, Sa'Idah Zahrotul Jannah, Mohamad Alnajjar, Axalia Levenchaud, Shaowen Peng, Shoko Wakamiya and Eiji Aramaki
    [Pdf] [Table of Content]
    Chatbots are widely used in the healthcare sector, making their accuracy and reliability essential. Beyond providing factually correct information, chatbots must also consider the human aspect of their responses. Large language models (LLMs) can be utilized to evaluate chatbot responses, employing prompting strategies such as chain-of-thought and few-shot prompting to enhance reasoning and optimize output quality. This study evaluates a chatbot’s answers to medical questions using both objective and subjective assessments. Different prompting techniques were applied: objective evaluation used baseline, chain-of-thought (COT), and chain-of-thought with few-shot (COTF) prompting, while subjective evaluation used baseline and baseline with few-shot (Baseline-f) prompting. The results revealed that COTF prompting with both models improved the performance of objective evaluation, while few-shot prompting enhanced subjective evaluation.
  • Michael Van Supranes, Martin Augustine Borlongan, Joseph Ryan Lansangan, Genelyn Ma. Sarte, Shaowen Peng, Shoko Wakamiya and Eiji Aramaki
    [Pdf] [Table of Content]
    This paper presents our submission to the MedNLP-CHAT Task at NTCIR-18, which focuses on detecting medical, ethical, and legal risks in chatbot-generated responses. We propose a two-step prompt-based classification framework using the Gemini-1.5-flash model. The method first generates support statements to guide reasoning, which are then integrated into a few-shot prompt for final classification. We evaluated our approach on the English versions of the Japanese and German subtasks, submitting two systems per subtask that varied in example selection strategy and label distribution. Our systems achieved strong performance in detecting medical risks—particularly in the German subtask—while ethical and legal risks were more challenging. To better understand the design factors influencing performance, we conducted ablation studies across 24 prompt variants. Logistic regression and CHAID analyses revealed that accuracy depends on complex interactions between subtask language, example similarity, actual label, and selection method. Higher similarity improves classification of risk-present cases but harms performance on risk-absent cases, indicating a trade-off between recall and false positives. The $k$-nearest method was more effective under high similarity, while $k$-spread offered balanced results across classes. Although the two-step prompting strategy did not show a statistically significant advantage overall, the best-performing configuration used five support statements, with diminishing gains beyond that. Our findings suggest that optimized prompt design, particularly with controlled support and example selection, can improve risk detection without requiring large-scale training or high computational resources.
  • Aoi Ohara, Nanami Murata, Ami Yuge and Rei Noguchi
    [Pdf] [Table of Content]
    We developed model systems for detecting medical, legal, and ethical risks in medical chatbot answers by using BERT and ChatGPT language models. The ChatGPT model system, which refers to external medical knowledge, performed best in detecting medical risk, while the BERT model system performed well in detecting legal and ethical risks. The hybrid model system reduces missed risks by combining the best of the BERT and ChatGPT model systems and has the best recall values for all risk determination models. This study demonstrates the usefulness of utilizing external medical knowledge and the effectiveness of the hybrid approach.
  • Pei-Ying Yang, Tzu-Cheng Peng, Wen-Chao Yeh, Chien Chin Chen and Yung-Chun Chang
    [Pdf] [Table of Content]
    The TMUNLPG2 team participated in the Japanese subtask of the NTCIR-18 Medical Natural Language Processing for AI Chat (MedNLP-CHAT) Task. This paper presents our methodological approach and analyzes the official results. For the Japanese subtask, we implemented two distinct methodologies addressing the objective and subjective components. In the objective task, we fine-tuned a pre-trained language model enhanced with focal loss, comprehensive feature engineering, and strategic data augmentation techniques to optimize performance. For the subjective task, we developed specialized feature engineering methods to extract implicit semantic relationships within question-answer pairs, subsequently leveraging these features to train a robust deep learning architecture. Our approach yielded significant results, with TMUNLPG2 achieving the highest average F1-score among seven participating teams in the objective task and securing second place in the subjective task. These outcomes demonstrate the efficacy of our methodological framework and highlight its potential applications in advancing medical natural language processing systems.
  • Hiroki Tanioka
    [Pdf] [Table of Content]
    Artificial intelligence (AI) is rapidly transforming many fields, and healthcare is no exception. The current state of AI in healthcare is characterized by a shift toward addressing ethical concerns and developing a robust framework for AI integration. Generative AI, a subset of AI that includes Large Language Models (LLMs), has emerged as a game changer with the potential to revolutionize medical consultations. Therefore, the AITOK team participated in Japanese/German subtasks of the NTCIR-18 MedNLP-CHAT using statistical knowledge only, GPT-3.5 Turbo, and GPT-4o, respectively. This report describes the problem-solving approach using generative AI for medical, legal, and ethical issues in medical consultation and its formal results.
  • Jun-Yu Wu, Cheng-Yun Wu, Bor-Jen Chen, Wen-Hsin Hsiao and Min-Yuh Day
    [Pdf] [Table of Content]
    The IMNTPU team presents a multilingual evaluation of Agentic AI for chatbot risk classification in the NTCIR-18 MedNLP-CHAT task. Our framework integrates fine-tuned small models, optimized few-shot prompting with GPT-4o, and multi-agent aggregation via majority and trust-weighted voting. Results show that Agentic AI enhances decision consistency, especially in subjective tasks like ethical risk, but yields limited gains in structured domains such as medical and legal assessment. Language-specific outcomes reveal that annotation quality and linguistic complexity jointly affect model performance, with Japanese systems showing the most stability. Confidence analysis highlights a decoupling between model certainty and accuracy, underscoring the need for adaptive trust and calibration strategies. Building on these insights, we propose a Trust-Guided Agentic AI architecture featuring self-consistency filtering, dynamic trust updating, and Chain-of-Thought prompting to further improve reliability in safety-critical AI systems.
  • Guanqi Cheng, Chang Qu and Ali Braytee
    [Pdf] [Table of Content]
    Our team, UTSolve, participated in the Medical Natural Language Processing for AI Chat (MedNLP-CHAT) task~\footnote{https://sociocom.naist.jp/mednlp-chat/} at NTCIR-18. The task involved classifying various medical texts into medical, ethical, and legal risks. In this report, we utilized BioBERT, a pre-trained biomedical language model that was trained on a large amount of biological text data to predict the risk level of medical texts. We also evaluated the medical and clinical language models MedBERT and ClinicalBERT. Based on prediction performance, BioBERT achieved the best classification results, with a weighted F1 score of 0.7812 for medical risk, 0.8629 for ethical risk, and 0.7288 for legal risk.
  • Return to Top


    [RadNLP]


  • Yuta Nakamura, Koji Fujimoto, Jonas Kluckert, Michael Krauthammer, Jun Kanzawa, Akira Katayama, Tomohiro Kikuchi, Ryo Kurokawa, Wataru Gonoi, Yuki Tashiro, Shouhei Hanaoka, Shuntaro Yada and Eiji Aramaki
    [Pdf] [Table of Content]
    Radiology reports play a vital role in clinical workflows, serving as a primary means for radiologists to communicate imaging findings to physicians. However, the increasing number of imaging studies has made it challenging to produce and interpret comprehensive reports in a timely manner. Natural language processing (NLP) has shown potential to alleviate this burden, yet most existing studies are limited to English, while clinical reports are often written in local languages. To address this gap, we have developed and released Japanese medical text datasets through a series of shared tasks. Our recent efforts, including NTCIR-16 Real-MedNLP and NTCIR-17 RR-TNM, focused on automating lung cancer staging from radiology reports using the TNM classification system. This task is clinically significant, yet challenging due to the implicit nature of staging information and the complexity of TNM criteria. In this paper, we introduce the NTCIR-18 RadNLP 2024 shared task, which extends the previous task with finer-grained classification, a larger and bilingual corpus, and new sentence-level subtasks. We present the dataset, participating systems, and evaluation results, aiming to provide practical insights into building NLP systems for cancer staging support.
  • Yoshifumi Okura and Yuki Kataoka
    [Pdf] [Table of Content]
    This study aims to develop and evaluate a system that automatically extracts the TNM classification of lung cancer (T: primary tumor, N: lymph node metastasis, M: distant metastasis) from radiological diagnosis reports. In the initial experiments, inference was performed using `gemini-2.0-flash-thinking-exp-1219`. By incorporating explicit TNM classification criteria and unit specifications—features absent in conventional methods—and introducing error analysis and prompt improvements through meta-prompting, an overall accuracy improvement of approximately 15% was achieved after prompt modification. In the final evaluation, using the `o1 2024-12-01-preview` model, we achieved approximately 70% joint accuracy (fine), 76% T accuracy, 93% N accuracy, and 95% M accuracy. This paper provides a detailed account of the experimental procedures and the improvement process at each stage.
  • Junya Sato, Kosuke Kita, Daiki Nishigaki, Miyuki Tomiyama and Masatoshi Hori
    [Pdf] [Table of Content]
    In this paper, we describe our proposed systems for the Japanese main task and sub task in Natural Language Processing for Radiology 2024 shared task. We employed Generative Pre-trained Transformer models and applied a few-shot prompting approach to tackle the classification task for lung cancer TNM staging from free-text radiology reports. Our method first performs zero-shot prompting using training data and then refines the final predictions by incorporating examples of incorrect predictions into the prompt. We demonstrate that this approach outperforms several BERT-based models and other open-source large language models. On the test data, our method achieved a Joint Accuracy (fine) of 0.732 for the main task and an overall micro F2.0 of 0.688 for the sub task, ranking 3rd in both categories.
  • Tsz-Yeung Lau and Shih-Hung Wu
    [Pdf] [Table of Content]
    This study investigates the application of Large Language Models (LLMs) for automated lung cancer staging based on radiology reports, as part of the CYUT team’s participation in the NTCIR-18 RadNLP Main Task. Through data analysis, we observed a moderate correlation among the T, N, and M staging classes. Experimental results indicated that jointly prompting LLMs to predict all three classes simultaneously yields improved performance. Additionally, standardizing measurement units to millimeters, rather than centimeters, proved to be a more effective strategy. Based on these findings, we refined our prompting methodology and applied it to both LLMs and reasoning-augmented models, including OpenAI’s O-series and DeepSeek-R1. These reasoning-models, enhanced through post-training with Chain-of-Thought (CoT) reasoning, demonstrated superior staging accuracy. As LLMs are generative models, their outputs may vary across different runs, introducing inconsistency in predictions. To mitigate this variability, we adopted an ensemble learning strategy aimed at consolidating divergent LLM outputs into a more stable and reliable lung cancer staging system. Experimental results demonstrate that ensemble methods consistently outperform individual models, enhancing both the robustness and reliability of staging from radiology reports. Our approach achieved second place in the NTCIR-18 RadNLP Main Task (English), underscoring the effectiveness of LLM-based ensemble techniques for TNM classification. The implementation is available at github: anson70242/NTCIR-18-RadNLP-CYUT.
  • Ryutaro Mori, Koichi Okuda, Shota Hosokawa, Taisei Komoda, Tsudou Watanabe and Yasuyuki Takahashi
    [Pdf] [Table of Content]
    We participated in the NTCIR-18 RadNLP2024 shared task [1] and investigated the automation of TNM classification using large language models (LLMs), specifically GPT-4o-mini, GPT-4o, and o1-mini. Our approach integrates cosine similarity-based retrieval using embedding vectors and few-shot learning to enhance classification accuracy. As a result of the experiment, o1-mini achieved the highest classification accuracy. However, the accuracy on the test data declined by approximately 30% compared to the validation data. In particular, the low classification accuracy of the T factor highlighted challenges in interpreting tumor size and extent of infiltration. In this paper, we analyze these results and report our approach to this task along with official results.
  • Daiki Shirafuji and Takafumi Niwa
    [Pdf] [Table of Content]
    Recent advances in language models (LMs) have significantly improved the handling of complex medical narratives compared to classical methods. However, one major obstacle to the practical usage of these LMs in the medical domain is that the models lack training on medical knowledge. In particular, standard tokenizers trained on open-domain corpora fail to accurately capture domain-specific terminologies, abbreviations, and writing styles in radiology reports or clinical notes. To address this issue, we propose a two-step domain-transfer method that updates both the tokenizer vocabulary and the LM representations. First, we replace low-frequency tokens in the original general-domain vocabulary with high-frequency bi- and tri-grams extracted from medical text, ensuring that domain-relevant tokens are learned. Second, we continually pre-train the LM on the medical corpus using the masked language modeling to more closely align the model parameters to the domain-specific language parameters. We evaluated the effectiveness of this approach in the RadNLP 2024 shared task on lung cancer staging from radiology reports, covering both English and Japanese. Experimental results indicate that our method improves performance on this specialized task, suggesting that customizing tokenizers and re-training language models can substantially mitigate the domain gap. In the future, we address standardizing radiology report formats to facilitate more robust and accurate automated analysis.
  • Soma Onishi, Daisaku Shibata, Masanori Tsujikawa, Ryo Ishii, Junya Tominaga and Hideki Ota
    [Pdf] [Table of Content]
    We propose a novel method for automatically inferring TNM stages from radiology reports. The proposed method includes a two-stage reasoning process. In Stage 1, kNN few-shot learning with the Chain of Thought is used for initial inference, followed by a self-review to evaluate the reasoning process. In Stage 2, if the inference results after the self-review are inconsistent, a second review is conducted from an alternative perspective. The proposed method achieved superior results in the NTCIR-18 RadNLP 2024 Main Task (Japanese), outperforming other teams by approximately 7.4 points, thereby winning the competition. The proposed method is designed as an extension of prompt engineering. It requires no complex training, which makes it applicable to various large language models.
  • Chirag Bhawnani, Dhananjaya Bedkani Linganaik, Sanjeeth J. Veigas and Vishnu Kumar Jakhoria
    [Pdf] [Table of Content]
    The management of lung cancer heavily relies on precise staging, which is traditionally derived from comprehensive radiology reports generated through imaging techniques like CT and MRI. However, these reports often lack explicit staging details, posing challenges for healthcare professionals who must manually extract relevant information. To address this issue, we propose an automated solution as part of our submission to the RadNLP (Natural Language Processing for Radiology) shared task at the NTCIR-18 international conference. Our approach utilizes tailored Natural Language Processing (NLP) techniques to enhance the processing of radiology reports. In this paper, we describe our methodology for the RadNLP subtask, which involves document segmentation to identify eight key classes within radiology reports, and the primary task, which focuses on the automated TNM staging of lung cancer. For the subtask, we employed an ensemble of three fine-tuned, hyperparameter-optimized BERT-based medical language models, which yielded an overall micro F2 score of 0.9433, securing the top rank in the competition. For the main task, we developed individual pipelines for T, N, and M staging, consisting of BERT-based models and LLMs in a multistage processing framework, resulting in a joint accuracy of 0.5679 and an overall 4th place finish in the competition. Our solution not only streamlines the extraction of critical information but also aims to improve the accuracy and efficiency of cancer staging, ultimately supporting clinical decision-making and contributing to better patient outcomes
  • Aoi Kondo, Tan You Quan Bernon, Tsubasa Oka, Hiroaki Koga and Mikio Oda
    [Pdf] [Table of Content]
    The NITKC team participated in the RadNLP Shared task of TNM classification from lung cancer radiology reports written in English, using an LLM-based approach. LLM accuracy varies depending on training methods and the number of parameters. We aimed to solve this task using open-source LLMs with fewer parameters than closed-source, proprietary LLMs and made improvements accordingly. Open-source LLMs have less prior knowledge than closed-source LLMs, putting them at a disadvantage for TNM classification. To address this, we used Graph-RAG to improve accuracy and address issues by representing domain knowledge for unfamiliar tasks as a graph and incorporating it as knowledge into the LLM. This method uses a graph database to represent domain knowledge for TNM classification in a graph structure. It dynamically incorporates the graph information into LLM prompts, compensating for the knowledge gaps in open-source LLMs and enabling more accurate inference. Additionally, to enhance performance, we trained BioBERT and MedBERT on a dataset labeled with lung cancer progression stages and utilized these inference results concurrently. As a result, we achieved a joint accuracy of 0.2963 in the TNM classification task. This demonstrates that our approach effectively mitigates the limitations of open-source LLMs in TNM classification.
  • Marina Higashi, Rintaro Ito, Keita Kato, Ryota Asai, Shingo Iwano and Shinji Naganawa
    [Pdf] [Table of Content]
    Lung cancer is the most common cause of cancer death in Japan. The TNM classification is essential for lung cancer diagnosis and treatment planning, and CT imaging plays a crucial role in its evaluation. However, the number of thoracic radiologists is limited in Japan. The development of a system to automatically extract TNM classification from radiology reports would be beneficial to radiologists and other clinicians. Large language models (LLMs) have recently shown remarkable progress in natural language processing, opening new possibilities for medical applications. The NURad team participated in the NTCIR-18 Natural Language Processing for Radiology (RadNLP) task . This paper describes our approach to the problem and discusses the official results. We explored different prompts, LLM models (Llama3, Open AI O1pro, Google Gemini 2.0, Google Notebook LM), and data types (Japanese and English). We also investigated fine-tuning with clinical data. The final model, utilizing a short prompt and trained on both Japanese and English datasets using Google Notebook LM, did not incorporate clinical data. Our final model with Google Notebook LM achieved a TNM (fine) score of 0.93 on the validation dataset. However, the score decreased to 0.54 on the test dataset. This decline was more pronounced for the T classification compared to the N and M classifications. This study demonstrates the potential of LLMs for automated TNM classification from radiology reports, but also highlights challenges in generalization to unseen data, particularly for T classification. Further research is needed to improve the robustness and accuracy of LLM-based TNM classification systems.
  • Keisuke Hidaka
    [Pdf] [Table of Content]
    Here, we report our approach to the NTCIR-18 RadNLP2024 Shared Task (Japanese Track, Main Task). In this study, we developed a system to determine the TNM classification from lung cancer using Japanese radiology reports. Specifically, we provided Google DeepMind’s Gemini 2.0 Flash Experimental (gemini-2.0-flash-exp) with a prompt that combines Chain-of-Thought (CoT) and Many-Shot In-Context Learning (ICL), enabling automatic prediction of the T, N, and M factors for each case. Besides accuracy, interpretability is crucial in the medical domain; thus, having the model output the rationale for its TNM classification ensures a degree of transparency. Moreover, by including numerous examples of CoT-based reasoning—written by a radiologist with 5 years of dedicated experience in diagnostic radiology—to explain how the TNM classification is derived, we achieved improved inference accuracy. Furthermore, to address privacy concerns and the need for local inference without network connectivity in clinical settings, we performed Supervised Fine-Tuning (SFT) using Gemma2-9b-it, a comparatively lightweight open-source model. By providing the model with CoT-based reasoning steps leading to TNM classification as training data, we observed improved inference accuracy. These findings demonstrate that additional data and prompt strategies to support large language model (LLM)-based inference can be highly effective in automating TNM classification while also indicating the feasibility of realizing interpretability in LLM-based medical applications.
  • Wuraola Oyewusi, Eliana Vasquez Osorio, Gareth Price and Goran Nenadic
    [Pdf] [Table of Content]
    The RadNLP 2024 (Natural Language Processing for Radiology) shared task at the international conference NTCIR-18 (English track) focuses on document classification for lung cancer staging, aiming to automatically determine the stage (i.e., the degree of progression) of lung cancer from radiology reports. Our approach involved data preprocessing, stratified data augmentation, and fine-tuning RadBERT—a transformer model pre-trained on radiology-specific text. We employed back-translation for data augmentation and 5-fold cross-validation to improve model robustness and address class imbalance. The results demonstrated that data augmentation significantly improved validation performance, with T accuracy increasing from 39.39% to 94.05% during K-fold validation and reaching 100% on the task validation set. However, a substantial performance gap was observed on the task test set, with joint accuracy dropping from 96.3% on the task validation set to 12.35%. This highlights challenges in model generalization due to limited dataset diversity and domain-specific language variability. This report details our methodology, results, and discusses the challenges encountered, highlighting the need for further research to improve the robustness and generalizability of automated lung cancer staging from limited radiology reports.
  • Wen-Chao Yeh, Yan-Chun Hsing, Tzu-Yi Li, Nitisalapa Timsatid, Shih-Chuan Chang, Shih-Hsin Hsiao, Chu-Chun Wang, Pak-Yue Chan, Wen-Lian Hsu and Yung-Chun Chang
    [Pdf] [Table of Content]
    The TMUNLPG3 team participated in the Lung Cancer Staging main task and Multi-label Sentence Classification subtask of the NTCIR-18 RadNLP Task. This paper illustrates our approach to address the challenges and discusses the official results. We tackled Lung Cancer TNM Staging maintask to highest among all participants in the English track by adopting LLM and Few-Shot prompt engineering. Our solution also performed excellently in the Multi-label Sentence Classification subtask.
  • Manuel-Carlos Díaz-Galiano, Lucas Molino-Piñar, Álvaro Herrera Arjonilla and Maite Martín-Valdivia
    [Pdf] [Table of Content]
    This paper presents our participation in the NTCIR-18 RadNLP 2024 English main task and subtask. We describe our proposed solution to address the problem and discuss the official results. Our approach is based on large language models, with additional experiments involving data augmentation, retrieval-augmented generation, and prompting for the main task. Additionally, for the subtask, we employed a ModernBERT model with pre-training and hyperparameter optimization. Our best-performing submission in the main task, scores 0.5309\% in overall joint accuracy (fine) evaluation. Also, our best-performing submission in the subtask, scores 0.8189\% in overall micro F2.0 evaluation. Results from additional runs also show that data augmentation could further improve model performance beyond our best submission.
  • Yuki Tashiro, Yuta Nakamura and Eiji Aramaki
    [Pdf] [Table of Content]
    This paper describes our approach to the RadNLP 2024 Maintask as participants of NTCIR-18. The RADNLP 2024 Main Task is to classify the stage of lung cancer from radiology reports. Our approach utilizes GPT-4o for inference, employing prompt engineering techniques. We achieved an accuracy of 0.5648 on the Japanese test data, demonstrating the robustness of closed-source models.
  • Takashi Nishibayashi, Mitsuhisa Ota and Masahiro Kazama
    [Pdf] [Table of Content]
    The Ubie team participated in the RadNLP core task on lung cancer staging classification based on Japanese radiology reports at NTCIR-18. This paper reports our approach and analyzes the official results. We investigated the impact of prompt engineering on TNM classification using large language models (LLMs). We compared multiple proprietary models available as of January 2025 (Gemini 1.5 Pro, Gemini Exp. 1206, and o1) using various prompt configurations, including zero-shot, few-shot, chain-of-thought (CoT), and self-feedbacked instruction. The results demonstrate significant performance improvements driven by model evolution in this medical text classification task. Analysis of prompt variations revealed differential impacts based on model capabilities. For Gemini models tested, explicitly prompting reasoning steps (CoT) led to the most substantial performance gains. In contrast, the o1 model, a reasoning model performing internal CoT and self-evaluation, showed limited benefit from explicit reasoning prompts, suggesting that strategies effective for non-reasoning models are less critical for advanced reasoning models. This finding, consistent with general guidance on prompting reasoning models, is also observed in our medical text classification experiments. The effectiveness of self-feedbacked instruction varied, showing no improvement for Gemini 1.5 Pro, possibly due to inadequate feedback generation and its dependence on factors like few-shot example selection. While prompt engineering offered limited gains for the reasoning model evaluated, it provided substantial performance benefits for non-reasoning models, highlighting its value for optimizing models without inherent advanced reasoning capabilities.
  • Aman Sinha and Ioana Buhnila
    [Pdf] [Table of Content]
    We present our results on the main task and subtask of the NTCIR-18 RadNLP 2024 shared task on the English language. We tested to what extent Large Language Models (LLMs) and Pretrained Language Models (PLMs) can identify and classify tumor types and subtypes. Our results for the main task showed that LLMs have difficulties in understanding different subtypes of tumors. For the tumor sentence segment classification subtask, we obtained competitive overall score with pretrained language models with an overall score of 0.83 for micro F2.0 metric. Our results showed that in low amount of data setting, we have a better chance with clinical PLMs in comparison to general and domain specific LLMs. Providing additional information such definitions in case clinical staging classification can help LLMs achieve better scores on fine-grained classification.
  • Tomoki Terada and Rei Noguchi
    [Pdf] [Table of Content]
    We developed highly interpretable classification models of lung cancer stage using Bag-of-Words representations that consist of predefined key terms based on domain knowledge. These models had high medical validity and provided new clinical insights. This study demonstrates the effectiveness of domain knowledge in improving model accuracy and the usefulness of model interpretability in the medical field.
  • Yosuke Yamagishi, Ryosuke Tomiyama and Yui Ueda
    [Pdf] [Table of Content]
    Automated extraction of TNM staging information from radiology reports is a challenging task that requires understanding complex clinical language and applying detailed staging criteria. In this paper, we present our approach to the NTCIR-18 RadNLP 2024 shared task on automated lung cancer staging from Japanese radiology reports. We developed a hybrid system that combines large language models (LLMs) with rule-based processing in a two-stage pipeline: first extracting structured information from reports using GPT-4o models, then applying classification rules to determine the appropriate TNM stages. Our approach employed different strategies for each classification component: a rule-based method for the complex T classification and a more flexible LLM-based approach for N and M classifications. Evaluation results showed strong performance on the validation dataset (joint accuracy of 0.8148) but revealed a significant drop in T classification performance on the test dataset (from 0.8704 to 0.4769), while N and M classifications maintained high accuracy levels. This performance disparity highlights the trade-offs between rule-based precision and LLM flexibility in clinical NLP systems. Our findings suggest that balancing these approaches and leveraging larger development datasets could improve the robustness of automated cancer staging systems for real-world clinical applications.
  • Takahito Nakajima
    [Pdf] [Table of Content]
    Lung cancer TNM classification from narrative radiology reports presents challenges due to expression variability and complex relationships between findings. This study develops an automated TNM classification system utilizing large language models (LLMs) with supervised fine-tuning (SFT) and specialized prompting (SP) approaches. We evaluated our system on the NTCIR-18 RadNLP 2024 Task dataset, achieving 72.69\% (Japanese) and 55.56\% (English) fine-grained accuracy, ranking 5th among 15 teams. Our system demonstrated particularly high performance in N-factor classification (>93.98\% accuracy) and in the subtask of textual analysis (ranking 1st in Japanese and 3rd in English tracks). Error analysis revealed challenges in interpreting complex expressions and implicit information. This system shows potential for clinical workflow optimization, standardization of TNM classification, and educational support, with implications for improving cancer staging practices.
  • Return to Top



    Pilot Tasks


    [Transfer-2]


  • Hideo Joho, Atsushi Keyaki, Yuuki Tachioka and Shuhei Yamamoto
    [Pdf] [Table of Content]
    This paper provides an overview of the NTCIR-18 Transfer-2 task that aims to bring together researchers from Information Retrieval, Machine Learning, and Natural Language Processing to develop a suite of technology for transferring resources generated for one purpose to another in the context of dense retrieval. Two subtasks were run for this round: the Retrieval Augmented Generation (RAG) subtask and the Dense Multimodal Retrieval (DMR) subtask. This paper presents the dataset developed and evaluation results of participant runs. Note that this paper includes material from our earlier work published in~\cite{emtcir04}, revised for the current work.
  • Yuuki Tachioka and Yasunori Terao
    [Pdf] [Table of Content]
    The ditlab team participated in the RAG and DMR tasks of the NTCIR-18 Transfer-2 task. For the RAG task, we proposed a late fusion method for answer generation that uses multiple contexts retrieved by the dense passage retriever. Unlike sequential approaches that input contexts sequentially into large language models (LLM), our method processes contexts in parallel and employs majority voting to determine the final answer. We also fine-tuned the LLM using a LoRA-based method to better handle quiz-style questions, achieving over 10 points gains against the baseline in terms of accuracy.For the DMR task, we introduce a modality-aware sensor encoder that processes numerical and textual sensor features separately, and enhance geolocation features by converting latitude/longitude data into address strings via k-nearest neighbor matching. Although our baseline performance is degraded from the official baseline due to the mismatch of data between the training and evaluation data, our approach improved the image-to-sensor retrieval performance from our baseline.
  • Riku Mizuguchi, Takeshi Yamazaki and Shuhei Yamamoto
    [Pdf] [Table of Content]
    This paper presents the participation of the YMX2L research team in the NTCIR-18 Transfer-2 Dense Multimodal Retrieval (DMR) task. Our approach focuses on the integration of visual and sensor data, leveraging data augmentation techniques and object detection to enhance retrieval performance. The experimental results demonstrate the effectiveness of our proposed methods and highlight key features that contribute to addressing the challenges of multimodal dense retrieval.
  • Return to Top


    [HIDDEN-RAD]


  • Key-Sun Choi and You-Sang Cho
    [Pdf] [Table of Content]
    The Hidden-Rad task, introduced as a pilot challenge at NTCIR18, aims to improve the interpretability of AI systems in radiologyrelated diagnostic reasoning by encouraging models to explicitly explain the rationale behind clinical interpretations. Traditional radiology reports often focus on final diagnoses while omitting the underlying causal reasoning. To address this, Hidden-Rad defines two subtasks: Task 1 targets diagnostic explanation generation using radiology reports, with optional use of X-ray images; Task 2 evaluates the interpretation of diagnostic reasoning from structured clinical questionnaires. The task is built on an enriched subset of the MIMIC-CXR dataset and includes formal evaluation criteria provided via a public repository. In total, three teams submitted 40 runs for Task 1, while two teams submitted 16 runs for Task 2. The top-performing systems achieved 69% and 78.84% for each subtask, respectively, demonstrating the potential for integrating causal reasoning into clinical report generation. The findings highlight future directions for explainable medical AI through the use of domain-specific knowledge graphs and customized language models.
  • Youngseob Won, Younggyun Hahm, Chanhyuk Yoon and Seong Tae Kim
    [Pdf] [Table of Content]
    The Teddysum team participated in the HIDDEN-RAD task at NTCIR-18, which focuses on extracting and reconstructing causal explanations in radiology report generation. Our approach integrates Chain-of-Thought (CoT) prompting, Retrieval-Augmented Generation (RAG) leveraging RadGraph, and a Tree-of-Thought (ToT)-inspired evaluation mechanism to enhance causal reasoning. For Task 1, we employ KG-LLaVA, a visual language model, to convert chest X-ray images into textual descriptions before integrating them into our reasoning pipeline. For Task 2, our text-based framework directly applies structured prompting and retrieval-based reasoning. Our method secured 1st place in Task 2, demonstrating the effectiveness of structured causal inference in radiology report generation. We discuss the advantages, limitations, and future directions for improving AI-driven causal explanation models in medical applications.
  • Mercy Ranjit, Rahul Kumar, Shaury Srivastav, Anirban Porya and Tanuja Ganu
    [Pdf] [Table of Content]
    This paper presents the participation of the Microsoft Research RADPHI3 team in the Hidden-RAD Challenge: Hidden Causality Inclusion in Radiology Reports. The task aims to recover hidden causality from radiology reports, optionally accompanied by their corresponding frontal chest X-rays (CXRs). We fine-tune small language models, specifically Rad-Phi-3.5 Vision-CXR, to recover causality analysis in both language-only and multi-modal settings, given radiology reports and radiology images as inputs. We also include baselines of various models in the general domain, including models specifically tuned for reasoning tasks such as GPT-4o, LLaMA 3.3, Phi4, DeepSeek, OpenAI o1, OpenAI o1-mini, and OpenAI o3-mini3. Through these experiments, we evaluated the effectiveness of general-domain, reasoning-specialized, and fine-tuned domain-specific small language models in generating causal explanations given radiology reports and images optionally as inputs.
  • Ju-Min Cho, Ho-Jin Yi, Myung-Kyu Kim, Se-Jin Jeong and Seung-Hoon Na
    [Pdf] [Table of Content]
    The nash team participated in the NTCIR-18 Hidden-RAD Task, focusing on generating causality-based diagnostic inferences from radiology reports. In Subtask 1, we applied a cost-efficient API-driven inference pipeline to recover hidden causalities within MIMIC-CXR reports. Our pipeline integrates few-shot in-context learning, retrieval-enhanced prompting, and strict candidate selection using an evaluation checklist. By leveraging retrieved similar cases to enrich the prompt dynamically, this approach achieved the highest ranking (1st place) in the official evaluation. In Subtask 2, we explored structured diagnostic reasoning using PRISMA-Guided Causal Explanation, applying prompt-based systematic reasoning to enhance interpretability. Our method, leveraging structured PRISMA flow with large language models, secured 2nd place in the official evaluation. Additionally, we investigated an alternative approach that combined fine-tuning and domain-specific prompting to improve model adaptability. While this method was not included in the final ranking, it demonstrated potential in enhancing domain-specific model interpretability. These findings contribute to the advancement of explainable AI (XAI) in radiology, bridging the gap between automated diagnosis and human expert decision-making.
  • Return to Top


    [SUSHI]


  • Tokinori Suzuki, Douglas W. Oard, Shashank Bhardwaj, Emi Ishita and Yoichi Tomiura
    [Pdf] [Table of Content]
    This paper describes the NTCIR-18 SUSHI Pilot Task. The task included two subtasks: folder search and archival reference detection. Details are presented for each subtask on the design of the test collection, the system runs submitted by participating teams, and the evaluation results for those submitted runs.
  • Haruki Fujimaki and Makoto P. Kato
    [Pdf] [Table of Content]
    This paper describes the KASYS team's participation in the NTCIR-18 SUSHI Task by presenting a multi-level metadata aggregation and retrieval approach for Subtask A, which focuses on retrieving undigitized historical materials with sparse item-level metadata. Our system leverages the hierarchical organization of the data---comprising Box, Folder, and Item levels---by aggregating metadata from lower to higher levels and applying two search strategies (``Merge'' and ``Each''). We evaluate traditional BM25 alongside dense retrieval models (E5 and ColBERT) without fine-tuning, and hyperparameter optimization using Optuna is employed to determine the optimal weight for each level. Although our multi-level score aggregation strategy was designed to exploit the hierarchical structure of the data, it did not yield a significant performance improvement over a simpler BM25 baseline. Future work will explore improved preprocessing of noisy metadata, hybrid retrieval methods combining BM25 with dense re-ranking, and model fine-tuning to further enhance performance in searching undigitized archival collections.
  • Douglas W. Oard, Shashank Bhardwaj and Emi Ishita
    [Pdf] [Table of Content]
    The University of Maryland participated in both subtasks of the SUSHI Pilot Task. This paper describes the design of the systems used for each task, and it presents some preliminary analysis of the available results. The generation of data that has been shared with other participating teams is also described.
  • Tokinori Suzuki and Yoichi Tomiura
    [Pdf] [Table of Content]
    Kyushu University's team (QshuNLP) participated the both subtasks of the NTCIR-18 SUSHI pilot task. In this paper, we describe our approaches, systems, and analyze the results.
  • Return to Top


    [U4]


  • Yasutomo Kimura, Sato Eisaku, Kazuma Kadowaki and Hokuto Ototake
    [Pdf] [Table of Content]
    This paper provides an overview of the NTCIR-18 U4 shared task, which focuses on unifying, understanding, and utilizing unstructured data in financial reports. This task aims to improve methods for extracting and analyzing information, particularly from tables, within annual securities reports. These reports are crucial for understanding a company's financial performance, yet their complex and varied table structures present significant challenges for automated processing. To address these issues, the task comprises two subtasks, Table Retrieval and Table Question Answering, designed to evaluate and advance system capabilities for handling real world financial documents. The dataset, drawn from TOPIX100 companies, encompasses diverse table formats and content, serving as a rigorous test bed for participants. Performance is assessed via a leaderboard that evaluates JSON formatted system outputs, promoting transparent and reproducible results. The NTCIR-18 U4 task saw 10 active teams participate, submitting a total of 210 submissions.
  • Koji Tanaka, Daiki Shirafuji and Tatsuhiko Saito
    [Pdf] [Table of Content]
    Recently, Large Language Models (LLMs) are gaining increased attention in the domain of Table Question Answering (TQA), particularly for extracting data from tables in documents. However, directly entering entire tables as long text into LLMs often leads to incorrect answers because most LLMs cannot inherently capture complex table structures. In this paper, we propose a cell extraction method for TQA without manual identification, even for complex table headers. Our approach estimates table headers by computing similarities between a given question and individual cells via a hybrid retrieval mechanism that integrates a language model and TF-IDF. We then select as the answer the cells at the intersection of the most relevant row and column. Furthermore, the language model is trained using contrastive learning on a small dataset of question-header pairs to enhance performance. We evaluated our approach in the TQA dataset from the shared task "Unifying, Understanding, and Utilizing Unstructured Data in Financial Reports" (U4) held in the NTCIR-18 conference, which our team (WhiteME) participated in. The experimental results show that our pipeline achieves an accuracy of 74.6%, outperforming existing LLMs such as GPT-4o mini (63.9%). In summary, we found that focusing on the header relationships through our hybrid retrieval strategy effectively addresses structural uncertainties in complex tables.
  • Long Si, Yin Zhang, Xiaotian Wang and Takehito Utsuro
    [Pdf] [Table of Content]
    The goal of this paper is to develop a system for participating in the information extraction task from tables in securities reports (NTCIR- 18 U4 Task). The NTCIR-18 U4 Task consists of two distinct tasks: (1) retrieving the table that contains the relevant data. (2) extracting the desired data from the table to address the question. For the first task, we will utilize a pre-trained model that has demonstrated strong performance in table retrieval, and we will fine-tune the model to enhance its effectiveness for this specific task. In the second task, We will employ the latest Large Language Models (LLMs), which have shown excellent results across a variety of Natural Language Processing tasks. This approach is expected to achieve state-ofthe- art performance, surpassing existing pre-trained BERT-based models.
  • Yukihiro Seito
    [Pdf] [Table of Content]
    This paper presents the methods and results of Team SMM for the U4 task at NTCIR-18. In the Table Retrieval subtask, we designed methods for table retrieval using a cell-level multi-vector retriever and a single-vector retriever to enhance retrieval accuracy. The retriever first narrows down candidate tables to the top 10 based on retrieval score. Then, a cross-encoder-based reranker classifies these candidates into three categories: positive, negative, and hard negative. Finally, the table with the highest probability of being positive is selected as the final retrieved result. For the Table Question Answering subtask, we employ a T5-based model for answer generation to produce multiple candidate answers and introduce a Cell ID Estimator that identifies which cells in the table were used as the basis for generating each candidate answer by leveraging cell, row, and column embeddings. The estimator then selects the final answer based on the highest supporting cell score. The test set is divided into public and private splits, inspired by Kaggle's evaluation methodology. The public split is used for leaderboard updates, while the private split ensures robustness by preventing models from overfitting to leaderboard data. Final evaluations include both splits to provide a more reliable assessment of model performance. In the formal run, our method achieved an accuracy of 97.70\% (public) and 97.55\% (private) for Table Retrieval (ID 62), and for Table Question Answering, 86.34\% and 86.57\% on cell ID and value prediction, respectively, on the public split, with corresponding accuracies of 82.76\% and 81.94\% on the private split.
  • So Takasago and Tomoyoshi Akiba
    [Pdf] [Table of Content]
    In this paper, we propose a three-stage method for the U4 TableQA task. The method first analyzes and segments the target table into header and data cell sections using a machine learning classifier. Then, it generates natural language descriptions for each data cell using sentence templates based on the table structure. Finally, it retrieves relevant sentences matching the input question from the generated sentence set to form the TableQA result. This approach is also extended to the Table Retrieval task. Evaluation experiments showed that the Table Retrieval task achieved an accuracy of 0.3569, whereas for the TableQA task, the accuracy of cell_id prediction was 0.7797, and the value prediction was 0.7168.
  • Yuki Fujita, Ryota Mizushima, Hokuto Ototake and Kenji Yoshimura
    [Pdf] [Table of Content]
    This paper describes the proposed methods and results of the FUSINT team in the U4 task. For the Table Retrieval task, we propose a method for retrieving specific tables in Securities Reports based on a given question. Our approach involves filtering using cosine similarity and reranking, followed by a binary classification model. We achieved approximately 90% accuracy, but challenges remain in preprocessing and generalizing the section prediction model. Future work should explore methods that can handle a wider variety of question formats. For the Table QA task, we propose a method for identifying table cells in Securities Reports, focusing on standardizing table structures and resolving inconsistencies in cell values. One advantage of our approach is its ability to visualize the reasoning process. While challenges remain in handling hierarchical tables due to matrix segmentation, our method successfully identified cell positions with a high accuracy of approximately 92%.
  • Xin Fan, Kazuya Uesato, Yuma Hayashi and Tsuyoshi Morioka
    [Pdf] [Table of Content]
    The AIREV team participated in the NTCIR-18 U4 shared task, which comprises two subtasks, Table Retrieval (TR) and Table Question Answering (TQA), designed to evaluate and advance system capabilities for handling real-world financial documents. This paper reports our approach to solving two subtasks and discusses the experimental results. Our proposed approaches are primarily based on fine-tuning pre-trained LLMs on specific downstream tasks involving several key components, converting tabular form data to natural language representations, well-designed prompts, Bert-based re-ranking, and LLM-based retrieval. Our proposed approaches are placed in the second position in the leaderboard on both the TR and TQA subtasks, based on the performance compared to the other participant teams, demonstrating the effectiveness of our proposed method.
  • Hayato Aida, Kosuke Takahashi and Takahiro Omi
    [Pdf] [Table of Content]
    This paper reports the methods, results and analysis of STMK24 for the NTCIR-U4 Table QA (TQA) task. STMK24 approaches TQA as a Visual Document Understanding task, and tables are transformed into three different modalities: image, text, and layout of the content. To simply comprehend the structures of the tables, our model is trained to infer the cell IDs of the tables, and the cell values are automatically extracted through rule-based conversion. We investigated the impact of each modality on Table QA performance and confirmed that the model achieves high cell ID inference accuracy when utilizing all modalities.
  • Hiroyuki Higa, Maeyama Yuuki and Kazuhiro Takeuchi
    [Pdf] [Table of Content]
    Financial reports, such as securities reports, contain various figures and tables that play a crucial role in conveying structured information. In this study, we focus on the analysis of tables by integrating both textual and tabular data. We present a method that leverages natural language processing (NLP) techniques to assess the correctness of extracted information.
  • Return to Top