NTCIR-19

Task Overview and Call for Task Participation

The 19th NTCIR (2025 - 2026)

Evaluation of Information Access Technologies

Conference
December 8-10, 2026
NII, Tokyo, Japan

Evaluation Tasks

Evaluation Tasks

The nineteenth NTCIR (NTCIR-19) Program Committee has selected the following five Core Tasks and six Pilot Tasks.
For details and latest information, please see below and visit each task’s homepage.

CORE TASKS

  • Automatic Evaluation of LLMs 2 ("AEOLLM-2")

    "AEOLLM 2 focuses on the automated evaluation of long-form deep research reports generated by LLMs. Participants will be asked to develop evaluation methods that automatically score the quality of the generated reports."

    Abstract:
    Building on the success of the NTCIR-18 core task AEOLLM, we propose AEOLLM-2 for NTCIR-19 to further investigate automatic evaluation methods for Large Language Models (LLMs), particularly in long-form text generation scenarios. In AEOLLM-2, we introduce a new subtask: Deep Research Evaluation. This subtask focuses on the automated evaluation of long-form deep research reports generated by LLMs. Participants will be asked to develop evaluation methods that automatically score the quality of the generated reports. The performance of each method will be measured by comparing its scores against human-annotated ground-truth labels. We believe that AEOLLM-2 will drive research in robust, scalable, and interpretable long-text evaluation techniques.

    Website: https://huggingface.co/spaces/THUIR/AEOLLM
    NTCIR-19 Kickoff Slide pdf icon
    NTCIR-18 Conference Closing Slide pdf icon

    Contact:

  • Argument Quality Assessment of Financial Forward-Looking Statements ("FinArg-3")

    "FinArg-3 aims to evaluate the quality of financial forward-looking arguments, with a focus on both linguistic and predictive reasoning aspects."

    Abstract:
    FinArg-3 is the third and final task in the FinArg shared task series, designed to evaluate the quality of financial forward-looking arguments across multiple data sources, including earnings calls, analyst reports, and social media posts. Building upon previous FinNum and FinArg tasks, FinArg-3 introduces a comprehensive framework for assessing arguments from both linguistic and predictive reasoning perspectives. The task consists of three subtasks: (1) multi-dimensional quality assessment of arguments in earnings calls based on specificity, strength, persuasiveness, and objectivity; (2) classification of analyst-provided scenarios based on their eventual realization; and (3) pairwise comparison of social media posts to identify more accurate forward-looking opinions. In addition to standard offline evaluation, the social media subtask incorporates a novel real-time evaluation setting where participants assess daily streaming pairs over a five-day period. FinArg-3 supports multilingual data (English and Traditional Chinese) and aims to foster research in financial NLP, argument mining, and LLM evaluation. By bridging subjective linguistic quality with real-world forecasting skill, FinArg-3 encourages the development of robust, trustworthy models for real-world financial applications.

    Website: https://sites.google.com/nlg.csie.ntu.edu.tw/finarg3/
    NTCIR-19 Kickoff Slide pdf icon
    NTCIR-18 Conference Closing Slide pdf icon

    Contact:

  • Hidden Causal Reasoning in Radiology Report Generation ("HIDDEN-RAD2")

    "Hidden-RAD2 is a task on generating and evaluating causal explanations in radiology reports by linking clinical findings with diagnostic impressions."

    Abstract:
    Hidden-RAD2 addresses the challenge of generating and evaluating causal explanations in radiology reports, focusing on how clinical findings logically support diagnostic impressions. Traditional radiology reports often present final diagnoses without explicitly documenting the reasoning behind them, making it difficult for both clinicians and AI models to capture diagnostic logic. Hidden-RAD2 provides structured input data, including radiologist-annotated impressions, anatomical locations, thoracic spine levels, and checklists of possible abnormalities, enabling participants to reconstruct the hidden causal reasoning that links observations to conclusions. The task consists of two settings: one based on radiology reports with optional imaging data, and another using only crowdsourced questionnaire responses without licensing restrictions. System outputs are explanatory text sections that explicitly connect findings with diagnoses. Evaluation combines automatic similarity metrics, domain-specific embeddings, large language model (LLM)-based rubric scoring, and expert qualitative assessment, with a strong emphasis on logical coherence and clinical validity. By encouraging systems to move beyond surface-level summarization toward structured diagnostic reasoning, Hidden-RAD2 aims to advance the development of explainable medical AI and enhance clinical trust in automated radiology report generation.

    Website: https://sites.google.com/view/hidden-rad2/
    NTCIR-18 Conference Closing Slide pdf icon

    Contact:

  • Advancing Lifelog Analytics and Retrieval at NTCIR19 ("Lifelog-7")

    "Lifelog-7 aiming to advance research in lifelog analytics and retrieval and it consists of multiple sub-tasks focusing on semantic access, knowledge mining, and question answering from lifelog and CASTLE data."

    Abstract:
    NTCIR-Lifelog-7 is the latest edition of the NTCIR lifelogging task series. Its goal is to support research on how to search, organize and make sense of rich lifelog data captured from daily activities. The task provides participants with a large, heterogeneous dataset that combines images, sensor readings and contextual information, including material from the CASTLE collection. Building on earlier lifelog tasks, Lifelog-7 offers several subtasks, each targeting a different challenge such as semantic access to lifelog content, personal knowledge discovery and question answering. These subtasks are designed to reflect realistic use cases, for example helping users to retrieve memories or understand patterns in their daily life. By benchmarking systems under a shared evaluation framework, Lifelog-7 aims to encourage new ideas and techniques for lifelog analytics and retrieval, and to contribute to the development of practical tools for personal memory support and context-aware information access.

    Website: https://lifelog-ntcir-project-bd9eae.gitlab.io/
    NTCIR-19 Kickoff Slide pdf icon
    NTCIR-18 Conference Closing Slide pdf icon

    Contact:

  • Medical Natural Language Processing for Emergency Call ("MedNLP-CALL")

    "MedNLP-CALL aims to classify the triage level of patients based on information from the emergency telephone calls and generate the corresponding medical notes."

    Abstract:
    Emergency call triage is designed to sort and prioritize calls so that the most serious, life-threatening cases receive immediate attention, while others are directed to services better suited to their condition. This helps medical staff and facilities allocate resources efficiently, prevent overcrowding in emergency departments, and ensure ambulances are dispatched in an orderly manner. However, the accuracy and safety in triage decision-making remain a challenge. Therefore, our shared task aims to build models that can automatically classify the triage level of patients based on information from telephone calls and generate medical notes, making it easier for healthcare providers to understand patient conditions. Participants will be provided with a dataset of dispatcher-caller dialogues, annotated with triage labels and medical notes, and will be tasked with predicting the correct triage level and generating corresponding notes.

    Website: https://sociocom.naist.jp/mednlp-call/
    NTCIR-19 Kickoff Slide pdf icon
    NTCIR-18 Conference Closing Slide pdf icon

    Contact:

  • Tip-of-the-Tongue ("ToT")

    "ToT retrieval involves re-finding an item for which the user cannot recall a reliable identifier. "

    Abstract:
    The NTCIR 2026 Tip-of-the-Tongue (ToT) task that extends previous ToT tracks and addresses the problem of known-item retrieval when searchers cannot reliably recall good identifiers for previously encountered items. ToT queries are often verbose and include both semantic memories (factual information about the target item) and episodic memories (contextual information about previous encounters), along with complex linguistic phenomena such as uncertainty markers, exclusion criteria, relative comparisons, and potentially inaccurate information due to memory limitations. The NTCIR 2026 ToT task will evaluate retrieval systems using ToT queries in English and East Asian languages against Wikipedia corpora, with participants returning ranked lists evaluated using standard metrics including normalized discounted cumulative gain (NDCG), reciprocal rank, and recall. Beyond core ToT challenges such as verbose query processing and inaccurate information handling, the task introduces novel East Asian-specific challenges including cultural context interpretation, cross-lingual adaptation, and robustness to lower-resourced content, advancing our understanding of how cultural and linguistic factors influence information retrieval system effectiveness while providing valuable resources for the broader IR and NLP communities.

    Website: https://ntcir-tot.github.io
    NTCIR-19 Kickoff Slide pdf icon

    Contact:

PILOT TASKS

  • Instruction Generation for Agentic Search ("AgenticInstruction")

    "AgenticInstruction Task seeks to identify effective and robust instructions for guiding Large Language Models (LLMs) in search agent roles."

    Abstract:
    AgenticInstruction Task seeks to identify effective and robust instructions for guiding Large Language Models (LLMs) in search agent roles. Participants are invited to submit instructions and generation methods covering four key search stages: query formulation, document selection (clicks), relevance judgment, and query reformulation. AgenticInstruction-1 centers on a high-recall Ad Hoc retrieval scenario in both Japanese and English. Submitted instructions will be evaluated in an iterative search setting, with performance measured at the session level. The task aims to establish best practices for instruction design in agentic search applications.

    Website: https://geniie-lab.github.io/ntcir/
    NTCIR-19 Kickoff Slide pdf icon

    Contact:

  • Composed Access to Multimodal E-commerce Objects ("CAMEO")

    "CAMEO is a pilot task exploring composed image retrieval and review-based question answering for multimodal e-commerce product search, leveraging a Vietnamese e-commerce dataset with English translations for broader accessibility."

    Abstract:
    We propose CAMEO, a pilot task that explores two complementary challenges in multimodal product search: (1) Composed Image Retrieval, where a user provides a reference product image along with a textual modification to retrieve a visually altered variant; and (2) Review-Based Question Answering, where a user poses a natural-language question about a product, and the system retrieves or generates answers from customer reviews or product attributes. Both subtasks leverage the ViEcomRec dataset, which contains images, metadata, and over 369,000 Vietnamese-language product reviews from a real e-commerce platform. This task targets researchers in information retrieval, question answering, and multimodal understanding, and is designed to be evaluated asynchronously using automatic metrics.

    Website: https://sites.google.com/view/ntcir-cameo26/
    NTCIR-19 Kickoff Slide pdf icon
    NTCIR-18 Conference Closing Slide pdf icon

    Contact:

  • Data Analytics for aGRicultural Information ("DAGRI")

    "DAGRI is a task for extracting information and answering questions from figures and tables contained in agricultural documents."

    Abstract:
    DAGRI (Data Analytics for aGRicultural Information) aims to convert agricultural documents into structured, machine-readable formats. It also seeks to create a question-answering system for agricultural knowledge transfer. By doing so, the project promotes the digitization of region-specific agricultural expertise and contributes to building a sustainable, community-based model for knowledge sharing.

    Website: https://sites.google.com/view/dagri/
    NTCIR-19 Kickoff Slide pdf icon
    NTCIR-18 Conference Closing Slide pdf icon

    Contact:

  • Fact-based Event-centric Human-value Understanding ("FEHU")

    "Given a factual news article and associated human value categories, the task consists of two subtasks: (1) Human Value Recognition, which classifies human values expressed across the article, and (2) Human Value-aware Text Generation, which rewrites the article while preserving its original human value expressions."

    Abstract:
    The Fact-based Event-centric Human-value Understanding (FEHU) task aims to evaluate language models’ ability to identify and preserve human values in factual news articles. It includes two core subtasks. Subtask 1, Human Value Classification, involves multi-label classification of human values expressed across entire articles. Subtask 2, Value-preserving Text Generation, requires generating rewritten or summarized versions of articles while ensuring that the original human value expressions are faithfully preserved. All values are annotated using a structured taxonomy, including labels such as “equality,” “justice,” and “freedom of thought.” FEHU offers a unique benchmark for evaluating language models’ value alignment and ethical sensitivity in both classification and generation tasks.

    Website: https://sites.google.com/view/ntcir19fehu
    NTCIR-19 Kickoff Slide pdf icon
    NTCIR-18 Conference Closing Slide pdf icon

    Contact:

  • Pre-trained Model Retrieval ("ModelRetrieval")

    "ModelRetrieval defines a pilot benchmark for retrieving pre-trained models across NLP and image style transfer by predicting task-specific performance, enabling fast, cost-effective model selection."

    Abstract:
    Advances in AI have produced vast repositories of pre-trained models, yet choosing the right model for a new task still requires expensive trial-and-error. ModelRetrieval proposes a pilot evaluation that standardizes this problem as information retrieval: given a task description or style exemplar, systems rank candidate models by expected downstream performance without fine-tuning. The benchmark contains two subtasks. (A) Language Model Retrieval focuses on BERT variants for document classification; participants must predict the post-fine-tuning accuracy ranking using only the task’s train/validation splits and unlabeled test texts, and submissions are scored by nDCG@k. (B) Image Style Transfer Model Retrieval asks systems to rank style-transfer models by their ability to reproduce the style of a query image, evaluated by MRR and nDCG@k. We will release model pools, datasets, and ground-truth rankings derived from organizer-performed fine-tuning/evaluation, together with scripts and baselines. The task complements TREC’s Million LLMs track by covering both NLP and vision models. By providing shared data and metrics, ModelRetrieval lowers selection costs and accelerates practical deployment.

    Website: https://pre-trained-model-retrieval.github.io/
    NTCIR-19 Kickoff Slide pdf icon

    Contact:

  • RAG Responses Confident and Correct? ("R2C2")

    "For a given single-answer question about movies, generate and rank passages (passage ranking subtask), and/or return an answer with a confidence score and supporting nuggets extracted from the passages (answering with confidence subtask)!"

    Abstract:
    INPUT: A single-answer question related to movies.
    Passage Ranking: generate a ranked list of passages.
    Answering with Confidence: return an answer, with an answer confidence score, and nuggets extracted from passages.

    Website: http://sakailab.com/r2c2/
    NTCIR-19 Kickoff Slide pdf icon
    NTCIR-18 Conference Closing Slide pdf icon

    Contact:

  • Multinational, Multilingual, Multi-Industry Regulatory Compliance Checking ("RegCom")

    "RegCom is a pilot shared task that aims to evaluate multilingual and multimodal systems for automatically checking ESG report compliance with SASB standards across diverse countries and industries."

    Abstract:
    RegCom is a pilot shared task that focuses on the automatic assessment of regulatory compliance in Environmental, Social, and Governance (ESG) reports across multiple languages, countries, and industries. With ESG disclosures becoming increasingly important for transparency, investment decision-making, and regulatory oversight, the task addresses the need for scalable systems that can verify alignment with standardized frameworks such as the Sustainability Accounting Standards Board (SASB) guidelines. RegCom introduces two subtasks: (1) Full-Report Compliance Matching, which requires identifying relevant sections of a lengthy ESG report that correspond to specific SASB metrics and verifying their conformity, and (2) Single-Page Metric Verification, which evaluates whether a given page includes accurate disclosures for a specific metric. The task supports six languages—English, French, Japanese, Korean, Chinese, and Thai—collected from six corresponding countries and spanning three industries per country. Participants are expected to develop multilingual, multimodal systems capable of handling varied document structures, semantics, and reporting styles. By addressing real-world challenges in ESG compliance checking, RegCom contributes to advancing research at the intersection of natural language processing, information retrieval, and sustainable finance.

    Website: https://sites.google.com/view/ntcir19regcom/
    NTCIR-19 Kickoff Slide pdf icon
    NTCIR-18 Conference Closing Slide pdf icon

    Contact:

  • Cross-modal Claim Verification in Scientific Papers ("SciClaimEval")

    "Given a claim and corresponding evidence (which may be presented in figures or tables), determine whether the claim is supported or refuted by the evidence."

    Abstract:
    This shared task focuses on cross-modal scientific claim verification, aiming to assess whether textual claims in scientific papers are adequately supported by evidence from diverse modalities. We introduce a new benchmark dataset, constructed by extracting claims and their corresponding evidence from scientific articles across multiple domains, including biomedical, NLP, and ML/AI. We assume that the evidence in this task includes both textual and non-textual elements such as figures or tables. To create a realistic and challenging task, we manually perturb the supporting evidence to generate unsupported claims. Participants will be tasked with predicting whether claims are supported or unsupported based on the associated evidence provided, with performance evaluated using standard metrics such as precision, recall, and F1-score.

    Website: https://sciclaimeval.github.io/
    NTCIR-19 Kickoff Slide pdf icon
    NTCIR-18 Conference Closing Slide pdf icon

    Contact:

Last Modified: 2025-10-15

Page Top