The 16th NTCIR Conference
Evaluation of Information Access Technologies
June 14-17, 2022
National Institute of Informatics, Tokyo, Japan

    [Preface]


  • Noriko Kando, Charles L. A. Clarke, Makoto P. Kato and Yiqun Liu
    [Pdf] [Table of Content]
  • Return to Top


    [Overview]


  • Takehiro Yamamoto and Zhicheng Dou
    [Pdf] [Table of Content]
    This is an overview of NTCIR-16, the sixteenth sesquiannual research project for evaluating information access technologies. NTCIR-16 involved various evaluation tasks related to information retrieval, natural language processing, question answering, etc. 10 tasks were organized in NTCIR-16. This paper describes an outline of NTCIR-16, which includes its organization, schedule, scope, and task designs. In addition, we introduce brief statistics of the NTCIR-16 participants. Readers should refer to individual task overview papers for their detailed descriptions and findings.
  • Return to Top


    [Keynote]


  • ChengXiang Zhai
    [Pdf] [Table of Content]
    Due to the empirical nature of the Information Retrieval (IR) task, experimental evaluation of IR methods and systems is essential. Historically, evaluation initiatives such as TREC, CLEF, and NTICR have made significant impacts on IR research and resulted in many test collections that can be reused by researchers to study a wide range of IR tasks in the future. However, despite its great success, the traditional Cranfield evaluation methodology using a test collection has significant limitations, especially for evaluating an interactive IR system, and it remains an open challenge how to evaluate interactive IR systems using reproducible experiments. In this talk, I will discuss how we can address this challenge by framing the problem of IR evaluation more generally as search simulation, i.e., having an IR system interact with simulated users and measuring the performance of the system based on its interaction with the simulated users. I will first present a general formal framework for evaluating IR systems based on search session simulation, discussing how the framework can not only cover the traditional Cranfield evaluation method as a special case but also reveal potential limitations of the traditional IR evaluation measures. I will then review the recent research progress in developing formal models for user simulation and evaluating user simulators. Finally, I will discuss how we may leverage the current IR test collections to support simulation-based evaluation by developing and deploying user simulators based on those existing collections. I will conclude the talk with a brief discussion of important future research directions in simulation-based IR evaluation.
  • Ellen Voorhees
    [Pdf] [Table of Content]
    Evaluating search system effectiveness is a foundational hallmark of information retrieval research. Doing so requires infrastructure appropriate for the task at hand, which has frequently entailed using the Cranfield paradigm: test collections and associated evaluation measures. Observers have declared Cranfield moribund multiple times in its 60 year history, though each time test collection construction techniques and evaluation measure definitions have evolved to restore Cranfield as a useful tool. Now Cranfield's effectiveness is once more in question since corpora sizes have grown to the point that finding a few relevant documents is easy enough to saturate high-precision measures while deeper measures are unstable because too few of the relevant documents have been identified. In this talk I'll review how Cranfield evolved in the past and examine its prospects for the future.
  • Falk Scholer
    [Pdf] [Table of Content]
    Information retrieval makes extensive use of test collections for the measurement of search system effectiveness. Broadly speaking, this evaluation framework includes three components: search queries; a collection of documents to search over; and relevance judgements. In this talk, we'll consider two aspects of this process: queries, and relevance scales. Test collections typically use a single query to represent a more complex search topic or information need. However, different people may generate a wide range of query variants when instantiating information needs. We'll consider the implications of this for the evaluation of search systems, and the potential benefits and costs of incorporating variant queries into a test collection framework. Relevance judgements are used to indicate whether the documents returned by a retrieval system are appropriate responses for the query. They can be made using a variety of different scales, including ordinal (binary or graded) and techniques such as magnitude estimation. We'll examine a number of different approaches, and explore their benefits and drawbacks for judging relevance for retrieval evaluation.
  • Return to Top


    [Tutorial]


  • Tetsuya Sakai
    [Pdf] [Table of Content]
    I plan to cover the following topics in this tutorial: 1. Why is (offline) evaluation important? 2. On a few evaluation measures used at NTCIR 3. How should we choose the evaluation measures? 4. How should we design and build a test collection? 5. How should we ensure the quality of the gold data? 6. How should we report the results? 7. Quantifying reproducibility and progress 8. Summary
  • Return to Top



    Core Tasks


    [Data Search 2]


  • Makoto P. Kato, Hiroaki Ohshima, Ying-Hsang Liu, Hsin-Liang Chen and Yu Nakano
    [Pdf] [Table of Content]
    NTCIR-16 Data Search 2 is the second round of the Data Search task at NTCIR. The first round of Data Search (NTCIR-15 Data Search) focused on the retrieval of a statistical data collection. This round also addressed the problem of ad-hoc data retrieval (IR subtask) and planned the other subtasks including question answering (QA) subtask and user interface (UI) subtask. This paper introduces the task definition, test collection, and evaluation methodology of the subtasks of NTCIR-16 Data Search 2. The IR subtask attracted seven research groups, from which we received 25 English runs and 23 Japanese runs. The evaluation results of these runs are presented and discussed in this paper.
  • Moeri Okuda, Ryota Mibayashi, Takafumi Kawahara, Naoaki Matsumoto, Kenji Tanaka, Takehiro Yamamoto and Hiroaki Ohshima
    [Pdf] [Table of Content]
    We describe a framework using the BERT-based query modification technique for the NTCIR-16 Data Search 2 IR Subtask. In our framework, we took a 3-step procedure: (1) the query modification, (2) item filtering by BM25, and (3) item re-ranking by BERT. The experimental results showed that our framework using the query modification did not outperform the baseline method that does not use the query modification.
  • Lya Hulliyyatus Suadaa, Lutfi Rahmatuti Maghfiroh, Muhammad Luqman and Isfan Nur Fauzi
    [Pdf] [Table of Content]
    In this paper, we present the system and results of The STIS team for the Information Retrieval (English) subtasks of the NTCIR-16 Data Search Task. The data collections in this task consist of a pair of metadata and a set of data files. We only used title, description, and tags of metadata as input documents of our proposed approach to retrieve a rank of query-related data files. We proposed using a pre-trained model to capture representative words prediction for each document then calculate the similarity between the query and the representative words as a rank score.
  • Lin Li, Xinyu Chen and Sijie Long
    [Pdf] [Table of Content]
    The WUT21 team participated in the IR subtask of the NTCIR-16 Data Search 2 Task. This paper reports our approach to solving the problem and discusses the official results. Our approach aims to choose a simple base model, for the IR subtask, using a document-based storage method to facilitate retrieval of specified fields, thereby formulating a retrieval strategy. Elastic Search is a distributed full-text search and analysis engine based on Lucene, which has the advantages of high performance, high scalability, and real-time performance. Based on Elastic Search, the strategy uses embedded retrieval algorithms to retrieve topics and calculate text similarity, and select the optimal algorithm to match topic texts according to the final evaluation index NDCG@10. The final results show that the basic text similarity algorithm has a relatively high contribution performance for information retrieval tasks.
  • Taku Okamoto, Tomokazu Hayashi and Hisashi Miyamori
    [Pdf] [Table of Content]
    This paper describes the system and results of Team KSU work on the NTCIR-16 Data Search2 IR subtask. The documents covered by this task consist of metadata extracted from the governmental statistical data and the body of the corresponding statistical data. The metadata is characterized by the fact that its document length is short, and the main body of statistical data is almost always composed of numbers, except for titles, headers, and comments. In the previous studies on ad hoc search for statistical documents, most of the ranking methods used only the metadata of the statistical documents, and there are few methods of using the contents of the tables of statistical data. However, ranking methods using only metadata have not been able to achieve the same or better performance compared to conventional ad hoc search for text documents. Therefore, in this paper, we propose a method that employs features of the table body of statistical data and a re-ranking method based on neural network models used in neural search, and verify how much the ranking results are improved. For the features of the main body of the table, we use eight types of features, four from the main body of the table and four from the whole table. As a neural search method, we use a re-ranking method based on the scores predicted from the features obtained by BERT and MLP. The results of the experiment showed that the method combining category search and BM25 resulted in nDCG@10 of 0.314 for Japanese and that of 0.069 for English. The results showed that Japanese ranked 2nd and English 6th among all teams.
  • Levy Silva, Luciano Barbosa, Sonia Castelo, Haoxiang Zhang, Aécio Santos and Juliana Freire
    [Pdf] [Table of Content]
    In this paper, we describe the approach and results of the NYUCIN team in the NTCIR-16 conference. We participated in the Data Search 2 Task, which is a shared task on ad-hoc retrieval for governmental statistical data composed of multiple subtasks. We report our work on the two subtasks we participated in: the English IR Subtask and the UI Subtask. For the IR Subtask, we explored learning-to-rank approaches based on deep learning models. Given the limited training data available for this task, we employed a transfer learning method to train a deep neural network that learns how to match web tables and news articles using data available on the Web. The official evaluation shows that our approach attained the highest score among all submitted runs across all evaluation metrics. In particular, for the nDCG@5 measure, our score of 0.246 represents a 30% improvement compared to the second-best result in NTCIR-16 Data Search 2 Task. For the experimental UI Subtask, we performed a preliminary user study to evaluate the effectiveness of the user interface of Auctus, a dataset search engine developed by our team.
  • Satanu Ghosh and Jiqun Liu
    [Pdf] [Table of Content]
    In this paper, we report our work and discuss the results for NTCIR- 16 DataSearch-2 IR subtask. NTCIR-16 Data Search-2 was organized to improve the present knowledge and promote the concepts of dataset search among IR researchers. In this particular subtask, we tried to perform ad-hoc retrieval for datasets based on given queries. While this task was available in English and Japanese, we decided to only compete for the English subtask. We sought to perform the ranking of datasets by using traditional BM25-based ranking functions and recent language models. During the evaluation experiments, we also explored the impact of metadata features on the performance of dataset ranking. Our best performing submission achieved a score of 0.153, 0.161 and 0.174 in nDCG@10, nERR@10, and Q-measure, respectively. In all the metrics, this run was ranked 13th among 25 submitted runs.
  • Return to Top


    [DialEval-2]


  • Sijie Tao and Tetsuya Sakai
    [Pdf] [Table of Content]
    This paper provides an overview of the NTCIR-16 Dialogue Evaluation (DialEval-2) task. DialEval-2 is the successor of The NTCIR-15 DialEval-1 task and the NTCIR-14 Short Text Conversation STC- 3 task. DialEval-2 consists of two subtasks: the Dialogue Quality (DQ) subtask and the Nugget Detection (ND) subtask. Both of the subtasks are designed to aim automatical evaluation of customerhelpdesk dialogues. The DQ subtask requires our participants to estimate three kinds of quality score for each dialogue: task accomplishment, customer satisfaction, and dialogue effectiveness. The ND subtask is set as a classification task, where participants are asked to classify every turn of a dialogue to detect nugget turns. A nugget stands for a turn being helpful for problem solving in the dialogue. In this paper, we introduce the task definition, data collection, evaluation measures, and the official evaluation results on the runs from the participant teams.
  • Fan Li and Tetsuya Sakai
    [Pdf] [Table of Content]
    In this paper, we report the work of the RSLDE team at the dialogue evaluation (DialEval-2) task of NTCIR-16, including the Chinese and English dialogue quality (DQ) and nugget detection (ND) subtasks. We implemented two sentence-level baselines that fine-tune BERT and XLNet along with a linear layer for the ND subtask. In addition, we propose a model based on BERT to capture the structure and context information of a customer-helpdesk dialogue. This dialogue-level model modifies the input and embeddings of BERT. We add a Transformer encoder layer over this model as our third model for the ND subtask and first model for the DQ subtask. The second model for the DQ subtask is the same dialogue-level model but without the Transformer encoder layer. Our XLNet model generated the best run for both English and Chinese ND subtasks. Our dialogue-level model outperformed the baselines for the Chinese DQ subtask but not for the English DQ subtask.
  • Fei Ding, Xin Kang, Yunong Wu and Fuji Ren
    [Pdf] [Table of Content]
    In this paper, we report the work of TUA1 team in the dialogue evaluation (DialEval-2) task of NTCIR-16, which consists of two subtasks: the Dialogue Quality (DQ) subtask and the Nugget Detection (ND) subtask. Our proposed method consists of two parts: a feature extractor and a feedforward network. The feature extractor employs pre-trained Transformer networks to extract the hidden representations of the dialogue utterances and employs a Latent Dirichlet Allocation (LDA) method to extract the topic information of these utterances. The feedforward network then concatenates the hidden representations and the topics extracted by the feature extractor, compresses them through several feedforward layers into a desired dimension, and finally predicts the quality scores and nugget types of the dialogues. Since the DialEval-2 task dataset was composed of the one-to-one translated Chinese and English dialogues, we employ the pre-trained Transformer networks for Chinese and English, respectively, to extract the hidden representations of the dialogues. This makes it possible to process the sub-tasks on the Chinese or English datasets simultaneously. We train the neural network models based on the mean squared error for both dialogue quality prediction and nugget detection subtasks. Our proposed method reaches the best scores for RSNOD and NMD metrics in both Chinese and English dialogue quality subtasks among all participants. The results indicate that the proposed method is promising in learning a dialogue quality prediction system for generating very close predictions to the human annotators.
  • Ting-Yun Hsiao, Yung-Wei Teng, Pei-Tz Chiu, Mike Tian-Jian Jiang and Min-Yuh Day
    [Pdf] [Table of Content]
    In recent years, there has been a surge in interest in evaluating the quality of chatbot conversation. We participated in the Dialogue Quality (DQ) and Nugget Detection (ND) subtasks in both Chinese and English. However, the majority of existing conventional approaches are based on the long short-term memory (LSTM) model. The paper suggests a method for assisting customers in resolving problems. The goal of this subtask is to automatically determine the status of dialogue sentences in a dialogue system's logs. On conversation tasks, we developed fine-tuning methodologies for the Transformer model. To evaluate and show the concept, we created a wide framework for testing and displaying the XLM-RoBERTa model's performance on conversational texts. Finally, the experimental findings of the two subtasks demonstrate the efficacy of our strategy. The experimental findings for the DialEval-2 task show that the suggested method's performance is reasonably equal to that of an LSTM-based baseline model. The main contribution of our study is that we suggested two crucial elements for increasing conversation quality and nugget identification subtasks in dialogue assessment, namely tokenization methods and finetuning procedures.
  • Tao-Hsing Chang and Jian-He Chen
    [Pdf] [Table of Content]
    It is important to evaluate the quality of dialogues generated by chatbots. Most previous automatic evaluation methods have been based on models (e.g., LSTM ) that are capable of processing time series. This study presents three models for dialogue quality and two nugget detection subtasks, respectively. Specifically, the first model uses a Pegasus model that can transform dialogues into short summaries; the second model uses a Bi-LSTM that merely adjusts the internal model structure; and the third model is a multi-agent model simulating situations in which multiple annotators generate different evaluation results for the same text. The experimental results show that certain opinions may need to be corroborated by more refined experimental design and the testing of more model parameters before they are applicable to this issue.
  • Return to Top


    [FinNum-3]


  • Chung-Chi Chen, Hen-Hsen Huang, Yu-Lieh Huang, Hiroya Takamura and Hsin-Hsi Chen
    [Pdf] [Table of Content]
    In the FinNum task series, we proposed numeral category understanding (FinNum-1) and numeral attachment (FinNum-2) tasks for better comprehending the numerals in financial narratives. In FinNum-3, we present a novel task, fine-grained claim detection, and further integrate the new task with the previous. There are two subtasks in FinNum-3: (1) Investor's claim detection (Chinese) and (2) Manager's claim detection (English). This round of FinNum attracted ten research teams, and received 10 submissions for Chinese subtask and 15 submissions for English subtask. This paper provides an overview of FinNum-3, including task definition, data annotation, and participants' results.
  • Sohom Ghosh and Sudip Kumar Naskar
    [Pdf] [Table of Content]
    The third edition of FinNUM shared task, being held with NTCIR-16 presented the challenge of classifying numerals present in financial texts into in-claim or out-of-claim classes. It consisted of two claim detection sub-tasks on i) professional analysts’ reports written in Chinese and ii) earning conference calls transcribed in English. In this paper, we describe the approach our team (LIPI) followed while participating in the English subtask of FinNUM3. This approach consists of ensembling transformer based models with a Logistic Regression model trained using BERT embeddings and handcrafted features. It out-performed the existing baseline and achieved a macro F1 score of 84.73% and micro F1 score of 95.59% on the test set.
  • Xie-Sheng Hong, Jia-Jun Lee, Shih-Hung Wu and Tian-Jian Jiang
    [Pdf] [Table of Content]
    This paper presents a description for our submission to the NTCIR-16 FinNum-3 shared task in fine-grained claim detection for financial documents. We submitted three runs in both the English and Chinese sections in the final test. The Run1 uses MacBERT (for Chinese data) and RoBERTa (for English data) with the classical classifier BiLSTM as the baseline of this study. In Run2, we change the classifier to AWD-LSTM for comparison. Furthermore, considering the the problem of unbalanced training data when training the model, we use data resampling technique in both Run1 and Run2. And we propose an attempt to extend the data using GPT2 in Run3.
  • Tzu-Ying Chen, Yu-Wen Chiu, Hui-Lun Lin, Chia-Tzu Lin, Yung-Chung Chang and Chun-Wei Tung
    [Pdf] [Table of Content]
    In financial documents, numbers often contain important information in addition to textual data. As a result, understanding the relationship and meaning between these numbers and words is one of the current research and development directions. The main goal of the NTCIR-16 FinNum-3 Task is to understand the meaning of the numeral in the financial reports, which can further classify the category and the claim of the target numerals. We proposed a system that can predict two tasks at the same time in this paper. Our method shows that the ensemble fine-tuned BERT has the best performance for predicting the category and claim, which reached 94.67% micro F1-score for the numerical category classification, and 92.75% micro F1-score for the claim detection.
  • Shunsuke Onuma and Kazuma Kadowaki
    [Pdf] [Table of Content]
    This study presents JRIRD's work on the FinNum-3 Manager's Claim Detection subtask. Numeracy is essential in financial documents and some studies have focused on numerical information representations in natural language processing. For the FinNum-3 task, we tried four representations of numerical value in a text and experimented with joint learning using numerical category information. The results showed that the best format of numerical values depended on a pre-trained model. The joint learning with numerical categories improved the performance of some pre-trained models and numeral format settings.
  • Yuxuan Liu, Maofu Liu and Mengjie Wu
    [Pdf] [Table of Content]
    This article introduces how we deal with the FinNum-3 task of NTCIR16. In the FinNum-3 task, the relationship between a numeral and a given label is the object of classification. In one text, given a target numeral and its offset in the text, models need to judge whether the given target numeral is in-claim or out-of-claim. In the experiments, we use the BiLSTM architecture to detect the in-claim or out-of-claim of the target numeral in two kinds of financial texts.
  • Alaa Alhamzeh, M. Kürsad Lacin and Előd Egyed-Zsigmond
    [Pdf] [Table of Content]
    The FinNum Task series aims at better understanding of numeral information in financial narratives. The goal of FinNum-3; on the English data part; is to have a fine-grained manager’s claim detection in the Earning Conference Calls (ECCs) with the help of Natural Language Processing (NLP). To succeed in the best performance for predicting in-claim and out-of-claim numerals, we propose the BERT (Bidirectional Encoder Representations from Transformers) base model , which is pre-trained on a large corpus of English data. The results of our model are 86.48% of macro-F1 score in the validation split and 87.12% of macro-F1 score in the test data.
  • Yung-Wei Teng, Pei-Tz Chiu, Ting-Yun Hsiao, Mike Tian-Jian Jiang and Min-Yuh Day
    [Pdf] [Table of Content]
    This paper provides a detailed description of IMNTPU team at the NTCIR-16 FinNum-3 shared task in formal financial documents. We proposed the use of the XLM-RoBERTa-based model with two different approaches on data augmentation to perform the binary classification task in FinNum-3. The first run (i.e., IMNTPU-1) is our baseline through the fine-tuning of the XLM-RoBERTa without data augmentation. However, we assume that presenting different data augmentations may improve the task performance because of the imbalance in the dataset. Accordingly, we presented double redaction and translation methods on data augmentation in the second (IMNTPU-2) and third (IMNTPU- 3) runs, respectively. The best macro-F1 scores obtained by our team in the Chinese and English datasets are 93.18% and 89.86%, respectively. The major contribution of this study provides a new understanding of data augmentation approach for the imbalanced dataset, which may help reduce the imbalanced situation in the Chinese and English datasets.
  • Return to Top


    [Lifelog-4]


  • Liting Zhou, Cathal Gurrin, Graham Healy, Hideo Joho, Binh Nguyen, Rami Albatal, Frank Hopfgartner and Duc-Tien Dang-Nguyen
    [Pdf] [Table of Content]
    NTCIR-16 saw the fourth edition of the Lifelog task, which aimed to foster comparative benchmarking of approaches to automatic and interactive information retrieval from multimodal lifelog archives. In this paper, we describe the test collection employed, along with the tasks, the submissions and the findings from this NTCIR16 Lifelog-4 LEST sub-task. We finish by suggesting future plans for lifelog tasks.
  • Zhiyu He, Jiayu Li, Wenjing Wu, Min Zhang, Yiqun Liu and Shaoping Ma
    [Pdf] [Table of Content]
    With the development of digital information storage technology and portable sensing devices, users are gradually accustomed to recording their personal life~(i.e., lifelog) in various digital ways. Therefore, the retrieval of lifelogging has become a new and essential research topic in related fields. Unlike traditional search engines, in lifelog, text and other data automatically recorded in real-time by sensors bring challenges to data arrangement and search. As the dataset is highly personalized, interactions and feedback from users should also be considered in the search engine. This paper describes our interactive approach for the NTCIR-16 Lifelog-4 Task. The task is to search relevant lifelog images from the users' daily lifelog given an event topic. A significant challenge is how to bridge the semantic gap between lifelog images and event-level topics. We propose a framework to address this problem with a multi-functional and flexible feedback mechanism and result presentation for interaction in a search engine. Besides, we propose a query text parsing procedure that parses the long query text into keywords and fills the fields automatically. We analyzed the interactive lifelog search engine with 12 topics constructed by ourselves according to LSC'18 development topics. Finally, we achieved an official result of 741 at the NTCIR-16 Lifelog-4 task in terms of RelRet score over 48 topics.
  • Thao-Nhu Nguyen, Tu-Khiem Le, Van-Tu Ninh, Ly-Duyen Tran, Manh-Duy Nguyen, Minh-Triet Tran, Thanh-Binh Nguyen, Annalina Caputo, Sinead Smyth, Graham Healy and Cathal Gurrin
    [Pdf] [Table of Content]
    In this paper, we present our DCU and HCMUS team’s participation in the NTCIR16 Lifelog-4 task by using two different retrieval systems, namely LifeSeeker and Myscéal that were originally introduced in the Lifelog Search Challenge (LSC) and adapted for addressing the Lifelog Semantic Access Task (LSAT). To tackle the task in an automatic manner, both LifeSeeker and Myscéal employed pre-processing techniques as part of the retrieval process, while LifeSeeker further utilised a post-processing step to refine the retrieval results. Regarding the interactive manner, we evaluated Myscéal system by conducting a user study on both expert and novice users on both ad-hoc and known-item-search settings.
  • Naushad Alam, Ahmed Alateeq, Yvette Graham, Mark Roantree and Cathal Gurrin
    [Pdf] [Table of Content]
    In this paper, we present two systems from DCU named DCUMemento and DCUVOX that earlier participated in the 2021 edition of the Lifelog Search Challenge and were redeveloped to participate in the NTCIR-16 Lifelog-4 task. Both systems use image-text embeddings from various CLIP models to build their search backend with DCUVOX using the ViT-B/32 model while DCUMemento uses a weighted ensemble of scores from ViT-L/14 and ResNet-50x64 models. The paper also discusses the query reformulation strategy used by the systems in addition to the system architecture. Finally, we present the results of our evaluation and discuss limitations of both systems with details of improvements planned for future iterations.
  • Return to Top


    [QA Lab-PoliInfo-3]


  • Yasutomo Kimura, Hideyuki Shibuki, Hokuto Ototake, Yuzu Uchida, Keiichi Takamaru, Madoka Ishioroshi, Masaharu Yoshioka, Tomoyoshi Akiba, Yasuhiro Ogawa, Minoru Sasaki, Ken-ichi Yokote, Kazuma Kadowaki, Tatsunori Mori, Kenji Araki, Teruko Mitamura and Satoshi Sekine
    [Pdf] [Table of Content]
    The goal of the NTCIR-16 QA Lab-PoliInfo-3 task is to develop real-world complex question answering (QA) techniques using Japanese political information such as local assembly minutes and newsletters. QA Lab-PoliInfo-3 consists of four subtasks: QA Alignment, Question Answering, Fact Verification, and Budget Argument Mining. In this paper, we present the data used and the results of the formal run.
  • Ramon Ruiz-Dolz
    [Pdf] [Table of Content]
    The rVRAIN team tackled the Budget Argument Mining (BAM) task, consisting of a combination of classification and information retrieval sub-tasks. For the argument classification (AC), the team achieved its best performing results with a five-class BERT-based cascade model complemented with some handcrafted rules. The rules were used to determine if the expression was monetary or not. Then, each monetary expression was classified as a premise or as a conclusion in the first level of the cascade model. Finally, each premise was classified into the three premise classes, and each conclusion into the two conclusion classes. For the information retrieval (i.e., relation ID detection or RID), our best results were achieved by a combination of a BERT-based binary classifier, and the cosine similarity of pairs consisting of the monetary expression and budget dense embeddings.
  • Daigo Nishihara, Hokuto Ototake and Kenji Yoshimura
    [Pdf] [Table of Content]
    This paper reports on the achievements of Budget Argument Mining subtask of the NTCIR-16 QA Lab-PoliInfo-3 task of fuys team. We have assigned ArgumentClass and RelatedID in different ways. ArgumentClass was assigned using BERT. We also thought that the accuracy could be improved by adding the flag indicating whether a speaker is a legislator or not ("giin-flag"). RelatedID was assigned using keyword extraction with TFIDF. The results showed good results for ArgumentClass, but the improvement in accuracy of RelatedID could not be confirmed. Although there was a difference in the results for the "giin-flag", the difference was small, and no advantage was found with or without the "giin-flag".
  • Keiyu Nagafuchi, Rin Sasaki, Seiya Oki, Yasutomo Kimura and Kenji Araki
    [Pdf] [Table of Content]
    The OUC team participated in the Budget Argument Mining subtask of NTCIR-16 Question Answering Lab for Political Information 3 (QA Lab-PoliInfo-3). In this paper, we report on our methods for this task and discuss the results. We performed argument classification using a fine-tuned BERT classifier. This method showed the second highest score (0.5716) among the participants in the test data. We also performed linking relatedID using TF-IDF vectorization of documents and calculation of their cosine similarity. This method showed the highest score (0.6596) among the participants in the test data.
  • Kohei Seguchi and Minoru Sasaki
    [Pdf] [Table of Content]
    In this study, we construct a system to predict argument labels for statements in meeting minutes by using sequence labeling methodology and validate the effectiveness of prediction performance for various input methods of utterance data to predict argument labels for money expressions effectively. To evaluate the validation of the system, we will use the Budget Argument Mining task data in NTCIR 16 QA Lab Poliinfo 3. We train an argument label prediction model on the training data that exists in the data, dividing it into two types: data for model training and data for model validation. As the prediction model, we use the Bidirectional LSTM CNNs CRF model to predict argument labels for each word in the input data and output a series of argument labels. In the experiment, we compare the prediction accuracy of models obtained by changing the data input method, such as the range of sentences containing money expressions. As a result of the experiment, we found that the prediction accuracy of argument labels was higher when each sentence was entered into the model rather than when all the statements of the assembly member were entered into the model. Furthermore, we found that the prediction accuracy of the argument labels can be improved by replacing the numbers in the money expression with special tokens.
  • Akio Kobayashi and Hiroki Sakaji
    [Pdf] [Table of Content]
    The SMLAB team participated in the budget argument mining subtask of the NTCIR-16 QALab Poli-info Task. This paper reports our approach to solving this task and discusses the official results. As an reult, our model underperformed other approaches from other teams drastically.
  • Naoki Igarashi, Daiki Iwayama, Hideyuki Shibuki and Tatsunori Mori
    [Pdf] [Table of Content]
    In this paper, we describe the development of a system for QA Alignment and a system for Fact Verification. We submitted 11 re- sults for the QA Alignment, 6 results including 4 late submissions for the Fact Verification. As a result, an F-measure of .7753 for the QA Alignment and an F-measure of .8563 for the Fact Verification were obtained.
  • Yuuki Tachioka and Atsushi Keyaki
    [Pdf] [Table of Content]
    The ditlab team participated in the QA alignment and Question Answering task of the NTCIR-16 QA Lab-PoliInfo-3 task. First, we developed a QA alignment system that associates each question to its answer by using heuristic rules to make paragraphs composed of the related sentences and by matching them. Heuristic rules were optimized for minutes. We prepared four types of features for matching. Second, we built a QA system that uses a similarity measure to find the original question similar to the question summary. The QA system then identified the answers associated with the original question using the results of the QA alignment described above. A Text-to-Text Transfer Transformer (T5) was used to summarize the associated answer.
  • Yo Amano, Masayuki Matsumoto, Kousuke Sasaki and Kazuhiro Takeuchi
    [Pdf] [Table of Content]
    This paper proposes a method for budget argument mining using topic extraction based on utterance classification. We employ a domain-specific word embedding, which is calculated only from the given data, to link budget descriptions with corresponding arguments.
  • Ryoto Ohsugi, Teruya Kawai, Yuki Gato, Tomoyosi Akiba and Shigeru Masuyama
    [Pdf] [Table of Content]
    AKBL team participated in the QA alignment, the Question Answering, and the Fact Verification subtasks. For the QA alignment subtask, our method firstly divides given question and answer texts into semantically consistent segments, then apply the Hungarian algorithm with the BM25 similarity metric to align those segments. For the Question Answering subtask, our system firstly selects a short segment relevant to a given question summary from the answer text, then converts it into the answer summary by using the abstractive summarizer based on the pre-trained BART. For the Fact Verification subtask, our best system firstly retrieves a passage relevant to a given claim from the assembly minutes, then checks if the passage entails the claim or not by using a BERT-based textual entailment classifier.
  • Yasuhiro Ogawa, Yugo Kato and Katsuhiko Toyama
    [Pdf] [Table of Content]
    Our nukl team participated in the NTCIR-16 QA Lab-PoliInfo-3's question answering (QA) subtask. This paper describes the QA system for Japanese assembly member speeches using T5. We generated answer summaries using two input types: the answerer's entire utterance and the answer text corresponding to the input question. We made two T5 models for each input and determined the final output according to the length of the answerer's utterance. Our system achieved the highest score in both automatic and human evaluations in this subtask.
  • Kazuma Kadowaki and Shunsuke Onuma
    [Pdf] [Table of Content]
    The JRIRD team participated in the budget argument mining subtask of the NTCIR-16 QA Lab-PoliInfo-3. This paper reports on our approach to solving this problem and discusses the official results. Our system consists of two BERT models that work independently toward two objectives: argument classification (AC) and related ID detection (RID). The results show that our system performs well, especially for argument classification.
  • Return to Top


    [WWW-4]


  • Tetsuya Sakai, Sijie Tao, Zhumin Chu, Maria Maistro, Yujing Li, Nuo Chen, Nicola Ferro, Junjie Wang, Ian Soboroff and Yiqun Liu
    [Pdf] [Table of Content]
    This is an overview of the NTCIR-16 We Want Web with CENTRE (WWW-4) task, the fourth round of an evaluation series that aims to quantify the progress and reproducibility of web search algorithms in offline ad hoc retrieval settings. For WWW-4, we introduced a new English web corpus, which we named Chuweb21. Moreover, in addition to bronze relevance assessments (i.e., those given by assessors who are neither topic creators nor topic experts), we collected gold relevance assessments (i.e., those given by topic creators). We received 18 runs from 4 teams, including two runs from the organiser team. We describe the task, data, evaluation measures, and report on the official evaluation results
  • Yuya Ubukata, Masaki Muraoka, Sijie Tao and Tetsuya Sakai
    [Pdf] [Table of Content]
    The SLWWW team participated in the NTCIR-16 We Want Web with CENTRE (WWW-4) task. This paper reports our approach and results in the ad hoc web search task. We applied two different methods to generate NEW runs, COIL (Contextualized Inverted List) and PARADE (Passage Representation Aggregation for Document Reranking). We also tried to reproduce the KASYS run which was a top-performing run in the WWW-3 task.
  • Kota Usuha, Kohei Shinden, Makoto P. Kato and Sumio Fujita
    [Pdf] [Table of Content]
    The KASYS team participated in the English subtask of the NTCIR-16 WWW-4 task. This paper describes our approach of generating NEW runs, and REV runs in the NTCIR-16 WWW-4 task. We applied BERT reading comprehension model to the WWW-4 task for generating NEW runs. We investigated the effectiveness of reading comprehension model in the ad-hoc Web document retrieval task. The evaluation results showed that our run outperformed the baseline in the gold relevance assessment for the four runs we submitted. The evaluation results of REV runs showed that our runs in WWW-3 still well performed in WWW-4.
  • Shenghao Yang, Haitao Li, Zhumin Chu, Jingtao Zhan, Yiqun Liu, Min Zhang and Shaoping Ma
    [Pdf] [Table of Content]
    The THUIR team participates in the English subtask of the NTCIR-16 We Want Web with CENTRE(WWW-4) task. This paper elaborates on our methods and discusses the experimental results. We adopt three methods, namely learning-to-rank models, a pre-trained language model tailored for information retrieval, and BERT with prompt learning. Experimental results demonstrate the importance of designing pre-training task specifically for information retrieval. Results also suggest the relatively simple prompt method cannot effectively improve the ranking performance.
  • Return to Top



    Pilot Tasks


    [RCIR]


  • Graham Healy, Tu-Khiem Le, Mai Boi Quach, Minh-Triet Tran, Thanh-Binh Nguyen and Cathal Gurrin
    [Pdf] [Table of Content]
    The NTCIR-16 RCIR pilot task aimed to motivate the development of a first generation of personalised retrieval techniques that integrate reading comprehension measures and eye tracker signals as a source of information when ranking text content. The dataset used in the challenge was newly generated by capturing eye movement measures while experimental participants read text passages on a computer screen. The RCIR challenge included two sub-tasks: a) the comprehension-evaluation task (CET) that involved predicting a measure of a reader’s comprehension for text passages and, b) the comprehension-based retrieval task (CRT) that involved retrieving relevant passage texts ranked by comprehension score. The participating teams were ranked using Spearman’s correlation coefficient (rho) for the CET sub-task and normalised Discounted Cumulative Gain (nDCG score) for the CRT sub-task.
  • Manh-Duy Nguyen, Thao-Nhu Nguyen, Binh Thanh Nguyen, Annalina Caputo and Cathal Gurrin
    [Pdf] [Table of Content]
    Reading is one of the most common everyday activities. People read through most of their daily context such as during study or for entertainment in their spare time. Despite playing a critical role in our lives, there has been limited research on how people read and how it affects their level of understanding. The NTCIR-16 RCIR challenge is the first collaborative evaluation that aims to automatically measure the reading comprehension of a reader and integrate it as part of the information retrieval process. In this paper, we present our approach for the NTCIR-16 RCIR challenge, in which task participants are required to predict reading comprehension using eye movement signals of the readers. We utilised several conventional machine learning techniques to estimate the level of comprehension and combined it with a language model to perform text retrieval. Our extensive experiments, covering both subject-dependent and subject-independent scenarios, showed that our approach with fine-tuning obtained a Spearman’s coefficient of 0.5993 for the comprehension-evaluation task and nDCG at 0.7296 for the comprehension-based retrieval task.
  • Yumi Kim, Aluko Ademola, Jeong Hyeun Ko and Heesop Kim
    [Pdf] [Table of Content]
    We participated in the CET sub-task of the NTCIR-16 RCIR. In order to participate in the NTCIR-16 reading comprehension information retrieval (RCIR) CET sub-task, we adopted five regression models: Linear Regression, Random Forest Regressor, Gradient Boosting Regressor, eXtreme Gradient Boosting (XGB) Regressor, and Voting Regressor. We submitted the prediction results of test data to NTCIR- 16 and analyzed the obtained results. Throughout the analysis, we found that Gradient Boosting and Random Forest Regressor generally show better performance with Spearman’s rho of 0.53 and 0.57, respectively. In addition, the feature importance analysis indicated that each participant shows different eye-tracking tendencies for their reading comprehension. Findings in our work may bring insight into the understanding of human reading and information seeking processes with the help of eye-tracking systems by applying various regression models.
  • Kim-Nghia Liu, Vinh Dang, Thanh-Son Nguyen and Minh-Triet Tran
    [Pdf] [Table of Content]
    The HCMUS team participated in the RCIR (Reading Comprehension in Information Retrieval) task of the NTCIR-16. The RCIR task aims to evaluate different techniques to rank text content with useful eye tracking information. In this paper, we present our methods to solve the problem of the Comprehension-evaluation Task (CET). We follow the feature processing and engineering strategy, and we adopt different techniques, such as BERT, PCA, and AutoML, to generate output results for this task. Our best solution achieves the Spearmans correlation coefficient of 0.50846.
  • Return to Top


    [Real-MedNLP]


  • Shuntaro Yada, Yuta Nakamura, Shoko Wakamiya and Eiji Aramaki
    [Pdf] [Table of Content]
    A standard dataset collection is essential for the development of information science. Particularly in the medical field, in which privacy protection is a critical issue, the importance of the dataset is significant. To discuss the validness of various methods, we build the clinical text dataset, Real-MedNLP, for multiple medical tasks. The goal of Real-MedNLP is threefold: (1) Real datasets: Previous medical shared tasks, MedNLP, MedNLP2, and MedNLPDoc, were based on the pseudo dataset, which was built from medical textbooks or dummy clinical texts. This task prepares real radiology and case reports. (2) Bilingual capability: Both English and Japanese data are handled. (3) Practicality: Both fundamental (named entity recognition) and applied practical tasks are handled. This study introduces the task setting of Real-MedNLP and submitted systems. The methods mostly share the common paradigm, which is based on a fundamental language model, such as BERT, aiming to separate the resource problems. Based on their results, this study discusses the feasibility of their approaches to bring us the future direction of medical NLP. Note that the Real-MedNLP is a shared task that handles real Japanese medical texts.
  • Satoshi Hiai, Shoji Nagayama and Atsushi Kojima
    [Pdf] [Table of Content]
    The AMI team participated in subtasks 1 and 2 of the NTCIR-16 Real-MedNLP Task. In this paper, we report our systems employed for subtasks 1 and 2. In subtask 1, the organizer provides a small amount of training data. In recent years, the approach based on BERT has achieved excellent results for such a low-resource situation. We construct two systems based on the BERT model pretrained on biomedical documents (UTH-BERT). We construct the ensemble method with hidden vectors from multiple layers of UTH-BERT and the fine-tuning method with the CRF layer. In subtask 2, participants construct their methods based on the annotation guideline. We construct a multistage method to identify named entities. The system consists of three stages: a candidate extraction stage, an identification stage, and a tag correction stage. We discuss the effectiveness of our systems on the basis of our preliminary experiments and the results in the formal run.
  • Yongwei Zhang, Rui Cheng, Lu Luo, Haifeng Gao, Shanshan Jiang and Bin Dong
    [Pdf] [Table of Content]
    The SRCB participated in subtask1: Few-resource Named Entity Recognition (NER) and subtask3: Adverse Drug Event detection (ADE) in NTCIR-16 Real-MedNLP. This paper reports our approach to solve the problem and discusses the official results. For the Few-resource NER subtask, we developed NER systems based on pretraining model, span-based classification and prompt learning. In addition, data augmentation and model ensemble are used to further improve performance. For ADE subtask, we mainly adopted two methods: multi-class classification and prompt learning. We employed a two-stage training strategy to solve the long tail distribution problem and applied transfer learning to improve performance of model.
  • Zhongguang Zheng, Lu Fang, Yiling Cao and Jun Sun
    [Pdf] [Table of Content]
    In this paper, we describe the approaches of FRDC team for the Real-MedNLP task. Specially, the FRDC team participated in three sub-tasks including Subtask1-CR-EN, Subtask3-CR-EN (ADE), and Subtask3-RR-EN (CI). The Real-MedNLP task aims to promote approaches for supporting real medical services under constrained training resources. We applied pre-trained language models (PTLMs) such as BERT and BioBERT to learn sentence and document representations. For each sub-task, we designed different networks based on PTLMs. Various effective methods such data augmentation were adopted in each sub-task. In the official run, we achieved the best score for the CI sub-task, and ranked 2nd in the ADE sub-task.
  • Joseph Cornelius, Oscar Lithgow-Serrano, Vani Kanjirangat, Fabio Rinaldi, Koji Fujimoto, Mizuho Nishio, Osamu Sugiyama, Kana Ichikawa, Farhad Nooralahzadeh, Aron Horvath and Michael Krauthammer
    [Pdf] [Table of Content]
    In this paper, we discuss our contribution to the NII Testbeds and Community for Information Access Research (NTCIR) - 16 Real-MedNLP shared task. Our team (ZuKyo) participated in the English subtask: Few-resource Named Entity Recognition. The main challenge in this low-resource task was a low number of training documents annotated with a high number of tags and attributes. For our submissions, we used different general and domain-specific transfer learning approaches in combination with multiple data augmentation methods. In addition, we experimented with models enriched with biomedical concepts encoded as token-based input features.
  • Koji Fujimoto, Mizuho Nishio, Osamu Sugiyama, Kana Ichikawa, Joseph Cornelius, Oscar Lithgow-Serrano, Vani Kanjirangat, Fabio Rinaldi, Aron Horvath, Farhad Nooralahzadeh and Michael Krauthammer
    [Pdf] [Table of Content]
    We describe our submissions to NTCIR-16 Real-MedNLP shared task. This paper presents the approach of the ZuKyo-JA subteam for solving the Japanese part of Subtask1 and Subtask3 (Subtask1-CR-JA, Subtask1-RR-JA, Subtask3-RR-JA) based on a sliding-window approach using Japanese BERT pre-trained masked-language model. A lot of methods used for these subtasks share in common, regardless of the difference in the task. We also show a method to aggressively use medical knowledge for data labeling, data augmentation, and the same class identification for the subtask3-RR-JA.
  • Tomohiro Nishiyama, Mihiro Nishidani, Aki Ando, Shuntaro Yada, Shoko Wakamiya and Eiji Aramaki
    [Pdf] [Table of Content]
    This paper describes how we tackled the Medical Natural Language Processing for Real-MedNLP task as participants of NTCIR16. We utilized BERT model for solving this task. We found that BERT model we trained is the best results with F1-score.
  • Benjamin Holmes, Adam Gagorik, Joshua Loving, Foad Green and Hu Huang
    [Pdf] [Table of Content]
    In this paper, we present our approach to subtasks 1, 2, and 3 of the NTCIR-16 RealMed-NLP challenge. For these challenges, the english language corpora (CR-EN and RR-EN) were used. In subtasks 1 and 2, the goal was to create an NLP system which could add tags to case reports (CR) or radiology reports (RR). In subtask 3, two applications of this system were tested: the ability to determine which RRs from a group referred to the same sample, and the ability to determine the probability that a medication caused side effects in a report. Our approach leveraged keyword extraction through a medical metathesaurus (MetaMap), sentence structuring using a SciSpacy model, and word embeddings using a trained BERT model. Using this model, we were able to complete these three subtasks with high levels of accuracy.
  • Masao Ideuchi, Masatoshi Tsuchiya, Yiran Wang and Masao Utiyama
    [Pdf] [Table of Content]
    This paper describes NICTmed team's challenge to Subtask1-CR-EN, Subtask1-CR-JA, Subtask3-CR-EN, and Subtask3-CR-JA in NTCIR-16 Real-MedNLP. In Real-MedNLP, approximately 100 annotated real clinical reports in both English and Japanese are given to participants. Subtask1-CR-EN/JA and Subtask3-CR-EN/JA are both based on Case Reports, Subtask1 is few-resource Named Entity Recognition (NER) and Subtask3 is information extraction for adverse drug event (ADE). We used multilingual BERT (mBERT) and XLM-RoBERTa (XLM-R) to compare how effective the multilingual pre-trained models work in specific domain downstream tasks English and Japanese. Our experiment used no external data to adjust conditions of English and Japanese experiments. As a result, we confirmed that the multilingual pre-training models provide similar level of accuracy in Japanese as in English, and got rank 3 in Entity F1 of all target entities for Subtask1-CR-JA, top rank in Report-level precision and F1 for Subtask3-CR-JA.
  • Shuai Shao, Gongye Jin, Daisuke Satoh and Yuji Nomura
    [Pdf] [Table of Content]
    The NTTD team participated in the Subtask1-CR-JA and Subtask1-RR-JA subtasks of the NTCIR-16 Real-MedNLP Task. This paper reports our approach to solving the NER (named entity recognition) problem when dealing with limited labeled medical documents. The documents are real Case-Report and Radiographic-Report data in Japanese. We first applied out recently developed annotation inconsistency detection tool to detect and correct inappropriate labels within the given training data. Then we applied data augmentation methods to create additional labeled data and combined the original and additional data as training data of our model. In this task, we fine-tuned Flair by the forementioned training data and acquired the results.
  • Rei Noguchi
    [Pdf] [Table of Content]
    Clinical text data are highly expected to be utilized directly for medical examination or diagnosis to support doctors’ practices. In this study, I propose a framework of similar case identification for radiology reports by structuring the reports into a “case matrix” and by applying a collaborative filtering algorithm.
  • Return to Top


    [SS]


  • Jia Chen, Weihao Wu, Jiaxin Mao, Beining Wang, Fan Zhang and Yiqun Liu
    [Pdf] [Table of Content]
    This is an overview of the NTCIR-16 Session Search (SS) task. The task features the Fully Observed Session Search subtask (FOSS) and the Partially Observed Session Search subtask (POSS). This year, we received 28 runs from 6 teams in total. This paper will describe the task background, data, subtasks, evaluation measures, and the evaluation results, respectively.
  • Haonan Chen and Zhicheng Dou
    [Pdf] [Table of Content]
    This paper presents the participation of RUCIR in the NTCIR-16 Session Search Task. We will discuss the approach we use to solve the problem and the experimental results. We use the state-of-the-art session search ranking model COCA which is based on BERT and contrastive learning. In addition, we use the BM25 algorithm and usefulness labels to make our ranking results more accurate. The official results show that our best run outperforms all other participants' runs in terms of all official metrics in both subtasks.
  • Shengjie Ma, Chuwei Zeng and Jiaxin Mao
    [Pdf] [Table of Content]
    A single query may hardly satisfy the user’s information needs, so that users will continuously submit more queries to the search system until they are satisfied or stop trying. This search process is called session search. The MM6 team participated in the IR subtask of the NTCIR-16 Session Search Task. This paper reports our three approaches for FOSS task and one approach for POSS task. We display and discuss the official results at the end.
  • Weihang Su, Xiangsheng Li, Yiqun Liu, Min Zhang and Shaopin Ma
    [Pdf] [Table of Content]
    Our team(THUIR2) participated in both FOSS and POSS subtasks of the NTCIR-16. Session Search (SS) Task. This paper describes our approaches and results. In the FOSS subtask, we submit five runs by using a learning-to-rank model and a fine-tuned pre-trained language model. We fine-tune the pre-trained language model with both ad-hoc data and session information and then assembled them by a learning-to-rank method. The assembled model achieves the best performance among all participants in the preliminary evaluation. In the POSS subtask, we used an assembled model which also achieves the best performance in the preliminary evaluation.
  • Return to Top


    [ULTRE]


  • Yurou Zhao, Zechun Niu, Feng Wang, Jiaxin Mao, Qingyao Ai, Tao Yang, Junqi Zhang and Yiqun Liu
    [Pdf] [Table of Content]
    In this paper, we present an overview of the NTCIR-16 Unbiased Learning to Rank (ULTRE) task. The ULTRE task is motivated by the ongoing development of Unbiased Learning to Rank research, consisting of two subtasks: offline ULTR and online ULTR. In the overview, we introduce the dataset , simulation method and evaluation protocols of ULTRE, and report the official evaluation results of the received runs.
  • Zechun Niu, Yurou Zhao, Feng Wang and Jiaxin Mao
    [Pdf] [Table of Content]
    The RUCIR21 team participated in both the offline and online subtasks of the NTCIR-16 Unbiased Learning to Rank Evaluation (ULTRE) task. This paper describes our approaches and reports the results in the ULTRE task. In the offline subtask, we tried four learning to rank models based on Mobile Click Model (MCM), as well as a revived Dual Learning Algorithm (DLA) model. In the online subtask, we revived a Pairwise Differentiable Gradient Descent (PDGD) run and two online DLA runs, we also tried an online DLA model based on MCM.
  • Anh Tran, Tao Yang and Qingyao Ai
    [Pdf] [Table of Content]
    The UTIRL team participated in both Offline and Online unbiased learning-to-rank (ULTR) (Chinese) subtasks of the NTCIR-16 ULTRE Task. This paper describes our implemented algorithms and analysis the official results. In the Offline ULTR subtask, we tried a newly proposed ULTR algorithm and an ensemble of ten models consisting of five different algorithms on two neural networks. In the Online ULTR subtask, we used three algorithms trained on a deep neural network.
  • Return to Top