The 15th NTCIR Conference
Evaluation of Information Access Technologies
December 8-11, 2020
National Institute of Informatics, Tokyo, Japan

    [Preface]


  • Charles L. A. Clarke and Noriko Kando
    [Pdf] [Table of Content]
  • Return to Top


    [Overview]


  • Yiqun Liu, Makoto P. Kato and Noriko Kando
    [Pdf] [Table of Content]
    This is an overview of NTCIR-15, the fifteenth sesquiannual research project for evaluating information access technologies. NTCIR-15 involved various evaluation tasks related to information retrieval, information recommendation, question answering, natural language processing, etc. (in total, seven tasks were organized at NTCIR-15). This paper describes an outline of the research project, which includes its organization, schedule, scope and task designs. In addition, we introduce brief statistics of participants in the NTCIR-15 Conference. Readers should refer to individual task overview papers for their detailed descriptions and findings.
  • Return to Top


    [Keynote]


  • Ben Carterette
    [Pdf] [Table of Content]
    The history of experimenting on information access systems using offline test collections---the Cranfield paradigm---goes back many decades and is a major aspect of scientific progress in search and IA. Its wide-scale adoption has been driven in part by its robustness and ease of use, in part by evaluation workshops like NTCIR, and in part by the emergence of new information access scenarios and problems that can adapt it. Despite that, there is a lot we still don't know about the ability of offline experiments to predict online outcomes with real users in real-world conditions. In this talk I discuss a common framework for thinking about experimentation and connect it to both offline Cranfield experiments and online A/B testing. Using examples from Spotify search and recommendation, I show how offline experiments motivate online development and vice versa. Developing a better understanding of how offline experiments translate into online experiences will be key as approaches from our research continue to be adopted into real-world technology.
  • Xiao-Li Meng
    [Pdf] [Table of Content]
    The terms reproducibility and replicability have been used interchangeably by some scientific communities and by media, and with the opposite meanings by others, causing much confusion. The 2019 report on "Reproducibility and Replicability in Science" issued by US National Academies of Sciences, Engineering and Medicine (NASEM) made an important contribution to delineate the two terms by equating reproducibility with computational reproducibility and replicability with scientific replicability. However, neither of them in itself can guarantee reliability. Reliability does not imply absolute truth, but it does require that our findings can be triangulated, can pass reasonable stress tests and fair-minded sensitivity tests, and they do not contradict the best available theory and scientific understanding, unless the findings are designed to challenge the existing common wisdom. The quality of data and information plays far important roles than their quantity in ensuring reliability. This talk reflects on these issues based on my statistical research on quantifying quality of big data, and as the founding Editor-in-Chief of Harvard Data Science Review (HDSR), an experience that has provided me a much broader data science perspective. Along the way, using US election prediction and COVID-19 testing as two recent examples, I will demonstrate how small our big data are when we take into account their quality.
  • Return to Top


    [Invited Talk]


  • Ellen Voorhees
    [Pdf] [Table of Content]
    To state the obvious, 2020 was an unusual year. The pandemic that caused so many changes large and small impacted the Text REtrieval Conference (TREC), too: TREC implemented its first extra-curricular track (TREC-COVID); both relevance assessing and the conference itself were required to be remote; and many existing TREC tracks either pivoted to be completely focused on COVID-19 or included some topics regarding it. This talk will describe implementing TREC in 2020 and highlight the outcomes of its eight tracks.
  • Nicola Ferro
    [Pdf] [Table of Content]
    The initial part of this talk will discuss the achievements and happenings in CLEF, the European initiative whose main mission is to promote research, innovation, and development of information access systems with an emphasis on multilingual and multimodal information with various levels of structure. We will focus on the just concluded CLEF 2020 edition ( https://clef2020.clef-initiative.eu/) and the just started CLEF 2021 edition ( http://clef2021.clef-initiative.eu/).
  • Gareth Jones
    [Pdf] [Table of Content]
    MediaEval is a multimedia benchmarking initiative which seeks to evaluate new algorithms for multimedia access and retrieval. MediaEval emphasizes the ``multi'' in multimedia, including tasks combining various facet combinations of speech, audio, visual content, tags, users, and context. MediaEval innovates new tasks and techniques focusing on the human and social aspects of multimedia content in a community driven setting. The initiative provides a platform for researchers to organize benchmark tasks within a planned annual timeline and to report results at an end of campaign workshop. This presentation will overview the objectives of the MediaEval campaigns and summarize current activities within MediaEval 2020.
  • Cathal Gurrin
    [Pdf] [Table of Content]
    Technology advances mean that we can now gather detailed multimedia traces that model our life activities; these are called lifelogs. As research began in this domain over fifteen years ago, it was noticeable that there was little understanding of how lifelogs can positively impact on the individual/society nor did our research community understand how lifelogs can be organised and indexed to provide effective retrieval facilities. This talk will provide an overview of the outputs of the NTCIR-Lifelog task. The motivation for proposing the NTCIR-Lifelog task and the progress made by running this task three times at NTCIR will be discussed along with highlighting the wider impact of NTCIR-Lifelog in terms of related benchmarking activities. Finally, future plans for next generation lifelog-related tasks at NTCIR will be proposed along with a possible roadmap for the future years.
  • Return to Top



    Core Tasks


    [DialEval-1]


  • Zhaohao Zeng, Sosuke Kato, Tetsuya Sakai and Inho Kang
    [Pdf] [Table of Content]
    In this paper, we provide an overview of the NTCIR-15 Dialogue Evaluation (DialEval-1) task. DialEval-1 consists of two subtasks: Dialogue Quality (DQ) and Nugget Detection (ND). Both DQ and ND subtasks aim to evaluate customer-helpdesk dialogues automatically. The DQ subtask is to assign quality scores to each dialogue in terms of three criteria: task accomplishment, customer satisfaction, and efficiency; and the ND subtask is to classify whether a customer or helpdesk turn is a nugget, where being a nugget means that the dialogue turn helps towards problem solving. In this overview paper, we introduce the task setting, evaluation methods and data collection, and report the official evaluation results of 18 runs received from 7 teams.
  • Yen Chun Huang, Yi Hsuan Huang, Yu Ya Cheng, Jung Y Liao and Yung-Chun Chang
    [Pdf] [Table of Content]
    In this paper, we present our approaches to the Nugget Detection (ND) subtask at the NTCIR-15 STC-3 task. The purpose of this subtask is to automatically identify the state of dialogue sentences in logs of a dialogue system. The proposed model integrates BERT Embeddings and BiLSTM through a concatenated attention mechanism. The results demonstrate that BERT Embeddings are effective in capturing the semantic relationship between pieces of the dialogue in the context. Therefore, our models are capable of surpassing two baseline models (i.e., BL-uniform and BL-popularity). In addition, according to our final evaluation results, the attention mechanism plays a crucial role in model optimization.
  • Junjie Wang, Yuxiang Zhang, Tetsuya Sakai and Hayato Yamana
    [Pdf] [Table of Content]
    NTCIR-15 releases a task DialEval-1, which contains both Chinese and English customer-helpdesk dialogues and corresponding evaluation indexes according to dialogue quality and nugget detection, and requires the participants to build deep learning model for automatic evaluation of helpdesk agent systems. We implement the deep learning models (CNN, BiLSTM etc) and the pre-trained models (ALBERT and DistilRoBERTa). For this task, we propose a label-based training method to transform the problem from a special multi-label classification task to a multi-class classification task. From our results, the label-based training method improves performance greatly. The official results show that the average score of team SKYMN is in third place and is close to the second one.
  • Xuan Zhang, Maofu Liu, Zhanzhao Zhou and Junyi Xiang
    [Pdf] [Table of Content]
    The NTCIR-15 Dialogue Evaluation Task (DialEval-1) hosts two subtasks, Dialogue Quality (DQ) and Nugget Detection (ND). The purpose of the DQ subtask is to assess the quality of the dialogue from three aspects. The ND subtask is to identify the current status of dialog turn. Both DQ and ND subtasks aim to evaluate customer-helpdesk dialogues automatically. In this paper, we use neural network to extract context dependency between dialogues by Bidirectional Long Short-Term Memory (Bi-LSTM), and adopt the attention mechanism to learn the keywords and sentences better. Compared with the current feature extraction method which ignores the dependency between dialogues, our method holds the stronger emphasis on the context dependency. Finally, the experimental results of the two subtasks show that our method is effective.
  • Xin Kang, Yunong Wu and Fuji Ren
    [Pdf] [Table of Content]
    In this paper we report the work of TUA1 team in the dialogue evaluation (DialEval-1) task of NTCIR-15, for the Chinese dialogue quality (DQ) and nugget detection (ND) subtasks. In the proposed method, first we employ a pre-trained BERT network for feature extraction from a dialogue sequence, and feed these feature vectors to a Bi-LSTM network together with a speaker embedding which is learned to separate the customer and helpdesk semantically. Then we feed the output, which is a sequence of semantic vectors, into a self-attention network, where a few attention heads are learned to assign evaluation weights over the sequence of vectors and summarize them into several high-level semantic vectors. Finally we concatenate these high-level semantic vectors and put them through several feed-forward neural network layers to finally predict the dialogue quality scores. We train the network based on the criteria of mean squared error and Sinkhorn divergence respectively for the dialogue quality prediction and one for the nugget detection. The results suggest that the proposed method is promising in learning a dialogue quality prediction system for generating very close predictions to the human annotators.
  • Ting Cao, Fan Zhang, Haoxiang Shi, Zhaohao Zeng, Sosuke Kato, Tetsuya Sakai, Injae Lee, Kyungduk Kim and Inho Kang
    [Pdf] [Table of Content]
    In this paper, we present three models for the nugget detection and dialogue quality subtasks at the NTCIR-15 DialEval-1 task. Despite the recent progress in dialogue systems, we still face a number of unresolved challenges such as the dialogue system that often generates responses that cannot satisfy customers or responses that cannot help solve problems. Therefore, we submitted three models to the NTCIR-15 DialEval-1 task. The first model is run0: LSTM with attention-based dialog embedding, using a recurrent neural network with an attention layer to embed the previous dialogue context. The model used two representation vectors, an extracted dialogue context vector and a sentence vector of the target sentence. The second model is run1: transformer encoder architecture for English nugget detection. The third model is run2: BiLSTM with an attention layer that leverages the outputs of the BiLSTM to obtain a sentence-level representation. On the English dataset of nugget detection subtask, run0 model: LSTM with attention-based dialogue embedding outperforms the baseline, but on the Chinese dataset, it does not outperform the baseline. This suggests that attention-based dialog embedding is possibly helpful for a smaller English dataset.
  • Tao-Hsing Chang, Jian-He Chen and Chi-Chia Chen
    [Pdf] [Table of Content]
    Chatbot dialogue quality evaluation is an important topic. Most existing automatic evaluation methods are based on models that can handle time series (for example, the long short-term memory (LSTM) model). However, this research adopted another approach to directly convert a complete dialogue into a semantic vector through Bidirectional Encoder Representations from Transformers (BERT). Subsequently, the vector was entered into a simple classification model for training and prediction. The experimental results for the DialEval-1 task reveal that the performance of the proposed method is reasonably comparable to that of the BL-LSTM model.
  • Mike Tian-Jian Jiang, Zhao-Xian Gu, Cheng-Jhe Chiang, Yueh-Chia Wu, Yu-Chen Huang, Cheng-Han Chiu, Sheng-Ru Shaw and Min-Yuh Day
    [Pdf] [Table of Content]
    Following the third Short-Text Conversation (STC-3) task at NTCIR-14, the first Dialogue Evaluation (DialEval-1) task continue examining, for Chinese and English, how well each participant’s system can tackle the two subtasks of Dialogue Quality (DQ) and Nugget Detection (ND). The former estimates the three quality scores of a dialogue, namely Accomplishment (A- score), Satisfaction (S-score), and Effectiveness (E-score), using integer ranks ranging from -2 to 2 each. The latter categorizes dialogue turns by seven nugget types. For DQ subtask, the task organizers measure performance by Normalised Match Distance (NMD) and Root Symmetric Normalised, Order-aware Divergence (RSNOD). For ND subtask, the metrics are Root Normalised Sum of Squares (RNSS) and Jensen-Shannon Divergence (JSD). We consider both subtasks classification problems and tackle them with several models of Transformer, to create a reliable and efficient process using the most recent advances of transfer learning. Our approaches involve various techniques of tokenization and fine-tuning for those Transformers. This paper describes their usages and usefulness of our official runs. In terms of NMD, our run2 for Chinese DQ subtask substantially outperforms the baselines. According to RSNOD, our run0 for English DQ subtask also achieve a significant difference of S-score statistically. Almost all of our runs for ND tasks reach the first places. NTCIR-15 DialEval-1 task. Those results suggest that one can easily optimize Transformers for DQ and ND subtasks.
  • Return to Top


    [FinNum-2]


  • Chung-Chi Chen, Hen-Hsen Huang, Hiroya Takamura and Hsin-Hsi Chen
    [Pdf] [Table of Content]
    In FinNum task series, we focus on understanding the numeral-related information in the financial social media data. In FinNum-2, we introduce a new task, named numeral attachment, to identify the relation between the mentioned stock and the numerals in a financial tweet and propose a NumAttach 2.0 dataset with 10,340 expert-annotated instances. In this paper, we give an overview of FinNum-2 shared task and analyze the results of 17 submissions from 7 teams. The statistics of NumAttach 2.0 and the comparison of participants' results with that of baseline models are provided.
  • Yu-Yu Chen and Chao-Lin Liu
    [Pdf] [Table of Content]
    In the FinNum-2 task, the goal is to tell if the given numeral is related to the given stock in a financial tweet. We employ a transfer-learning mechanism and use the Google BERT embeddings so that we only need to collect and annotate a small amount of data to train the classifiers for the task. In addition, our classifiers consider some intuitive but useful syntactic features, e.g., the positions of words in the tweets. Experimental results indicate that these new features boost the prediction quality, and we achieved better than 68% in the tests in the formal run.
  • Jose Moreno, Emanuela Boros and Antoine Doucet
    [Pdf] [Table of Content]
    This paper presents the TLR participation in the FinNum-2 task. Our system is based on a Transformer architecture improved by a pre-processing strategy for numeral attachment identification. Instead of relying on a vanilla attention mechanism, we focus the attention to specific tokens that are essential for the task. The results in an unseen test collection show that our model correctly generalises the predictions as our best run outperforms all those of other participants in terms of F1-macro (official metric). Further, results show the robustness of our method as well as the experiments with two alternatives (with and without parameter tuning) leading to an additional improvement of 4% over our best run.
  • Yu-Chi Liang, Yi-Hsuan Huang, Yu-Ya Cheng and Yung-Chun Chang
    [Pdf] [Table of Content]
    Machine learning methods for financial document analysis have been focusing mainly on the textual part. However, the numerical parts of these documents are also rich in information content. This paper presents our approach in Numeral Attachment in Financial Tweets (FinNum-2) at NTCIR-15. The purpose of this task is to determine whether there is a relationship between the target cashtag and the target number in financial tweets. We construct a model based on BERT-BiLSTM with attention mechanism, which is our main architecture of this task. In addition, we also add the results from a dependency parser and a CNN model to our main architecture. Our experimental results indicate that the BERT-BiLSTM with attention model has the best performance. More precisely, we obtain 87.02% in development set and 64.74% in test set in terms of F-score.
  • Mike Tian-Jian Jiang, Yi-Kun Chen and Shih-Hung Wu
    [Pdf] [Table of Content]
    The paper describes our submissions to the NTCIR-15-FinNum-2 shared task in financial tweets analysis. We submitted two runs in the final test. The first run is our baseline system, which is based on the BERT model with our preprocessing strategy. The second run is our fine-tuned system based on the XLM-RoBERTa pretraining model with more tokenization and fine-tuning techniques. The macro-F1 of run 2 is 95.99% on development set, and 71.90% on formal test which ranked second best.
  • Xinxin Xia, Wei Wang and Maofu Liu
    [Pdf] [Table of Content]
    This article introduces how we deal with the FinNum-2 task of NTCIR15. In the FinNum-2 task, the relationship between a number and a given label is the object of classification. In a short text, given a target value and a cashtag, judge whether the given target value is related to the given cashtag according to the content. The classification involved in this topic is essentially a two-classifier, that is, given a short text, determine whether the given value and the label are relevant or not. We use the SVM model to classify the text by splicing text features and analyze the results.
  • Return to Top


    [QA Lab-PoliInfo-2]


  • Yasutomo Kimura, Hideyuki Shibuki, Hokuto Ototake, Yuzu Uchida, Keiichi Takamaru, Madoka Ishioroshi, Teruko Mitamura, Masaharu Yoshioka, Tomoyosi Akiba, Yasuhiro Ogawa, Minoru Sasaki, Kenichi Yokote, Tatsunori Mori, Kenji Araki, Satoshi Sekine and Noriko Kando
    [Pdf] [Table of Content]
    The NTCIR-15 QA Lab-PoliInfo-2 aims at real-world complex Question Answering (QA) technologies using Japanese political information such as local assembly minutes and newsletters. QA Lab-PoliInfo-2 has four sub tasks, namely Stance classification, Dialog summarization, Entity linking and Topic detection. We describe the used data, formal run results, and comparison between human marks and automatic evaluation scores.
  • Kazuhiro Atsumi and Yoshinobu Kano
    [Pdf] [Table of Content]
    This paper reports the knlab team's approach and results in the NTCIR-15 QA Lab-PoliInfo's Stance Classification Task. This task predicts stances (Agreement, Disagreement, No Mention) for each party regarding each proposal, using minutes of proceedings which includes statements of politicians. Our team designed features obtained from a sentiment dictionary and BERT, then trained LightGBM to classify the stances.
  • Daiki Shirafuji, Hiromichi Kameya, Rafal Rzepka and Kenji Araki
    [Pdf] [Table of Content]
    There are many discussions held during political meetings, and a large number of utterances for various topics is included in their transcripts. We need to read all of them if we want to follow speakers’ intentions or opinions about a given topic. To avoid such a costly and time-consuming process to grasp often longish discussions, NLP researchers work on generating concise summaries of utterances. Summarization subtask in QA Lab-PoliInfo-2 task of the NTCIR-15 addresses this problem for Japanese utterances in assembly minutes, and our team (SKRA) participated in this subtask. As a first step for summarizing utterances, we created a new pre-trained sentence embedding model, i.e. the Japanese Political Sentence-BERT. With this model, we summarize utterances without labelled data. This paper describes our approach to solving the task and discusses its results.
  • Yuichi Sasazawa and Naoaki Okazaki
    [Pdf] [Table of Content]
    We report the results of the NTCIR15 QA Lab-PoliInfo-2 stance classification task. There are two types of stances each party has against each bill. The first one is that is explicitly stated in the statements and those that are not explicitly stated in the statements. For the former, we used a rule-based algorithm to extract stances against the bills. For the latter, we used the name of the bills to classify stances against the bills. Our team achieved 99.78\% accuracy in the automatic metrics on the test data, achieved state-of-the-art score.
  • Kazuma Kadowaki
    [Pdf] [Table of Content]
    The JRIRD team participated in the Dialog Summarization subtask of the NTCIR-15 QA Lab-PoliInfo-2 task. This paper describes my approach for the topic-aware summarization of assembly member speeches. The system consists of three modules: (1) a pre-processor that retrieves speeches from minutes, (2) a BERT-based sentence extractor that extracts candidate sentences by predicting the topic-aware importance of each sentence in a speech without annotations, and (3) a UniLM-based summary generator that generates a summary from the extracted sentences while controlling the length of the summary. Results show that my system achieved an outstanding performance among all of the participants in the task, both in the evaluation using ROUGE scores and in human evaluations.
  • Takuma Himori, Yasutomo Kimura and Kenji Araki
    [Pdf] [Table of Content]
    The HUHKA team participated in the Entity Linking task of Question Answering Lab for Political Information 2 (QA Lab-PoliInfo-2) subtask of the NTCIR-15. This report describes our methods for solving the task and discusses the results. We extract mentions of "law name" with BERT and filters, and link the mention to Wikipedia with knowledge base such as Wikipedia and e-Gov.
  • Ken-Ichi Yokote
    [Pdf] [Table of Content]
    We propose a system that has Simple Strategies For Conclusion-Oriented Summarization Baseline.
  • Yuji Naraki and Tetsuya Sakai
    [Pdf] [Table of Content]
    The selt team participated in the entity linking task of NTCIR-15 QA Lab-PoliInfo 2. This paper describes our entity linking system for assembly member speeches using BERT and wikipedia2vec. Using this system, we can effectively preprocess and postprocess data for mention detection, and (after fixing the format of our run file) achieved the second-best performance in the leaderboard of NTCIR-15. For mention detection specifically, we achieved the best score.
  • Yasuhiro Ogawa, Yuta Ikari, Takahiro Komamizu and Katsuhiko Toyama
    [Pdf] [Table of Content]
    Our nukl team participated in the NTCIR-15 QA Lab-PoliInfo-2. We submit three subtasks, Dialog summarization, Entity linking, and Dialog Topic Detection. This paper describes our three systems. In the Dialog summarization, we used Progressive Ensemble Random Forest (PERF), which we developed in the NTCIR-14 QA Lab-PoliInfo. While we applied PERF to sentence extraction in the NTCIR-14, we also applied it to sentence reduction and achieved a good performance. In the Dialog Topic Detection subtask, we used simple rule-based approach and showed that some of topics are not described in Togikai dayori, which is the official summary of assembly members speeches.
  • Takanori Nekomoto, Ryoto Ohsugi, Tomoyosi Akiba, Shigeru Masuyama and Daiki Shirato
    [Pdf] [Table of Content]
    NTCIR-15 QA Lab-PoliInfo-2 establishes several tasks aimed at presenting pertinent information in resolving political issues. We, the akbl team, tackled the Stance Classification, the Dialog Summarization, and the Topic Detection tasks. For the Stance Classification task, we used, at first, a rule-based analyzer for extracting the opinion statements, then, for those left undetermined, we applied a BERT-based stance classifier on the debate statements. For the Dialog Summarization task, we firstly searched for the relevant segments, then we extracted the final sentence to form the output summary. For the Topic Detection task, we employed a clustering algorithm on the BERT embeddings of initial topic candidates extracted by using regular expressions, then their final topics were selected based on the centroid of each cluster.
  • Hiromu Onogi, Kiichi Kondo, Younghun Lim, Xinnan Shen, Madoka Ishioroshi, Hideyuki Shibuki, Tatsunori Mori and Noriko Kando
    [Pdf] [Table of Content]
    In this paper, we developed and described a system for the stance classification, two systems for the dialog summarization and a system for the entity linking, separately. We submitted 5 results including 3 late submissions for the stance classification, 10 results including 5 late submissions for the dialog summarization and 4 results for the entity linking, respectively. As a result, an accuracy of .9388 for the stance classification, a ROUGE-1 score of .2410 for the dialog summarization and an F-measure of .3910 for the entity linking were obtained.
  • Yuya Hirai, Yo Amano and Kazuhiro Takeuchi
    [Pdf] [Table of Content]
    This paper proposes a method of summarizing discussions on the premise that a graph structure that can confirm their argument structure also in a visual way. Firstly, our proposed method extracts words related to each participants' opinions and positions based on the co-occurred words with the other participants in the same discussion. Then, LDA (Latent Dirichlet Analysis) is applied to weigh the priority of the extracted position and opinion words, while the number of topics for LDA is determined by hierarchical clustering based on co-occurrence frequency. Finally, topic phrases as output is generated by using the dependency structure analysis.
  • Ryo Kato and Minoru Sasaki
    [Pdf] [Table of Content]
    In this study, we develop a system that automatically identifies whether each party agrees or disagrees with each bill on the minutes of the Tokyo Metropolitan Assembly. For the interpellations given by the members of each party and their answers in the minutes of the meeting, we predict whether the members of each party are for or against each bill. It is not difficult to predict agrees or disagrees with a bill if members have given their opinions on it. However, it is difficult to predict whether the minority parties agree or disagree with the bill because the minority parties have little opportunity to give speech. In this paper, we propose a method for predicting agree or disagree with a bill based on the external knowledge and the past meeting proceedings when it is not possible to extract agree or disagree with a bill.
  • Kouta Nakayama and Satoshi Sekine
    [Pdf] [Table of Content]
    This is the report of the summarization system that our team LIAT submitted to the dialog summarization task in NTCIR-15 QALab PoliInfo-2. We designed an extractive summarizer by dividing the task into three parts and training a model for each. Analysis from the scores showed that the line level extractive summarizer that we created did not suit the task.
  • Return to Top


    [SHINRA2020-ML]


  • Satoshi Sekine, Masako Nomoto, Kouta Nakayama, Asuka Sumida and Koji Matsuda
    [Pdf] [Table of Content]
    In this paper, a Knowledge Base (KB) construction project, SHINRA and a shared-task we conducted under NTCIR15, SHINRA2020-ML are described. SHINRA is a project to structure Wikipedia based on a pre-defined set of attributes for given categories. The categories and the attributes follow the definition of the Extended Named Entity (ENE) . We conducted a shared task of automatic knowledge base construction (AKBC) , and at the same time, used the submitted results to construct a large and more accurate KB. In the shared tasks, the participants are not notified which is the test data so they must run their systems on all entities in Wikipedia. By this method, the organizers receive information for all entities, later made public, and will be used to build structured knowledge by ensemble learning. We call this recourse construction scheme "Resource by Collaborative Contribution (RbCC)". SHINRA2020-ML, a shared-task of SHINRA, targets the categorization of Wikipedia entities in 30 languages.
  • The Viet Bui and Phuong Le-Hong
    [Pdf] [Table of Content]
    The FPT.AI team participated in the SHINRA2020-ML subtask of the NTCIR-15 SHINRA task. This paper describes our method to solving the problem and discusses the official results. Our method focuses on learning cross-lingual representations, both on the word level and document level for page classification. We propose a three-stage approach including multilingual model pre-training, monolingual model fine-tuning and cross-lingual voting. Our system is able to achieve the best scores for 25 out of 30 languages; and its accuracy gaps to the best performing systems of the other five languages are relatively small.
  • Rúben Cardoso, Afonso Mendes and Andre Lamurias
    [Pdf] [Table of Content]
    Wikipedia is an online encyclopedia available in 285 languages. It composes an extremely relevant Knowledge Base (KB), which could be leveraged by automatic systems for several purposes. However, the structure and organisation of such information are not prone to automatic parsing and understanding and it is, therefore, necessary to structure this knowledge. The goal of the current SHINRA2020-ML task is to leverage Wikipedia pages in order to categorise their corresponding entities across 268 hierarchical categories, belonging to the Extended Named Entity (ENE) ontology. In this work, we propose three distinct models based on the contextualised embeddings yielded by Multilingual BERT. We explore the performances of a linear layer with and without explicit usage of the ontology’s hierarchy, and a Gated Recurrent Units (GRU) layer. We also test several pooling strategies to leverage BERT’s embeddings and selection criteria based on the labels’ scores. We were able to achieve good performance across a large variety of languages, including those not seen during the fine-tuning process (zero-shot languages).
  • Tushar Abhishek, Ayush Agarwal, Anubhav Sharma, Vasudeva Varma and Manish Gupta
    [Pdf] [Table of Content]
    Maintaining a unified ontology across various languages is expected to result in effective and consistent organization of Wikipedia entities. Such organization of the Wikipedia knowledge base (KB) will in turn improve the effectiveness of various KB oriented multi-lingual downstream tasks like entity linking, question answering, fact checking, etc. As a first step toward a unified ontology, it is important to classify Wikipedia entities into consistent fine-grained categories across 30 languages. While there is existing work on fine-grained entity categorization for rich-resource languages, there is hardly any such work for consistent classification across multiple low-resource languages. Wikipedia webpage format variations, content imbalance per page, imbalance with respect to categories across languages make the problem challenging. We model this problem as a document classification task. We propose a novel architecture, RNN_GNN_XLM-R, which leverages the strengths of various popular deep learning architectures. Across ten participant teams at the NTCIR-15 Shinra 2020-ML Classification Task, our proposed model stands second in the overall evaluation.
  • Hiyori Yoshikawa, Chunpeng Ma, Aili Shen, Qian Sun, Chenbang Huang, Guillaume Pelat, Akiva Miura, Daniel Beck, Timothy Baldwin and Tomoya Iwakura
    [Pdf] [Table of Content]
    The NTCIR-15 SHINRA2020-ML Task is a multilingual text categorization task where a Wikipedia page in a given language should be mapped to one or more of the 219 Extended Named Entity (ENE) categories. The UOM-FJ team participated in 28 out of the 30 subtasks (languages), with a primary focus on English. Our system makes use of different types of information associated with the target Wikipedia articles, as well as the hierarchical structure of ENE. Our system ranked first on the English subtask with an F1-score of 82.73, demonstrating the effectiveness of using different types of document information.
  • Kouta Nakayama and Satoshi Sekine
    [Pdf] [Table of Content]
    This paper reports the document classification system that our team LIAT submitted to the classification task in NTCIR-15 SHINRA2020-ML. We used the outputs of BERT as document embeddings to deal with the longer sentences of Wikipedia. We used the Transformer encoder to classify the document embeddings into each class. Our system was not better than other submission results, but we hope that our results will also be used as a resource.
  • Masaharu Yoshioka and Yoshiaki Koitabashi
    [Pdf] [Table of Content]
    The HUKB team participated in the SHINRA 2020 ML task of the NTCIR-15. This paper introduces our approach to solve the problem and discuss the official results.
  • Sosuke Nishikawa and Ikuya Yamada
    [Pdf] [Table of Content]
    SHINRA2020-ML task aims to classify Wikipedia entities in 30 languages into Extended Named Entity based on 920K Japanese Wikipedia entities with gold-standard entity types. To address this task, we propose a novel method to extract effective features from the Wikipedia descriptions. In particular, we use the two types of features, i.e., text-based and entity-based features, based on state-of-the-art neural embedding models. As a result, we achieve the highest micro F1 score in two languages (i.e., French and German) on the final submission, and competitive results on the other 7 languages.
  • Return to Top


    [WWW-3]


  • Tetsuya Sakai, Sijie Tao, Zhaohao Zeng, Yukun Zheng, Jiaxin Mao, Zhumin Chu, Yiqun Liu, Maria Maistro, Zhicheng Dou, Nicola Ferro and Ian Soboroff
    [Pdf] [Table of Content]
    This is an overview of the NTCIR-15 We Want Web with CENTRE (WWW-3) task. The task features the Chinese subtask (adhoc web search) and the English subtask (adhoc web search, replicability and reproducibility), and received 48 runs from 9 teams. We describe the subtasks, data, evaluation measures, and the official evaluation results.
  • Kohei Shinden, Atsuki Maruta and Makoto P. Kato
    [Pdf] [Table of Content]
    The KASYS team participated in the English subtask of the NTCIR-15 WWW-3 Task. This paper describes our approach for generating NEW runs and REP (replicated/reproduced) runs in the NTCIR-15 WWW-3 Task. We applied BERT to the WWW-3 task for generating NEW runs, following a recent BERT-based approach for news retrieval. For replicating and reproducing the WWW-2 runs, we used an open-source information retrieval toolkit, Anserini, with some updates. NEW runs achieved the top performance in terms of nDCG, Q-measure, and iRBU, suggesting that BERT-based document ranking is highly effective not only for the other ranking tasks, but also for Web document retrieval. The results of REP runs were not well reproduced in the WWW-3 test collection, for which we discuss possible implementation differences from the original paper.
  • Xiaochen Zuo, Jing Yao and Zhicheng Dou
    [Pdf] [Table of Content]
    The RUCIR team participated in both Chinese and English subtasks of the NTCIR-15 We Want Web-3 (WWW-3) task. This paper describes our approaches and results in both subtasks. In the Chinese subtask, we use Bert [2] on the SogouQCL [17] dataset and a commercial dataset. In English subtask, we use Bert and learning to rank method on TREC Web Track dataset and MS MARCO Passage Ranking dataset 1. Our approaches achieved the best performances in the Chinese subtask.
  • Masaki Muraoka, Zhaohao Zeng and Tetsuya Sakai
    [Pdf] [Table of Content]
    The SLWWW team participated in the English subtask of the NTCIR-15 We Want Web-3 (WWW-3) Task. This paper describes our approach and results from the WWW-3 task. We utilized learning to rank models which were trained on the MQ2007 and MQ2008 datasets.
  • Fiana Raiber and Oren Kurland
    [Pdf] [Table of Content]
    Cluster-based document retrieval methods were shown to be highly effective in past research. In our submissions to the WWW-3 task, we experimented with one such method that has demonstrated superior performance compared to other state-of-the-art techniques.
  • Zhumin Chu, Jingtao Zhan, Xiangsheng Li, Jiaxin Mao, Yiqun Liu, Min Zhang and Shaoping Ma
    [Pdf] [Table of Content]
    The THUIR team participated in both Chinese and English subtasks of the NTCIR-15 We Want Web-3 (WWW-3) task. In the Chinese subtask, we tried two kinds of neural ranking models based on BERT, as well as a revived SDMM model. This paper describes our approaches and results in WWW-3 task. In the English subtask, we revived three learning-to-rank runs and a BM25 run we submitted in WWW-2 English subtask, we also tried a new ranking system based on BERT.
  • Ahmet Aydın, Gökhan Göksel, Ahmet Arslan and Bekir Taner Dinçer
    [Pdf] [Table of Content]
    The ESTUCeng team participated in the English subtask of the We Want Web Task (WWW-3). This paper describes our approach to tackle the ad-hoc Web search problem and discusses the results. We used LambdaMART, a learning to rank algorithm, to re-rank samples generated by Language Modeling with Dirichlet smoothing. We extracted several traditional features adopted from the literature as well as a family of new HTML document quality features for the ClueWeb12-B13 dataset. The traditional features are augmented with the newly proposed features. Then, we used feature selection to obtain an optimum subset of features that would produce the highest retrieval effectiveness. The query relevance judgements of the NTCIR-13 We Want Web-1 and the NTCIR-14 We Want Web-2 are used as training data. The novel document quality features (type-D) proved to be useful in achieving a competitive retrieval effectiveness for the ClueWeb12-B13 dataset.
  • Canjia Li and Andrew Yates
    [Pdf] [Table of Content]
    MPII participated in the English subtask of WWW-3 at NTCIR-15 with several variants of our recent PARADE model. PARADE aggregates passage-level relevance representations into a document-level representation, which is then used to predict a document's relevance score. We submitted the best-performing PARADE variants in three runs. Our results support the findings in the PARADE paper: aggregating representations is more effective than aggregating scores, and effectiveness increases with the complexity of the aggregation approach.
  • Zhu Liang, Jinling Shang, Xuwen Song and Si Shen
    [Pdf] [Table of Content]
    The NAUIR team participated in English subtasks of the NTCIR-15 We Want Web-3 (WWW-3) task. This article describes our method and results in the English subtask of WWW-3. In the English task, we used the modified DRMM model and the BERT model for query and document matching, respectively. In the pre-training process, we use the document corpus of the WWW-3 task for word embedding training, and BERT uses the bert-base-uncased which is pre-training model officially provided by Google. In terms of results, the results of our modified DRMM model and BERT model are better than those of BASELINE.
  • Return to Top



    Pilot Tasks


    [Data Search]


  • Makoto P. Kato, Hiroaki Ohshima, Ying-Hsang Liu and Hsin-Liang Chen
    [Pdf] [Table of Content]
    NTCIR-15 Data Search is a shared task on ad-hoc retrieval for governmental statistical data. The first round of Data Search focuses on the retrieval of a statistical data collection published by the Japanese government (e-Stat), and one published by the US government (Data.gov). This paper introduces the task definition, test collection, and evaluation methodology of NTCIR-15 Data Search. This round of Data Search attracted six research groups, from which we received 17 submissions for the Japanese subtask, and 37 submissions for the English subtask. The evaluation results of these runs are presented and discussed in this paper.
  • Taku Okamoto and Hisashi Miyamori
    [Pdf] [Table of Content]
    In this paper, we describe the system and results of Team KSU for Data Search Task in NTCIR-15. The documents covered by this task consist of metadata extracted from the governmental statistical data and the body of the corresponding statistical data. The metadata is characterized by the fact that its document length is short, and the main body of statistical data is almost always composed of numbers, except for titles, headers, and other data. We newly developed the categorical search that narrows the set of documents to be retrieved by category in order to properly capture the scope of the problem domain intended by the givne query. In addition, to compensate for the short document length of metadata, we implemented a method of extracting the header information of the table from the main body of statistical data to augment documents to be searched. As a ranking model, we used either BM25 or a reranking method based on BERT. The evaluation in the official run showed that the combined method of category search and BM25 scored 0.448 in nDCG@10 in the Japanese subtask and 0.255 in nDCG@10 in the English subtask, where each showed the highest score on this measure among all the official runs.
  • Lya Hulliyyatus Suadaa, Lutfi Rahmatuti Maghfiroh, Isfan Nur Fauzi and Siti Mariyah
    [Pdf] [Table of Content]
    The STIS team participated in the English subtasks of the NTCIR-15 Data Search Task, exploring metadata of document as document features consisting of title, description, and tags of documents. Baseline models used traditional information retrieval of Anserini for ad-hoc retrieval of governmental statistical data. The availability of query-document relevance datasets annotated by humans encourages elaborating those datasets to improve candidate retrieved documents. We proposed a re-ranking document retrieval approach using relevance level classifiers of query-document pairs to improve document retrieval performance. For top candidate documents of Anserini, we re-ranked documents by using Bi-LSTM classifier and Finetuned BERT-based classifiers. Our results show that the re-ranking approach by finetuning the BERT-based relevance level classifier improves the document retrieval quality of Anserini.
  • Phuc Nguyen, Kazutoshi Shinoda, Taku Sakamoto, Diana Petrescu, Hung-Nghiep Tran, Atsuhiro Takasu, Akiko Aizawa and Hideaki Takeda
    [Pdf] [Table of Content]
    In the Open Data era, a large number of datasets have been made available on the Web, lead to new challenges in organizing, managing, and searching on those resources. In this paper, we introduce NII Table Linker, a dataset searching framework designed for the two English and Japanese sub-tasks of the NTCIR-15 Data Search Task (DST). In particular, we study the capacity of the standard information retrieval techniques on DST and introduce the four re-ranking models based on (1) pre-trained contextualized embeddings, (2) entity-centric, (3) data file content, and (4) cluster-based approach. On the English sub-task, the model using pre-trained contextualized embeddings achieves the 2nd in the primary metric (nDCG@10) and the 1st in the remaining metrics in the official evaluation. On the Japanese sub-task, our Japanese runs also achieve promising performance unofficial evaluation; the run of BM25 fine-tuned produces the 1st in the nERR metrics.
  • Ryota Mibayashi, Huulong Pham, Naoaki Matsumoto, Takehiro Yamamoto and Hiroaki Ohshima
    [Pdf] [Table of Content]
    In this paper, we introduce our approach in the NTCIR-15 Data Search Task. Combination of three kinds of algorithms were mainly used in many different ways. As a result, Query Modification + BM25 gave the best performance.
  • Return to Top


    [MART]


  • Graham Healy, Tu-Khiem Le, Hideo Joho, Frank Hopfgartner and Cathal Gurrin
    [Pdf] [Table of Content]
    MART (Micro-activity Retrieval Task) was a NTCIR-15 collaborative benchmarking pilot task. The NTCIR-15 MART pilot aimed to motivate the development of first generation techniques for high-precision micro-activity detection and retrieval, to support the identification and retrieval of activities that occur over short time-scales such as minutes, rather than the long-duration event segmentation tasks of the past work. Participating researchers developed and benchmarked approaches to retrieve micro-activities from rich time-aligned multi-modal sensor data. Groups were ranked in decreasing order of micro-activity retrieval accuracy using mAP (mean Average Precision). The dataset used for the task consisted of a detailed lifelog of activities gathered using a controlled protocol of real-world activities (e.g. using a computer, eating, daydreaming, etc). The data included a lifelog camera data stream, biosignal activity (EOG, HR), and computer interactions (mouse movements, screenshots, etc). This task presented a novel set of challenging micro-activity based topics.
  • Takuma Yoshimura, Pham Huulong, Ryota Mibayashi, Rui Kimura and Hiroaki Ohshima
    [Pdf] [Table of Content]
    Human behavior recognition has an important role in helping computers to understand human behavior. Among them, Micro Activity recognition is required for computers to understand more detailed behavior. In this study, we proposed a feature selection method to prevent over-learning for data sets with a small number of data and a large number of feature dimensions. As the feature selection method, we use Super-LCC, which is fast and has low information entropy loss. The proposed method reduces the feature size from 3108 dimensions to 20.75 dimensions on average. The mAP of the proposed method was evaluated with the SVM(Support Vector Machine) and the result was 0.71707, which confirms the usefulness of the proposed method.
  • Duy-Duc Le Nguyen, Yu-Chi Lang and Yung-Chun Chang
    [Pdf] [Table of Content]
    In this paper, we propose a new method on predicting user activities at NTCIR-15 MART retrieval task. Extra concepts from ResNet's generated features follow with a Bidirectional Long-Short Term Memory helps our neural network paying more attention in the corresponding class. Our model received a fare result on the scoreboard.
  • Tu-Khiem Le, Manh-Duy Nguyen, Ly-Duyen Tran, Van-Tu Ninh, Cathal Gurrin and Graham Healy
    [Pdf] [Table of Content]
    The growing attention to lifelogging research has led to the creation of many retrieval systems, in most of which the event segmentation was employed as a key functionality. While previous literature focused on splitting lifelog data into broad segments of daily living activities, less attention was paid to micro-activities which last for short period, yet carry valuable information for building a high-precision retrieval engine. In this paper, we present our effort in addressing the NTCIR-15 MART challenge, in which the participants were asked to retrieve micro-activities from a multi-modal dataset. We proposed five models which investigate imagery and sensory data, both jointly and severally using various Deep Learning and Machine Learning techniques and achieved maximum mAP score of 0.901 on the Image Tabular Pair-wise Similarity model and ranked second in the competition. The model not only can capture the information comes from the temporal visual data combined with sensor signal, but also works as a Siamese network to learn the homogeneity between micro-activities.
  • Tai-Te Chu, Yi-Ting Liu, Chia-Chung Chang, An-Zi Yen, Hen-Hsen Huang and Hsin-Hsi Chen
    [Pdf] [Table of Content]
    This paper presents our approach to the NTCIR-15 MART task. The task is divided into two subtasks, micro-activity retrieval task(MART) and insights task. We participated in micro-activity retrieval task, where the goal is to retrieve the detailed lifelog of activities from multi-modal data, such as first-person perspective images, screenshots, and bio signals. The major challenge in the task is the semantic gap between textual descriptions and the visual concepts in images. To reduce the semantic gap, we propose a supervised model that encodes activity description as vectors. Our model incorporates the visual features and bio signal features from seven users to classify the pre-defined twenty micro-activities. In order to recognize computer activities (e.g., reading text on screen and browsing news websites), we utilize RoI (Region of Interest) features. After encoding visual features by employing the pre-trained computer vision model, we use Gated Recurrent Unit (GRU) to capture the slight variation of users’ movement in time-series.Experimental results show that our system is effective in the micro-activity retrieval task. In terms of performance, our system achieved 0.85050 of MAP score and won the third place.
  • Jiayu Li, Ziyi Ye, Weizhi Ma, Min Zhang, Yiqun Liu and Shaoping Ma
    [Pdf] [Table of Content]
    Activity recognition is a general but important task in various scenarios for understanding user behaviors, in which some pre-defined human activities are recognized based on heterogeneous sensor data. Previous related studies mainly focus on the problem of long-term physical state distinguish, helping researchers understand the behaviors of individuals. However, NTCIR-15 Micro-activity Retrieval Task (MART) concentrates on micro activity recognition, which aims to identify activities engaging in minutes of daily behavior, which requires a deeper insight into the character of the activity. In this paper, we present the methodologies our team THUIR employed in the MART. Firstly, various feature engineering methods are applied to extract valuable features from multi-modal raw data, and feature selection methods are adopted to exclude useless features. Then, we try two different ways to handle this task: taking it as a 1) ranking problem or 2) multi-label classification problem, two distinct approaches are proposed: a similarity-based approach for the ranking problem and tree-based classifiers for the classification problem. In two-fold cross-validation experiments, the combination of the correlation-based feature selection method and the rule-based Gradient Boosting Decision Tree (GBDT) classifier outperforms other models, achieving mAP of 0.95 on the test set. This method achieves the best performance among all participants in the MART.
  • Return to Top