The 14th NTCIR Conference
Evaluation of Information Access Technologies
June 10-13, 2019
National Institute of Informatics, Tokyo, Japan

    [Preface]


  • Charles L. A. Clarke and Noriko Kando
    [Pdf] [Table of Content]
  • Return to Top


    [Overview]


  • Makoto P. Kato and Yiqun Liu
    [Pdf] [Table of Content]
    This is an overview of NTCIR-14, the fourteenth sesquiannual research project for evaluating information access technologies. NTCIR-14 involved various evaluation tasks related to information retrieval, information recommendation, question answering, natural language processing, etc. (in total, seven tasks were organized at NTCIR-14). This paper describes an outline of the research project, which includes its organization, schedule, scope and task designs. In addition, we introduce brief statistics of participants in the NTCIR-14 Conference. Readers should refer to individual task overview papers for their detailed descriptions and findings.
  • Return to Top


    [Keynote]


  • Charles L. A. Clarke
    [Pdf] [Table of Content]
    For over two years, I put my academic research career on hold, left the University of Waterloo, moved to California, and worked to make a large commercial search engine better. In September 2019, I returned to my role as a Professor at the University of Waterloo, but the things I learned in the commercial world will impact my academic research for the rest of my career. While I can't tell you deep, dark secrets, I can you about problems and ideas that are inspired by my experience. These include changes in the way I think about search evaluation, online search behaviour, personalization and query understanding.
  • Return to Top


    [Invited Talk]


  • Nicola Ferro
    [Pdf] [Table of Content]
    Evaluation measures are the basis for quantifying the performance of IR systems and the way in which their values can be processed to perform statistical analyses depends on the scales on which these measures are defined. For example, mean and variance should be computed only when relying on interval scales. In this talk, we will present our formal theory of IR evaluation measures, based on the representational theory of measurements, to determine whether and when IR measures are interval scales. We found that common set- based retrieval measures – namely Precision, Recall, and F-measure – always are interval scales in the case of binary relevance while this does not happen in the multi-graded relevance case. In the case of rank-based retrieval measures – namely AP, gRBP, DCG, and ERR – only gRBP is an interval scale when we choose a specific value of the parameter p and define a specific total order among systems while all the other IR measures are not interval scales. We will also introduce some brand new set-based and rank-based IR evaluation measures which ensure to be interval scales. In our previous work we defined a theory of IR evaluation measures, based on the representational theory of measurement, which allowed us to determine whether and when IR measures are interval scales. Finally, we will discuss the outcomes of an extensive evaluation, based on standard TREC collections, to study how our theoretical findings impact on the experimental ones. In particular, we conduct a correlation analysis to study the relationship among the above-mentioned state-of-the-art evaluation measures and their scales. We study how the scales of evaluation measures impact on non parametric and parametric statistical tests for multiple comparisons of IR system performance.
  • Gareth Jones
    [Pdf] [Table of Content]
    MediaEval is a multimedia benchmarking initiative which seeks to evaluate new algorithms for multimedia access and retrieval. MediaEval emphasizes the "multi" in multimedia, including tasks combining various facet combinations of speech, audio, visual content, tags, users, and context. MediaEval innovates new tasks and techniques focusing on the human and social aspects of multimedia content in a community driven setting. The initiative provides a platform for researchers to organize benchmark tasks within a planned annual timeline and to report results at an end of campaign workshop. MediaEval 2019 marks the 10th anniversary of the foundation of MediaEval. This presentation will briefly overview 10 years of MediaEval campaigns and summarize current activities within the MediaEval 2019 campaign.
  • Return to Top



    Core Tasks


    [Lifelog-3]


  • Cathal Gurrin, Hideo Joho, Frank Hopfgartner, Liting Zhou, Van-Tu Ninh, Tu-Khiem Le, Rami Albatal, Duc Tien Dang Nguyen and Graham Healy
    [Pdf] [Table of Content]
    Lifelog-3 was the third instance of the lifelog task at NTCIR. At NTCIR-14, the Lifelog-3 task explored three different lifelog data access related challenges, the search challenge, the annotation challenge and the insights challenge. In this paper we review the activities of participating teams who took part in the challenges and we suggest next steps for the community.
  • Isadora Nguyen Van Khan, Pranita Shrestha, Min Zhang, Yiqun Liu and Shaoping Ma
    [Pdf] [Table of Content]
    Automatically recognize a user’s status by using Lifelog data can be used to annotate a user’s day and then make personal suggestions based on the previous status or as feature for others applications. How- ever, this recognition is yet not well studied. In this paper we present a method to automatically recognise a user’s status. To achieve it we use two different features datasets -a non-visual one and another one based on the semantic from pictures- and use super- vised Machine Learning algorithms for the recognition. Then, we discuss the impact of the non-visual features on the different statuses we chose and try to find a smaller dataset of features for each status. Finally, we give some statistics and visual insights about the users. We have obtained good results with the non-visual features: 0.89 accu- racy for the Inside or Outside detection, 0.74 for Alone or not and 0.80 for Working or Not. The results are better when using the visual features: 0.95 accuracy.
  • Tokinori Suzuki and Daisuke Ikeda
    [Pdf] [Table of Content]
    Our QUIK team participated in the Lifelog Semantic Access subTask (LSAT) of the NTCIR-14 Lifelog-3 task. The task is that given a topic of us-ers’ daily activity or event (e.g. Find the moment when a user was taking a train from the city to home) as a query, system retrieve the relevant images of the moments from users’ images of recording their daily lives. For LSAT task, we present an approach to retrieve users’ lifelog images by computing the similarity between users’ lifelog images and images obtained from the web by querying the LSAT topics into a web search engine. For computing the similarity between the lifelog images and images from the web, we em-ploy a classifier trained on the images collected from the web with a convo-lutional neural network model. This paper describes our approach to solv-ing LSAT task and report the official results that we got.
  • Nguyen-Khang Le, Dieu-Hien Nguyen, Trung-Hieu Hoang, Thanh-An Nguyen, Thanh-Dat Truong, Duy-Tung Dinh, Quoc-An Luong, Viet-Khoa Vo-Ho, Vinh-Tiep Nguyen and Minh-Triet Tran
    [Pdf] [Table of Content]
    Lifelogging has been gaining more and more attention in the research community in recent years. Not only can it provide valuable insight and a deeper understanding of human daily activities, but it can also be used to improve personal health and wellness. However, there are many challenging problems in this field. One of the most important tasks of processing lifelog data is to access its semantic, which aims to retrieve the moments of interest from the lifelog. There are many approaches to this problem, two of which are using data processing and providing friendly user interaction. Our proposed system takes both of these approaches. We first extract concepts from the images, build a structure to quickly query images based on these concepts. We then provide users with a friendly user interface to perform the task.
  • Min-Huan Fu, Chia-Chun Chang, Hen-Hsen Huang and Hsin-Hsi Chen
    [Pdf] [Table of Content]
    This paper presents our approach to the task of NTCIR-14 Lifelog-3. We participated in two of the subtasks, lifelog semantic access task (LSAT) and lifelog activity detection task (LADT). We attempt to reduce the semantic gap presented in lifelog tasks by introducing textual knowledge derived from external resources. In both subtasks, we extract additional visual concepts with computer vision models, and then incorporate both official and additional concepts into our system using pre-trained word embeddings, in which textual knowledge is inherent. For LSAT, we propose an interactive system that automatically suggests users a list of candidate query words, and adopt probabilistic relevance-based ranking function for retrieval. Our system also allows users to refine the retrieval results by filtering irrelevant images out. For LADT, we incorporate visual concepts into our supervised learning framework. We first encode visual concepts with pre-trained word embeddings, and perform unordered aggregation to produce order-independent representation of visual concepts. In terms of performance, our systems achieve 0.4727 of mean average precision in LSAT and 0.5439 of F1 score in LADT.
  • Van-Tu Ninh, Tu-Khiem Le, Liting Zhou, Graham Healy, Kaushik Venkataraman, Minh-Triet Tran, Duc-Tien Dang-Nguyen, Sinead Smyth and Cathal Gurrin
    [Pdf] [Table of Content]
    This paper describes the work of DCU research team incollaboration with University of Science, Vietnam, and University ofBergen, Norway in NTCIR-14. In this paper, we describe how we struc-ture the lifelog data for our interactive retrieval system as well as in-structing both novice and expert users to use our search engine properly.Our systems are experimented with one simple user-control baseline andone advanced retrieval system. In the advanced system, we propose anapproach to augment visual object concepts from images of lifelogger andthe Internet. This also initiates the trend of building interactive lifelogsearch engine which is oriented to each specific lifelogger’s life.
  • Return to Top


    [OpenLiveQ-2]


  • Makoto P. Kato, Akiomi Nishida, Tomohiro Manabe, Sumio Fujita and Takehiro Yamamoto
    [Pdf] [Table of Content]
    This is an overview of the NTCIR-14 OpenLiveQ-2 task. This task aims to provide an open live test environment of Yahoo Japan Corporation's community question-answering service (Yahoo! Chiebukuro) for question retrieval systems. The task was simply defined as follows: given a query and a set of questions with their answers, return a ranked list of questions. Submitted runs were evaluated both offline and online. In the online evaluation, we employed pairwise preference multileaving, a multileaving method that showed high efficiency over the other methods in a recent study. We describe the details of the task, data, and evaluation methods, and then report official results at NTCIR-14 OpenLiveQ-2.
  • Hiroki Tanioka
    [Pdf] [Table of Content]
    The AITOK team participated in the NTCIR-14 OpenLiveQ-2 Task. This report describes our approach to ranking question-answering lists of Yahoo! Chiebukuro search results for the queries of Query IDs, and discusses our results of test results. Our approach intends to make sure of the degree of catchy to question-answering, thereby integrates two strategies. The first strategy is statistics based approach with questions and clickthrough data. The second strategy is natural language based approach with question-answer pairs. In the offline test, we struggled in- creasing Q-measure using these two strategies day by day. Additionally, we employed manually dynamic programming approach for optimization of Q-measure. Although the approach is very simple, a result sorted by a score mainly based on page view is the best in all results. However, the best result in the offline test is not enough in the online test. Instead, the other result sorted by a score based on clickthrough with last-updated time in descending order is ranked in the top group.
  • Tomohiro Manabe, Sumio Fujita and Akiomi Nishida
    [Pdf] [Table of Content]
    We report our work at the NTCIR-14 OpenLiveQ-2 task. From the given data set for question retrieval on a community QA service, we extracted some BM25F-like features and translation-based features in addition to basic features. After that, we constructed multiple ranking models with the data. According to the offline evaluation results, our linear combination model with translated features achieved the best score on Q-measure among our runs. At the first round of online evaluation, our linear models with BM25F-like features and translation-based features obtained the largest credits among 62 runs including other teams' runs. At the final round, our linear combination model with BM25F-like features and neural ranking models with basic features obtained the largest amount of credits among 30 runs which passed the first round. According to the online evaluation results based on real users' feedback, neural ranking is one of the best approaches to improve practical search effectiveness on the service.
  • Takashi Sato, Yuki Nagase and Masato Uraji
    [Pdf] [Table of Content]
    We submited 18 runs to NTCIR-14 OpenLiveQ-2 task. In this task we reorder questions by using white and black words because most questions in OpenliveQ-question-data of this task are fit for the queries. The white words and selected by the frequency in questions, Google sugestion, manual and/or populer words found in the title page of the word in Wikipedia. Whereas the black words are selected by the rareness in the questions. The reorder of questions by white words is more effective than that by black words from the evaluation results.
  • Piyush Arora and Gareth Jones
    [Pdf] [Table of Content]
    We describe the DCU-ADAPT team’s participation in the NTCIR-14 OpenLiveQ-2 task. In this task for a given query and a set of questions with their answers, we were required to return a ranked list of questions that potentially matches and satisfy the user’s query effectively. Submitted runs were evaluated using both offline and online measures. Offline evaluation was done using evaluation metrics such as NDCG@10, ERR@10 and online evaluation was conducted in two phases using pairwise preference multileaving approach. In this task we focus on exploring different LearningToRank (L2R) models, feature selection and data normalisation techniques. Overall, we submitted fourteen systems in the benchmark competition which were evaluated in the offline and first phase of the online evaluation. Five of our best systems (5/14) were selected for the final evaluation in the online evaluation phase. Our best run was ranked “6” out of the 65 submissions for the task. We performed detailed analysis of our systems submission and found that the ranking of different systems in this task varies considerably depending on the evaluation metric being chosen. The offline and online metrics used in this task does not match well, indicating that only relevance based measures might not reflect well the manner in which users interact with the information in an online setting.
  • Return to Top


    [QA Lab-PoliInfo]


  • Yasutomo Kimura, Hideyuki Shibuki, Hokuto Ototake, Yuzu Uchida, Keiichi Takamaru, Kotaro Sakamoto, Madoka Ishioroshi, Teruko Mitamura, Noriko Kando, Tatsunori Mori, Harumichi Yuasa, Satoshi Sekine and Kentaro Inui
    [Pdf] [Table of Content]
    The NTCIR-14 QA Lab-PoliInfo aims at real-world complex Question Answering (QA) technologies using Japanese political information such as local assembly minutes and newsletters. QA Lab-PoliInfo has three tasks, namely Segmentation, Summarization and Classification task. We describe the used data, formal run results, and comparison between human marks and automatic evaluation scores.
  • Jiawei Yong, Shintaro Kawamura, Katsumi Kanasaki, Shoichi Naitoh and Kiyohiko Shinomiya
    [Pdf] [Table of Content]
    Our RICT team tackled segmentation and classification subtasks in NTCIR-14 QA Lab-PoliInfo. As our technical characteristic in segmentation task, we regard segments as retrieval objects, and utilize cue-phase-based semi-supervised learning method to detect segment boundary. Here we have presented 5 methods for formal run. As to classification task, we train our supervised learning model by all utterances without topic effects and utilize outlier detection technologies to alleviate the imbalance training data problem. Here we have submitted 7 runs for formal run. Since evaluation results show that our approach achieves higher scores than average, the main contribution for classification we made is providing a feasible system to deal with a small number of imbalance training data problem especially in the regional assembly minutes. Meanwhile, we also made contribution for segmentation to grasp distinguishing features of regional assembly minutes by our efficient method.
  • Satoshi Hiai, Yuka Otani, Takashi Yamamura and Kazutaka Shimada
    [Pdf] [Table of Content]
    This paper describes a summarization system for NTCIR-14 QA Lab-PoliInfo. For the summarization task, participants of the task need to generate a summary corresponding to an assemblyperson's speech in assembly minutes within the limit length. Our method extracts important sentences to summarize an assemblyperson's speech in the minutes. Our method applies a machine learning model to predict the important sentences. However, the given assembly minutes' data do not contain information about the importance of the sentences. Therefore, we construct training data for the importance prediction model using a word similarity between sentences in a speech and those in the summary. On the formal run, some scores by our method were the best in all the submitted runs of all participants. The result shows the effectiveness of our method.
  • Souichi Furukawa, Yuto Naritomi, Hokuto Ototake, Toshifumi Tanabe and Kenji Yoshimura
    [Pdf] [Table of Content]
    This paper reports on the achievements of Classification subtask of the NTCIR-14 QA Lab-PoliInfo task of FU-01 team. We proposed two different methods, rule-based and MaxEnt based methods, for classifying pros and cons of a political topic and whether an utterance sentence includes fact-checkable reasons or not. The results of formal run of the subtask shows our MaxEnt based method achieved higher accuracy than the rule-based method.
  • Ginya Nishijima, Masahiro Shiratori, Hokuto Ototake, Toshifumi Tanabe and Kenji Yoshimura
    [Pdf] [Table of Content]
    This paper reports on the achievements of Classification subtask of the NTCIR-14 QA Lab-PoliInfo task of FU-02 team. We propose a method for classifying pros and cons of a political topic, whether an utterance sentence includes fact-checkable reasons or not, and whether an utterance sentence is relevant with the topic. Our proposed method consists of three different classifiers which are based on a simple rule, keywords, and word embeddings respectively.
  • Yasuhiro Ogawa, Michiaki Satou, Takahiro Komamizu and Katsuhiko Toyama
    [Pdf] [Table of Content]
    The nagoy team participated in the summarization subtask of the NTCIR-14 QA Lab-PoliInfo. This paper describes our summarization system for assemblyman's speeches, which uses random forest classifiers. At this subtask, we cannot achieve good results when training on all data because they are imbalanced. To solve this problem, we developed a new summarization system that apples multiple random forest classifiers training on different sized data sets step by step. As a result, our system achieved good performance, especially in the evaluation by ROUGE scores. We also compare our system with a single random forest classifier using probability.
  • Kazuki Terazawa, Daiki Shirato, Tomoyoshi Akiba and Shigeru Masuyama
    [Pdf] [Table of Content]
    The local council's proceedings are useful as research materials and as a basis for selecting one among candidates during elections, but they are large in volume. For this reason, it would be helpful for many people to be able to identify the source from a summary published on the Web and automatically summarize the utterance.Therefore, in this paper, using the dataset of the Tokyo Metropolitan Conference proceedings provided by NTCIR-14 QA Lab-PoliInfo, we worked on the automatic specification of the original utterance extent (Segmentation Task) and the summary of the utterance (Summarization Task) .In the Segmentation Task, the extraction extent is specified by the number of lines in the conference proceedings, and the result is evaluated by its Precision, Recall, and F-measure. In the Summarization Task, the ROUGE and human evaluation are performed. In addition, it is difficult for individuals to conduct fact-checking to cope with fake news that is disseminated to a large amount of information on the Internet. Therefore, we use the dataset of the Tokyo Metropolitan Congress Proceedings, which is also provided by NTCIR-14 QA Lab-PoliInfo, and work on estimation of fact-checkability against utterances and classification of utterances when it is possible to check facts (Classification Task). In the Classification Task, labels support, against and other are assigned to an utterance and the results are evaluated by its Precision, Recall, F-measure. We also made a decision on the relevance of the target policy.
  • Daiki Shirafuji, Sho Takishita, Patrycja Swieczkowska, Rafal Rzepka and Kenji Araki
    [Pdf] [Table of Content]
    The STARS team participated in the Classification task of Question Answering Lab for Political Information (QALab-PoliInfo) subtask of the NTCIR-14. This report describes our methods for solving the task and discusses the results. We identify whether the policy and remarks are relevant or not, whether they contain a verifiable fact or not, and predict the stance (positive, negative or neutral) mainly with machine learning approach.
  • Tatsuya Ogasawara and Takeru Yokoi
    [Pdf] [Table of Content]
    In this research, we classified the utterances of assemblymen according to three viewpoints: whether utterance is 1)relevant to the policy(Relevance Classification), 2)fact-checkable(Fact-checkability Classification), and 3)one of agreement, disagreement or neutral(Stance Classification). In the Relevance Classification experiment, classification was performed by the method using cosine similarity. Also, in the Fact-checkability Classification experiment, classification was performed by a decision tree classifier. The number of specific words such as evidence expression, named entity, etc., were used as the training features. Then, in the Stance Classification experiment, classification was performed by a support vector machine. The vector of polarity values for the word appearing in each utterance was used as the training feature for the support vector machine. The polarity values were decided by the emotion polarity dictionary. As a result, the accuracy was approximately 80% in each classification result. However, in the minority class of each classification experiment, the scores of precision and recall were low. In order to improve the scores, in the Relevance Classification and the Fact-checkability Classification, the preparation of more higher quality training data is necessary. In the Fact-checkability Classification, the evidence expression was not successfully extracted. Therefore, extraction of evidence expression based on a more complicated rule base than that used in this experiment is a future task. Also, in the Stance Classification, it is a challenge to construct features that capture polarity reversal or expression, or incorporate appropriate knowledge for the domain of data usage.
  • Toshiki Tomihira and Yohei Seki
    [Pdf] [Table of Content]
    The news has spread quickly due to the development of social media. It will be a problem if the fake news has gone viral. Since fake news is related to politics in many cases, we focus on automatic confirmation of facts using Japanese Regional Assembly Minutes Corpus. To verify fact-checkability in the sentence correctly, it is important to focus on the sentences which contain the evidence of facts. In this paper, we explain the approach of prediction model of ``Relevance'', ``Fact-checking'', ``Stance'' to the sentences of the minutes. Furthermore, we try whether the model combining CNN and LSTM is valid to check fact.
  • Linyuan Tang, Koichiro Watanabe, Shuntaro Yada and Kyo Kageura
    [Pdf] [Table of Content]
    The LISLab team participated in the classification and summarisation subtask of the NTCIR-14 PoliInfo Task. This report describes our approaches to classifying and summarising opinions of assemblymen. In the classification subtask, we applied a SVM classifier with Bag-of-Ngrams features to detect Relevance, Fact-checkability, and Stance. We also conducted hyper parameter tuning and a feature analysis to improve the performance of our classifier. In the summarisation subtask, we used rhetorical structure modelling and TextRank method for key phrase extraction and selection. Our summarisation model had an average performance both in the automatic and manual evaluation. We also suggest that the provided gold standards may be unsatisfactory by analysing evaluation results.
  • Tasuku Kimura, Ryo Tagami, Hikaru Katsuyama, Sho Sugimoto and Hisashi Miyamori
    [Pdf] [Table of Content]
    In this paper, the systems and the results of the team KSU for QA Lab--PoliInfo Task in NTCIR--14 are described. First, in Segmentation Task which required extracting primary information correctly from the input data, we proposed a method based on rules and vocabulary distributions. In Summarization Task which demanded generating a summary focused on a specific topic, we tried using a framework of the query--focused abstractive summarization. Finally, in Classification Task which called for classifying stances of a certain text for a specific topic, we developed a method combining deep learning and two--stage classifiers. As a result, the team KSU achieved third in five teams with the f--measure of 0.855 in Segmentation Task, and second place in 11 teams with the accuracy 0.934 in Classification Task.
  • Minoru Sasaki and Tetsuya Nogami
    [Pdf] [Table of Content]
    Stance classification has been defined as automatically identifing speaker's positions about a specific discussion topic from text. Although stance classification has been active research area, there is no approach that uses external knowledge to improve the classification. In this paper, we propose stance classification system using sentiment dictionary. To evaluate the efficiency of the proposed system, we conduct some experiments to compare with the result of the baseline method using Support Vector Machine on the NTCIR-14 QA Lab-PoliInfo classification task formal run dataset. The results showed that the proposed methods using sentiment dictionary obtains higher precision compared with the baseline method using SVM for the ``support'' and ``against'' samples. However, the precision of the proposed method is decreased about 10% in comparison to the baseline system for the ``neutral'' samples.
  • Taiki Shinjo, Hitoshi Nishikawa and Takenobu Tokunaga
    [Pdf] [Table of Content]
    The TTECH team participated in the Classication and the Summarization subtasks of the NTCIR-14 QALab-PoliInfo Task. This paper reports our methods used for these tasks and their experimental results.
  • Ken-Ichi Yokote and Makoto Iwayama
    [Pdf] [Table of Content]
    Argument detection is considered to be a key factor in much previous work related to QALab-PoliInfo Segmentation task. However, It has different views about ''argument'' thus classifying sentences in terms of argument detection may be a noisy process. In this paper, we propose a method that has filter-by-confidence step after assuming all text segments to be an argument , instead of argument detection step. Our method achieved 93.9 precision and 81.3 recall,indicating that filter-by-confidence is helpful to avoid negative affect of noisy text classification process.
  • Return to Top


    [STC-3]


  • Zhaohao Zeng, Sosuke Kato and Tetsuya Sakai
    [Pdf] [Table of Content]
    In this paper, we provide an overview of the NTCIR-14 Short Text Conversation-3 Dialogue Quality (DQ) and Nugget Detection (ND) subtasks. Both DQ and ND subtasks aim to evaluate customer-helpdesk dialogues automatically: (1) DQ subtask is to assign quality scores to each dialogue in terms of three criteria: task accomplishment, customer satisfaction, and efficiency; and (2) ND subtask is to classify whether a customer or helpdesk turn is a nugget, where being a nugget means that the turn helps towards problem solving. In this overview paper, we describe the task details, evaluation methods and dataset collection, and report the official results.
  • Yaoqin Zhang and Minlie Huang
    [Pdf] [Table of Content]
    This paper describes an overview of the Emotion Generation subtask at NTCIR-14. The goal of the emotion generation subtask is to investigate how well a chatting machine can express feelings by generating a textual response to an input post. The task is defined as follows: given a post and a pre-specified emotion class of the generated response, the task is to generate a response that is appropriate in both topic and emotion. This challenge has attracted more 40 teams registered, and 11 teams finally submitted results. In this overview paper, we reported the details of this challenge, including task definition, data preparation, annotation schema, submission statistics, and evaluation results.
  • Yan-Chun Hsing, Chien-Hung Chen and Yung-Chun Chang
    [Pdf] [Table of Content]
    This paper presents our approach to the Chinese emotional conversation generation (CECG) subtask in the short text conversation (STC) task at NTCIR-14. The official training data contains 600,000 pairs of post and response from Weibo, from which we remove noisy data and train our model. The proposed methods to generate responses mainly include retrieval-based and generation-based approaches. We construct a sequence-to-sequence-based model, which is commonly used in generation-based methods, to generate responses that contain emotions of our choosing. Besides, we also propose a refined distributed emotion vector (RDEV) representation model, which is an emotion detection method based on valence and arousal, to improve the responses so that they would contain appropriate content as well as adequate emotion. RDEV combines convolutional and recurrent neural networks, and performs remarkably well on the dataset of emotion analysis in Chinese weibo texts task in NLPCC 2014. Our final evaluation results achieve an average score of 0.32 from three annotators. The performance of our system is very promising, although not the best among the competing teams in this challenge.
  • Yangyang Zhou, Zheng Liu, Xin Kang, Yunong Wu and Fuji Ren
    [Pdf] [Table of Content]
    In this paper, we describe the overview of our work in STC-3 Chinese Emotional Conversation Generation (CECG) subtask at NTCIR-14. We propose a Post & Emotion to Response (P&E2R) model to train emotions together with posts to obtain the responses. We then propose another model called Post to Response & Emotion to Response (P2R&E2R) model to separate the training of emotions from that of grammar and semantics on the basis of the prior model. We try to use these models to explore how to combine emotions with the generation model better. In the evaluation section, the average scores of our models are both over 0.8, which suggests that our proposed models have emotional output capabilities in Chinese.
  • Yi-Lin Xie and Wei-Yun Ma
    [Pdf] [Table of Content]
    We describe how we build the system for NTCIR-14 Short Text Conversation (STC-3) Chinese Emotional Conversation Generation (CECG) subtask. In our approach, we first build an emotion keyword list, which contains emotion vocabularies under 5 different emotion classes. After that, responses are generated through a RNN-based inference mechanism, aimed at using emotion words from previously built keyword list to express emotion. Although our system doesn’t reach top performance, we still find that there are some interesting findings in our results worth discussing.
  • Sosuke Kato, Rikiya Suzuki, Zhaohao Zeng and Tetsuya Sakai
    [Pdf] [Table of Content]
    This paper describes our approaches to the Nugget Detection and Dialogue quality subtasks at the NTCIR-14 STC3 task. We tried to make a few changes to the baseline BiLSTM model, and submitted three models, including BiLSTM with multi-head attention, BiLSTM with multi-task learning, and BiLSTM with BERT. On the Chinese dataset, BiLSTM with multi-task learning and BiLSTM with BERT outperformed the baseline, but the improvement is not statistically significant. On the smaller English dataset, the multi-task learning model is the best of our submitted runs, but it does not outperform the BiLSTM baseline in both ND and DQ subtasks. Also, with BERT the baseline model also performs better than the baseline on the English dataset, which may suggest that multi-task learning and pre-trained embedding are helpful on the smaller English dataset.
  • Hsiang-En Cherng and Chia-Hui Chang
    [Pdf] [Table of Content]
    In this paper, we consider the Nugget Detection (ND) and Dialogue Quality (DQ) subtasks for Short Text Conversation 3 (STC-3) using deep learning. The goal of NQ and DQ subtasks is to extend the one-round STC to multi-round conversation such as customer-helpdesk dialogues. The DQ subtask aims to judge the quality of the whole dialogue using three measures: Task Accomplishment (A-score), Dialogue Effectiveness (E-score) and Customer Satisfaction of the dialogue (S-score). The ND subtask, on the other hand, is to classify if an utterance in a dialogue contains a nugget, which is similar to dialogue act (DA) labeling problem. We applied a general model with utterance layer, context layer and memory layer to learn dialogue representation for both DQ and ND subtasks and use gating and attention mechanism at multiple layers including: utterance layer and context layer. The result shows that BERT produced a better utterance representation than multi-stack CNN for both DQ and ND subtasks and outperform the baseline models proposed by NTCIR on Ubuntu customer helpdesk dialogues corpus.
  • Ming Yan, Maofu Liu and Junyi Xiang
    [Pdf] [Table of Content]
    The purpose of dialogue quality is to test the degree of completion and satisfaction of dialogue. Nugget detection aims at automatically identifying the status of dialogue sentences from the dialog system, such as problem extraction, problem solving and so on. Existing methods rely on feature extraction tools, which can result in error accumulation, but also ignore the context dependency between dialogues and the semantic information of sentences, which is helpful for the detection of dialogue quality and nugget detection. In this paper, a neural network method is proposed to extract the context dependency between dialogues by Bi-LSTM, attention mechanism is adopted to learn key sentences or phrases in dialogues. The two kinds of information are combined to improve the quality of dialogues and the recognition ability of nugget detection. The experimental results of STC-3 DQ and ND subtask in NTCIR-14 show that our proposed method is effective.
  • Zhanzhao Zhou, Maofu Liu and Zhenlian Zhang
    [Pdf] [Table of Content]
    As an important influencing factor of human-computer interaction experience, the research on the generation of emotional dialogue has aroused the widespread concern of researchers. We participated in the STC-3 (Short Text Conversation) CECG (Chinese Emotional Conversation Generation) subtask of NTCIR-14 and proposed a retrieval-based emotional dialogue system. The WUST system includes three modules, i.e. candidate generation, candidate matching, and candidate ranking. The system traverses the candidates and computes the text similarity one by one between the given post and candidate dialogue, and finally sorts the candidates by their scores calculated by a linear function. The highlight of the system is the ability to generate reliable responses in contrast to the generated-based system. As the evaluation results show, the system can generate appropriate and reliable responses both in content and emotion and rank the fifth of all the participants.
  • Kai Cong and Wai Lam
    [Pdf] [Table of Content]
    We present our model participated in the NTCIR-14 Short Text Conversation-3 Dialogue Quality (DQ) subtask. The DQ subtask is to assign quality scores to each customer-helpdesk dialogue in terms of three criteria: task accomplishment (A), customer satisfaction (S), and efficiency (E). Each dialogue is composed of posts by customer or helpdesk and each post consists of turns of sentences, which naturally forms a hierarchical structure. In this work, we consider this problem as a document classification task and have attempted various ways until we finalize our model into a hierarchical attention network with bidirectional GRUs and Google’s BERT sentence embedding, which stands for Bidirectional Encoder Representations from Transformers. Also, our model is augmented with a sender-aware encoding to differentiate the contributions from the customer side and the helpdesk side. According to the official STC evaluation in the test dataset, our proposed system achieves among the top performance teams in the English dataset.
  • Wei Shih-Chieh, Cheng Chi-Bin, Cao Guangzhongyi, Chiang Yi-Jing, Wu Chin-Yi, Lin Shih-Hsiang and Tsai Kun-Li
    [Pdf] [Table of Content]
    In this work, we will report how we (TKUIM) built a system for the sub-task CECG of STC-3. Our system mainly consists of two parts, the response generation subsystem and the emotion classification subsystem. For the response generation subsystem, we trained five generative models using different training parameters. These models will output response candidates based on a seq2seq deep learning architecture with the attention mechanism. For the emotion classification subsystem, we trained an emotion classifier with probability output for each emotion class. Corresponding to the desired response emotion class, a desired emotion classifier is used to select the most probable response from the previous response candidates. An emotion accept threshold and a default response library are set up for each response emotion class. When the selected response does not pass the emotion accept threshold, a default response from the library for that emotion class is output to replace the poorly generated response. In this mission, we submitted only one valid result, which got an average total score of 0.726 within a maximum scale of 2.
  • Min-Yuh Day, Yi-Jun Xie, Chi-Sheng Hung, Jhih-Yi Chen, Yu-Ling Kuo and Jian-Ting Lin
    [Pdf] [Table of Content]
    This paper describes the IMTKU (Information Management at Tamkang University) emotional dialogue system for Short Text Conversation at NTCIR-14 STC-3 Chinese Emotional Conversation Generation (CECG) Subtask. The IMTKU team proposed an emotional dialogue system that integrates retrieval-based model, generative-based model, and emotion classification model with deep learning approach for short text conversation focusing on Chinese emotional conversation generation subtask at NTCIR-14 STC-3 task. For the retrieval-based method, the Apache Solr search engine was used to retrieve the responses to a given post and obtain the most similar one by each emotion with a word2vec similarity ranking model. For the generative-based method, we adopted a sequence-to-sequence model for generating responses with emotion classifier to label the emotion of each response to a given post and obtain the most similar one by each emotion with a word2vec similarity ranking model. The official results show that the aver-age score of IMTKU is 0.592 for the retrieval-based model and 0.06 for the generative-based model. The IMTKU self-evaluation indicates that the average score is 1.183 for retrieval-based model and 0.1the 6 for the generative-based model. The best accuracy score of the emotion classification model of IMTKU is 87.6% with bi-directional long short-term memory (Bi-LSTM).
  • Sébastien Montella, Chia-Hui Chang and Frederic Lassabe
    [Pdf] [Table of Content]
    Recent studies have significantly contributed to high-quality text generation. Different frameworks were leveraged to build sophisticated models for language generation. However, automatic generation of emotional content has barely been studied. We propose in this paper an Attention-Based Sequence Generation Model for Emotionally-Triggered Short-Text Conversation (STC). We use emotion category embeddings to represent different emotions and to trigger the generation of the stated emotion. The emotion vectors are learned during training allowing the model to have a degree of freedom to modify those vectors accordingly. Our attention mechanism is customized to include emotional information by using gated convolutional neural networks (GCNNs) in order to create an emotional context vector. Moreover, we utilize distinct Start-Of-Sentence (SOS) token for each emotion category in order to further push our sequence-to-sequence model into a specific emotional generation mode. This approach avoids the implementation of different generative models for each emotion. Experimental results demonstrate the ability and difficulty of our architecture to generate affective answers.
  • Xiaohe Li, Jiaqing Liu, Weihao Zheng, Xiangbo Wang, Yutao Zhu and Zhicheng Dou
    [Pdf] [Table of Content]
    This paper describes RUCIR's system in NTCIR-14 Short Text Conversation (STC) Chinese Emotional Conversation Generation (CECG) subtask. In our system, we use the Attention-based Sequence-to-Sequence(Seq2seq) method as our basic structure to generate emotional responses. This paper introduces: 1) an emotion-aware Seq2seq model, 2) several features to boost the performance of emotion consistency. Official results show that we are the best in terms of the overall results across the five given emotion categories.
  • Chih-Chien Wang, Min-Yuh Day, Wei-Jin Gao, Yen-Cheng Chiu and Chun-Lian Wu
    [Pdf] [Table of Content]
    This paper gives an overview of our work for the NII Testbeds and Community for Information access Research (NTCIR)-14 Short Text Conversation (STC)-3 Chinese Emotional Conversation Generation (CECG) subtask. In NTCIR-14 STC3, emotion in the post-comment pairs was considered in both retrieval-based and generation-based approaches. In this subtask, we developed an emotion classification and two approaches, generation-based and retrieval-based approaches, to create responses to post. In STC3 CECG subtask repository, each post and comment were label with one emotions tag: Like, sadness, disgust, anger, and happiness. The posts and comment with emotion that not mentioned above were label as “other”. We develop an emotion classifier to label the comment we created. The purpose of this subtask is to create comment that is coherence and fluency to the post. Besides, the created post should be emotion consistency. In the retrieval-based approach, we used Apache Solr to search an appropriate comment to the post. In generation-based approach, we use attention-based sequence to sequence (Seq2Seq) model to create new comment to each post. For emotion classifier, we used Multilayer Perceptron (MLP). In the paper, we provide our procedure in detail for creating new comment to post. We provide two submissions: retrieval-based approach with emotion and generation-based with emotion. However, due to the format issue, the evaluation results of submission for generation-based with emotion is not provided. For the purpose of self-improvement, we provide our self-evaluation results to both two submissions. Further improvement suggestions are also provided in the paper.
  • Return to Top


    [WWW-2]


  • Jiaxin Mao, Tetsuya Sakai, Cheng Luo, Peng Xiao, Yiqun Liu and Zhicheng Dou
    [Pdf] [Table of Content]
    In this paper, we provide an overview of the NTCIR-14 We Want Web-2 (WWW-2) task, which includes the Chinese and the English subtasks. The series of WWW tasks are classical ad-hoc textual retrieval tasks. The WWW-2 task received 10 runs from 2 teams for the Chinese subtask, and 18 runs from 4 teams for the English subtask. In this overview paper, we not only describe the task details, data, and evaluation methods but also show the report on the official results.
  • Andrew Yates
    [Pdf] [Table of Content]
    MPII participated in the English subtask of the NTCIR-14 WWW-2 Task. We evaluated several variants of the PACRR neural re-ranking model, considering smaller vs. larger models and the impact of cascade pooling. No sigificant differences were found between any of the runs.
  • Yukun Zheng, Zhumin Chu, Xiangsheng Li, Jiaxin Mao, Yiqun Liu, Min Zhang and Shaoping Ma
    [Pdf] [Table of Content]
    The THUIR team participated in both Chinese and English subtasks of the NTCIR-14 We Want Web-2 (WWW-2) task. This paper describes our approaches and results in WWW-2 task. In the Chinese subtask, we designed and trained two neural ranking models on Sogou-QCL dataset. In the English subtask, we adopted learning to rank models by training them on MQ2007 and MQ2008 dataset. Our methods achieved the best performances in both Chinese and English subtasks.
  • Xue Yang, Shuqi Lu, Shijun Wang, Han Zhang and Zhicheng Dou
    [Pdf] [Table of Content]
    The RUCIR team participated in the Chinese and English subtasks of the NTCIR-14 We Want Web-2 (WWW-2) Task. In this paper, we describe our approach for solving the ad hoc Web search problem and introduce the official results. For both Chinese and English subtasks, we adopted a learning to rank framework to re-rank candidate documents for each query. We extracted several traditional ranking features for each query-document pair, and at the same time, we trained deep neural models to get matching scores, and use the matching scores as deep features. The traditional features and deep features are fused by the learning to rank model.
  • Peng Xiao and Tetsuya Sakai
    [Pdf] [Table of Content]
    SLWWW participated in English subtask of the NTCIR-14 We Want Web Task (WWW-2), and this paper describes our approaches we implemented during the task.
  • Return to Top



    Pilot Tasks


    [CENTRE]


  • Tetsuya Sakai, Nicola Ferro, Ian Soboroff, Zhaohao Zeng, Peng Xiao and Maria Maistro
    [Pdf] [Table of Content]
    CENTRE is the first-ever metatask that operates across the three major information retrieval evaluation venues: CLEF, NTCIR, and TREC. The task had three subtasks: T1 (Replicability), T2TREC (Reproducibility), and T2OPEN (Reproducibility). The T1 subtask examined whether a particular pair of runs from the NTCIR-13 WWW-1 task can be replicated (on the same data). The T2TREC subtask examined whether a particular pair of runs from TREC 2013 Web track can be reproduced on the NTCIR-13 WWW-1 test collection. T2OPEN encouraged participants to reproduce past runs of their own choice on the WWW-1 test collection. Only one team (MPII) participated in CENTRE, but the team participated in all three subtasks. The NTCIR edition of CENTRE focussed on whether the effect of an Advanced run over a Baseline run can be replicated/reproduced. The results of MPII are quite positive for both T1 and T2TREC subtasks in terms of replicating/reproducing the overall effects, as measured by the Effect Ratio.
  • Andrew Yates
    [Pdf] [Table of Content]
    The MPII team participated in the T1, T2TREC, and T2OPEN subtasks of the NTCIR-14 CENTRE Task. This report describes our approaches, the known ways in which our approaches differed from the runs being reproduced, and the success of our reproductions. While our T1 replication and T2TREC reproduction were successful from an overall perspective, the per-topic results were mixed, and our T2OPEN reproduction was inconclusive. We discuss several factors that may have contributed to these outcomes.
  • Return to Top


    [FinNum]


  • Chung-Chi Chen, Hen-Hsen Huang, Hiroya Takamura and Hsin-Hsi Chen
    [Pdf] [Table of Content]
    Numeral is the crucial part of financial documents. In order to understand the detail of opinions in financial documents, we should not only analyze the text, but also need to assay the numeric information in depth. Because of the informal writing style, analyzing social media data is more challenging than analyzing news and official documents. In this paper, we give an overview of the results of a shared task called FinNum in NTCIR-14 for fine-grained numeral understanding in financial social media data, i.e., to identify the category of a given numeral in a tweet. This task attracted 13 participants to register, received 16 submissions from 9 participants, and, finally, accepted 6 papers from participants.
  • Abderrahim Ait Azzi and Houda Bouamor
    [Pdf] [Table of Content]
    This paper describes our submission to the NTCIR-14-FinNum Shared Task on Fine-Grained Numeral Understanding in Financial Tweets. We participate in both Subtask-1 and Subtask-2. We formulate the problem as a sequence labeling task and design a hybrid approach where we use external linguistic and non-linguistic features to enrich word level representation within a CNN-based neural network. Since the two subtasks are strongly related, for Subtask-2, we introduce a fusion approach in which our model considers information about the category predicted in Subtask-1 when assigning a sub-category to each numeral. Our models achieve an F 1 score of 93.94% (micro) and 90.05% (macro) on Subtask-1 and 87.17% (micro) and 82.40% (macro) for Subtask-2 on the test set. This ranks us 1st in the competition in both Subtask-1 and Subtask-2.
  • Alan Spark
    [Pdf] [Table of Content]
    The following paper describes effort and results in NTCIR14 FinNum Task, approaches and results in both subtasks of FinNum. In the task, the team focuses on feature extraction and experiments on the concatenation of various features, on insights derived from an unsupervised approach to construct and extend data available for analysis. Feature extraction steps are designed for parallel execution thus proposed technics meant for use at scale.
  • Chao-Chun Liang and Keh-Yih Su
    [Pdf] [Table of Content]
    This paper describes our work for solving the financial numeral classification problem in the NTCIR-14 FinNum task, and discusses experimental results. After implementing the three proposed vanilla neural network models (CNN, RNN, and RNN with CNN filters), we further incorporate POS and NE linguistic features. Inspired by human observation, we also propose a pre-processing procedure, which splits numerals in the Twitter string in advance, to reduce the OOV rate in the test set. Experimental results show both approaches improve the performance significantly.
  • Wei Wang, Maofu Liu and Zhenlian Zhang
    [Pdf] [Table of Content]
    This paper describes our system in the FinNum task at NTCIR-14, and the FinNum task is dedicated to identifying the category of a numeral from the financial social media data, i.e. tweet, for fine-grained numeral understanding. In NTCIR-14, the FinNum task contains two subtasks, the first one, named Subtask1 in this paper, is to classify a numeral into seven categories, i.e. Monetary, Percentage, Option, Indicator, Temporal, Quantity, and Product/Version Number, and the second one, named Subtask2 in this paper, extends the Subtask1 to the subcategory level, which classifies financial numerals into seventeen classes. In our system, we first complete the Subtask1, and then, on the basis of the seven categories, we separately classify numerals into corresponding subcategories according to the category. Our submitted system constructs the classification model based on Support Vector Machines (SVM) to identify the categories of numerals. In additional experiments, we adopt the Bidirectional Encoder Representations from Transformers (BERT) model to act as a multi-classifier, and the experimental results show that the BERT model is superior to the SVM model.
  • Ke Tian and Zi Jun Peng
    [Pdf] [Table of Content]
    This paper describes how we tackle the Fine-Grained Numeral Understanding in Financial Tweet (FinNum) task of NTCIR14. The deep word and character embedding-based attention model is proposed to fine-grained classify the financial numeral tweets. The experiment is shown that the model has good performance and the ensemble result achieved F1-micro and F1 macro of task 1 are 87.41%, 78.04% respectively, and the final F1-micro and F1 macro of task 2 is 80.64%, 73.43 respectively.
  • Qianhui Wu, Guoxin Wang, Yuying Zhu, Haoyan Liu and Börje F. Karlsson
    [Pdf] [Table of Content]
    Numerals contain much information in the financial domain and thus playing a crucial role in financial analysis processes. In this paper, we focus on the type classification task of numerals in financial tweets and propose a hybrid neural model. A mention model employs a multi-layer perceptron to extract information from target numerals, while a context model utilizes recurrent neural networks to encode preceding and post context separately. Moreover, we present several feature templates to replace inputs like pre-trained word vectors, which help the model handle problems caused by sparse numeral embeddings. Experimental results demonstrate that the proposed approach well outperforms baseline methods.
  • Return to Top