Proceedings of the Ninth International Workshop on Evaluating Information Access (EVIA 2019),
a Satellite Workshop of the NTCIR-14 Conference
June 10, 2019
National Institute of Informatics, Tokyo, Japan

Abstracts


    [Preface]


  • Nicola Ferro, Ian Soboroff and Min Zhang
    [Pdf] [Table of Content]
  • Return to Top


    [NTCIR Book]


  • Douglas W. Oard, Tetsuya Sakai and Noriko Kando
    [Pdf] [Table of Content]
  • Return to Top


    [EVIA]


  • Charles Clarke
    [Pdf] [Table of Content]
    Document length was rarely a factor in traditional retrieval evaluation metrics. As a result, traditional rankers could take advantage of this property by favoring longer documents, which were more likely to be relevant. Returning a long, non-relevant document was in no way penalized by these traditional metrics, even if in practice the searcher would spend time fruitlessly scanning the document for relevant material. As we show by way of an illustrative experiment, a policy of ignoring document length may have had unfortunate impacts even in traditional contexts. But this policy becomes increasingly inappropriate in an era of neural rankers, where query-document similarity may be computed directly from text. Freed from ranking based on document- and corpus-level statistics, it should be possible to return precisely the required information and no more. Future evaluation efforts and metrics should reflect this goal.
  • Shih-Hung Wu, Wen-Feng Shih and Sheng-Lun Chien
    [Pdf] [Table of Content]
    With the advance of the study on automatically generated conversation, the research on evaluation is also getting important. How to evaluate the quality of the emotional conversation text is our research goal. The two major evaluation methods have their own drawbacks. The automatic evaluation methods can judge the dialogue system quickly; however, there is no commonly accepted metrics currently. On the other hand, human judgments suffer from inconsistency; the inter-annotator agreement is unstable. In this paper, we conduct a study on how to make the human judgment more stable by analyzing the mutual agreement between different human judges, and discuss how to systematically design evaluation questions. We discuss how to improve the evaluation rules in STC-2 task in NTCIR, which originally are not designed for emotional conversation. We design a process to find out stable factors with catharsis emotional aspects to improve the evaluation rules for emotional dialogue evaluation. The dialogue data with catharsis types are gathered from our STC-2 system. Evaluation questionnaire with different aspects is verified on whether it can achieve consistency or not. By analyzing the questionnaire survey result, we find the aspects that can achieve higher consistency.
  • Simona Frenda, Noriko Kando, Viviana Patti and Paolo Rosso
    [Pdf] [Table of Content]
    Important issues, such as abortion governmental laws, are discussed everyday online involving different opinions that could be favorable or not. Often the debates change tone and become more aggressive undermining the discussion. In this paper, we analyze the relation between abusive language and the stances of disapproval toward some controversial issues that involve specific groups of people (such as women), which are commonly also targets of hate speech. We analyzed the tweets about the feminist movement and the legalization of abortion events released by the organizers of Stance Detection shared task at SemEval 2016. An interesting finding is the usefulness of semantic and lexical features related to misogynistic and sexist speech which improve considerably the sensitivity of the system of stance classification toward the feminist movement. About the abortion issue, we found that the majority of the expressions relevant for the classification are negative and aggres sive. The improvements in terms of precision, recall and f-score are confirmed by the analysis of the correct predicted unfavorable tweets, which are featured by expressions of hatred against women. The promising results obtained in this initial study demonstrate indeed that disapproval is often expressed using abusive language. It suggests that the monitoring of hate speech and abusive language during the stance detection process could be exploited to improve the quality of the debates in social media.
  • Douglas Oard, Marine Carpuat, Petra Galuscakova, Joseph Barrow, Suraj Nair, Xing Niu, Han-Chin Shing, Weijia Xu, Elena Zotkina, Kathleen McKeown, Smaranda Muresan, Efsun Kayi, Ramy Eskander, Chris Kedzie, Yan Virin, Dragomir Radev, Rui Zhang, Mark Gales, Anton Ragni and Kenneth Heafield
    [Pdf] [Table of Content]
    Sixteen years ago, the first "surprise language exercise" was conducted, in Cebuano. The evaluation goal of a surprise language exercise is to learn how well systems for a new language can be quickly built. This paper briefly reviews the history of surprise language exercises. Some details from the most recent surprise language exercise, in Lithuanian, are included to help to illustrate how the state of the art has advanced over this period.
  • Return to Top