NTCIR

NII Testbeds and Community for Information access Research
  • NTCIR Official site
  • Conference
  • Organizers
  • Data
  • Important Dates
 

NTCIR-10 Crosslink-2: Task Description

Last Update: 20 November 2012

ChangLog:
20/Nov/2012      Submission due extended to 21st December
04/Sep/2012       Evaluation Tool with Wikipedia ground-truth released
24/Aug/2012      Validation Tool released
30/Jul/2012        Submission specification released
01/Jul/2012        Wikipedia CEJK XML collections and topics released
 

IMPORTANT DATES

Run submissions due: December 21, 2012
Evaluation results released: January 31, 2013
Paper submissions due: March 31, 2013
Final workshop meeting: June 18-21 2013
 

1.    Introduction

Cross-lingual link discovery (CLLD) is concerned with automatically finding potential links between documents in different languages. In contrast to traditional information retrieval tasks where queries are not attached to explicit context, or only loosely attached to context, cross language link discovery algorithms actively recommend a set of meaningful anchors in the context of a source document and establish links to documents in an alternative language. CLLD is helpful for complimentary knowledge discovery in different language domains.

Currently in a knowledge base such as Wikipedia, the articles indifferent languages are rarely cross-linked except for direct equivalent pages(on the same subject) in different languages. This could pose serious difficulties to users seeking information or knowledge from different lingual sources, or where there is no equivalent page in one language or another.


Figure 1: Lost in translation

Figure 1 shows several different language versions of the page on “Custard”. Note that: 1) anchors are largely linked to articles in the source languages; 2) not all cross-language equivalent link sexist – the English article “Custard” is not linked to the Italian custard article “Crema pasticcera”, andvice versa; 3) some cross-language equivalent links are incorrect – the Chinese custard article “奶黄” is incorrectly linkedto the Italian pudding article “Budino”, and vice versa.

Therefore, the job of CLLD is to help identify a set of meaningful anchors in the context of a source document and establish links to documents in an alternative language in user's preference.

Crosslink as a pilot task of NTCIR-9 has been successfully held in 2011. At the end of experimentation season, in total 57 runs from 11 teams were received.

To participate, please visit one of the following registration pages:
(English version)
http://research.nii.ac.jp/ntcir/ntcir-10/howto.html
(Japanese version)
http://research.nii.ac.jp/ntcir/ntcir-10/howto-ja.html

2.    Task Definition

For the Crosslink task at NTCIR-10, we are planning to have three similar but different subtasks with opposite link direction. The new subtasks are:

    Chinese to English CLLD (C2E)

    Japanese to English CLLD (J2E)

    Korean to English CLLD (K2E)

The new subtasks are not simple replicas of previous Crosslink tasks. These subtasks will allow the CLLD approaches proposed at NTCIR-9 for suggesting good links in English documents to Chinese / Japanese / Korean ones can be re-examined in a different linking environment.

Plus, this time participants will have to deal with an extra problem when trying to cross link documents as there are no word boundaries in Chinese / Japanese text,  and in Korean eojeol. Natural language processing needs for CJK language to English document linking could make this task even more challenging.

As huge efforts were committed by the participating teams of NTCIR-9 Crosslink task in creating various CLLD systems from scratch and to further evaluate those systems with different topics, runs are also allowed for the subtasks in previous evaluation round:
•    English to Chinese CLLD
•    English to Japanese CLLD
•    English to Korean CLLD


Having the same subtasks in the new evaluation round will allow seeing continuous improvement of the existing CLLD systems.

3.    Topic and Document Collection

3.1 Training and Test Topics
Training topics
For system training, please download the test topics and the document collections for the NTCIR-9 Crosslink task when they are available. Participants may use these topics to create dry runs.

Test topics
Four sets of  topics, 25 articles each in four languages (
ECJK) have been chosen from the new Wikipedia document collections and used as test topics.  Participants should use these topics to create runs for formal final submissions. These topics will be provoided as orphaned Wikipedia articles by removing all hyperlinks to and from these documents.  The corresponding topic pages in English, Chinese, Japanese and Korean will not be contained in the document collections.

3.2 Document Collection
For the Crosslink-2 task, an English Wikipedia collection along with the new CJK collections were created for the evaluation of the new subtasks. 

The collections are formed by search engine friendly xml files created from recent dumps of Wikipedia mysql database.  The details of the collections are given as follows (the language of the corpus, the number of articles, the size of the corpus, and date of dump):

English         3,581,772     33G     04/01/2012   
Chinese       404,620        3.6G    
11/01/2012
Japanese     858,610        9.8G     
04/01/2012
Korean         297,913         2.2G     22
/01/2012

4.    Submission
Please refer to the Submission page for detailed information.
 
5.    Evaluation
Please see the evaluation page for the Crosslink-1. Click here to download a copy of the latest evaluation tool.
 
6.   Task Organizers

Shlomo Geva                Queensland University of Technology, Australia
In-Su Kang                     Kyungsung University, South Korea
Fuminori Kimura           Ritsumeikan University, Japan
Yi-Hsun, Lee                  Academia Sinica, Taiwan
Eric Tang                       Queensland University of Technology, Australia
Andrew Trotman           Universityof Otago, New Zealand
Yue Xu                           Queensland University of Technology, Australia


Task mailing list:
crosslink


7. Paper   

Participant paper submission page: https://www.easychair.org/conferences/?conf=crosslink2

8. Schedule

~~/02/2012          Call for Task Participation
31/08/2012          Task registration due
07~11/2012         Dry run
09~12/2012         Formal run
01/07/2012          Crosslink test topics release
15/07/2012          Crosslink validation tool for test topics release
27/08/2012          Crosslink validation tool for test topics release  
31/11/2012          Run submissions due
21/12/2012          Run submissions due
01/12/2012          Crosslink assessment tool release
01/12/2012          Crosslink evaluation tool with ground-truth release
31/01/2013          Evaluation results due
01/02/2013          Task overview partial release
31/03/2013          Participant paper submission due
20/04/2013          Task overview paper submission due
01/05/2013          All camera-‐ready copy for the Proceedings due

18-21/06/2013    EVIA 2013/NTCIR-10 Meeting


9. Resources

1)  The overview paper for the NTCIR-9 Crosslink task: http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/NTCIR/01-NTCIR9-OV-CROSSLINK-TangL.pdf 
  
2)  NTCIR-9 Crosslink Website: http://ntcir.nii.ac.jp/CrossLink
4)  Assessment and evaluation tool-kits project site: http://code.google.com/p/crosslink/ 
5)  Open source link discovery tools and library:
 
COUNTER19615