NTCIR Project
NTCIR-10 Cross-lingual Link Discovery (CrossLink-2)
Research Purpose Use of Test Collection

[JAPANESE][NTCIR Home] [NTCIR Data Home]

NTCIR-10 Cross-lingual Link Discovery (CROSSLINK-2)

NTCIR-10 CrossLink-2 CEJK Wikipedia Corpora are the test collections that can be used for experiments of cross-lingual link discovery between English and CJK (Chinese, Japanese and Korean) Wikipedia. Tasks for bidirectional cross-lingual document linking could include:

CJK to English

Chinese to English CLLD (C2E) subtask
Japanese to English CLLD (J2E) subtask
Korean to English CLLD (K2E) subtask

English to CJK

English to Chinese CLLD (E2C) subtask
English to Japanese CLLD (E2J) subtask
English to Korean CLLD (E2K) subtask

For each language subtask, a set of 25 documents are provided as topics; and for each topic it is required to identify prospective anchors and recommend links for them in the target corpus.

Collection	Document Data				Task Data
	Document Type	Language	Year	Volume	Topic			Relevance judgment
					Language	#
NTCIR-10 CrossLink-2	Wikipedia articles	Chinese	2012	404,620	CEJK	Test	25 X 4	two set of qrels (from Wikipedia ground-truth and manual assessment) for test-topics
		English	2012	3,581,772
		Japanese	2012	858,610
		Korean	2012	297,913

NTCIR-10 CrossLink-2 Document Collections (CEJK Wikipedia Corpora) and Topics are distributed under the conditions of Creative Commons Attribution-Share-Alike License 3.0 (Unported).

For more details, please visit this page:
http://warehouse.ntcir.nii.ac.jp/openaccess/crosslink/10crosslink_documents.html

NTCIR-10 CrossLink-2 Document Collections are available at:
http://warehouse.ntcir.nii.ac.jp/openaccess/crosslink/10crosslink_documents.html

The documents sets included in the NTCIR-10 CrossLink-2 test collection are formed of search engine friendly xml files created from Wikipedia mysql database dumps taken on 2012. The original article text containing unique Wikipedia mark-ups is converted into XML using the YAWN system [1]. The details of the collections are given in the following Table.

English	3,581,772	33GB	04/01/2012
Chinese	404,620	3.6GB	11/01/2012
Japanese	858,610	9.8GB	04/01/2012
Korean	29,7913	2.2GB	22/01/2012

Tags Most tags are kept as the same as the tags in original Wikipedia XML dump.
Some important tags are given below:

Mandatory tags
<title>	</title>	The tag for document title
<id>	</id>	The tag "id" of first level is the document identifier
<link>	</link>	The tag for link including general link and language link. Language link contains special attribute, e.g. "xlink:label="ko"". The language code : zh, ja, ko, en
<timestamp>	</timestamp>	Last update timestamp
<categories>	</categories>	including a list of sub-categories
Other tags
<p>	</p>	Paragraph marker
<sec>	</sec>	Section identifier. An article often includes multiple sections.

Training and Test Topics are available at:
http://warehouse.ntcir.nii.ac.jp/openaccess/crosslink/10crosslink_documents.html
* All Topics are downloadable form NII/IDR. Please see here.

Test Topics

25 articles in CECJK languages on various topics are chosen and used as formal test topics.

Submission XML File DTD


<!ELEMENT crosslink-submission (details, description, collections, topic+)>
<!ATTLIST crosslink-submission
   participant-id CDATA #REQUIRED
   run-id CDATA #REQUIRED
   task (A2F) #REQUIRED
   source_lang (zh|en|ja|ko) ) #REQUIRED
   default_lang (zh|en|ja|ko) ) #REQUIRED
>
<!ELEMENT details (machine, time)>
<!ELEMENT machine (cpu, speed, cores, hyperthreads, memory)>
<!ELEMENT cpu (#PCDATA)>
<!ELEMENT speed (#PCDATA)>
<!ELEMENT cores (#PCDATA)>
<!ELEMENT hyperthreads (#PCDATA)>
<!ELEMENT memory (#PCDATA)>
<!ELEMENT time (#PCDATA)>
<!ELEMENT description (#PCDATA)>
<!ELEMENT collections (collection+)>
<!ELEMENT collection (#PCDATA)>
<!ELEMENT topic (outgoing)>
<!ATTLIST topic
   file CDATA #REQUIRED
   name CDATA #REQUIRED
>

<!ELEMENT outgoing (anchor+)>

<!ELEMENT anchor (tofile+)>
<!ATTLIST anchor
   name CDATA #REQUIRED
   offset CDATA #REQUIRED
   length CDATA #REQUIRED
>
<!ELEMENT tofile (#PCDATA)>
<!ATTLIST tofile
   bep_offset CDATA #REQUIRED
   lang (zh|en|ja|ko)#REQUIRED
   title CDATA #REQUIRED
>

For more details about the submission format, please visit this page.

Gold Standard (qrel)

To obtain Gold Standard (qrels, Relevance Judgment Data), please visit here.

An example Gold Standard

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ltwResultsetType>
	<ltw_Topic name="Australia" id="4689264"> 
		<outgoingLinks>
			<outLink>131130</outLink>
			<outLink>108627</outLink>
			<outLink>7208</outLink>
			<outLink>292091</outLink>
			<outLink>1247664</outLink>
			<outLink>1213529</outLink>
                        ...
			<outLink>457369</outLink>
			<outLink>479260</outLink>
		</outgoingLinks>
	</ltw_Topic>
</ltwResultsetType>

Tools

The toolkits for assessment and evaluation in the NTCIR-10 CrossLink-2 task are available at:
http://code.google.com/p/crosslink/

The test collection and data are available from NII free of charge.

NTCIR-10 CrossLink Gold Standard (qrels, Relevance Judgment Data) and Topics are downloadable from NII/IDR at:
http://www.nii.ac.jp/dsc/idr/en/ntcir/ntcir.html

Please note: The above dataset package include NTCIR-9 CrossLink data. The data for NTCIR-10 are included in the folder named 'crosslink2'.

Reference

Schenkel, R., F. Suchanek, and G. Kasneci, "YAWN: A Semantically Annotated Wikipedia XML Corpus." In Proceedings of BTW'2007, 2007.

Tang, L.-X., I.-S. Kang, F. Kimura, Y.-H. Lee, A. Trotman, S. Geva, et al (2013)., "Overview of the NTCIR-10 Cross-Lingual Link Discovery Task," in Proceedings of NTCIR-10, Tokyo, Japan.
[poster] [paper PDF]

Tang, L.-X., Geva, S., Trotman, A., Xu, Y., & Itakura, K. Y. (2011). Overview of the NTCIR-9 Crosslink Task: Cross-lingual Link Discovery. Proceedings of NTCIR-10 (pp. 437-463). Tokyo, Japan.
[poster] [paper PDF]

NTCIR-10 CrossLink-2 Task website

CrossLink Toolkits

For the use of the Relevance Judgment Data (Gold Standard, qrels) , please refer to
The terms of use [PDF].

Contact us : ntc-secretariat

License

Use and/or redistribution of the NTCIR-10 CrossLink-2 CEJK Wikipedia Corpora and Topics is permitted under the conditions of Creative Commons Attribution-Share-Alike License 3.0(Unported).
Details can be found at http://creativecommons.org/licenses/by-sa/3.0/.

[JAPANESE] [NTCIR Home] [Top of this page] [NTCIR Data Home]

Updated on : 2013-09-12

ntc-admin

NTCIR Project NTCIR-10 Cross-lingual Link Discovery (CrossLink-2) Research Purpose Use of Test Collection

NTCIR-10 Cross-lingual Link Discovery (CROSSLINK-2)

NTCIR Project
NTCIR-10 Cross-lingual Link Discovery (CrossLink-2)
Research Purpose Use of Test Collection