NTCIR Project
NTCIR-9 Crosslink
(Cross-lingual Link Discovery)
Research Purpose Use of Test Collection

[JAPANESE] [NTCIR Home] [NTCIR Data Home]

NTCIR-9 CROSSLINK (Cross-lingual Link Discovery)

NTCIR-9 Crosslink test collection can be used for experiments of cross-lingual link discovery from English to CJK (Chinese, Japanese and Korean) document linking such as

English to Chinese CLLD (E2C) subtask
English to Japanese CLLD (E2J) subtask
English to Korean CLLD (E2K) subtask

For each subtask, English documents are provided as topics; and for each topic it is required to identify prospective anchors and recommend links for them in the CJK document collections.

Collection	Document Data				Task Data
	Document Type	Language	Year	Volume	Topic			Relevance judgment
					Language	#
NTCIR-9 Crosslink	Wikipedia articles	Chinese	2010	318,736	English	Training	3	one set qrel (from Wikipedia ground-truth )　for training topics
		Japanese	2010	716,088
		Korean	2010	201,596		Test	25	two set of qrels (from Wikipedia ground-truth and manual assessment) for test-topics

NTCIR-9 Crosslink Document Collections (CJK XML Corpora) and Topics are distributed under the conditions of Creative Commons Attribution-Share-Alike License 3.0 (Unported).

For more details, please visit this page:
http://warehouse.ntcir.nii.ac.jp/openaccess/crosslink/crosslink_documents.html

NTCIR-9 Crosslink Document Collections are available at:
http://warehouse.ntcir.nii.ac.jp/openaccess/crosslink/crosslink_documents.html

The documents sets included in the NTCIR-9 CROSSLINK test collection are formed of search engine friendly xml files created from Wikipedia mysql database dumps taken on June 2010. The original article text containing unique Wikipedia mark-ups is converted into XML using the YAWN system [1]. The details of the collections are given in the following Table.

Language	# doc	Size	Dump Date
Chinese	318736	2.7G	27/06/2010
Japanese	716,088	6.1G	24/06/2010
Korean	201596	1.2G	28/06/2010

Tags

Most tags are kept as the same as the tags in original Wikipedia XML dump.
Some important tags are given below:

Mandatory tags
<title>	</title>	The tag for document title
<id>	</id>	The tag "id" of first level is the document identifier
<link>	</link>	The tag for link including general link and language link. Language link contains special attribute, e.g. "xlink:label="ko"". The language code : zh, ja, ko, en
<timestamp>	</timestamp>	Last update timestamp
<categories>	</categories>	including a list of sub-categories
Other tags
<p>	</p>	Paragraph marker
<sec>	</sec>	Section identifier. An article often includes multiple sections.

Training and Test Topics are available at:
http://warehouse.ntcir.nii.ac.jp/openaccess/crosslink/crosslink_documents.html

* All Topics are downloadable from NII/IDR. Please see here.

Training and Test Topics
Training topics

#	Title	ID
1	Australia	4689264
2	Femme fatale	299098
3	Martial arts	19501

Only three topics are used for system training in the NTCIR 9 Crosslink task.

Test topics
A set of 25 articles will be randomly chosen from the English Wikipedia and used as formal test topics.

Both training and test topics can be used to generate evaluation run for system benchmarking. Please note that these topics should be orphaned by removing all hyperlinks to and from these documents by participants. The corresponding topic pages in Chinese, Japanese and Korean should also be removed from document collections.

Form

Submission XML File DTD


<!ELEMENT crosslink-submission (details, description, collections, topic+)>

<!ATTLIST crosslink-submission
   participant-id CDATA #REQUIRED

   run-id CDATA #REQUIRED
   task (A2F| A2B) #REQUIRED
   default_lang (zh|ja|ko) ) #REQUIRED>

<!ELEMENT details (machine, time)>
<!ELEMENT machine (cpu, speed, cores, hyperthreads, memory)>
<!ELEMENT cpu (#PCDATA)>
<!ELEMENT speed (#PCDATA)>
<!ELEMENT cores (#PCDATA)>

<!ELEMENT hyperthreads (#PCDATA)>
<!ELEMENT memory (#PCDATA)>
<!ELEMENT time (#PCDATA)>
<!ELEMENT description (#PCDATA)>
<!ELEMENT collections (collection+)>
<!ELEMENT collection (#PCDATA)>
<!ELEMENT topic (outgoing)>

<!ATTLIST topic
   file CDATA #REQUIRED

   name CDATA #REQUIRED
>

<!ELEMENT outgoing (anchor+)>

<!ELEMENT anchor (tofile+)>

<!ATTLIST anchor
   name CDATA #REQUIRED
   offset CDATA #REQUIRED
   length CDATA #REQUIRED

>

<!ELEMENT tofile (#PCDATA)>

<!ATTLIST tofile
   bep_offset CDATA #REQUIRED
   lang (zh|ja|ko)#REQUIRED
   title CDATA #REQUIRED
>

The root element crosslink-submission should contain information about participant's ID, run ID (which should include your university affiliation), the task which should be either A2F or A2B and the default target language (zh, ja, or ko). The linking algorithm should be described in description node. The collections element contains a list of document collections used in the run. Generally, the collection element should contain text from one of the following: Chinese Wikipedia, Japanese Wikipedia or Korean Wikipedia. Each topic should be contained in a topic element which should contain an anchor element for each anchor-text that should be linked. Each anchor element should include offset, length and name attributes for detailed information of the recommended anchor, and should also have one or more tofile sub-elements with the target document ID contained within them. The tofile element should contain following information: language id, title and bep (specified in lang, title, and bep_offset attributes separately) of the linked document.

Submission Format

Example submission

<crosslink-submission participant-id="QUT"
   run-id="QUT_E2C_A2B_01"
   task="A2F"
   default_lang=”zh”>

   <details>
      <machine>
         <cpu>Intel Celeron</cpu>
         <speed>1.06GHz</speed>
         <cores>1</cores>

         <hyperthreads>1</hyperthreads>
         <memory>128MB</memory>
      </machine>
      <time>3.04 seconds</time>

   </details>
   <description>Describe the approach here, NOT in the run-id.</description>
   <collections>
      <collection>Chinese Wikipedia</collection>
   </collections>

   <topic file="9638" name=" 99 Luftballons">
<outgoing>
   <anchor offset="768" length="8" name="Balloons">
      <tofile bep_offset="637" lang=”zh”  title=”气球” >4424</tofile>
      <tofile bep_offset=“238343” lang=”zh”  title=”气球炸?” >442489</tofile>

      <tofile bep_offset=“23438” lang=”zh”  title=”气球男孩事件” >64424</tofile>
      <tofile bep_offset=“8997” lang=”zh”  title=”?气球之旅” >14424 </tofile>
      <tofile bep_offset=“334” lang=”zh”  title=”猫与气球” >43224</tofile>
   </anchor>

   ...
</outgoing>
   </topic>
</crosslink-submission>

For more details about the submission format, please visit this page.

Gold Standard (qrel)

To obtain Gold Standard (qrels, Relevance Judgment Data), please visit here.

An example Gold Standard

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ltwResultsetType>
	<ltw_Topic name="Australia" id="4689264"> 
		<outgoingLinks>
			<outLink>131130</outLink>
			<outLink>108627</outLink>
			<outLink>7208</outLink>
			<outLink>292091</outLink>
			<outLink>1247664</outLink>
			<outLink>1213529</outLink>
                        ...
			<outLink>457369</outLink>
			<outLink>479260</outLink>
		</outgoingLinks>
	</ltw_Topic>
</ltwResultsetType>

Tools

The toolkits for assessment and evaluation in the NTCIR-9 Crosslink task are available at:
http://code.google.com/p/crosslink/

The test collection and data are available from NII free of charge.

NTCIR-9 Crosslink Gold Standard (qrels, Relevance Judgment Data) and Topics are downloadable from NII/IDR at;
http://www.nii.ac.jp/dsc/idr/en/ntcir/ntcir.html

Please note: The above dataset package include NTCIR-10 Crosslink data, also. The data for NTCIR-9 are included in the folder named 'crosslink1'.

Reference

[1] Schenkel, R., F. Suchanek, and G. Kasneci, "YAWN: A Semantically Annotated Wikipedia XML Corpus." In Proceedings of BTW'2007, 2007.
[2] Tang, L.-X., Geva, S., Trotman, A., Xu, Y., & Itakura, K. Y. (2011). Overview of the NTCIR-9 Crosslink Task: Cross-lingual Link Discovery. Proceedings of NTCIR-9 (pp. 437-463). Tokyo, Japan.
[poster] [paper PDF]
[3] NTCIR-9 CROSSLINK Task website
[4] Crosslink Toolkits
For the use of the Relevance Judgment Data (Gold Standard, qrels) , please refer to
[5] The terms of use [PDF].

Contact us : ntc-secretariat

License

Use and/or redistribution of the NTCIR-9 Crosslink CJK XML Corpora and Topics is permitted under the conditions of Creative Commons Attribution-Share-Alike License 3.0(Unported).
Details can be found at http://creativecommons.org/licenses/by-sa/3.0/.

[JAPANESE] [NTCIR Home] [Top of this page] [NTCIR Data Home]

Updated on : 2015-07-24

ntc-admin

NTCIR Project NTCIR-9 Crosslink(Cross-lingual Link Discovery) Research Purpose Use of Test Collection

NTCIR-9 CROSSLINK (Cross-lingual Link Discovery)

NTCIR Project
NTCIR-9 Crosslink
(Cross-lingual Link Discovery)
Research Purpose Use of Test Collection