[JAPANESE] [NTCIR Home] [NTCIR Data Home]
Collection | Document Data | Task Data | ||||||
Document Type | Language | Year | Volume | Topic | Relevance judgment | |||
Language | # | |||||||
NTCIR-9 Crosslink |
Wikipedia articles |
Chinese |
2010 |
318,736 | English | Training | 3 | one set qrel (from Wikipedia ground-truth )@for training topics |
Japanese |
2010 |
716,088 | ||||||
Korean | 2010 | 201,596 | Test | 25 | two set of qrels (from Wikipedia ground-truth and manual assessment) for test-topics |
NTCIR-9 Crosslink Document Collections (CJK XML Corpora) and Topics are distributed under the conditions of Creative Commons Attribution-Share-Alike License 3.0 (Unported).
For more details, please visit this page:
http://warehouse.ntcir.nii.ac.jp/openaccess/crosslink/crosslink_documents.html
NTCIR-9 Crosslink Document Collections are available at:
http://warehouse.ntcir.nii.ac.jp/openaccess/crosslink/crosslink_documents.html
Language | # doc | Size | Dump Date |
Chinese | 318736 | 2.7G | 27/06/2010 |
Japanese | 716,088 | 6.1G | 24/06/2010 |
Korean | 201596 | 1.2G | 28/06/2010 |
Tags
Most tags are kept as the same as the tags in original Wikipedia XML dump.
Mandatory tags |
||
<title> |
</title> |
The tag for document title |
<id> |
</id> |
The tag "id" of first level is the document identifier |
<link> |
</link> |
The tag for link including general link and language link. |
<timestamp> |
</timestamp> |
Last update timestamp |
<categories> |
</categories> |
including a list of sub-categories |
Other tags |
||
<p> |
</p> |
Paragraph marker |
<sec> |
</sec> |
Section identifier. An article often includes multiple sections. |
Training and Test Topics are available at:
http://warehouse.ntcir.nii.ac.jp/openaccess/crosslink/crosslink_documents.html
* All Topics are downloadable from NII/IDR. Please see here.
Training and Test Topics
Training topics
# | Title | ID |
1 | Australia | 4689264 |
2 | Femme fatale | 299098 |
3 | Martial arts | 19501 |
Only three topics are used for system training in the NTCIR 9 Crosslink task.
Test topics
A set of 25 articles will be randomly chosen from the English Wikipedia
and used as formal test topics.
Both training and test topics can be used to generate evaluation run for
system benchmarking. Please note that these topics should be orphaned by
removing all hyperlinks to and from these documents by participants. The
corresponding topic pages in Chinese, Japanese and Korean should also be
removed from document collections.
<!ELEMENT crosslink-submission (details, description, collections, topic+)>
<!ATTLIST crosslink-submission participant-id CDATA #REQUIRED run-id CDATA #REQUIRED task (A2F| A2B) #REQUIRED default_lang (zh|ja|ko) ) #REQUIRED>
<!ELEMENT details (machine, time)>
<!ELEMENT machine (cpu, speed, cores, hyperthreads, memory)>
<!ELEMENT cpu (#PCDATA)>
<!ELEMENT speed (#PCDATA)>
<!ELEMENT cores (#PCDATA)> <!ELEMENT hyperthreads (#PCDATA)> <!ELEMENT memory (#PCDATA)> <!ELEMENT time (#PCDATA)> <!ELEMENT description (#PCDATA)> <!ELEMENT collections (collection+)> <!ELEMENT collection (#PCDATA)> <!ELEMENT topic (outgoing)>
<!ATTLIST topic file CDATA #REQUIRED name CDATA #REQUIRED >
<!ELEMENT outgoing (anchor+)>
<!ELEMENT anchor (tofile+)>
<!ATTLIST anchor name CDATA #REQUIRED offset CDATA #REQUIRED length CDATA #REQUIRED >
<!ELEMENT tofile (#PCDATA)>
<!ATTLIST tofile bep_offset CDATA #REQUIRED lang (zh|ja|ko)#REQUIRED title CDATA #REQUIRED >
<crosslink-submission participant-id="QUT"
run-id="QUT_E2C_A2B_01"
task="A2F"
default_lang=hzhh>
<details>
<machine>
<cpu>Intel Celeron</cpu>
<speed>1.06GHz</speed>
<cores>1</cores>
<hyperthreads>1</hyperthreads>
<memory>128MB</memory>
</machine>
<time>3.04 seconds</time>
</details>
<description>Describe the approach here, NOT in the run-id.</description>
<collections>
<collection>Chinese Wikipedia</collection>
</collections>
<topic file="9638" name=" 99 Luftballons">
<outgoing>
<anchor offset="768" length="8" name="Balloons">
<tofile bep_offset="637" lang=hzhh title=h
h >4424</tofile>
<tofile bep_offset=g238343h lang=hzhh title=h
ŕy?h >442489</tofile>
<tofile bep_offset=g23438h lang=hzhh title=h
jwh >64424</tofile>
<tofile bep_offset=g8997h lang=hzhh title=h?
Vˇh >14424 </tofile>
<tofile bep_offset=g334h lang=hzhh title=hL^
h >43224</tofile>
</anchor>
...
</outgoing>
</topic>
</crosslink-submission>
To obtain Gold Standard (qrels, Relevance Judgment Data), please visit here.
An example Gold Standard<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <ltwResultsetType> <ltw_Topic name="Australia" id="4689264"> <outgoingLinks> <outLink>131130</outLink> <outLink>108627</outLink> <outLink>7208</outLink> <outLink>292091</outLink> <outLink>1247664</outLink> <outLink>1213529</outLink>
...
<outLink>457369</outLink> <outLink>479260</outLink> </outgoingLinks> </ltw_Topic> </ltwResultsetType>
The toolkits for assessment and evaluation in the NTCIR-9 Crosslink task are available at:
http://code.google.com/p/crosslink/
The test collection and data are available from NII free of charge.
Reference
- NTCIR-9 Crosslink Gold Standard (qrels, Relevance Judgment Data) and Topics are downloadable from NII/IDR at;
http://www.nii.ac.jp/dsc/idr/en/ntcir/ntcir.html
Please note: The above dataset package include NTCIR-10 Crosslink data, also. The data for NTCIR-9 are included in the folder named 'crosslink1'.
- [1] Schenkel, R., F. Suchanek, and G. Kasneci, "YAWN: A Semantically Annotated Wikipedia XML Corpus." In Proceedings of BTW'2007, 2007.
- [2] Tang, L.-X., Geva, S., Trotman, A., Xu, Y., & Itakura, K. Y. (2011). Overview of the NTCIR-9 Crosslink Task: Cross-lingual Link Discovery. Proceedings of NTCIR-9 (pp. 437-463). Tokyo, Japan.
[poster] [paper PDF]- [3] NTCIR-9 CROSSLINK Task website
- [4] Crosslink Toolkits
- For the use of the Relevance Judgment Data (Gold Standard, qrels) , please refer to
[5] The terms of use [PDF].
License
Contact us : ntc-secretariat
Use and/or redistribution of the NTCIR-9 Crosslink CJK XML Corpora and Topics is permitted under the conditions of Creative Commons Attribution-Share-Alike
License 3.0(Unported).
Details can be found at http://creativecommons.org/licenses/by-sa/3.0/.