[JAPANESE][NTCIR Home] [NTCIR Data Home]
Collection | Document Data | Task Data | ||||||
Document Type | Language | Year | Volume | Topic | Relevance judgment | |||
Language | # | |||||||
NTCIR-10 CrossLink-2 |
Wikipedia articles |
Chinese |
2012 |
404,620 | CEJK | Test |
25 X 4 |
two set of qrels (from Wikipedia ground-truth and manual assessment) for test-topics |
English |
2012 |
3,581,772 | ||||||
Japanese |
2012 | 858,610 | ||||||
Korean |
2012 |
297,913 |
NTCIR-10 CrossLink-2 Document Collections (CEJK Wikipedia Corpora) and Topics are distributed under the conditions of Creative Commons Attribution-Share-Alike License 3.0 (Unported).
For more details, please visit this page:
http://warehouse.ntcir.nii.ac.jp/openaccess/crosslink/10crosslink_documents.html
NTCIR-10 CrossLink-2 Document Collections are available at:
http://warehouse.ntcir.nii.ac.jp/openaccess/crosslink/10crosslink_documents.html
English | 3,581,772 | 33GB | 04/01/2012 |
Chinese | 404,620 | 3.6GB | 11/01/2012 |
Japanese | 858,610 | 9.8GB | 04/01/2012 |
Korean | 29,7913 | 2.2GB | 22/01/2012 |
Mandatory tags |
||
<title> |
</title> |
The tag for document title |
<id> |
</id> |
The tag "id" of first level is the document identifier |
<link> |
</link> |
The tag for link including general
link and language link. |
<timestamp> |
</timestamp> |
Last update timestamp |
<categories> |
</categories> |
including a list of sub-categories |
Other tags |
||
<p> |
</p> |
Paragraph marker |
<sec> |
</sec> |
Section identifier. An article often includes multiple sections. |
http://warehouse.ntcir.nii.ac.jp/openaccess/crosslink/10crosslink_documents.html
* All Topics are downloadable form NII/IDR. Please see here.
25 articles in CECJK languages on various
topics are chosen and used as formal test topics.
<!ELEMENT crosslink-submission (details, description, collections, topic+)>
<!ATTLIST crosslink-submission
participant-id CDATA #REQUIRED
run-id CDATA #REQUIRED
task (A2F) #REQUIRED
source_lang (zh|en|ja|ko) ) #REQUIRED
default_lang (zh|en|ja|ko) ) #REQUIRED
>
<!ELEMENT details (machine, time)>
<!ELEMENT machine (cpu, speed, cores, hyperthreads, memory)>
<!ELEMENT cpu (#PCDATA)>
<!ELEMENT speed (#PCDATA)>
<!ELEMENT cores (#PCDATA)>
<!ELEMENT hyperthreads (#PCDATA)>
<!ELEMENT memory (#PCDATA)>
<!ELEMENT time (#PCDATA)>
<!ELEMENT description (#PCDATA)>
<!ELEMENT collections (collection+)>
<!ELEMENT collection (#PCDATA)>
<!ELEMENT topic (outgoing)>
<!ATTLIST topic
file CDATA #REQUIRED
name CDATA #REQUIRED
>
<!ELEMENT outgoing (anchor+)>
<!ELEMENT anchor (tofile+)>
<!ATTLIST anchor
name CDATA #REQUIRED
offset CDATA #REQUIRED
length CDATA #REQUIRED
>
<!ELEMENT tofile (#PCDATA)>
<!ATTLIST tofile
bep_offset CDATA #REQUIRED
lang (zh|en|ja|ko)#REQUIRED
title CDATA #REQUIRED
>
To obtain Gold Standard (qrels, Relevance Judgment Data), please visit here.
An example Gold Standard<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <ltwResultsetType> <ltw_Topic name="Australia" id="4689264"> <outgoingLinks> <outLink>131130</outLink> <outLink>108627</outLink> <outLink>7208</outLink> <outLink>292091</outLink> <outLink>1247664</outLink> <outLink>1213529</outLink>
...
<outLink>457369</outLink> <outLink>479260</outLink> </outgoingLinks> </ltw_Topic> </ltwResultsetType>
The toolkits for assessment and evaluation in the
NTCIR-10 CrossLink-2 task are available at:
http://code.google.com/p/crosslink/
The test collection and data are available from NII free of charge.
Reference
- NTCIR-10 CrossLink Gold Standard (qrels, Relevance Judgment Data) and Topics are downloadable from NII/IDR at:
http://www.nii.ac.jp/dsc/idr/en/ntcir/ntcir.html
Please note: The above dataset package include NTCIR-9 CrossLink data. The data for NTCIR-10 are included in the folder named 'crosslink2'.
- Schenkel, R., F. Suchanek, and G. Kasneci, "YAWN: A Semantically Annotated Wikipedia XML Corpus." In Proceedings of BTW'2007, 2007.
- Tang, L.-X., I.-S. Kang, F. Kimura, Y.-H. Lee, A. Trotman, S. Geva, et al (2013)., "Overview of the NTCIR-10 Cross-Lingual Link Discovery Task," in Proceedings of NTCIR-10, Tokyo, Japan.
[poster] [paper PDF]- Tang, L.-X., Geva, S., Trotman, A., Xu, Y., & Itakura, K. Y. (2011). Overview of the NTCIR-9 Crosslink Task: Cross-lingual Link Discovery. Proceedings of NTCIR-10 (pp. 437-463). Tokyo, Japan.
[poster] [paper PDF]- NTCIR-10 CrossLink-2 Task website
- CrossLink Toolkits
- For the use of the Relevance Judgment Data (Gold Standard, qrels) , please refer to
The terms of use [PDF].
License
Contact us : ntc-secretariat
Use and/or redistribution of the NTCIR-10 CrossLink-2 CEJK
Wikipedia Corpora and Topics is permitted under the conditions of
Creative Commons Attribution-Share-Alike License
3.0(Unported).
Details can be found at http://creativecommons.org/licenses/by-sa/3.0/.