[ENGLISH] [NTCIR ホーム] [NTCIR データ Home]
コレクション | 文書データ | タスクデータ | ||||||
文書タイプ | 言語 | 年度 | 文書数 | 課題 | 正解判定 | |||
言語 | # | |||||||
NTCIR-9 Crosslink |
ウィキペディア 記事 |
中国語 |
2010 |
318,736 | 英語 | 訓練用 | 3 | one set qrel (from Wikipedia ground-truth ) for training topics |
日本語 |
2010 |
716,088 | ||||||
韓国語 | 2010 | 201,596 | テスト用 | 25 | two set of qrels (from Wikipedia ground-truth and manual assessment) for test-topics |
NTCIR-9 Crosslink 文書データ(CJK XML Corpora)と課題データは、 Creative Commons Attribution-Share-Alike License 3.0 (Unported) のライセンスに基づき提供されています。
詳細については、こちらのページをご覧ください。
http://warehouse.ntcir.nii.ac.jp/openaccess/crosslink/crosslink_documents.html
NTCIR-9 Crosslink 文書データ はこちらから利用できます。
http://warehouse.ntcir.nii.ac.jp/openaccess/crosslink/crosslink_documents.html
The documents sets included in the NTCIR-9 CROSSLINK test collection are formed of search engine friendly xml files created from Wikipedia mysql database dumps taken on June 2010. The original article text containing unique Wikipedia mark-ups is converted into XML using the YAWN system [1]. The details of the collections are given in the following Table.
Language | # doc | Size | Dump Date |
Chinese | 318736 | 2.7G | 27/06/2010 |
Japanese | 716,088 | 6.1G | 24/06/2010 |
Korean | 201596 | 1.2G | 28/06/2010 |
Tags
Most tags are kept as the same as the tags in original Wikipedia XML dump.
Mandatory tags |
||
<title> |
</title> |
The tag for document title |
<id> |
</id> |
The tag "id" of first level is the document identifier |
<link> |
</link> |
The tag for link including general link and language link. |
<timestamp> |
</timestamp> |
Last update timestamp |
<categories> |
</categories> |
including a list of sub-categories |
Other tags |
||
<p> |
</p> |
Paragraph marker |
<sec> |
</sec> |
Section identifier. An article often includes multiple sections. |
訓練用およびテスト用課題データ (Topics)はこちらから利用できます。
http://warehouse.ntcir.nii.ac.jp/openaccess/crosslink/crosslink_documents.html
*全課題データをまとめてダウンロードできます。 こちらをご覧ください。
Training and Test Topics
Training topics
# | Title | ID |
1 | Australia | 4689264 |
2 | Femme fatale | 299098 |
3 | Martial arts | 19501 |
Only three topics are used for system training in the NTCIR 9 Crosslink task.
Test topics
A set of 25 articles will be randomly chosen from the English Wikipedia
and used as formal test topics.
Both training and test topics can be used to generate evaluation run for
system benchmarking. Please note that these topics should be orphaned by
removing all hyperlinks to and from these documents by participants. The
corresponding topic pages in Chinese, Japanese and Korean should also be
removed from document collections.
<!ELEMENT crosslink-submission (details, description, collections, topic+)>
<!ATTLIST crosslink-submission participant-id CDATA #REQUIRED run-id CDATA #REQUIRED task (A2F| A2B) #REQUIRED default_lang (zh|ja|ko) ) #REQUIRED>
<!ELEMENT details (machine, time)>
<!ELEMENT machine (cpu, speed, cores, hyperthreads, memory)>
<!ELEMENT cpu (#PCDATA)>
<!ELEMENT speed (#PCDATA)>
<!ELEMENT cores (#PCDATA)> <!ELEMENT hyperthreads (#PCDATA)> <!ELEMENT memory (#PCDATA)> <!ELEMENT time (#PCDATA)> <!ELEMENT description (#PCDATA)> <!ELEMENT collections (collection+)> <!ELEMENT collection (#PCDATA)> <!ELEMENT topic (outgoing)>
<!ATTLIST topic file CDATA #REQUIRED name CDATA #REQUIRED >
<!ELEMENT outgoing (anchor+)>
<!ELEMENT anchor (tofile+)>
<!ATTLIST anchor name CDATA #REQUIRED offset CDATA #REQUIRED length CDATA #REQUIRED >
<!ELEMENT tofile (#PCDATA)>
<!ATTLIST tofile bep_offset CDATA #REQUIRED lang (zh|ja|ko)#REQUIRED title CDATA #REQUIRED >
The root element crosslink-submission should contain information about participant's ID, run ID (which should include your university affiliation), the task which should be either A2F or A2B and the default target language (zh, ja, or ko). The linking algorithm should be described in description node. The collections element contains a list of document collections used in the run. Generally, the collection element should contain text from one of the following: Chinese Wikipedia, Japanese Wikipedia or Korean Wikipedia. Each topic should be contained in a topic element which should contain an anchor element for each anchor-text that should be linked. Each anchor element should include offset, length and name attributes for detailed information of the recommended anchor, and should also have one or more tofile sub-elements with the target document ID contained within them. The tofile element should contain following information: language id, title and bep (specified in lang, title, and bep_offset attributes separately) of the linked document.
<crosslink-submission participant-id="QUT"
run-id="QUT_E2C_A2B_01"
task="A2F"
default_lang=”zh”>
<details>
<machine>
<cpu>Intel Celeron</cpu>
<speed>1.06GHz</speed>
<cores>1</cores>
<hyperthreads>1</hyperthreads>
<memory>128MB</memory>
</machine>
<time>3.04 seconds</time>
</details>
<description>Describe the approach here, NOT in the run-id.</description>
<collections>
<collection>Chinese Wikipedia</collection>
</collections>
<topic file="9638" name=" 99 Luftballons">
<outgoing>
<anchor offset="768" length="8" name="Balloons">
<tofile bep_offset="637" lang=”zh” title=”气球” >4424</tofile>
<tofile bep_offset=“238343” lang=”zh” title=”气球炸?” >442489</tofile>
<tofile bep_offset=“23438” lang=”zh” title=”气球男孩事件” >64424</tofile>
<tofile bep_offset=“8997” lang=”zh” title=”?气球之旅” >14424 </tofile>
<tofile bep_offset=“334” lang=”zh” title=”猫与气球” >43224</tofile>
</anchor>
...
</outgoing>
</topic>
</crosslink-submission>
Gold Standard (qrel, 適合判定データ)の入手方法はこちらをご覧ください。
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <ltwResultsetType> <ltw_Topic name="Australia" id="4689264"> <outgoingLinks> <outLink>131130</outLink> <outLink>108627</outLink> <outLink>7208</outLink> <outLink>292091</outLink> <outLink>1247664</outLink> <outLink>1213529</outLink>
...
<outLink>457369</outLink> <outLink>479260</outLink> </outgoingLinks> </ltw_Topic> </ltwResultsetType>
The toolkits for assessment and evaluation in the NTCIR-9 Crosslink task are available at:
http://code.google.com/p/crosslink/
NIIから配布するものはいずれも無料です。
- NTCIR-9 Crosslinkのタスクデータ(Gold Standard (qrel, 適合判定データ)) 及び 課題データは、NIIのIDRからダウンロードできます:
http://www.nii.ac.jp/dsc/idr/ntcir/ntcir.html
備考: 上記はNTCIR-10 Crosslinkのタスクデータを含むデータセットパッケージです。NTCIR-9のデータは、'crosslink1'という名前のフォルダ内に含まれています。
参考書類 ---お問い合わせ : ntc-secretariat
- [1] Schenkel, R., F. Suchanek, and G. Kasneci, "YAWN: A Semantically Annotated Wikipedia XML Corpus." In Proceedings of BTW'2007, 2007.
- NTCIR-9 Crosslink タスク統括論文
[2] Tang, L.-X., Geva, S., Trotman, A., Xu, Y., & Itakura, K. Y. (2011). Overview of the NTCIR-9 Crosslink Task: Cross-lingual Link Discovery. Proceedings of NTCIR-9 (pp. 437-463). Tokyo, Japan.
[poster] [paper PDF]- [3] NTCIR-9 CROSSLINK Task website
- [4] Crosslink Toolkits
- 適合判定データ(Gold Standard, qrels)の使用について
[5] 利用規程
License
NTCIR-9 Crosslink CJK XML Corporaおよび課題データは、Creative Commons Attribution-Share-Alike License 3.0(Unported) のライセンスに基づき、利用および/または再配布が許諾されています。
ライセンスの詳細については、こちらをご覧ください。http://creativecommons.org/licenses/by-sa/3.0/.
[ENGLISH] [NTCIR ホーム] [このページの先頭] [NTCIR データ Home]