|
task |
subtask |
data |
data type |
genre/task |
language |
file name |
distribution date |
number of documents/ topics (size) |
year |
core |
CrossLink |
Document Data |
Web
(Wikipedia) |
C |
NTCIR-10
Chinese Wikipedia |
Jul 01, 2012 |
432,988
(3.7GB) |
Jun 11, 2012 |
E |
NTCIR-10
English Wikipedia |
Jul 01, 2012 |
3,581,772
(33GB) |
Jun 04, 2012 |
J |
NTCIR-10
Japanese Wikipedia |
Jul 01, 2012 |
937,444
(11GB) |
Jun 04, 2012 |
K |
NTCIR-10
Korean Wikipedia |
Jul 01, 2012 |
405,536
( 2.7GB) |
Jun 22, 2012 |
Document Data for system training purposes |
Web
(Wikipedia) |
C |
NTCIR-9
Chinese Wikipedia |
ready to use
**a |
285,624
(1.9GB) |
Jun 27, 2010 |
J |
NTCIR-9
Japanese Wikipedia |
ready to use
**a |
716,088
(6.1GB) |
Jun 24, 2010 |
K |
NTCIR-9
Korean Wikipedia |
ready to use
**a |
201,596
( 1.2GB) |
Jun 28, 2010 |
Task Data |
Cross-lingual link discovery |
CEJK |
NTCIR-10 CrossLink Task Data |
Jul 01, 2012 |
25 articles each in four (CEJK) languages |
- |
Task Data for system training purposes |
Cross-lingual link discovery |
E |
NTCIR-9 CrossLink Topics |
ready to use
**a |
Two sets of 25
articles chosen from English Wikipedia |
- |
NTCIR-9 CrossLink Relevance Judgment Data |
Jul 01, 2012 |
- |
INTENT |
Document Data |
Web |
Cs |
SogouT |
ready to use
**b |
ca.130M pages
(ca. 5TB) |
crawled and released on Nov 2008 |
SogouQ |
ready to use
**b |
- |
collected in 2008
(consistent with SogouT) |
J |
ClueWeb09 |
ready to use
**c |
ca. 67M Japanese pages
*A |
crawled during Jan and Feb 2009 |
Task Data |
Subtopic Mining |
CsEJ |
NTCIR-10 INTENT Task Data |
May 31, 2012/
Jun 13, 2012 (English, for Subtopic Mining only) |
100 Queries for each language |
- |
Document Ranking |
CsJ |
- |
Task Data for system training purposes |
Subtopic Mining |
CsJ |
NTCIR-9 INTENT Task Data |
Jul 01, 2012 |
100 Queries for each language |
- |
Document Ranking |
CsJ |
- |
1CLICK |
Task Data |
One Click Access: Main Task |
EJ |
NTCIR-10 1CLICK Task Data |
Aug 31, 2012 |
100 Queries for each language |
- |
Query Classification Subtask |
PatentMT |
C to E |
Task Data |
Patent Translation |
C |
NTCIR-10 PatentMT Test Data |
Oct 15, 2012
**e |
- |
06-07 |
E |
NTCIR-10 PatentMT Reference Data |
Oct 15, 2012
**e |
- |
06-07 |
Document Data for system training purposes |
patent full |
E |
Patent grant data published from USPTO |
Jul 01, 2012 |
- |
93-05 |
Task Data for system training purposes |
Patent Translation |
C-E |
NTCIR-9 PatentMT C-E Parallel Corpus |
Jul 01, 2012
**e |
ca. 1 million sentence pairs |
- |
NTCIR-9 PatentMT C-E Parallel Development Data |
Jul 01, 2012
**e |
2000 sentence paris |
- |
E to J |
Task Data |
Patent Translation |
E |
NTCIR-10 PatentMT Test Data |
Oct 15, 2012 |
- |
06-07 |
J |
NTCIR-10 PatentMT Reference Data |
Oct 15, 2012 |
- |
06-07 |
Document Data for system training purposes |
patent full |
J |
Publication of unexamined patent applications
|
Jul 01, 2012 |
- |
93-05 |
Task Data for system training purposes |
Patent Translation |
J-E |
NTCIR-8 PatentMT J-E Parallel Corpus |
Jul 01, 2012 |
3,186,284 sentence pairs |
- |
NTCIR-8 PatentMT J-E Parallel Development Data |
J to E |
Task Data |
Patent Translation |
J |
NTCIR-10 PatentMT Test Data |
Oct 15, 2012 |
- |
06-07 |
E |
NTCIR-10 PatentMT Reference Data |
Oct 15, 2012 |
- |
06-07 |
Document Data for system training purposes |
patent full |
E |
Patent grant data published from USPTO |
Jul 01, 2012 |
- |
93-05 |
Task Data for system training purposes |
Patent Translation |
J-E |
NTCIR-8 PatentMT J-E Parallel Corpus |
Jul 01, 2012 |
3,186,284 sentence pairs |
- |
NTCIR-8 PatentMT J-E Parallel Development Data |
RITE |
Document Data |
(to be announced) |
- |
- |
- |
- |
- |
Task Data |
Binary-class |
CsCtJ |
NTCIR-10 RITE Task Data |
Nov 14, 2012 |
- |
- |
Multi-class |
Entrance Exam |
J |
Task Data for system training purposes |
Binary-class |
CsCtJ |
NTCIR-9 RITE Task Data |
Jul 01, 2012 |
- |
- |
Multi-class |
Entrance Exam |
J |
SpokenDoc |
Document Data |
spoken documents |
J |
the Corpus of Spontaneous Japanese |
ready to use
**d |
- |
- |
Task Data |
Spoken Term Detection |
CSJ large-size task
*B |
NTCIR-10 SpokenDoc Task Data |
- |
- |
- |
moderate-size task |
Spoken Document Retrieval |
CSJ lecture retrieval task
*B |
- |
- |
- |
passage retrieval task |
Pilot |
MATH |
Document Data |
Scientific Articles |
E |
NTCIR-10 Math Retrieval Document Set |
Oct, 2012 |
100,000 docs |
- |
NTCIR-10 Math Understanding Document Set |
Oct, 2012 |
15 docs |
Document Data for system training purposes |
Scientific Articles |
E |
NTCIR-10 Math Retrieval Document Set for system training purposes |
ready to use |
10,000 docs |
NTCIR-10 Math Understanding Document Set for system training purposes |
ready to use |
10 docs |
Task Data |
Math Retrieval |
E |
NTCIR-10 Math Task Data |
Oct, 2012 |
- |
Math Understanding |
MedNLP |
Document Data |
Imaginary medical history |
J |
NTCIR-10 MedNLP train |
Dec, 2012 |
2244 sentences |
- |
NTCIR-10 MedNLP test |
Jan, 2013 |
1121 sentences |