|
task |
subtask |
data |
| data type |
genre/task |
language |
file name |
distribution date |
number of documents/ topics (size) |
year |
| core |
CrossLink |
Document Data |
Web
(Wikipedia) |
C |
NTCIR-10
Chinese Wikipedia |
Jul 01, 2012 |
432,988
(3.7GB) |
Jun 11, 2012 |
| E |
NTCIR-10
English Wikipedia |
Jul 01, 2012 |
3,581,772
(33GB) |
Jun 04, 2012 |
| J |
NTCIR-10
Japanese Wikipedia |
Jul 01, 2012 |
937,444
(11GB) |
Jun 04, 2012 |
| K |
NTCIR-10
Korean Wikipedia |
Jul 01, 2012 |
405,536
( 2.7GB) |
Jun 22, 2012 |
| Document Data for system training purposes |
Web
(Wikipedia) |
C |
NTCIR-9
Chinese Wikipedia |
ready to use
**a |
285,624
(1.9GB) |
Jun 27, 2010 |
| J |
NTCIR-9
Japanese Wikipedia |
ready to use
**a |
716,088
(6.1GB) |
Jun 24, 2010 |
| K |
NTCIR-9
Korean Wikipedia |
ready to use
**a |
201,596
( 1.2GB) |
Jun 28, 2010 |
| Task Data |
Cross-lingual link discovery |
CEJK |
NTCIR-10 CrossLink Task Data |
Jul 01, 2012 |
25 articles each in four (CEJK) languages |
- |
| Task Data for system training purposes |
Cross-lingual link discovery |
E |
NTCIR-9 CrossLink Topics |
ready to use
**a |
Two sets of 25
articles chosen from English Wikipedia |
- |
| NTCIR-9 CrossLink Relevance Judgment Data |
Jul 01, 2012 |
- |
| INTENT |
Document Data |
Web |
Cs |
SogouT |
ready to use
**b |
ca.130M pages
(ca. 5TB) |
crawled and released on Nov 2008 |
| SogouQ |
ready to use
**b |
- |
collected in 2008
(consistent with SogouT) |
| J |
ClueWeb09 |
ready to use
**c |
ca. 67M Japanese pages
*A |
crawled during Jan and Feb 2009 |
| Task Data |
Subtopic Mining |
CsEJ |
NTCIR-10 INTENT Task Data |
May 31, 2012/
Jun 13, 2012 (English, for Subtopic Mining only) |
100 Queries for each language |
- |
| Document Ranking |
CsJ |
- |
| Task Data for system training purposes |
Subtopic Mining |
CsJ |
NTCIR-9 INTENT Task Data |
Jul 01, 2012 |
100 Queries for each language |
- |
| Document Ranking |
CsJ |
- |
| 1CLICK |
Task Data |
One Click Access: Main Task |
EJ |
NTCIR-10 1CLICK Task Data |
Aug 31, 2012 |
100 Queries for each language |
- |
| Query Classification Subtask |
| PatentMT |
C to E |
Task Data |
Patent Translation |
C |
NTCIR-10 PatentMT Test Data |
Oct 15, 2012
**e |
- |
06-07 |
| E |
NTCIR-10 PatentMT Reference Data |
Oct 15, 2012
**e |
- |
06-07 |
| Document Data for system training purposes |
patent full |
E |
Patent grant data published from USPTO |
Jul 01, 2012 |
- |
93-05 |
| Task Data for system training purposes |
Patent Translation |
C-E |
NTCIR-9 PatentMT C-E Parallel Corpus |
Jul 01, 2012
**e |
ca. 1 million sentence pairs |
- |
| NTCIR-9 PatentMT C-E Parallel Development Data |
Jul 01, 2012
**e |
2000 sentence paris |
- |
| E to J |
Task Data |
Patent Translation |
E |
NTCIR-10 PatentMT Test Data |
Oct 15, 2012 |
- |
06-07 |
| J |
NTCIR-10 PatentMT Reference Data |
Oct 15, 2012 |
- |
06-07 |
| Document Data for system training purposes |
patent full |
J |
Publication of unexamined patent applications
|
Jul 01, 2012 |
- |
93-05 |
| Task Data for system training purposes |
Patent Translation |
J-E |
NTCIR-8 PatentMT J-E Parallel Corpus |
Jul 01, 2012 |
3,186,284 sentence pairs |
- |
| NTCIR-8 PatentMT J-E Parallel Development Data |
| J to E |
Task Data |
Patent Translation |
J |
NTCIR-10 PatentMT Test Data |
Oct 15, 2012 |
- |
06-07 |
| E |
NTCIR-10 PatentMT Reference Data |
Oct 15, 2012 |
- |
06-07 |
| Document Data for system training purposes |
patent full |
E |
Patent grant data published from USPTO |
Jul 01, 2012 |
- |
93-05 |
| Task Data for system training purposes |
Patent Translation |
J-E |
NTCIR-8 PatentMT J-E Parallel Corpus |
Jul 01, 2012 |
3,186,284 sentence pairs |
- |
| NTCIR-8 PatentMT J-E Parallel Development Data |
| RITE |
Document Data |
(to be announced) |
- |
- |
- |
- |
- |
| Task Data |
Binary-class |
CsCtJ |
NTCIR-10 RITE Task Data |
Nov 14, 2012 |
- |
- |
| Multi-class |
| Entrance Exam |
J |
| Task Data for system training purposes |
Binary-class |
CsCtJ |
NTCIR-9 RITE Task Data |
Jul 01, 2012 |
- |
- |
| Multi-class |
| Entrance Exam |
J |
| SpokenDoc |
Document Data |
spoken documents |
J |
the Corpus of Spontaneous Japanese |
ready to use
**d |
- |
- |
| Task Data |
Spoken Term Detection |
CSJ large-size task
*B |
NTCIR-10 SpokenDoc Task Data |
- |
- |
- |
| moderate-size task |
| Spoken Document Retrieval |
CSJ lecture retrieval task
*B |
- |
- |
- |
| passage retrieval task |
| Pilot |
MATH |
Document Data |
Scientific Articles |
E |
NTCIR-10 Math Retrieval Document Set |
Oct, 2012 |
100,000 docs |
- |
| NTCIR-10 Math Understanding Document Set |
Oct, 2012 |
15 docs |
| Document Data for system training purposes |
Scientific Articles |
E |
NTCIR-10 Math Retrieval Document Set for system training purposes |
ready to use |
10,000 docs |
| NTCIR-10 Math Understanding Document Set for system training purposes |
ready to use |
10 docs |
| Task Data |
Math Retrieval |
E |
NTCIR-10 Math Task Data |
Oct, 2012 |
- |
| Math Understanding |
| MedNLP |
Document Data |
Imaginary medical history |
J |
NTCIR-10 MedNLP train |
Dec, 2012 |
2244 sentences |
- |
| NTCIR-10 MedNLP test |
Jan, 2013 |
1121 sentences |