|
The 9th NTCIR Workshop
DATA
mJapanesen
NTCIR-9 Test Collections: Documents
The following documents collections are used for the 9th NTCIR Workshop.
They are available for the participating research groups for the task participation
and system evaluation within the 9th NTCIR Workshop (*1). To obtain the
data, the signed user agreement forms must be submitted to the NTCIR Project Office at the NII.
*1: The test collections and data available from NII are free of charge.
Nominal cost may be required for some datasets provided by other parties.
|CrossLink|GeoTime|INTENT|PatentMT|RITE|SpokenDoc|VisEX|
|
task |
data |
Data Type |
genre/
task |
language |
file name |
Distribution
Date
|
number of documents/
topics (size) |
year |
core |
GeoTime |
Document
Data |
news articles |
E |
New York Times |
ready to use**a
|
315,417
*B |
02-05 |
Xinhua English |
ready to use**a |
406,792 |
98-01 |
Mainichi Daily |
Jan 05, 2011 |
24,878 |
98-01 |
Korea Times |
Jan 05, 2011 |
50,129 |
98-01 |
J |
Mainichi |
Jan 05, 2011 |
797,700 |
98-05 |
K |
Hankookilbo 1998-2001 |
Jan 05, 2011 |
235,171 |
98-01 |
Chosenilbo |
Jan 05, 2011 |
239,641 |
98-01 |
Task Data
|
IR |
JE |
NTCIR-9 GeoTime task Data |
- |
- |
- |
Task Data
for system training purposes |
IR |
JE |
NTCIR-8 GeoTime task Data |
Jan 05, 2011 |
- |
- |
INTENT |
Document Data |
Web |
Cs |
SogouT |
ready to use
**b |
ca.130M pages
(ca. 5TB) |
crawled and released on Nov 2008 |
SogouQ |
ready to use
**b |
- |
collected in 2008
(consistent with SogouT) |
J |
ClueWeb09 |
ready to use
**c |
ca. 67M Japanese pages
*A |
crawled during Jan and Feb 2009 |
Task Data |
subtopic mining |
CsJ |
NTCIR-9 INTENT Task Data |
Jun/Jul, 2011 |
100 Topics for each language |
- |
document ranking |
CsJ |
- |
Task Data |
One Click Access |
J |
NTCIR-9 1Click Task Data |
Jun/Jul, 2011 |
60 Topics |
- |
SpokenDoc |
Document
Data |
spoken documents |
J |
the Corpus of Spontaneous Japanese |
ready to use
**d |
- |
- |
Task Data |
Spoken Term Detection |
J |
NTCIR-9 SpokenDoc Task Data |
- |
- |
- |
Spoken Document Retrieval |
J |
RITE |
Document Data |
news articles |
Cs |
Xinhua Chinese |
ready to use
**a |
604,720 |
98-05 |
Ct |
UDN2002-2005 |
- |
1,663,517 |
02-05 |
CIRB020( United Daily News, Economic Daily News, Min Sheng Daily, United
Evening News, Star News) |
- |
249,508 |
98-99 |
CIRB040r( United Daily News, United Express, Ming Hseng News, Economic
Daily News) |
- |
901,446 |
00-01 |
J |
Mainichi |
- |
797,700 |
98-05 |
Task Data |
Binary classification |
CtCsJ |
NTCIR-9 RITE Task Data |
- |
- |
- |
Multi-class classification |
- |
- |
- |
RITE4QA |
- |
- |
- |
Pilot |
CrossLink |
Document Data |
Web
(Wikipedia) |
C |
Chinese Wikipedia |
Jan 05, 2011 |
285,624
(1.9GB) |
Jun 27, 2010 |
J |
Japanese Wikipedia |
Jan 05, 2011 |
716,088
(6.1GB) |
Jun 24, 2010 |
K |
Korean Wikipedia |
Jan 05, 2011 |
201,596
( 1.2GB) |
Jun 28, 2010 |
Task Data |
Cross-lingual link discovery |
E |
NTCIR-9 CrossLink Task data |
Jan 05, 2011 |
Two sets of 25
articles chosen from English Wikipedia |
- |
Vis-EX |
Document Data |
news articles |
E |
Xinhua English |
ready to use**a |
409,792 |
98-01 |
J |
Mainichi |
Jan 05, 2011 |
419,759
(535MB) |
98-01 |
Task Data |
IE/
analysis |
EJ |
NTCIR-9 Vis-EX Dataset |
- |
- |
- |
NTCIR-7 ACLIA
IR/QA@data |
Jan 05, 2011 |
- |
- |
NTCIR-7 MuST Dataset |
Jan 05, 2011 |
701
(2.9MB) |
98-01 |
PatentMT |
C
to E |
Task Data |
Patent Translation |
C |
NTCIR-9 PatentMT Test Data |
May 9, 2011
**e |
- |
06-07 |
E |
NTCIR-9 PatentMT Reference Data |
-
**e |
- |
06-07 |
Document Data
for system training purposes |
patent full |
E |
Patent grant data published from USPTO |
Jan 05, 2011 |
- |
93-05 |
Task Data
for system training purposes |
Patent Translation |
C-E |
NTCIR-9 PatentMT E-C Patent Parallel Corpus |
Jan 05, 2011
**e |
- |
- |
NTCIR-9 PatentMT E-C Parallel Development data |
Jan 05, 2011
**e |
- |
- |
E to J
|
Task Data |
Patent Translation |
E |
NTCIR-9 PatentMT Test Data |
May 9, 2011 |
- |
06-07 |
J |
NTCIR-9 PatentMT Reference Data |
- |
- |
06-07 |
Document Data
for system training purposes |
patent full |
J |
Publication of unexamined patent applications
|
Jan 05, 2011 |
- |
93-05 |
Task Data
for system training purposes |
Patent Translation |
J-E |
NTCIR-9 PatentMT J-E Parallel Corpus |
Jan 05, 2011 |
- |
- |
NTCIR-9 PatentMT J-E Parallel Development Data |
J to E |
Task Data |
Patent Translation |
J |
NTCIR-9 PatentMT Test Data |
May 5, 2011 |
- |
06-07 |
E |
NTCIR-9 PatentMT Reference Data |
- |
- |
06-07 |
Document Data
for system training purposes |
patent full |
E |
Patent grant data published from USPTO |
Jan 05, 2011 |
- |
93-05 |
Task Data
for system training purposes |
Patent Translation |
J-E |
NTCIR-9 PatentMT J-E Parallel Corpus |
Jan 05, 2011 |
- |
- |
NTCIR-9 PatentMT J-E Parallel Development Data |
[Return to top]
*A: The full ClueWeb09 collection consists of roughly 1 billion web pages
*B: There is a smaller amount of documents from Feb. 2003 to May 2004 and there is no data from June 2004.
1: For the details of the task data (topics and relevance judgments, questions
and answers, summaries, etc), please visit the webpages of each task.
2: For the data with **, the procedure to obtain the data is specified.
**a: The data will be delivered from LDC for the Workshop participants who submit
an additional user agreement form to LDC.
**b: The data will be distributed by Sogou labs for research purpose only.
License information is available at the page:
http://www.sogou.com/labs/dl/license.html (in Chinese).
**c: The data will be distributed by Carnegie Mellon University for research purpose only. A license agreement can be found on the page:
http://boston.lti.cs.cmu.edu/Data/clueweb09/.
**d: The data can be provided by the National Institute for Japanese Language
for research purpose for a charge. A license agreement can be found on
the page: http://www.kokken.go.jp/katsudo/seika/corpus/releaseinfo/020/ (in Japanese)
**e: The data will be delivered from The Hong Kong Institute of Education (HKIED) for the Workshop participants who submit an additional user agreement form
to HKIED.
http://research.nii.ac.jp/ntcir/ntcir-9/ntcir9cepc-patmt.html
3: Please notice that the document collections shall be used for the purpose
of accomplishing tasks set out in the NTCIR Workshop and for the purpose
of research related to the tasks. The documents can not be used for "information
purpose".
[ Return to top]
|