NTCIR (NII Test Collection for IR Systems) Project bNTCIRbCONTACT INFORMATIONbNIIb
NTCIR HOME

NTCIR-9 HOME
-Workshop Aims
-NEWS
-Call for Participation
-Sponsors
-Program
-Keynote
-Registration
-Venue + Travel
Accomodation
Travel Support
VISA
-Submission Instruction
-Evaluation Tasks
INTENT
Vis-EX
RITE
CrossLink
GeoTime
PatentMT
SpokenDoc
Call for Task Proposals
Call for Participation
Data
-EVIA 2011
Call for Papers
-Organizers
Coordination
Program Committee
EVIA PC
-Important Dates
-Task Participation
-Contact Us
NTCIR HOME
Past NTCIRs
Data/Tools
Proceedings
FAQ


The 9th NTCIR Workshop

DATA
mJapanesen

NTCIR-9 Test Collections: Documents

The following documents collections are used for the 9th NTCIR Workshop. They are available for the participating research groups for the task participation and system evaluation within the 9th NTCIR Workshop (*1). To obtain the data, the signed user agreement forms must be submitted to the NTCIR Project Office at the NII.

*1: The test collections and data available from NII are free of charge. Nominal cost may be required for some datasets provided by other parties.

|CrossLink|GeoTime|INTENT|PatentMT|RITE|SpokenDoc|VisEX|

task data
Data Type genre/
task
language file name Distribution
Date
number of documents/
topics (size)
year
core GeoTime Document
Data
news articles E New York Times ready to use**a
315,417
*B
02-05
Xinhua English ready to use**a 406,792 98-01
Mainichi Daily Jan 05, 2011 24,878 98-01
Korea Times Jan 05, 2011 50,129 98-01
J Mainichi Jan 05, 2011 797,700 98-05
K Hankookilbo 1998-2001 Jan 05, 2011 235,171 98-01
Chosenilbo Jan 05, 2011 239,641 98-01
Task Data
IR JE NTCIR-9 GeoTime task Data - - -
Task Data
for system training purposes
IR JE NTCIR-8 GeoTime task Data Jan 05, 2011 - -
INTENT Document Data Web Cs SogouT ready to use
**b
ca.130M pages
(ca. 5TB)
crawled and released on Nov 2008
SogouQ ready to use
**b
- collected in 2008
(consistent with SogouT)
J ClueWeb09 ready to use
**c
ca. 67M Japanese pages
*A
crawled during Jan and Feb 2009
Task Data subtopic mining CsJ NTCIR-9 INTENT Task Data Jun/Jul, 2011 100 Topics for each language -
document ranking CsJ -
Task Data One Click Access J NTCIR-9 1Click Task Data Jun/Jul, 2011 60 Topics -
SpokenDoc Document
Data
spoken documents J the Corpus of Spontaneous Japanese ready to use
**d
- -
Task Data Spoken Term Detection J NTCIR-9 SpokenDoc Task Data - - -
Spoken Document Retrieval J
RITE Document Data news articles Cs Xinhua Chinese ready to use
**a
604,720 98-05
Ct UDN2002-2005 - 1,663,517 02-05
CIRB020( United Daily News, Economic Daily News, Min Sheng Daily, United Evening News, Star News) - 249,508 98-99
CIRB040r( United Daily News, United Express, Ming Hseng News, Economic Daily News) - 901,446 00-01
J Mainichi - 797,700 98-05
Task Data Binary classification CtCsJ NTCIR-9 RITE Task Data - - -
Multi-class classification - - -
RITE4QA - - -
Pilot CrossLink Document Data Web
(Wikipedia)
C Chinese Wikipedia Jan 05, 2011 285,624
(1.9GB)
Jun 27, 2010
J Japanese Wikipedia Jan 05, 2011 716,088
(6.1GB)
Jun 24, 2010
K Korean Wikipedia Jan 05, 2011 201,596
( 1.2GB)  
Jun 28, 2010
Task Data Cross-lingual link discovery E NTCIR-9 CrossLink Task data Jan 05, 2011 Two sets of 25 articles chosen from English Wikipedia -
Vis-EX Document Data news articles E Xinhua English ready to use**a 409,792 98-01
J Mainichi Jan 05, 2011 419,759
(535MB)
98-01
Task Data IE/
analysis
EJ NTCIR-9 Vis-EX Dataset - - -
NTCIR-7 ACLIA
IR/QA@data
Jan 05, 2011 - -
NTCIR-7 MuST Dataset Jan 05, 2011 701
(2.9MB)
98-01
PatentMT C
to E
Task Data Patent Translation C NTCIR-9 PatentMT Test Data May 9, 2011
**e
- 06-07
E NTCIR-9 PatentMT Reference Data -
**e
- 06-07
Document Data
for system training purposes
patent full E Patent grant data published from USPTO Jan 05, 2011 - 93-05
Task Data
for system training purposes
Patent Translation C-E NTCIR-9 PatentMT E-C Patent Parallel Corpus Jan 05, 2011
**e
- -
NTCIR-9 PatentMT E-C Parallel Development data Jan 05, 2011
**e
- -
E to J
Task Data Patent Translation E NTCIR-9 PatentMT Test Data May 9, 2011 - 06-07
J NTCIR-9 PatentMT Reference Data - - 06-07
Document Data
for system training purposes
patent full J

Publication of unexamined patent applications

Jan 05, 2011 - 93-05
Task Data
for system training purposes
Patent Translation J-E NTCIR-9 PatentMT J-E Parallel Corpus Jan 05, 2011 - -
NTCIR-9 PatentMT J-E Parallel Development Data
J to E Task Data Patent Translation J NTCIR-9 PatentMT Test Data May 5, 2011 - 06-07
E NTCIR-9 PatentMT Reference Data - - 06-07
Document Data
for system training purposes
patent full E Patent grant data published from USPTO Jan 05, 2011 - 93-05
Task Data
for system training purposes
Patent Translation J-E NTCIR-9 PatentMT J-E Parallel Corpus Jan 05, 2011 - -
NTCIR-9 PatentMT J-E Parallel Development Data
[Return to top]

*A
: The full ClueWeb09 collection consists of roughly 1 billion web pages
*B: There is a smaller amount of documents from Feb. 2003 to May 2004 and there is no data from June 2004.

1: For the details of the task data (topics and relevance judgments, questions and answers, summaries, etc), please visit the webpages of each task.

2: For the data with **, the procedure to obtain the data is specified.
**a: The data will be delivered from LDC for the Workshop participants who submit an additional user agreement form to LDC.

**b: The data will be distributed by Sogou labs for research purpose only. License information is available at the page:
http://www.sogou.com/labs/dl/license.html (in Chinese).

**c: The data will be distributed by Carnegie Mellon University for research purpose only. A license agreement can be found on the page:
http://boston.lti.cs.cmu.edu/Data/clueweb09/.

**d: The data can be provided by the National Institute for Japanese Language for research purpose for a charge. A license agreement can be found on the page: http://www.kokken.go.jp/katsudo/seika/corpus/releaseinfo/020/ (in Japanese)

**e: The data will be delivered from The Hong Kong Institute of Education (HKIED) for the Workshop participants who submit an additional user agreement form to HKIED.
http://research.nii.ac.jp/ntcir/ntcir-9/ntcir9cepc-patmt.html

3: Please notice that the document collections shall be used for the purpose of accomplishing tasks set out in the NTCIR Workshop and for the purpose of research related to the tasks. The documents can not be used for "information purpose".


[Return to top]




Last Modified:2011.06.27