NTCIR (NII Testbeds and Community for Information access Research) Project bNTCIRbCONTACT INFORMATIONbNIIb
NTCIR HOME

NTCIR-10 HOME
-NEWS
-NTCIR-10 Conference
-Call for Participation
-Workshop Aims
-Call for NTCIR-11 Task Proposals
-Task Participation
-Call for Task Participation
-Evaluation Tasks
CrossLink
INTENT
1CLICK
PatentMT
RITE
SpokenDoc
MATH
MedNLP
Call for Task Proposals
Data
-EVIA 2013
Call for Papers
Submission Instruction
-Organization
Task Organizers
Coordination
Program Committee
EVIA Program Cairs
Conference Organizing Committee
-Important Dates
-Sponsors
-Venue + Travel
Accomodation
Travel Support
VISA
-Contact Us
NTCIR HOME
Past NTCIRs
Data/Tools
Proceedings
FAQ


The 10th NTCIR Workshop

DATA
mJapanesen

NTCIR-10 Test Collections: Documents

The following documents collections are used for the 10th NTCIR Workshop. They are available for the participating research groups for the task participation and system evaluation within the 10th NTCIR Workshop (*1). To obtain the data, the signed user agreement forms must be submitted to the NTCIR Project Office at the NII.

*1: The test collections and data available from NII are free of charge. Nominal cost may be required for some datasets provided by other parties.

|CrossLink|INTENT|1CLICK |PatentMT|RITE|SpokenDoc|MATH|MedNLP|

task subtask data
data type genre/task language file name distribution date number of documents/
topics (size)
year
core CrossLink Document Data Web
(Wikipedia)
C NTCIR-10
Chinese Wikipedia
Jul 01, 2012 432,988
(3.7GB)
Jun 11, 2012
E NTCIR-10
English Wikipedia
Jul 01, 2012 3,581,772
(33GB)
Jun 04, 2012
J NTCIR-10
Japanese Wikipedia
Jul 01, 2012 937,444
(11GB)
Jun 04, 2012
K NTCIR-10
Korean Wikipedia
Jul 01, 2012 405,536
( 2.7GB)  
Jun 22, 2012
Document Data for system training purposes Web
(Wikipedia)
C NTCIR-9
Chinese Wikipedia
ready to use
**a
285,624
(1.9GB)
Jun 27, 2010
J NTCIR-9
Japanese Wikipedia
ready to use
**a
716,088
(6.1GB)
Jun 24, 2010
K NTCIR-9
Korean Wikipedia
ready to use
**a
201,596
( 1.2GB)  
Jun 28, 2010
Task Data Cross-lingual link discovery CEJK NTCIR-10 CrossLink Task Data Jul 01, 2012 25 articles each in four (CEJK) languages -
Task Data for system training purposes Cross-lingual link discovery E NTCIR-9 CrossLink Topics ready to use
**a
Two sets of 25 articles chosen from English Wikipedia -
NTCIR-9 CrossLink Relevance Judgment Data Jul 01, 2012 -
INTENT Document Data Web Cs SogouT ready to use
**b
ca.130M pages
(ca. 5TB)
crawled and released on Nov 2008
SogouQ ready to use
**b
- collected in 2008
(consistent with SogouT)
J ClueWeb09 ready to use
**c
ca. 67M Japanese pages
*A
crawled during Jan and Feb 2009
Task Data Subtopic Mining CsEJ NTCIR-10 INTENT Task Data May 31, 2012/
Jun 13, 2012 (English, for Subtopic Mining only)
100 Queries for each language -
Document Ranking CsJ -
Task Data for system training purposes Subtopic Mining CsJ NTCIR-9 INTENT Task Data Jul 01, 2012 100 Queries for each language -
Document Ranking CsJ -
1CLICK Task Data One Click Access: Main Task EJ NTCIR-10 1CLICK Task Data Aug 31, 2012 100 Queries for each language  -
Query Classification Subtask
PatentMT C to E Task Data Patent Translation C NTCIR-10 PatentMT Test Data Oct 15, 2012
**e
- 06-07
E NTCIR-10 PatentMT Reference Data Oct 15, 2012
**e
- 06-07
Document Data for system training purposes patent full E Patent grant data published from USPTO Jul 01, 2012 - 93-05
Task Data for system training purposes Patent Translation C-E NTCIR-9 PatentMT C-E Parallel Corpus Jul 01, 2012
**e
ca. 1 million sentence pairs -
NTCIR-9 PatentMT C-E Parallel Development Data Jul 01, 2012
**e
2000 sentence paris -
E to J Task Data Patent Translation E NTCIR-10 PatentMT Test Data Oct 15, 2012 - 06-07
J NTCIR-10 PatentMT Reference Data Oct 15, 2012 - 06-07
Document Data for system training purposes patent full J

Publication of unexamined patent applications

Jul 01, 2012 - 93-05
Task Data for system training purposes Patent Translation J-E NTCIR-8 PatentMT J-E Parallel Corpus Jul 01, 2012 3,186,284 sentence pairs -
NTCIR-8 PatentMT J-E Parallel Development Data
J to E Task Data Patent Translation J NTCIR-10 PatentMT Test Data Oct 15, 2012 - 06-07
E NTCIR-10 PatentMT Reference Data Oct 15, 2012 - 06-07
Document Data for system training purposes patent full E Patent grant data published from USPTO Jul 01, 2012 - 93-05
Task Data for system training purposes Patent Translation J-E NTCIR-8 PatentMT J-E Parallel Corpus Jul 01, 2012 3,186,284 sentence pairs -
NTCIR-8 PatentMT J-E Parallel Development Data
RITE Document Data (to be announced) - - - - -
Task Data Binary-class CsCtJ NTCIR-10 RITE Task Data Nov 14, 2012 - -
Multi-class
Entrance Exam J
Task Data for system training purposes Binary-class CsCtJ NTCIR-9 RITE Task Data Jul 01, 2012 - -
Multi-class
Entrance Exam J
SpokenDoc Document Data spoken documents J the Corpus of Spontaneous Japanese ready to use
**d
- -
Task Data Spoken Term Detection CSJ large-size task
*B
NTCIR-10 SpokenDoc Task Data - - -
moderate-size task
Spoken Document Retrieval CSJ lecture retrieval task
*B
- - -
passage retrieval task
Pilot  MATH Document Data Scientific Articles E NTCIR-10 Math Retrieval Document Set Oct, 2012 100,000 docs -
NTCIR-10 Math Understanding Document Set Oct, 2012 15 docs
Document Data for system training purposes Scientific Articles E NTCIR-10 Math Retrieval Document Set for system training purposes ready to use 10,000 docs
NTCIR-10 Math Understanding Document Set for system training purposes ready to use 10 docs
Task Data Math Retrieval E NTCIR-10 Math Task Data Oct, 2012 -
Math Understanding
MedNLP  Document Data  Imaginary medical history J   NTCIR-10 MedNLP train Dec, 2012  2244 sentences  - 
NTCIR-10 MedNLP test  Jan, 2013 1121 sentences
[Return to top]

*A
: The full ClueWeb09 collection consists of roughly 1 billion web pages.

*B: The participation to this subtask requires the possession of "The Corpus of Spontaneous Japanese (CSJ)" released by the National Institute for Japanese Language.

1: For the details of the task data (topics and relevance judgments, questions and answers, summaries, etc), please visit the webpages of each task.

2: For the data with **, the procedure to obtain the data is specified.
**a: NTCIR-9 Crosslink Document Collections and Topics are distributed under the conditions of Creative Commons Attribution-Share-Alike License 3.0 (Unported). For more details, please visit this page:
http://warehouse.ntcir.nii.ac.jp/openaccess/crosslink/crosslink_documents.html 

**b: The data will be distributed by Sogou labs for research purpose only. License information is available at the page:
http://www.sogou.com/labs/dl/license.html (in Chinese).

**c: The data will be distributed by Carnegie Mellon University for research purpose only. A license agreement can be found on the page:
http://boston.lti.cs.cmu.edu/Data/clueweb09/.

**d: The data can be provided by the National Institute for Japanese Language for research purpose for a charge. A license agreement can be found on the page: http://www.kokken.go.jp/katsudo/seika/corpus/releaseinfo/020/ (in Japanese)

**e: The data will be delivered from The Hong Kong Institute of Education (HKIED) for the Workshop participants who submit an additional user agreement form to HKIED. For more details, please visit this page:
http://research.nii.ac.jp/ntcir/ntcir-10/ntcir10cepc-patentmt.html

3: Please notice that the document collections shall be used for the purpose of accomplishing tasks set out in the NTCIR Workshop and for the purpose of research related to the tasks. The documents can not be used for "information purpose".


[Return to top]

Last Modified:2013.08.22