NTCIR-11 Workshop

NTCIR-11 Test Collections: data sets for NTCIR-11 Workshop Participants


The following data sets are used for the NTCIR-11 Workshop. They are available for the participating research groups for the task participation and system evaluation within the NTCIR-11 Workshop (*1). To obtain the data, the signed user agreement forms must be submitted to the NTCIR Project Office at the NII.

* 1: The test collections and data available from NII are free of charge. Nominal cost may be required for some datasets provided by other parties.

task subtask data
data type genre/task language file name distribution date number of documents/
topics (size)
year
core   IMine Document Data Web Cs SogouT ready to use
**a
ca.130M pages
(ca. 5TB)
crawled and released on Nov 2008
SogouQ ready to use
**a
About 4GB collected in 2008/2011
E ClueWeb12-B3 ready to use
**b

crawled during 2012
Task Data  Subtopic Mining CsEJ NTCIR-11 IMine Task Data Topics and non-diversified baseline DR runs released: Jan, 2014 50 Queries for each language -
Document Ranking CsE
Search Task Mining  NTCIR-11 IMine TaskMine Task Data  Topics released: Mar, 2014  50 Queries for each language 
Task Data for system training purposes Subtopic Mining CsEJ NTCIR-9/10 INTENT Task DataNTCIR-9/10 INTENT Task Data   Jan 31, 2014 100 Queries for each language  -
Document Ranking
MATH   Document Data  Scientific Articles  E NTCIR-11 Math Retrieval Document Data (Full dataset) Apr 15, 2014 100,000 docs  2013   
Task Data  Math Retrieval   E NTCIR-11 Math Task Data (Topic)  Jun 2, 2014  50 Topics  -
Task Data for system training purposes  Math Retrieval  E NTCIR-11 Math Task Data  (Initial dataset) Mar 10, 2014  Several Topics
MedNLP   Document Data  Health Record J Training data: mednlp-2-train.txt Mar 10, 2014 100 documents 2013 
Test data for task 1 (NER): test.xml July 11, 2014 49 documents 2014
Test data for task2 (Normalization/Coding) July 25, 2014
MobileClick Documents and Queries iUnit Retrieval Subtask Information Retrieval E i: NTCIR-11 MobileClick document sets (English) Mar, 2014 60 queries 2014
ii: NTCIR-11 MobileClick query sets (English)
J iii: NTCIR-11 MobileClick document sets (Japanese)
iv: NTCIR-11 MobileClick query sets (Japanese)
Documents, Queries, and iUnits  iUnit Summarization Subtask Summarization E i, ii and NTCIR-11 MobileClick iUnit sets (English)
J iii, iv and NTCIR-11 MobileClick iUnit sets (Japanese)
RITE-VAL      Document Data Fact Validation JA J Wikipedia Apr 30, 2014 1.4GB 2011-2012
Textbooks Apr 30, 2014 (Textbooks1)
July 17, 2014 (Textbooks2)
1MB 2011-2012
Task Data for System Training Purposes  NTCIR-10 RITE2 ExamSearch Task Data JA Apr 30, 2014 1,000 sentences  2011-2012
Task Data NTCIR-11 RITE-VAL Fact Validation Task Data JA Aug 4July 25, 2014 1,000 sentences  2013-2014
Document Data  EN E Wikipedia Apr 30, 2014 18GB  2013-2014
Task Data for System Training Purposes  NTCIR-10 RITE2 ExamSearch Task Data EN Apr 30, 2014 600 sentences 2011-2012
Task Data NTCIR-11 RITE-VAL Fact Validation Task Data EN Aug 4July 25, 2014 600 sentences 2013-2014
Document Data  CS  Cs Wikipedia Apr 30, 2014 1GB 2014
Task Data for System Training Purposes  NTCIR-11 RITE-VAL Fact Validation CS Training data Apr 30, 2014 45KB 2014
Task Data NTCIR-11 RITE-VAL Fact Validation CS Test data July 25, 2014    2014
Document Data  CT Ct Wikipedia Apr 30, 2014 1GB  2014
Task Data for System Training Purposes  NTCIR-11 RITE-VAL Fact Validation CT Training data Apr 30, 2014 50KB 2014
Task Data NTCIR-11 RITE-VAL Fact Validation CT Test data July 25, 2014   2014
Task Data for System Training Purposes  System Validation JA J NTCIR-10 RITE2 BC, MC, ExamBC, UnitTest Task Data JA Apr 30, 2014 3,788
sentence pairs 
2011-2012
Task Data  NTCIR-11 RITE-VAL System Validation Task Data JA Aug 4July 25, 2014 100,000 sentence pairs  2013-2014
Task Data for System Training Purposes  CS Cs NTCIR-10 RITE2 BC, MC Task Data CS Apr 30, 2014 7,594 sentence pairs 2011-2012
Task Data NTCIR-11 RITE-VAL Fact Validation, CS Test Data July 25, 2014   2014
Task Data for System Training Purposes  CT Ct NTCIR-10 RITE2 BC, MC Task Data CT Apr 30, 2014 7,594 sentence pairs  2011-2012
Task Data NTCIR-11 RITE-VAL Fact Validation, CT Test Data July 25, 2014   2014
SpokenQuery&Doc  Document Data Spokenquery&SpokenDocument retrieval documents J NTICR-10 SpokenDoc documents ready to use
114 lectures; total 32 hours (2280 slides) From 2007 To 2013
NTCIR-11 Spokenquery&SpokenDocument retrieval documents Dec, 2013 114 lectures; total 32 hours (2280 slides)
Document Data for system trainin purpose  Spokenquery&SpokenDocument retrieval documents  NTICR-10 SpokenDoc documents  ready to use  114 lectures; total 32 hours (2280 slides) From 2007 To 2013 
NTCIR-11 Spokenquery&SpokenDocument retrieval documents  Dec, 2013 114 lectures; total 32 hours (2280 slides)
Task Data  SQ-SCR task  Document retrieval    After Mar.2014
(during formal -run) 
less than 120 topicss
SQ-STD subtask Term retrieval
STD-SCR subtask   Document retrieval
Task Data for system training purposes  SQ-SCR task   Document retrieval J   After Jan.2014
(at dry-run)
SQ-STD subtask Term retrieval
STD-SCR subtask   Document retrieval
Pilot     QA Lab Task Data   English Subtask E Center Shiken Exam Data (world_history_B)*:
a. questions: Center Shiken Exam Data Set 1
b. answers: Center Shiken Exam Data Set 2
* Translation of Japanese Subtask Center Shiken Exam Data Set.
ready to use Topic:
36(2007),41(2003) 
2003,2007
Second-stage University Entrance Exam Data*:
a. questions: Second-stage University Entrance Exam Data Set 1
b. answers: Second-stage University Entrance Exam Data Set 2
* Translation of Japanese Subtask Center Shiken Exam Data Set.
ready to use*
* To be announced: Second-stage University Entrance Exam Data Set2  
To be announced 2007
 Task Data for system training purposes Center Shiken Exam Data (world_history_B)*:
a. Sample Questions
* Translation of Japanese Subtask Center Shiken Exam Data Set.
ready to use Topic:
40(1997),41(2001),
36(2005),36(2009) 
1997,2001,2005,2009
Second-stage University Entrance Exam Data*:
a. Sample Questions 
* Translation of Japanese Subtask Center Shiken Exam Data Set.
ready to use To be announced  2005,2009
Document Data Japanese Subtask J Wikipedia Corpus:
a. Wikipedia Data: Wikipedia Indri indexed Dataset1
b. Indexed Data: Wikipedia Indri indexed Dataset2,3
ready to use:
Open Access
 
1.17 GB -
Japanese Textbook Corpus1 - World History Subset:
a. Textbook Data: Tokyo Shoseki Textbook Data Set 0
b. annotations: Tokyo Shoseki Textbook Data Set 1
c. Indexed Data: Tokyo Shoseki Textbook Data Set 2, 3
ready to use 570 KB  2007,2008
Japanese Textbook Corpus2 - World History Subset:
a. Textbook Data: Yamakawa Shuppansha Textbook Data Set 0
b. annotations: Yamakawa Shuppansha Textbook Data Set 1
c. Indexed Data: Yamakawa Shuppansha Textbook Data Set 2, 3
ready to use*:
* To be announced: Yamakawa Shuppansha Textbook Data Set 0
  2010
Task Data   Center Shiken Exam Data (world_history_B):
a. questions: Center Shiken Exam Data Set 1
b. answers: Center Shiken Exam Data Set 2
ready to use Topic:
36(2007),41(2003)
2003,2007
Second-stage University Entrance Exam Data:
a. questions: Second-stage University Entrance Exam Data Set 1 (questions)
b. answers: Second-stage University Entrance Exam Data Set 2 (answers) 
ready to use:
 * To be announced: Second-stage University Entrance Exam Data Set2
 To be announced 2007
Task Data for system training purposes  Center Shiken Exam Data (world_history_B):
a. Sample Questions
ready to use  Topic:
40(1997),41(2001),
36(2005),36(2009) 
1997,2001,2005,2009
Second-stage University Entrance Exam Data:
a. Sample Questions
ready to use   To be announced 2005,2009
Temporalia    Document Data Web (News)  E  LivingKnowledge news and blogs annotated subcollection ready to use
**c
ca. 3.8M docs (ca. 20GB) 2011-2013
Task Data  TQIC Subtask / Classification NTCIR-11 Temporalia Task Data  May 9, 2014 300 queries 2014
TIR Subtask / Retrieval 50 Topics
Task Data for system training purposes  TQIC Subtask / Classification NTCIR-11 Temporalia Task Data  Jan 25, 2014 100 queries
TIR Subtask / Retrieval 15 Topics
RecipeSearch Document Data Cooking Recipe E Yummly Recipe Data **f ready to use **g recipe information for 100,000 recipes (33,605,459 bytes) -
Cooking Recipe J Rakuten Recipe **d ready to use **e recipe information for 440,000 recipes (158,321,432 bytes) -
Task Data * To be announced soon. *  E
J      
  • **a: The data will be distributed by Sogou labs for research purpose only. License information is available at the page:
    http://www.sogou.com/labs/dl/license_en.html .
  • ** b: The data will be distributed by Carnegie Mellon University for research purpose only. A license agreement can be found on the page:
    http://lemurproject.org/clueweb12/.
  • ** c: The data will be distributed by Internet Memory for research purpose only. Please visit the Temporalia website to learn how to obtain a copy of the document collection.
  • ** d: URL: http://rit.rakuten.co.jp/opendata.html .
  • ** e: The data will be available for download from NII and ALAGIN for research purpose only. License information is available at the page:
    http://rit.rakuten.co.jp/rdr_terms.html .
  • ** f: URL: http://labs.yummly.com/data/ntcir-11-data/ .
  • ** g: The data will be available for download from Yummly for academic research purpose only. License information is available at the page:
    http://labs.yummly.com/data/agreement.pdf .
  • 1: For the details of the task data (topics and relevance judgments, questions and answers, summaries, etc), please visit the webpages of each task.
  • 2: For the data with **, the procedure to obtain the data is specified.
  • 3: Please notice that the document collections shall be used for the purpose of accomplishing tasks set out in the NTCIR Workshop and for the purpose of research related to the tasks. The documents can not be used for "information purpose".

Last Modified: 2014-07-28