NTCIR-12

NTCIR-12 Test Collections: data sets for NTCIR-12 Task Participants


The following data sets are used for the NTCIR-12. They are available for the participating research groups for the task participation and system evaluation within the NTCIR-12 (*1). To obtain the data, the signed user agreement forms must be submitted to the NTCIR Project Office at the NII.
*The test collections and data available from NII are free of charge. Nominal cost may be required for some datasets provided by other parties.

  • 1: For the details of the task data (topics and relevance judgments, questions and answers, summaries, etc), please visit the webpages of each task.
  • 2: For the data with **, the procedure to obtain the data is specified.
  • 3: Please notice that the document collections shall be used for the purpose of accomplishing tasks set out in the NTCIR Workshop and for the purpose of research related to the tasks. The documents can not be used for "information purpose".



  task sutask data
data type genre/task language
(C: Chinese
Cs: Simplified Chinese
Ct: Traditional Chinese
E: English
J: Japanese)
file name number of documents/
topics (size)
distribution date year
core IMine    Document Data Web Cs SogouT **a ca.130M Chinese pages
(ca. 5TB)
ready to use crawled and released on Nov 2008
SogouQ **a ca. 4GB Chinese query logs ready to use collected in 2008/2011
E ClueWeb12-B13 **b 52M English Web pages ready to use crawled during 2012
Task Data Query Understanding CsEJ NTCIR-12 IMine-2 Task Data 100 Queries for each language June 2015 -
Vertical Incorporating CsE
Task Data for system training purposes Query Understanding CsEJ NTCIR-9 INTENT Task Data, NTCIR-10 INTENT-2 Task Data, NTCIR-11 IMine Task Data  100 Queries (INTENT, INTENT-2) and 50 Queries (IMine-1) for each language  ready to use -
Vertical Incorporating
MedNLPDoc Document Data Health Record J Training data: mednlpdoc-train.xml 200 documents Aug 1, 2015  2015 
Test data for task 1: mednlpdoc-test.xml 100 documents Jan 15, 2016 2016
MobileClick Documents**c and Queries Information Retrieval i: NTCIR-12 MobileClick document sets (English)**c 100 queries Aug, 2015 2015 
ii: NTCIR-12 MobileClick query sets (English)
J iii: NTCIR-12 MobileClick document sets (Japanese)**c
iv: NTCIR-12 MobileClick query sets (Japanese)
Documents**c, Queries, and iUnits Summarization E i, ii and NTCIR-12 MobileClick iUnit sets (English)**c
J iii, iv and NTCIR-12 MobileClick iUnit sets (Japanese)**c
SpokenQuery
&Doc
Document Data Spokenquery&SpokenDocument retrieval documents Spokenquery&SpokenDocument retrieval documents (SDPWS data set)    
Task Data SQ-SCR/ Document retrieval NTCIR-12 Spoken Query and Spoken Document Retrieval Dataset -
SQ-STC/ Term retrieval
Task Data for system training purposes NTCIR-11 Spoken Query and Spoken Document Retrieval Dataset/
NTCIR-10 IR for Spoken Documents Dataset/
NTCIR-9 IR for Spoken Documents Dataset
-
Temporalia Document Data  Web(News) C SogouCA**a   ready to use 2012
SogouT**a
E LivingKnowledge news and blogs annotated subcollection**d ca. 3.8M docs (ca. 20GB) ready to use 2011-2013
Task Data TID Subtask NTCIR-12 Temporalia TID Dataset
TDR Subtask NTCIR-12 Temporalia TDR Dataset
Task Data for system training purposes NTCIR-11 Temporalia TQIC Dataset/
NTCIR-11 Temporalia TIR Dataset
-
pilot Lifelog Document Data Images, Visual Concepts, Semantic Content      
Task Data Lifelog data   NTCIR-12 Lifelog Dataset (Dry Run)  
NTCIR-12 Lifelog Dataset (Formal Run)
QA Lab Document Data English Subtask E Wikipedia Corpus: Solr Instance with Indexed Wikipedia Subset    ready to use
(open access)
 -
Japaneses Subtask J Wikipedia Corpus: NTCIR-11 QA Lab for Entrance Exam Japanese Wikipedia Data Set
Textbook Data: Japanese Textbook Corpus1 -World History Subset
(Tokyo Shoseki Text Data/ Tokyo Shoseki Annotation Data/ Tokyo Shoseki Index Data)
txt/xml/
index data
572KB(Text Data) ready to use 2007,2008
  Textbook Data: Japanese Textbook Corpus2 - World History Subset
(Yamakawa Shuppansha Text Data/Yamakawa Shuppansha Annotation Data/Yamakawa Shuppansha Index Data)
txt/xml/
index data
252KB(Text Data) ready to use 2010
Task Data for system training purposes English Subtask/
Japaneses Subtask
E*/J Sample Questions National Center Test Sample Questions HTML, xml  28KB(html) ready to use
Second-stage Examination Sample Questions 78KB(html)
 Training data National Center Test Training Data
(Question)
 xml     230 Topics ready to use 1997,2001,2003,
2005,2007,2009
National Center Test Training Data
(Question Format)
TBA
National Center Test Training Data
(Right Answer)
ready to use
Second-stage Examination Training Data
(Question and Answer Sheet)
661 Topics TBA 2005,2007,2009
Second-stage Examination Training Data
(Question Format)
Second-stage Examination Training Data
(Right Answer and Answer Nugget)
Task Data English Subtask E Phase1 Phase1 Test Data xml      Jul.,2015
Phase1 Answer Data Aug.,2015
Phase3 Phase3 Test Data Dec,2015
Phase3 Answer Data
Japaneses Subtask J Phase1 Phase1 Test Data Jul.,2015
Phase1 Answer Data Aug.,2015
Phase2 Phase2 Test Data Oct,2015
Phase2 Answer Data Nov.,2015
Phase3 Phase3 Test Data Dec.,2015
Phase3 Answer Data
Tools English Subtask/
Japaneses Subtask
E*/J Scorer Scorer and Format Checker  64.3MB  ready to use
English Subtask E Baseline System NTCIR QALab CMU Baseline 14.7MB
Japaneses Subtask J Baseline System Kachako factoidQA-CenterShikenKaitou-ki 179MB
Ontology Event Ontology xml 16.2MB
STC   Document Data Web (retrieval repository, labeled data) C/J    
Task Data Chinese subtask/ Japanese subtask C/J NTCIR-12 Short Text Conversation Task Data

  • **a: The data will be distributed by Sogou labs for research purpose only. License information is available at the page:
    http://www.sogou.com/labs/dl/license_en.html .
  • ** b:The data will be distributed by Carnegie Mellon University for research purpose only. A license agreement can be found on the page:
    http://lemurproject.org/clueweb12/.
  • **c: Please contact the MobileClick-2 organizers (ntcadm-1click) for obtaining the document data.
  • ** d: The data will be distributed by Internet Memory for research purpose only. Please visit the Temporalia website to learn how to obtain a copy of the document collection.

Last Modified: 2015-06-25