DATA｜NTCIR-12

NTCIR-12

NTCIR-12 Test Collections: data sets for NTCIR-12 Task Participants

The following data sets are used for the NTCIR-12. They are available for the participating research groups for the task participation and system evaluation within the NTCIR-12 (*1). To obtain the data, the signed user agreement forms must be submitted to the NTCIR Project Office at the NII.
*The test collections and data available from NII are free of charge. Nominal cost may be required for some datasets provided by other parties.

1: For the details of the task data (topics and relevance judgments, questions and answers, summaries, etc), please visit the webpages of each task.
2: For the data with **, the procedure to obtain the data is specified.
3: Please notice that the document collections shall be used for the purpose of accomplishing tasks set out in the NTCIR Workshop and for the purpose of research related to the tasks. The documents can not be used for "information purpose".

IMine
MedNLPDoc
MobileClick
SpokenQuery&Doc
Temporalia
Lifelog
QA Lab
STC

	task	sutask	data
			data type	genre/task	language (C: Chinese Cs: Simplified Chinese Ct: Traditional Chinese E: English J: Japanese)	file name			number of documents/ topics (size)	distribution date	year
core	IMine		Document Data	Web	Cs	SogouT **a			ca.130M Chinese pages (ca. 5TB)	ready to use	crawled and released on Nov 2008
						SogouQ **a			ca. 4GB Chinese query logs	ready to use	collected in 2008/2011
					E	ClueWeb12-B13 **b			52M English Web pages	ready to use	crawled during 2012
			Task Data	Query Understanding	CsEJ	NTCIR-12 IMine-2 Task Data			100 Queries for each language	June 2015	-
				Vertical Incorporating	CsE
			Task Data for system training purposes	Query Understanding	CsEJ	NTCIR-9 INTENT Task Data, NTCIR-10 INTENT-2 Task Data, NTCIR-11 IMine Task Data			100 Queries (INTENT, INTENT-2) and 50 Queries (IMine-1) for each language	ready to use	-
				Vertical Incorporating
	MedNLPDoc		Document Data	Health Record	J	Training data: mednlpdoc-train.xml			200 documents	Aug 1, 2015	2015
						Test data for task 1: mednlpdoc-test.xml			100 documents	Jan 15, 2016	2016
	MobileClick		Documents**c and Queries	Information Retrieval	E	i: NTCIR-12 MobileClick document sets (English)**c			100 queries	Aug, 2015	2015
						ii: NTCIR-12 MobileClick query sets (English)
					J	iii: NTCIR-12 MobileClick document sets (Japanese)**c
						iv: NTCIR-12 MobileClick query sets (Japanese)
			Documents**c, Queries, and iUnits	Summarization	E	i, ii and NTCIR-12 MobileClick iUnit sets (English)**c
					J	iii, iv and NTCIR-12 MobileClick iUnit sets (Japanese)**c
	SpokenQuery &Doc		Document Data	Spokenquery&SpokenDocument retrieval documents		Spokenquery&SpokenDocument retrieval documents (SDPWS data set)

			Task Data	SQ-SCR/ Document retrieval		NTCIR-12 Spoken Query and Spoken Document Retrieval Dataset					-
				SQ-STC/ Term retrieval
			Task Data for system training purposes			NTCIR-11 Spoken Query and Spoken Document Retrieval Dataset/ NTCIR-10 IR for Spoken Documents Dataset/ NTCIR-9 IR for Spoken Documents Dataset					-
	Temporalia		Document Data	Web(News)	C	SogouCA **a				ready to use	2012
						SogouT **a
					E	LivingKnowledge news and blogs annotated subcollection**d			ca. 3.8M docs (ca. 20GB)	ready to use	2011-2013
			Task Data	TID Subtask		NTCIR-12 Temporalia TID Dataset
				TDR Subtask		NTCIR-12 Temporalia TDR Dataset
			Task Data for system training purposes			NTCIR-11 Temporalia TQIC Dataset/ NTCIR-11 Temporalia TIR Dataset					-
pilot	Lifelog		Document Data	Images, Visual Concepts, Semantic Content
			Task Data	Lifelog data		NTCIR-12 Lifelog Dataset (Dry Run)
						NTCIR-12 Lifelog Dataset (Formal Run)
	QA Lab		Document Data	English Subtask	E	Wikipedia Corpus:	Solr Instance with Indexed Wikipedia Subset			ready to use (open access)	-
				Japaneses Subtask	J	Wikipedia Corpus:	NTCIR-11 QA Lab for Entrance Exam Japanese Wikipedia Data Set
						Textbook Data:	Japanese Textbook Corpus1 -World History Subset (Tokyo Shoseki Text Data/ Tokyo Shoseki Annotation Data/ Tokyo Shoseki Index Data)	txt/xml/ index data	572KB(Text Data)	ready to use	2007,2008
						Textbook Data:	Japanese Textbook Corpus2 - World History Subset (Yamakawa Shuppansha Text Data/Yamakawa Shuppansha Annotation Data/Yamakawa Shuppansha Index Data)	txt/xml/ index data	252KB(Text Data)	ready to use	2010
			Task Data for system training purposes	English Subtask/ Japaneses Subtask	E*/J	Sample Questions	National Center Test Sample Questions	HTML, xml	28KB(html)	ready to use
							Second-stage Examination Sample Questions		78KB(html)
						Training data	National Center Test Training Data (Question)	xml	230 Topics	ready to use	1997,2001,2003, 2005,2007,2009
							National Center Test Training Data (Question Format)			TBA
							National Center Test Training Data (Right Answer)			ready to use
							Second-stage Examination Training Data (Question and Answer Sheet)		661 Topics	TBA	2005,2007,2009
							Second-stage Examination Training Data (Question Format)
					J		Second-stage Examination Training Data (Right Answer and Answer Nugget)
			Task Data	English Subtask	E	Phase1	Phase1 Test Data	xml		Jul.,2015
							Phase1 Answer Data			Aug.,2015
						Phase3	Phase3 Test Data			Dec,2015
							Phase3 Answer Data
				Japaneses Subtask	J	Phase1	Phase1 Test Data			Jul.,2015
							Phase1 Answer Data			Aug.,2015
						Phase2	Phase2 Test Data			Oct,2015
							Phase2 Answer Data			Nov.,2015
						Phase3	Phase3 Test Data			Dec.,2015
							Phase3 Answer Data
			Tools	English Subtask/ Japaneses Subtask	E*/J	Scorer	Scorer and Format Checker		64.3MB	ready to use
				English Subtask	E	Baseline System	NTCIR QALab CMU Baseline		14.7MB
				Japaneses Subtask	J	Baseline System	Kachako factoidQA-CenterShikenKaitou-ki		179MB
						Ontology	Event Ontology	xml	16.2MB
	STC		Document Data	Web (retrieval repository, labeled data)	C/J
			Task Data	Chinese subtask/ Japanese subtask	C/J	NTCIR-12 Short Text Conversation Task Data

**a: The data will be distributed by Sogou labs for research purpose only. License information is available at the page:
http://www.sogou.com/labs/dl/license_en.html .
** b:The data will be distributed by Carnegie Mellon University for research purpose only. A license agreement can be found on the page:
http://lemurproject.org/clueweb12/.
**c: Please contact the MobileClick-2 organizers (ntcadm-1click) for obtaining the document data.
** d: The data will be distributed by Internet Memory for research purpose only. Please visit the Temporalia website to learn how to obtain a copy of the document collection.

Last Modified: 2015-06-25