NTCIR-12 Test Collections: data sets for NTCIR-12 Task Participants
task | sutask | data | |||||||||
data type | genre/task | language (C: Chinese Cs: Simplified Chinese Ct: Traditional Chinese E: English J: Japanese) |
file name | number of documents/ topics (size) |
distribution date | year | |||||
core | IMine | Document Data | Web | Cs | SogouT **a | ca.130M Chinese pages (ca. 5TB) |
ready to use | crawled and released on Nov 2008 | |||
SogouQ **a | ca. 4GB Chinese query logs | ready to use | collected in 2008/2011 | ||||||||
E | ClueWeb12-B13 **b | 52M English Web pages | ready to use | crawled during 2012 | |||||||
Task Data | Query Understanding | CsEJ | NTCIR-12 IMine-2 Task Data | 100 Queries for each language | June 2015 | - | |||||
Vertical Incorporating | CsE | ||||||||||
Task Data for system training purposes | Query Understanding | CsEJ | NTCIR-9 INTENT Task Data, NTCIR-10 INTENT-2 Task Data, NTCIR-11 IMine Task Data | 100 Queries (INTENT, INTENT-2) and 50 Queries (IMine-1) for each language | ready to use | - | |||||
Vertical Incorporating | |||||||||||
MedNLPDoc | Document Data | Health Record | J | Training data: mednlpdoc-train.xml | 200 documents | Aug 1, 2015 | 2015 | ||||
Test data for task 1: mednlpdoc-test.xml | 100 documents | Jan 15, 2016 | 2016 | ||||||||
MobileClick | Documents**c and Queries | Information Retrieval | E | i: NTCIR-12 MobileClick document sets (English)**c | 100 queries | Aug, 2015 | 2015 | ||||
ii: NTCIR-12 MobileClick query sets (English) | |||||||||||
J | iii: NTCIR-12 MobileClick document sets (Japanese)**c | ||||||||||
iv: NTCIR-12 MobileClick query sets (Japanese) | |||||||||||
Documents**c, Queries, and iUnits | Summarization | E | i, ii and NTCIR-12 MobileClick iUnit sets (English)**c | ||||||||
J | iii, iv and NTCIR-12 MobileClick iUnit sets (Japanese)**c | ||||||||||
SpokenQuery &Doc |
Document Data | Spokenquery&SpokenDocument retrieval documents | Spokenquery&SpokenDocument retrieval documents (SDPWS data set) | ||||||||
Task Data | SQ-SCR/ Document retrieval | NTCIR-12 Spoken Query and Spoken Document Retrieval Dataset | - | ||||||||
SQ-STC/ Term retrieval | |||||||||||
Task Data for system training purposes | NTCIR-11 Spoken Query and Spoken Document Retrieval Dataset/ NTCIR-10 IR for Spoken Documents Dataset/ NTCIR-9 IR for Spoken Documents Dataset |
- | |||||||||
Temporalia | Document Data | Web(News) | C | SogouCA**a | ready to use | 2012 | |||||
SogouT**a | |||||||||||
E | LivingKnowledge news and blogs annotated subcollection**d | ca. 3.8M docs (ca. 20GB) | ready to use | 2011-2013 | |||||||
Task Data | TID Subtask | NTCIR-12 Temporalia TID Dataset | |||||||||
TDR Subtask | NTCIR-12 Temporalia TDR Dataset | ||||||||||
Task Data for system training purposes | NTCIR-11 Temporalia TQIC Dataset/ NTCIR-11 Temporalia TIR Dataset |
- | |||||||||
pilot | Lifelog | Document Data | Images, Visual Concepts, Semantic Content | ||||||||
Task Data | Lifelog data | NTCIR-12 Lifelog Dataset (Dry Run) | |||||||||
NTCIR-12 Lifelog Dataset (Formal Run) | |||||||||||
QA Lab | Document Data | English Subtask | E | Wikipedia Corpus: | Solr Instance with Indexed Wikipedia Subset | ready to use (open access) |
- | ||||
Japaneses Subtask | J | Wikipedia Corpus: | NTCIR-11 QA Lab for Entrance Exam Japanese Wikipedia Data Set | ||||||||
Textbook Data: | Japanese Textbook Corpus1 -World History Subset (Tokyo Shoseki Text Data/ Tokyo Shoseki Annotation Data/ Tokyo Shoseki Index Data) |
txt/xml/ index data |
572KB(Text Data) | ready to use | 2007,2008 | ||||||
Textbook Data: | Japanese Textbook Corpus2 - World History Subset (Yamakawa Shuppansha Text Data/Yamakawa Shuppansha Annotation Data/Yamakawa Shuppansha Index Data) |
txt/xml/ index data |
252KB(Text Data) | ready to use | 2010 | ||||||
Task Data for system training purposes | English Subtask/ Japaneses Subtask |
E*/J | Sample Questions | National Center Test Sample Questions | HTML, xml | 28KB(html) | ready to use | ||||
Second-stage Examination Sample Questions | 78KB(html) | ||||||||||
Training data | National Center Test Training Data (Question) |
xml | 230 Topics | ready to use | 1997,2001,2003, 2005,2007,2009 |
||||||
National Center Test Training Data (Question Format) |
TBA | ||||||||||
National Center Test Training Data (Right Answer) |
ready to use | ||||||||||
Second-stage Examination Training Data (Question and Answer Sheet) |
661 Topics | TBA | 2005,2007,2009 | ||||||||
Second-stage Examination Training Data (Question Format) |
|||||||||||
J | Second-stage Examination Training Data (Right Answer and Answer Nugget) |
||||||||||
Task Data | English Subtask | E | Phase1 | Phase1 Test Data | xml | Jul.,2015 | |||||
Phase1 Answer Data | Aug.,2015 | ||||||||||
Phase3 | Phase3 Test Data | Dec,2015 | |||||||||
Phase3 Answer Data | |||||||||||
Japaneses Subtask | J | Phase1 | Phase1 Test Data | Jul.,2015 | |||||||
Phase1 Answer Data | Aug.,2015 | ||||||||||
Phase2 | Phase2 Test Data | Oct,2015 | |||||||||
Phase2 Answer Data | Nov.,2015 | ||||||||||
Phase3 | Phase3 Test Data | Dec.,2015 | |||||||||
Phase3 Answer Data | |||||||||||
Tools | English Subtask/ Japaneses Subtask |
E*/J | Scorer | Scorer and Format Checker | 64.3MB | ready to use | |||||
English Subtask | E | Baseline System | NTCIR QALab CMU Baseline | 14.7MB | |||||||
Japaneses Subtask | J | Baseline System | Kachako factoidQA-CenterShikenKaitou-ki | 179MB | |||||||
Ontology | Event Ontology | xml | 16.2MB | ||||||||
STC | Document Data | Web (retrieval repository, labeled data) | C/J | ||||||||
Task Data | Chinese subtask/ Japanese subtask | C/J | NTCIR-12 Short Text Conversation Task Data |
Last Modified: 2015-06-25