NTCIR Project
NTCIR-10 SpokenDoc-2 (IR for Spoken Documents)
Research Purpose Use of Test Collection


NTCIR-10 SpokenDoc-2 (IR for Spoken Documents)

This test collection is intended to evaluate spoken document retrieval (SDR) targeting lecture speech.
The test collection includes:

Document Data is not included in it. Users are required to obtain it by themselves.
-please see: how to obtain

Collection Task Target Task Data
Query Golden standard
Language #
NTCIR-10 SpokenDoc-2 STD large-size CSJ Japanese 100 query terms IPU (inter pausal unit) lists in which the query term appeared.
(Automatically determined from the manual transcription)
STD moderate-size, iSTD SDPWS 100 query terms + 100 dummy query terms
SCR lecture retrieval CSJ 120 query topics Relevance judgment at lecture level.
SCR passage retrieval SDPWS 120 query topics Two-leval relevance judgment for arbitrary-length passages in the documents.

Speech Data

Two sets of speech data are used for this test collection. The STD large-size task and the SCR lecture retrieval task use the Corpus of Spontaneous Japanese (CSJ) as their document data. The other tasks, i.e the STD moderate-size task, the iSTD task, and the SCR passage retrieval task, use the Corpus of Spoken Document Processing Workshop (SDPWS).

Corpus of Spontaneous Japanese (CSJ)

The document data is the subset of the Corpus of Spontaneous Japanese (CSJ) released by the National Institute for Japanese Language. The users of the test collection are required to purchase the data by themselves. See CSJ website.

The CSJ includes several kinds of spontaneous speech data, such as lecture speech and spoken monologues, together with their manual transcriptions. Two kinds of lecture speech, i.e. lectures at academic societies and simulated lectures on a given subject, are employed as the document data of the test collection, which amount to 2702 lectures and about 600 hours in length.

Corpus of Spoken Document Processing Workshop (SDPWS)

It consists of the recordings of the first to sixth annual Spoken Document Processing Workshop and their manual transcriptions. It is available for research purpose from The Speech and Audio Cloud Working Group (formerly the Spoken Document Processing Working Group), organized in special interest group of spoken language processing (SIG-SLP), information processing society of Japan.

Reference Automatic Transcriptions

The reference automatic transcriptions of these document data, which have been used in the NTCIR-10 SpokenDoc-2 evaluation, are also available from The Speech and Audio Cloud Working Group (formerly the Spoken Document Processing Working Group), organized in special interest group of spoken language processing (SIG-SLP), information processing society of Japan.

Four kinds of automatic transcriptions are prepared. The textual representation of them is the N-best list of the word or syllable sequence depending on the two background ASR systems, along with the lattice and confusion network representation of them.

Two different kinds of language models are used to obtain these transcriptions; one of them is trained by matched lecture text and the other is by unmatched newspaper articles. Thus, there are four transcriptions for each collection: word-based with high WER, word-based with low WER, syllable-based with high WER, and syllable-based with low WER.

Query Terms (for STD task)

Two sets of the query term list, i.e. the list for the CSJ (large-size task) and the list for the SDPWS (moderate-size task and iSTD task), are provided. Each query term consists of one or more words. The range of query length distributes from 3 to 14 morae and 3 to 18 morae for the large-size task and the moderate-size (iSTD) task, respectively. The format of a query term list is as follows.

TERM-ID term Japanese_katakana_sequence

An example list is:
SpokenDoc2-STD-formal-SDPWS-001 アーティキュレーション アーティキュレーション
SpokenDoc2-STD-formal-SDPWS-002 IBM アイビーエム
SpokenDoc2-STD-formal-SDPWS-003 アカデミックハラスメント アカデミックハラスメント
SpokenDoc2-STD-formal-SDPWS-004 Adaboost アダブースト

Query Topics (for SDR task)

A query topic is represented by a natural language sentence. The format of a query topic list is as follows.

TOPIC-ID question

An example list is:
SpokenDoc1-dry-0001 話者認識の学習データのサイズが知りたい
SpokenDoc1-dry-0002 オークションにおける自動入札戦略を知りたい
SpokenDoc1-dry-0003 日本語話し言葉コーパスを用いている研究を教えてください
SpokenDoc1-dry-0004 情報検索性能を評価するにはどのような方法があるか知りたい

Gold Standard (for STD task)

The file is a well-formed XML document. It has a single root level tag <ROOT>. Under the root tag, it has two main tags, <RUN> and <RESULTS>.

A <RUN> tag indicates the task name and is just written as follows.

# for the large-size task

# for the moderate-size and iSTD task

A <RESULTS> tag includes a list of <QUERY> tags.

A <QUERY> tag has three attributes as follows.

Its value is the corresponding query term ID.
Its value is the text of the query term.
The value shows the query types: out-of-vocabulary (OOV), in-vocabulary (IV) and inexsitent query (iSTD). The definition of OOV and IV queries is according to the reference ASR dictionary of the matched-conditioned word-based larguage model provided by the task organizers. The iSTD query terms are obviously NOT in any speech of SDPWS.
A <QUERY> also includes a list of <TERM> tags.

A <TERM> has a set of attributes that indicate a relevant occurence. The attributes are as follows.

Its value indicates the docuemnt ID.
Its value indicates the IPU of the correct occurence.
Note that the moderate-size task and the iSTD task use the same query set. The half of query terms (100) is used for the moderate-size task. Therefore, there is no <TERM> tag of an iSTD query term. An example of the <RESULTS> section of the file for the moderate-size task is as follows.
<QUERY id="SpokenDoc2-STD-formal-SDPWS-001" term="アーティキュレーション" category="OOV">
<TERM document="09-17" ipu="0189" />
<TERM document="09-17" ipu="0198" />
<TERM document="09-17" ipu="0212" />
The query set for the moderate-size task includes 53 OOV query terms and 47 IV query terms. The total numbers of IPUs including the OOV and IV terms are 480 and 458 in the SDPWS speeches, respectively.
On the other hand, the query set for the large-size task has 54 OOV and 46 IV query terms. The total numbers of IPUs of the OOV and IV terms are 844 and 953 in the CSJ speeches, respectively.

Gold Standard (for SCR task)

The file is a well-formed XML document. It has a single root level tag <ROOT>. Under the root tag, it has two main tags, <RUN$> and <RESULT>.

A <RUN> tag indicates the task name and is just written as follows.


A <RESULT> tag includes a list of <QUERY> tags.

A <QUERY> tag has an attribute "id", the value of which indicates the corresponding query topic, and includes a list of <CANDIDATE> tags.

A <CANDIDATE> has a set of attributes that indicate a relevant document or a relevant passage. The attributes are as follows.

Its value indicates the docuemnt ID
Its value indicates the first IPU of the relevant passage. It is used only for the passage retrieval task.
Its value indicates the last IPU of the relevant passage. It is used only for the passage retrieval task.
Its value indicates the relevancy level, which is either "R" (Relevant), "P" (Partially Relevant), or "I"(Irrelevant).
An Example of the <RESULT> section is as follows.
<QUERY id="SpokenDoc2-SCR-formal-PAS-001">
<CANDIDATE document="07-01" ipu-from="0063" ipu-to="0071" relevancy="P" />
<CANDIDATE document="07-01" ipu-from="0090" ipu-to="0107" relevancy="R" />


The following is the procedure to obtain the test collection. The test collection and data are available from NII free of charge.


Contact us: ntc-secretariat


The test collection was constructed and used for the NTCIR project. It is usable only for research purposes.
The document collection included in the test collection was made available to NII for use in the NTCIR project free of charge or for a fee. The providers of the document data understand the importance of such test collections in research on information access technologies and have kindly given their permission to use the data for research purposes. Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. To maintain a good relationship with the data producers/provider, we researchers must be reliable partners and use the data only for research purposes under the user agreement, and we must use the data carefully so as not to violate copyright.