NTCIR-9 SpokenDoc (IR for Spoken Documents)
The NTCIR-9 SpokenDoc Test Collection is intended to evaluate spoken document retrieval (SDR) targeting lecture speech.
The test collection includes:

Document Data is not included in it. Users are required to obtain it by themselves.

-> how to obtain

Collection Task Target Task Data
Query Relevance judgment
Language #
NTCIR-9 SpokenDoc STD CORE Japanese 50 query terms N/A
(Automatically determined from the manual transcription)
ALL 50 query terms
SDR ALL 86 query topics Two-leval relevance judgment for arbitrary-length passages in the documents, including supporting passages.

'Speech Data' and 'Manual Transcriptions'

The document data of this test collection is the subset of the Corpus of Spontaneous Japanese (CSJ) released by the National Institute for Japanese Language. The users of the test collection are required to purchase the data by themselves.
See CSJ website (Sorry, currently information is available in Japanese only) .

The CSJ includes several kinds of spontaneous speech data, such as lecture speech and spoken monologues, together with their manual transcriptions. Two kinds of lecture speech, i.e. lectures at academic societies and simulated lectures on a given subject, are employed as the document data of the test collection, which amount to 2702 lectures and about 600 hours in length

The two sets of them are used as the target for the tasks in the test collection.

'Reference Automatic Transcriptions'

The reference automatic transcriptions of the document data, which have been used in the NTCIR-9 SpokenDoc-1 evaluation, are also available from Reference Automatic Transcriptions for SpokenDoc-1 .

Two kinds of automatic transcriptions are prepared. The textual representation of them is the N-best list of the word or syllable sequence depending on the two background ASR systems, along with the lattice and confusion network representation of them.

Query Terms (for STD task)

Two sets of the query term list, i.e. the list for ALL set and the list for the CORE set, are provided. Each set includes 50 query terms. Each query term consists of one or more words. The range of query length distributes from 4 to 14 morae. The format of a query term list is as follows.

TERM-ID term Japanese_katakana_sequence

An example list is:
SpokenDoc1-STD-dry-ALL-0001 国立国語研究所 コクリツコクゴケンキュージョ
SpokenDoc1-STD-dry-ALL-0002 統計数理研究所 トーケイスーリケンキュージョ
SpokenDoc1-STD-dry-ALL-0003 大語彙音声認識 ダイゴイオンセーニンシキ
SpokenDoc1-STD-dry-ALL-0004 談話セグメント境界 ダンワセグメントキョーカイ

Query Topics (for SDR task)

Each query topic asks for passages of varying lengths from lectures. A query topic is represented by a natural language sentence. The format of a query topic list is as follows.

TOPIC-ID question

An example list is:
SpokenDoc1-dry-0001 話者認識の学習データのサイズが知りたい
SpokenDoc1-dry-0002 オークションにおける自動入札戦略を知りたい
SpokenDoc1-dry-0003 日本語話し言葉コーパスを用いている研究を教えてください
SpokenDoc1-dry-0004 情報検索性能を評価するにはどのような方法があるか知りたい

Gold Standard

The file consists of a sequence of blocks, each of which represents information related to a query. A block consists of a query line followed by relevance judgment lines of arbitrary length. In addition to these lines, those started with "#" are comment lines.

The format of the query line <QueryLine> is as follows:

<QueryLine> ::= <Query-ID>: <QuerySentence> <LF>
where <QuerySentence> is a string encoded by Japanese EUC code enclosed by double quotes, and <LF> is a linefeed code.

The format of the relevance judgment line <RelLine> is as follows:

<RelLine> ::= <DocumentID> [<Interval>] <Judgement> <Support> [<Comment>] <LF>
where <Interval> and <Comment> are Omissible. <DocuemntID> is an ID of a document (lecture) defined in the CSJ.

<Interval> describes an interval of utterances in the document specified by <DocumentID>. The format is as follows.

<Interval> ::= <IPU>-<IPU> | <IPU>
where <IPU> is an ID of the Inter Pausal Unit defined in the CSJ.

<Judgment> describes the result of the relevance judgment about the interval of the utterances specified by <DocumentID> and <Interval>. The value is either "R" (Relevant), "P" (Partially Relevant), or "I"(Irrelevant).

<Judgment> ::= R | P | I

<Support> describes the support information about the relevancy. The format is as follows.

<Support> ::= N | S | U | <Interval> { <Interval> }
The value means as follows:

<Comment> describes any comment about the judgment, most of which are encoded by Japanese EUC.


The test collection and data are available from NII free of charge.

The test collection was constructed and used for the NTCIR project. It is usable only for research purposes.
The document collection included in the test collection was made available to NII for use in the NTCIR project free of charge or for a fee. The providers of the document data understand the importance of such test collections in research on information access technologies and have kindly given their permission to use the data for research purposes. Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. To maintain a good relationship with the data producers/provider, we researchers must be reliable partners and use the data only for research purposes under the user agreement, and we must use the data carefully so as not to violate copyright.