NTCIR Project
NTCIR-9 SpokenDoc (IR for Spoken Documents)
Research Purpose Use of Test Collection

[JAPANESE] [NTCIR Home] [NTCIR Data Home]

NTCIR-9 SpokenDoc (IR for Spoken Documents)

The NTCIR-9 SpokenDoc Test Collection is intended to evaluate spoken document retrieval (SDR) targeting lecture speech.
The test collection includes:

100 query terms for evaluating Spoken Term Detection (STD) task
86 query topics for evaluating Spoken Document Retrieval (SDR) task
gold standard: the result of the relevance judgement for the 86 query topics
the scoring tool for SDR task

Document Data is not included in it. Users are required to obtain it by themselves.

-> how to obtain

Collection	Task	Target	Task Data
			Query		Relevance judgment
			Language	#
NTCIR-9 SpokenDoc	STD	CORE	Japanese	50 query terms	N/A (Automatically determined from the manual transcription)
		ALL		50 query terms
	SDR	ALL		86 query topics	Two-leval relevance judgment for arbitrary-length passages in the documents, including supporting passages.

'Speech Data' and 'Manual Transcriptions'

The document data of this test collection is the subset of the Corpus of Spontaneous Japanese (CSJ) released by the National Institute for Japanese Language. The users of the test collection are required to purchase the data by themselves.
See CSJ website (Sorry, currently information is available in Japanese only) .

The CSJ includes several kinds of spontaneous speech data, such as lecture speech and spoken monologues, together with their manual transcriptions. Two kinds of lecture speech, i.e. lectures at academic societies and simulated lectures on a given subject, are employed as the document data of the test collection, which amount to 2702 lectures and about 600 hours in length

The two sets of them are used as the target for the tasks in the test collection.

ALL: all the 2702 lectures (about 600 hours).
CORE: the subset 177 lectures (about 44 hours), which is defined in the CSJ.

'Reference Automatic Transcriptions'

The reference automatic transcriptions of the document data, which have been used in the NTCIR-9 SpokenDoc-1 evaluation, are also available from Reference Automatic Transcriptions for SpokenDoc-1 .

Two kinds of automatic transcriptions are prepared. The textual representation of them is the N-best list of the word or syllable sequence depending on the two background ASR systems, along with the lattice and confusion network representation of them.

Word-based transcriptions obtained by using a word-based ASR system. In other words, a word n-gram model is used for the language model of the ASR system. With the textual representation, it also provides the vocabulary list used in the ASR.
Syllable-based transcription obtained by using a syllable-based ASR system. The syllable n-gram model is used for the language model, where the vocabulary is the all Japanese syllables.

Query Terms (for STD task)

Two sets of the query term list, i.e. the list for ALL set and the list for the CORE set, are provided. Each set includes 50 query terms. Each query term consists of one or more words. The range of query length distributes from 4 to 14 morae. The format of a query term list is as follows.

TERM-ID term Japanese_katakana_sequence

An example list is:

SpokenDoc1-STD-dry-ALL-0001 国立国語研究所コクリツコクゴケンキュージョ
SpokenDoc1-STD-dry-ALL-0002 統計数理研究所トーケイスーリケンキュージョ
SpokenDoc1-STD-dry-ALL-0003 大語彙音声認識ダイゴイオンセーニンシキ
SpokenDoc1-STD-dry-ALL-0004 談話セグメント境界ダンワセグメントキョーカイ
...

Query Topics (for SDR task)

Each query topic asks for passages of varying lengths from lectures. A query topic is represented by a natural language sentence. The format of a query topic list is as follows.

TOPIC-ID question

An example list is:

SpokenDoc1-dry-0001 話者認識の学習データのサイズが知りたい
SpokenDoc1-dry-0002 オークションにおける自動入札戦略を知りたい
SpokenDoc1-dry-0003 日本語話し言葉コーパスを用いている研究を教えてください
SpokenDoc1-dry-0004 情報検索性能を評価するにはどのような方法があるか知りたい
...

Gold Standard

The file consists of a sequence of blocks, each of which represents information related to a query. A block consists of a query line followed by relevance judgment lines of arbitrary length. In addition to these lines, those started with "#" are comment lines.

The format of the query line <QueryLine> is as follows:

where <QuerySentence> is a string encoded by Japanese EUC code enclosed by double quotes, and <LF> is a linefeed code.

The format of the relevance judgment line <RelLine> is as follows:

where <Interval> and <Comment> are Omissible. <DocuemntID> is an ID of a document (lecture) defined in the CSJ.

<Interval> describes an interval of utterances in the document specified by <DocumentID>. The format is as follows.

where <IPU> is an ID of the Inter Pausal Unit defined in the CSJ.

<Judgment> describes the result of the relevance judgment about the interval of the utterances specified by <DocumentID> and <Interval>. The value is either "R" (Relevant), "P" (Partially Relevant), or "I"(Irrelevant).

<Judgment> ::= R | P | I

<Support> describes the support information about the relevancy. The format is as follows.

<Support> ::= N | S | U | <Interval> { <Interval> }

The value means as follows:

N: The interval needs no support.
S: The interval is supported somewhere in the document, but cannot be specified.
U: The interval is not supported. (Therefore, it is either partially relevant or irrelevant.)
<Interval>: The interval is supported by other interval <Interval> in the same document.

<Comment> describes any comment about the judgment, most of which are encoded by Japanese EUC.

The test collection and data are available from NII free of charge.

NTCIR-9 SpokenDoc Task data are downloadable from NII/IDR at;
http://www.nii.ac.jp/dsc/idr/en/ntcir/ntcir.html

Reference

The terms of use [PDF]
Overview of the IR for Spoken Documents Task in NTCIR-9 Workshop

NTCIR-9 IR for Spoken Documents (SpokenDoc) Task website

Contact us : ntc-secretariat

Notice

The test collection was constructed and used for the NTCIR project. It is usable only for research purposes.
The document collection included in the test collection was made available to NII for use in the NTCIR project free of charge or for a fee. The providers of the document data understand the importance of such test collections in research on information access technologies and have kindly given their permission to use the data for research purposes. Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. To maintain a good relationship with the data producers/provider, we researchers must be reliable partners and use the data only for research purposes under the user agreement, and we must use the data carefully so as not to violate copyright.

[JAPANESE] [NTCIR Home] [Top of this page] [NTCIR Data Home]
Updated on : 2015-07-24
ntc-admin

NTCIR Project NTCIR-9 SpokenDoc (IR for Spoken Documents) Research Purpose Use of Test Collection