NTCIR Project
NTCIR-10 SpokenDoc-2 (IR for Spoken Documents)
Research Purpose Use of Test Collection

[JAPANESE] [NTCIR Home] [NTCIR Data Home]

NTCIR-10 SpokenDoc-2 (IR for Spoken Documents)

This test collection is intended to evaluate spoken document retrieval (SDR) targeting lecture speech.
The test collection includes:

100 query terms for evaluating Spoken Term Detection (STD) large-size task
100 query terms for evaluating Spoken Term Detection (STD) moderate-size task + 100 dummy query terms for inexistent Spoken Term Detection (iSTD) task
120 query topics for evaluating Spoken Content Retrieval (SCR) lecture retrieval task
120 query topics for evaluating Spoken Content Retrieval (SCR) passage retrieval task
the golden standard for the 100 query terms for STD moderate-size task
the golden standard for the 100 query terms for STD large-size task
the golden standard (the result of the relevance judgment) for the 120 query topics for SCR lecture retrieval task
tne golden standard (the result of the relevance judgment) for the 120 query topics for SCR passage retrieval task
the scoring tool for STD and iSTD task
the scoring tool for SCR task

Document Data is not included in it. Users are required to obtain it by themselves.
-please see: how to obtain

Collection	Task	Target	Task Data
			Query		Golden standard
			Language	#
NTCIR-10 SpokenDoc-2	STD large-size	CSJ	Japanese	100 query terms	IPU (inter pausal unit) lists in which the query term appeared. (Automatically determined from the manual transcription)
	STD moderate-size, iSTD	SDPWS		100 query terms + 100 dummy query terms
	SCR lecture retrieval	CSJ		120 query topics	Relevance judgment at lecture level.
	SCR passage retrieval	SDPWS		120 query topics	Two-leval relevance judgment for arbitrary-length passages in the documents.

Speech Data

Two sets of speech data are used for this test collection. The STD large-size task and the SCR lecture retrieval task use the Corpus of Spontaneous Japanese (CSJ) as their document data. The other tasks, i.e the STD moderate-size task, the iSTD task, and the SCR passage retrieval task, use the Corpus of Spoken Document Processing Workshop (SDPWS).

Corpus of Spontaneous Japanese (CSJ)

The document data is the subset of the Corpus of Spontaneous Japanese (CSJ) released by the National Institute for Japanese Language. The users of the test collection are required to purchase the data by themselves. See CSJ website.

The CSJ includes several kinds of spontaneous speech data, such as lecture speech and spoken monologues, together with their manual transcriptions. Two kinds of lecture speech, i.e. lectures at academic societies and simulated lectures on a given subject, are employed as the document data of the test collection, which amount to 2702 lectures and about 600 hours in length.

Corpus of Spoken Document Processing Workshop (SDPWS)

It consists of the recordings of the first to sixth annual Spoken Document Processing Workshop and their manual transcriptions. It is available for research purpose from The Speech and Audio Cloud Working Group (formerly the Spoken Document Processing Working Group), organized in special interest group of spoken language processing (SIG-SLP), information processing society of Japan.

Reference Automatic Transcriptions

The reference automatic transcriptions of these document data, which have been used in the NTCIR-10 SpokenDoc-2 evaluation, are also available from The Speech and Audio Cloud Working Group (formerly the Spoken Document Processing Working Group), organized in special interest group of spoken language processing (SIG-SLP), information processing society of Japan.

Four kinds of automatic transcriptions are prepared. The textual representation of them is the N-best list of the word or syllable sequence depending on the two background ASR systems, along with the lattice and confusion network representation of them.

Word-based transcriptions obtained by using a word-based ASR system. In other words, a word n-gram model is used for the language model of the ASR system. With the textual representation, it also provides the vocabulary list used in the ASR.
Syllable-based transcription obtained by using a syllable-based ASR system. The syllable n-gram model is used for the language model, where the vocabulary is the all Japanese syllables.

Two different kinds of language models are used to obtain these transcriptions; one of them is trained by matched lecture text and the other is by unmatched newspaper articles. Thus, there are four transcriptions for each collection: word-based with high WER, word-based with low WER, syllable-based with high WER, and syllable-based with low WER.

Query Terms (for STD task)

Two sets of the query term list, i.e. the list for the CSJ (large-size task) and the list for the SDPWS (moderate-size task and iSTD task), are provided. Each query term consists of one or more words. The range of query length distributes from 3 to 14 morae and 3 to 18 morae for the large-size task and the moderate-size (iSTD) task, respectively. The format of a query term list is as follows.

TERM-ID term Japanese_katakana_sequence

An example list is:

SpokenDoc2-STD-formal-SDPWS-001 アーティキュレーションアーティキュレーション
SpokenDoc2-STD-formal-SDPWS-002 ＩＢＭアイビーエム
SpokenDoc2-STD-formal-SDPWS-003 アカデミックハラスメントアカデミックハラスメント
SpokenDoc2-STD-formal-SDPWS-004 Ａｄａｂｏｏｓｔアダブースト
...

Query Topics (for SDR task)

A query topic is represented by a natural language sentence. The format of a query topic list is as follows.

TOPIC-ID question

An example list is:

SpokenDoc1-dry-0001 話者認識の学習データのサイズが知りたい
SpokenDoc1-dry-0002 オークションにおける自動入札戦略を知りたい
SpokenDoc1-dry-0003 日本語話し言葉コーパスを用いている研究を教えてください
SpokenDoc1-dry-0004 情報検索性能を評価するにはどのような方法があるか知りたい
...

Gold Standard (for STD task)

The file is a well-formed XML document. It has a single root level tag <ROOT>. Under the root tag, it has two main tags, <RUN> and <RESULTS>.

A <RUN> tag indicates the task name and is just written as follows.

# for the large-size task
<RUN>
<SUBTASK>STD</SUBTASK>
<TARGET>CSJ</TARGET>
</RUN>

# for the moderate-size and iSTD task
<RUN>
<SUBTASK>STD, iSTD</SUBTASK>
<TARGET>SDPWS</TARGET>
</RUN>

A <RESULTS> tag includes a list of <QUERY> tags.

A <QUERY> tag has three attributes as follows.

id: Its value is the corresponding query term ID.
term: Its value is the text of the query term.
category: The value shows the query types: out-of-vocabulary (OOV), in-vocabulary (IV) and inexsitent query (iSTD). The definition of OOV and IV queries is according to the reference ASR dictionary of the matched-conditioned word-based larguage model provided by the task organizers. The iSTD query terms are obviously NOT in any speech of SDPWS.

A <QUERY> also includes a list of <TERM> tags.

A <TERM> has a set of attributes that indicate a relevant occurence. The attributes are as follows.

document: Its value indicates the docuemnt ID.
ipu: Its value indicates the IPU of the correct occurence.

Note that the moderate-size task and the iSTD task use the same query set. The half of query terms (100) is used for the moderate-size task. Therefore, there is no <TERM> tag of an iSTD query term. An example of the <RESULTS> section of the file for the moderate-size task is as follows.

The query set for the moderate-size task includes 53 OOV query terms and 47 IV query terms. The total numbers of IPUs including the OOV and IV terms are 480 and 458 in the SDPWS speeches, respectively.
On the other hand, the query set for the large-size task has 54 OOV and 46 IV query terms. The total numbers of IPUs of the OOV and IV terms are 844 and 953 in the CSJ speeches, respectively.

Gold Standard (for SCR task)

The file is a well-formed XML document. It has a single root level tag <ROOT>. Under the root tag, it has two main tags, <RUN$> and <RESULT>.

A <RUN> tag indicates the task name and is just written as follows.

A <RESULT> tag includes a list of <QUERY> tags.

A <QUERY> tag has an attribute "id", the value of which indicates the corresponding query topic, and includes a list of <CANDIDATE> tags.

A <CANDIDATE> has a set of attributes that indicate a relevant document or a relevant passage. The attributes are as follows.

document: Its value indicates the docuemnt ID
ipu-from: Its value indicates the first IPU of the relevant passage. It is used only for the passage retrieval task.
ipu-to: Its value indicates the last IPU of the relevant passage. It is used only for the passage retrieval task.
relevancy: Its value indicates the relevancy level, which is either "R" (Relevant), "P" (Partially Relevant), or "I"(Irrelevant).

An Example of the <RESULT> section is as follows.

<RESULT>
<QUERY id="SpokenDoc2-SCR-formal-PAS-001">
<CANDIDATE document="07-01" ipu-from="0063" ipu-to="0071" relevancy="P" />
<CANDIDATE document="07-01" ipu-from="0090" ipu-to="0107" relevancy="R" />
...

The following is the procedure to obtain the test collection. The test collection and data are available from NII free of charge.

NTCIR-10 SpokenDoc Task data are downloadable from NII/IDR at;
http://www.nii.ac.jp/dsc/idr/en/ntcir/ntcir.html

Document Data is not included in it. Users are required to obtain it by themselves.
- please see: how to obtain

Reference

The terms of use [PDF]
Overview of the NTCIR-10 SpokenDoc-2 Task

NTCIR-10 IR for Spoken Documents (SpokenDoc-2) Task website

Tools

Contact us: ntc-secretariat

Notice

The test collection was constructed and used for the NTCIR project. It is usable only for research purposes.
The document collection included in the test collection was made available to NII for use in the NTCIR project free of charge or for a fee. The providers of the document data understand the importance of such test collections in research on information access technologies and have kindly given their permission to use the data for research purposes. Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. To maintain a good relationship with the data producers/provider, we researchers must be reliable partners and use the data only for research purposes under the user agreement, and we must use the data carefully so as not to violate copyright.

[JAPANESE] [NTCIR Home] [Top of this page] [NTCIR Data Home]
Updated on : 2013-08-27

ntc-admin

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　

NTCIR Project NTCIR-10 SpokenDoc-2 (IR for Spoken Documents) Research Purpose Use of Test Collection