[JAPANESE] [NTCIR Home] [NTCIR Data Home]
This test collection is intended to evaluate spoken document retrieval (SDR) targeting lecture speech.
The test collection includes:
Document Data is not included in it. Users are required to obtain it by themselves.
-please see: how to obtain
Collection | Task | Target | Task Data | ||
Query | Golden standard | ||||
Language | # | ||||
NTCIR-10 SpokenDoc-2 | STD large-size | CSJ | Japanese | 100 query terms | IPU (inter pausal unit) lists in which the query term appeared. (Automatically determined from the manual transcription) |
STD moderate-size, iSTD | SDPWS | 100 query terms + 100 dummy query terms | |||
SCR lecture retrieval | CSJ | 120 query topics | Relevance judgment at lecture level. | ||
SCR passage retrieval | SDPWS | 120 query topics | Two-leval relevance judgment for arbitrary-length passages in the documents. |
The document data is the subset of the Corpus of Spontaneous Japanese (CSJ) released by the National Institute for Japanese Language. The users of the test collection are required to purchase the data by themselves. See CSJ website.
The CSJ includes several kinds of spontaneous speech data, such as lecture speech and spoken monologues, together with their manual transcriptions. Two kinds of lecture speech, i.e. lectures at academic societies and simulated lectures on a given subject, are employed as the document data of the test collection, which amount to 2702 lectures and about 600 hours in length.
It consists of the recordings of the first to sixth annual Spoken Document Processing Workshop and their manual transcriptions. It is available for research purpose from The Speech and Audio Cloud Working Group (formerly the Spoken Document Processing Working Group), organized in special interest group of spoken language processing (SIG-SLP), information processing society of Japan.
The reference automatic transcriptions of these document data, which have been used in the NTCIR-10 SpokenDoc-2 evaluation, are also available from The Speech and Audio Cloud Working Group (formerly the Spoken Document Processing Working Group), organized in special interest group of spoken language processing (SIG-SLP), information processing society of Japan.
Four kinds of automatic transcriptions are prepared. The textual representation of them is the N-best list of the word or syllable sequence depending on the two background ASR systems, along with the lattice and confusion network representation of them.
Two sets of the query term list, i.e. the list for the CSJ (large-size task) and the list for the SDPWS (moderate-size task and iSTD task), are provided. Each query term consists of one or more words. The range of query length distributes from 3 to 14 morae and 3 to 18 morae for the large-size task and the moderate-size (iSTD) task, respectively. The format of a query term list is as follows.
A query topic is represented by a natural language sentence. The format of a query topic list is as follows.
The file is a well-formed XML document. It has a single root level tag <ROOT>. Under the root tag, it has two main tags, <RUN> and <RESULTS>.
A <RUN> tag indicates the task name and is just written as follows.
A <RESULTS> tag includes a list of <QUERY> tags.
A <QUERY> tag has three attributes as follows.
A <TERM> has a set of attributes that indicate a relevant occurence. The attributes are as follows.
The file is a well-formed XML document. It has a single root level tag <ROOT>. Under the root tag, it has two main tags, <RUN$> and <RESULT>.
A <RUN> tag indicates the task name and is just written as follows.
A <RESULT> tag includes a list of <QUERY> tags.
A <QUERY> tag has an attribute "id", the value of which indicates the corresponding query topic, and includes a list of <CANDIDATE> tags.
A <CANDIDATE> has a set of attributes that indicate a relevant document or a relevant passage. The attributes are as follows.
The following is the procedure to obtain the test collection. The test collection and data are available from NII free of charge.
- NTCIR-10 SpokenDoc Task data are downloadable from NII/IDR at;
http://www.nii.ac.jp/dsc/idr/en/ntcir/ntcir.html
- Document Data is not included in it. Users are required to obtain it by themselves.
- please see: how to obtain
Reference
Contact us: ntc-secretariat
Notice
The test collection was constructed and used for the NTCIR project. It is usable only for research purposes.
The document collection included in the test collection was made available to NII for use in the NTCIR project free of charge or for a fee. The providers of the document data understand the importance of such test collections in research on information access technologies and have kindly given their permission to use the data for research purposes. Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. To maintain a good relationship with the data producers/provider, we researchers must be reliable partners and use the data only for research purposes under the user agreement, and we must use the data carefully so as not to violate copyright.
[JAPANESE] [NTCIR Home] [Top of this page] [NTCIR Data Home]Updated on : 2013-08-27 ntc-admin