[JAPANESE] [NTCIR Home] [NTCIR Data Home]
The NTCIR-9 SpokenDoc Test Collection is intended to evaluate spoken document retrieval (SDR) targeting lecture
speech.
The test collection includes:
Collection | Task | Target | Task Data | ||
Query | Relevance judgment | ||||
Language | # | ||||
NTCIR-9 SpokenDoc | STD | CORE | Japanese | 50 query terms | N/A (Automatically determined from the manual transcription) |
ALL | 50 query terms | ||||
SDR | ALL | 86 query topics | Two-leval relevance judgment for arbitrary-length passages in the documents, including supporting passages. |
The document data of this test collection is the subset of the Corpus of Spontaneous Japanese (CSJ) released by the National Institute for Japanese Language. The users of
the test collection are required to purchase the data by themselves.
See CSJ website (Sorry, currently information is available in Japanese only) .
The CSJ includes several kinds of spontaneous speech data, such as lecture speech and spoken monologues, together with their manual transcriptions. Two kinds of lecture speech, i.e. lectures at academic societies and simulated lectures on a given subject, are employed as the document data of the test collection, which amount to 2702 lectures and about 600 hours in length
The two sets of them are used as the target for the tasks in the test collection.
The reference automatic transcriptions of the document data, which have been used in the NTCIR-9 SpokenDoc-1 evaluation, are also available from Reference Automatic Transcriptions for SpokenDoc-1 .
Two kinds of automatic transcriptions are prepared. The textual representation of them is the N-best list of the word or syllable sequence depending on the two background ASR systems, along with the lattice and confusion network representation of them.
Two sets of the query term list, i.e. the list for ALL set and the list for the CORE set, are provided. Each set includes 50 query terms. Each query term consists of one or more words. The range of query length distributes from 4 to 14 morae. The format of a query term list is as follows.
Each query topic asks for passages of varying lengths from lectures. A query topic is represented by a natural language sentence. The format of a query topic list is as follows.
The file consists of a sequence of blocks, each of which represents information related to a query. A block consists of a query line followed by relevance judgment lines of arbitrary length. In addition to these lines, those started with "#" are comment lines.
The format of the query line <QueryLine> is as follows:
The format of the relevance judgment line <RelLine> is as follows:
<Interval> describes an interval of utterances in the document specified by <DocumentID>. The format is as follows.
<Judgment> describes the result of the relevance judgment about the interval of the utterances specified by <DocumentID> and <Interval>. The value is either "R" (Relevant), "P" (Partially Relevant), or "I"(Irrelevant).
<Support> describes the support information about the relevancy. The format is as follows.
<Comment> describes any comment about the judgment, most of which are encoded by Japanese EUC.
The test collection and data are available from NII free of charge.
Reference
- NTCIR-9 SpokenDoc Task data are downloadable from NII/IDR at;
http://www.nii.ac.jp/dsc/idr/en/ntcir/ntcir.html
- The terms of use [PDF]
- Overview of the IR for Spoken Documents Task in NTCIR-9 Workshop
- NTCIR-9 IR for Spoken Documents (SpokenDoc) Task website
Contact us : ntc-secretariat
Notice
The test collection was constructed and used for the NTCIR project. It is usable only for research purposes.
The document collection included in the test collection was made available to NII for use in the NTCIR project free of charge or for a fee. The providers of the document data understand the importance of such test collections in research on information access technologies and have kindly given their permission to use the data for research purposes. Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. To maintain a good relationship with the data producers/provider, we researchers must be reliable partners and use the data only for research purposes under the user agreement, and we must use the data carefully so as not to violate copyright.