NTCIR (NII Test Collection for IR Systems) Project Related URL'sContact InformationNII


Publications/Online Proceedings
Related URL's
Mailing Lists
Contact Information

OverviewOriginal Proposal

NTCIR : NACSIS Test Collection Project [1,2]

From Original Proposal: Scope of the Project

From the proposal (originally prepared in October 1997, modified on March 1998)
2.Backgroud and Aims
3.Specification of the Collections
3.1The Document
3.3Relevance assessment
3.4Linguistic Analysis
Project Members


In this project we conduct research on evaluation of information retrieval systems and plan (1) to construct Large-scale "test collections", which will be available for research purpose, and (2) to organize evaluation workshops, in which participating research groups conduct research using the common test collection and do cross-system comparison and enjoy exchaning research ideas and opinion based on the common experience.

A test collection is an experimental database which contains (1)database(s), (2)"search topics", which resemble users' information needs, and (3) relevance assessment, i.e. exhaustive lists of documents which are relevant to the search topics. Test collections are vital for research and development of information retrieval systems and comparison of the effectiveness of various retrieval models and approaches in the same evaluation environment.

An evaluation workshop using NACSIS Test Collection 1 was held from November 1998 to September 1999. The second workshop is planed to be held from May, 2000 to March 2001.
"NTCIR" is pronounced as "EnTi-SaiR"

[TOP of This Page]Example

1. Introduction

In this document we introduce the project constructing NACSIS Test Collection for evaluation of Information Retrieval systems currently carried out at the National Center for Science Information Systems (NACSIS), Japan. In the following, the backgrounds and the aims of the project are firstly introduced, followed by the specification of the test collection. We then discuss some technical problems concerning the construction of the collection.

[TOP of This Page]

2. Background and Aims

The project is intended to provide sound infrastructure to evaluate the search effectiveness of information retrieval systems with Japanese language and facilitate the IR research with Japanese language and cross-lingual retrieval including Japanese.

The project is also motivated by the recognition of the following situations:

(1) Needs for a standard Japanese test collections
(2) Need for cross-lingual retrieval
(3) Need for the variety in text types
(4) Need for the various components in a documents
(5) Need for the fundamental data for research into the intersection of IR and NLP

The importance of the large-scale standard test collection in IR research are widely recognised. Stopping, stemming and query analysis are language depended procedures. Especially indexing texts written in Japanese or other East Asian languages like Chinese or Korean are quite different from those with English, French or other European languages since there is no explicit boundary (i.e. no space) between words in a sentence. Regarding other East Asian languages, there are large-scale test collection of TREC Chinese Collection and KORDIC Collection for Korean language, which contains ca.50,000 documents. For Japanese, there is only one standard test collection called BMIR-J2, which has just published in March, 1998 and consists of 5,080 Japanese newspaper articles and ca.60 queries. There are still acute needs for enhancement of the collection in the aspects of the variety of text types and the scale.

Need for cross-lingual retrieval is acute in the internet environment. Moreover in the scientific texts, foreign language terms, sentences, or abstracts are often appeared in a Japanese text in their original spelling. Therefore cross-linguistic strategies should be used not only for retrieval of the non-equivalent multilingual database but also retrieval of Japanese scientific documents [Kando, 1997]. Therefore we need a test-collection which can be used for cross-lingual retrieval and consists of scientific texts.

In order to respond the needs stated above, we aim to construct a large scale test collection which is also usable for cross-linguistic retrieval and application of NLP to IR.

[TOP of This Page]

3. The specifications of the collection

The Collection contains more than 300,000 documents, more than half of them are Japanese-English paired documents, 100 search topics for each subject domain, and relevance assessment for each search topic.

3.1 The Document

The documents are mainly abstracts of conference papers. Abstract records of conference papers are derived from NACSIS's Academic Conference Papers Database. Because we are still negociating with the publishers of the journals, we can not specify the exact number of fulltext records can be included in the collection. The format of a document record is SGML tagged plain text. An abstract record consists of document ID, title (Japanese (J) & Englisih (E)), name(s) of authors (J & E), name of conference (J), date of conference, hosting organization (J & E), abstract (segmented into paragraphs; J & E) and keywords manually assigned by authors of the paper(J & E). A example is shown below;

<TITL TYPE="kanji"><TITL.ORIG>機械翻訳における構造変換の干渉について</TITL.ORIG></TITL>
<TITE TYPE="alpha">Interaction between Structural Changes in Machine Translation</TITE>
<AUTH TYPE="kanji">木下 聡 / 辻井 潤一</AUPK>
<AUTE TYPE="alpha">Kinoshita,Satoshi / John,Phillips / Tsujii,Jun-ichi</AUPE>
<CONF TYPE="kanji"><CONF.ORIG>研究発表会(自然言語処理)</CONF.ORIG></CONF>
<CNFE TYPE="alpha">The Special Interest Group Notes of IPSJ</CNFE>
<ABST TYPE="kanji"><ABST.P><ABST.P.ORIG>語い項目によ
<ABSE TYPE="alpha"><ABSE.P>This paper discusses complex structural changes during transfer in machine translation 
with a non-destructive transfer framework.Though the description of each individual idiosyncratic structural change,
which is mainly caused by lexical items,is not difficult,special provision must be made when they are combined,because 
interaction between them sometimes causes unexpected problems.Transfer of coordinate structures is also discussed as this 
sometimes necessitates a structural change and interacts with other structural changes in a problematic way.We give solutions 
to this problem in our logic-based transfer model.</ABSE.P></ABSE>
<KYWD TYPE="kanji"><KYWD.ORIG>自然言語処理 // 機械翻訳 // 論理 // トランスファー</KYWD.ORIG></KYWD>
<KYWE TYPE="alpha">Natural Language Processing // Machine Translation // Logic // Transfer</KYWE>
<SOCE TYPE="alpha">Information Processing Society of Japan</SOCE>

3.2 Topics

Search topics are collected from users, and the analysts can rewrite them to make them more clear and objective. Format of the topics is similar to the one once used in TREC and contains SGML-like tags. A topic consists of title of the topic, description, detailed narrative, and list of concepts. Each narrative may contain detailed explanation of the topic, term definition, background knowledge, purpose of the search, expected number of relevant documents, preference in text types, criteria of relevance judgement, and so on. They may be used as a profile of user's information need in a specific situation to evaluate interactive retrieval systems. The lists of concepts may be used for administrative purpose only. The example of the topic was translated into English and shown below.

Examples of topics with English translation

<E>Are there any documents of bibliometrics that deal with the proper
treatment of the types which are unseen in the given sample? </E>
<E>There are many studies that deal with the mathematical structure of
given samples in the field of bibliometrics. I wonder whether there
are studies that deal with the unseen types. Especially, I would
like to know the extention of Lotka's law or Bradford's law to the
theoretical population.</E>
<J>計量書誌学 未知データ 母集団 標本 ロトカの法則
<E>bibliometrics, unseen types, population, sample, Lotka's law
Bradford's law</E>

<title>complex nouns</title>
<E>Are there any research concerning the automatic anaylsis of complex nouns
using both statistical and symbolic method together?</E>
<E>Any work on the automatic analysis of complex nouns is half relevant to
my need. The target languages may be any, e.g. English, Japanese, etc.
But the analysis should cover parsing (bracketing), and the decomposition
is not sufficient.</E>
<J>複合名詞 言語処理 統計的手法 記号的手法 パージング 構造解析</J>
<E>complex nouns, natural language processing, statistical method, symbolic
method, parsing, structural analysis</E>

3.3 Relevance assessment

The relevance assessment is done in three grades, i.e., relevant, partially relevant, non-relevant. The top ranked documents in the search results of various search strategies on content based IR systems and by searchers are pooled and form a set of candidates of relevant documents. The human analysts of each subject domains reviews the candidate set and assess the relevance of each document in it. The same analyst who create or rewrite the topic assess the relevance.

3.4. Linguistic Analysis

A part of the collection contains detailed part-of-speech tags [ Kageura et al, 1997; Koyama et al, 1998]. Because of absence of explicit boundary between words in Japanese sentences, we set the three levels of lexical boundaries (i.e., word boundary, strong morpheme boundary, and week morpheme boundary), and assigned detailed POS tags based on the boundaries and types of origin, so that the collection can be used to examine the suitable term segmentation of Japanese texts for retrieval purpose.

Example of Lexical Boundaries

w ベクトル m 空間 m モデル w に w 基づく w 情報 m 検索 m システム w は...
(whereas: w : word boundary m : morpheme boundary)

4. Workshop

As a part of the project we will organize a conpetition-type workshop in Japanese text retrieval like TREC(Text REtrieval Conference). The workshop's ojbectives are: to encourage research in information retrieval, cross-lingual information retrieval and related areas by providing a large-scale Japanese test collection, to providie a forum for research groups interested in comparing results and exchanging ideas or opinions in an informal atmosphere, and to improve the quality of the Test Collections based on the feedback from participants. Participation is inveted from anyone interested in Japanese text retrieval and cross-lingual information retrieval, from large-scale collections of scientific documts, and anyone who wish to contribute the effort to construct the resources which will be available to the general IR and NLP research community.

5. Summary

This paper introduced the on-going project to construct a large-scale English-Japanese bilingual test collection. It has tried to follow the tradition of standard test collection for IR systems evaluation and at the same time it aims to applicable to different settings, for example, IR with Japanese language, retrieval of scientific documents, cross-lingual retrieval, construction of tools or vocabularies for cross-lingual retrieval, ombination of manually assigned keywords and content-based IR, interactive IR systems, terminological processing, and terminological research, and so on.

For the further information about the project, please contact Noriko Kando on kando. Any discussion, comments, and leads about the project are greatly appreciated.