NTCIR-1 and 2

The Test Collection 1(NTCIR-1) consists of three document collections, i.e. JE Collection, J Collection, and E Collection. Each of them contains the documents extracted from "NACSIS Academic Conference Paper Database" . J Collection (mlir/ntc1-j1) and E Collection (clir/ntc1-e1) are used for the NTCIR Workshop 2.

The Test Collection 2(NTCIR-2) consists of two document collections, i.e. J Collection and E Collection. Each of them contains the documents extracted from "NACSIS Academic Conference Paper Database" and "NACSIS Grant-in-Aid Scientific Research Database" .

J Collection contains Japanese documents with both Japanese titles and Japanese abstracts. It was constructed through extracting Japanese parts of the documents which have both Japanese titles and Japanese abstracts from the database.

E Collection contains English documents with both English titles and English abstracts. It was constructed through extracting English parts of the documents which have both English titles and English abstracts from the database.

Segmented texts: for Japanese documents in NTCIR-1 and 2, the texts will be prepared in another form, which are segmented into terms and components of terms. They are segmented using commercially available Japanese morphological analyzer, which has been used by several operational IR systems in Japan.There are two kinds of segmentation, which are "hard segmentation" and "soft segmentation". The former is an EUC-s double-byte space which indicates a segmentation between two terms, and the latter is an EUC-s double-byte underscore which indicates that between two components of term. The components, combination of components and/or terms can be used as index terms. We leave single-byte spaces that is used for segmenting single-byte characters as they are. (Description about how to generate segmented texts)