NTCIR-1 and 2
-
The Test Collection 1(NTCIR-1) consists of three
document collections, i.e. JE Collection, J Collection, and E Collection.
Each of them contains the documents extracted from "NACSIS
Academic Conference Paper Database"
. J Collection (mlir/ntc1-j1) and
E Collection (clir/ntc1-e1) are used for the NTCIR Workshop 2.
-
The Test Collection 2(NTCIR-2) consists of two document
collections, i.e. J Collection and E Collection. Each of them contains
the documents extracted from "NACSIS
Academic Conference Paper Database"
and "NACSIS
Grant-in-Aid Scientific Research Database"
.
-
J Collection contains Japanese documents with both
Japanese titles and Japanese abstracts. It was constructed through extracting
Japanese parts of the documents which have both Japanese titles and Japanese
abstracts from the database.
-
E Collection contains English documents with both
English titles and English abstracts. It was constructed through extracting
English parts of the documents which have both English titles and English
abstracts from the database.
-
Segmented texts: for Japanese documents in NTCIR-1
and 2, the texts will be prepared in another form, which are segmented
into terms and components of terms. They are segmented using commercially
available Japanese morphological analyzer, which has been used by several
operational IR systems in Japan.There
are two kinds of segmentation, which are "hard segmentation" and "soft
segmentation". The former is an EUC-s double-byte space which indicates
a segmentation between two terms, and the latter is an EUC-s double-byte
underscore which indicates that between two components of term. The components,
combination of components and/or terms can be used as index terms. We leave
single-byte spaces that is used for segmenting single-byte characters as
they are. (Description
about how to generate segmented texts)