TITLE=readme-e.txt DATE=1999-11-01 NACSIS Test Collection for Information Retrieval Systems 1 (NTCIR-1) README 1 File Description This CD-ROM contains the following files: readme-j.pdf - Japanese version of this file (PDF, Font Embedded) readme-j.txt - Japanese version of this file (EUC) readme-e.txt - This file agreem-j.pdf - Memorandum on the Permission to Use Test Collection 1 (Japanese) agreem-e.pdf - Memorandum on the Permission to Use Test Collection 1 (English) tagree-j.pdf - Memorandum on the Permission to Use Tagged Data Collection (Japanese) tagree-e.pdf - Memorandum on the Permission to Use Tagged Data Collection (English) manual-j.pdf - User Manual (Japanese) manual-e.pdf - User Manual (English) adhoc.tgz - Document set (Japanese and English) and relevance judgments for ad hoc retrieval mlir.tgz - Document set (Japanese ) and relevance judgments for monolingual retrieval clir.tgz - Document set (English) and relevance judgments for cross-lingual retrieval topics.tgz - topics (Japanese) tmrec.tgz - Tagged data collection (For research on automatic term recognition) - Adobe Acrobat Reader is needed to read *.pdf files. - *.tgz files are tared, then gzipped. Please use "gzip -dc | tar xvf -" to extract the original data on a UNIX system. Japanese data is in EUC code. The files included in *.tgz files, and their original file sizes, are given below. For the file size, 1 MB = 1024 * 1024 bytes are used. (1) adhoc.tgz Document set (Japanese and English) and relevance judgments for ad hoc retrieval The following files are included under the directory "adhoc/": ntc1-je1 (576.6 MB) - Document set (JE Collection) Japanese and English rel1_ntc1-je1_0001-0030 - Relevance judgments of topic0001-0030 against ntc1-je1.(Relevant File; A-judgments are treated as "relevant" ) rel2_ntc1-je1_0001-0030 - Relevance judgments of topic0001-0030 against ntc1-je1 (Partially Relevant File; A- and B-judgments are treated as "relevant") rel1_ntc1-je1_0031-0083 - Relevance judgments of topic0031-0083 against ntc1-je1 (Relevant File; A-judgments are treated as "relevant") rel2_ntc1-je1_0031-0083 - Relevance judgments of topic0031-0083 against ntc1-je1 (Partially Relevant File; A- and B-judgments are treated as "relevant") (2) mlir.tgz Document set (Japanese) and relevance judgments for monolingual retrieval The following files are included under the directory "mlir/": ntc1-j1 (311.5 MB) - Document set (J Collection) Japanese documents rel1_ntc1-j1_0001-0030 - Relevance judgments of topic0001-0030 against ntc1-j1 (Relevant File; A-judgments are treated as "relevant") rel2_ntc1-j1_0001-0030 - Relevance judgments of topic0001-0030 against ntc1-j1 (Partially Relevant File; A- and B-judgments are treated as "relevant") rel1_ntc1-j1_0031-0083 - Relevance judgments of topic0031-0083 against ntc1-j1 (Relevant File; A-judgments are treated as "relevant" ) rel2_ntc1-j1_0031-0083 - Relevance judgments of topic0031-0083 against ntc1-j1 - (Partially Relevant File; A- and B-judgments are treated as "relevant") (3) clir.tgz Document set (English) and relevance judgments for cross-lingual retrieval The following files are included under the directory "clir/": ntc1-e1 (275.5 MB) - Document set (E Collection) English documents rel1_ntc1-e1_0001-0030 - Relevance judgments of topic0001-0030 against ntc1-e1 (Relevant File; A-judgments are treated as "relevant" ) rel2_ntc1-e1_0001-0030 - Relevance judgments of topic0001-0030 against ntc1-e1 (Partially Relevant File; A- and B-judgments are treated as "relevant") rel1_ntc1-e1_0031-0083 - Relevance judgments of topic0031-0083 against ntc1-e1 (Relevant File; A-judgments are treated as "relevant") rel2_ntc1-e1_0031-0083 - Relevance judgments of topic0031-0083 against ntc1-e1 (Partially relevant File; A- and B-judgments are treated as "relevant") (4) topics.tgz Search topics (Japanese) The following files are included under the directory "topics/": topic0001-0030 - Used as the training topics at the 1st NTCIR Workshop topic0031-0083 - Used as the test topics at the 1st NTCIR Workshop (5) tmrec.tgz Tagged data collection for research on automatic term recognition The following files are included under the directory "tmrec/": README.j - Description about files in this directory (Japanese) README - Description about files in this directory (English) README.termtagj - Description for the Selection of Term Candidate and Tagging (Japanese) README.termtage - Description for the Selection of Term Candidate and Tagging (English) ntc1-tt0 - Tagged data collection ntc1-tu0 - Untagged data collection ntc1-ttg - Data collection with the tags for term candidates. ntc1-tml - List of term candidates extracted from ntc1-ttg. 2 Format of the Data and Usage - Plain text files use EUC code. - For the format of each file and its usage, please consult the NTCIR-1 manual (manual-e.pdf or [OLE10]manual-j.pdf). - Relevance judgment files are specified by the combination of retrieval task, the document set used, and topic number. Please use them in the correct combination. For detailed information, please consult Section 5.2 and Fig. 5-2. - The use of Test Collection 1 (NTCIR-1) and Tagged Data Collection is permitted under "the Memorandum on the Permission to Use Test Collection 1" and "the Memorandum on the Permission to Use Tagged Data Collection 1" 3 Notice about Documents Since one of the purposes of the original database is to provide a timely alerting information service about papers presented in Japanese academic conferences as quickly as possible, documents are placed in the database without any revision or modification by professional abstractors or editors. The documents are author abstracts, and the discourse-level structures of texts may be different from those found in abstracts by professional abstractors. As part of the philosophy of leaving the data as close to the original as possible, and because it is impossible to check all the data manually, there are many "errors" in the data. These range from errors in the original data or other typographical errors, to errors in the reformatting done at NACSIS and by the Test Collection Project Group. The error checking has concentrated on allowing readability of the data rather than on correcting content. This means that there have been automated checks of control characters for correct matching of the beginning and end tags, and for complete ACCN (accession number) fields. For automatic term recognition, it is appropriate to correct obvious linguistic errors, so we have manually cleaned up the Japanese part of the data in files ntc1-tt0 and ntc1-tu0. However, the English part has not been corrected because the Automatic Term Recognition task in the first NTCIR Workshop was monolingual Japanese, and because of a shortage of time. The documents in the NTCIR-1 are extracted from the "NACSIS Academic Conference Paper Database" to be used for the purpose of research on information retrieval and related areas. Therefore, please note that the documents are part of the original database and the coverage is incomplete. As a result, the documents in the NTCIR-1 cannot be used for information purposes. Please understand that neither the organizer of the NTCIR nor NACSIS are responsible for any problems or damage caused by the use of NTCIR-1. 4 Inquiries NTCIR Project, R & D Dept., National Center for Science Information Systems (NACSIS) Attn: Noriko Kando Email: ntcadm@nii.ac.jp Postal address: 3-29-1 Otsuka, Bunkyo-ku, Tokyo 112-8640, JAPAN Phone: +81-3-3942-6969 Fax: +81-3-5395-7064