#The followings are based on the two(2) e-mails. 1- Subject: [ntcadm-clir:1141] HK standard problem - duplication 2- Subject: [ntcadm-clir:1143] FAQ : HK standard : Discard one document from each duplicated pair of documents. -------- Original Message -------- Subject: [ntcadm-clir:1141] HK standard problem - duplication Date: Wed, 29 Oct 2003 11:02:44 +0900 From: "Kazuaki Kishida" To: ntc4-clir-participant@nii.ac.jp CC: ntcadm-clir@nii.ac.jp Dear Users of Hong Kong Standard Data I have to report on another problem, i.e., duplications in the Hong Kong Standard data set (this is an ENGLISH document set). ----------------------------------------------------------------------- 1. Duplication We have a set of duplications in the file of ntc4-e01-hk98.txt. That is, a set of 173 records (HK-199809070280001 - HK-199809070280173) is appearing two times in the file, i.e., HK-199809070280001... ... HK-199809070280173... HK-199809070280001... ... HK-199809070280173... Therefore, there are 173 duplicated records in the documents set. ----------------------------------------------------------------------- 2. How to treat duplications You can choose a method you like from (1) to (3). (1) DOC LIST A method of removing these records is to simply delete document IDs of these duplications from your document list of search output. (2) INDEXING Another method of removing these records is to remove these duplications from the document sets and to execute indexing process again. (3) OTHERS You can select any other methods for removing the duplications from your search results. * Please put top 1000 documents in each file of your search results. --- Please discard one set of the 173 documents of (HK-199809070280001-HK199809070280173), and keep one set of the documents with the IDs in the document collection to be used. It means the total number of the HK collection became "96,683" after discarding 173 records from the original collection with 96,856 records. Please see the NTCIR-4 CLIR Web page at : http://research.nii.ac.jp/ntcir/ntcir-ws4/clir/index.html#Document_set The numbers of documents in each sub-files were updated after discarding problematic abnormal documents and one set in the duplicated documents. --- --------------------------------------------------------------------- 3. If you do NOT delete duplication... If you do NOT remove the duplications, the task organizers will delete the duplicated records before computing values of evaluation indicators. As a result, it is possible that the number of documents included in your search output becomes less than 1000. --------------------------------------------------------------------- 4. Please describe your method for removing duplications in your system description. The field is prepared in the template of system description. Could you please write your method for removing the duplications in the filed? For example, (a)"Removing duplications: DOC LIST" (b)"Removing duplications: INDEXING" (c)"Removing duplications: OTHERS" (please also describe the (method) (d)"Removing duplications: NONE" ------------------------------------------------------------------------ My best regards, Kazuaki KISHIDA One of the task organizers ------------------------------------------------- Kazuaki KISHIDA Professor, Faculty of Cultural Information Resources, Surugadai University 698 Azu, Hanno, Saitama 357-8555 JAPAN E-mail: kishida@surugadai.ac.jp CXE02062@nifty.ne.jp