#The followings are based on the two(2) e-mails.
    1- Subject: [ntcadm-clir:1141] HK standard problem - duplication
    2- Subject: [ntcadm-clir:1143] FAQ : HK standard : Discard one document 
                from each duplicated pair of documents.


-------- Original Message --------
Subject: [ntcadm-clir:1141] HK standard problem - duplication
Date: Wed, 29 Oct 2003 11:02:44 +0900
From: "Kazuaki Kishida" <kishida@surugadai.ac.jp>
To: ntc4-clir-participant@nii.ac.jp
CC: ntcadm-clir@nii.ac.jp


Dear Users of Hong Kong Standard Data

I have to report on another problem, i.e., duplications in the 
Hong Kong Standard data set (this is an ENGLISH document set).

-----------------------------------------------------------------------
1. Duplication

We have a set of duplications in the file of ntc4-e01-hk98.txt.
That is, a set of 173 records (HK-199809070280001 - HK-199809070280173) 
is appearing two times in the file, i.e., 

 HK-199809070280001...
...
 HK-199809070280173...
 HK-199809070280001...
...
 HK-199809070280173...

Therefore, there are 173 duplicated records in the documents set.
-----------------------------------------------------------------------
2. How to treat duplications

You can choose a method you like from (1) to (3).
(1) DOC LIST
A method of removing these records is to simply delete document 
IDs of these duplications from your document list of search output. 
(2) INDEXING
Another method of removing these records is to remove these 
duplications from the document sets and to execute indexing 
process again.
(3) OTHERS
You can select any other methods for removing the duplications 
from your search results.

* Please put top 1000 documents in each file of your search 
results.


---

Please discard one set of the 173 documents of (HK-199809070280001-HK199809070280173),
and keep one set of the documents with the IDs in the document collection to be used.

It means the total number of the HK collection became "96,683" 
after discarding 173 records from the original collection with 96,856 records.

Please see the NTCIR-4 CLIR Web page at :
http://research.nii.ac.jp/ntcir/ntcir-ws4/clir/index.html#Document_set

The numbers of documents in each sub-files were updated after discarding 
problematic abnormal documents and one set in the duplicated documents.

---


---------------------------------------------------------------------
3. If you do NOT delete duplication...

If you do NOT remove the duplications, the task organizers will 
delete the duplicated records before computing values of evaluation 
indicators. As a result, it is possible that the number of documents 
included in your search output becomes less than 1000.
---------------------------------------------------------------------
4. Please describe your method for removing duplications in your 
system description.

The  field is prepared in the template of system description. 
Could you please write your method for removing the duplications in 
the  filed? For example,
(a)"Removing duplications: DOC LIST"
(b)"Removing duplications: INDEXING" 
(c)"Removing duplications: OTHERS" (please also describe the (method)
(d)"Removing duplications: NONE"
------------------------------------------------------------------------

My best regards,

Kazuaki KISHIDA
One of the task organizers
-------------------------------------------------
Kazuaki KISHIDA
Professor, 
Faculty of Cultural Information Resources,
Surugadai University
698 Azu, Hanno, Saitama 357-8555 JAPAN
E-mail:
kishida@surugadai.ac.jp
CXE02062@nifty.ne.jp