README for a Conversion Script for Xinhua News Service articiles
-------------------------------------------------------------------------------
Data conversion script "xie2ntc.pl":
"xie2ntc.pl" is a script for format conversion of document data "Xinhua News
Service" from Aquiant Corpus' format to NTCIR-4 WS data format.
USAGE:
% perl xie2ntc.pl
is supposed to be a directory on which "Xinhua News Service" document
files in each year of 1998 and 1999.
For exmaple, if there are "Xinhua New Service" document files for 1998 on
a directory "XIE/1998", you can input as follows:
% perl xie2ntc.pl XIE/1998 ntc4-e-xie98.txt
-------------------------------------------------------------------------------
Data file size:
The original "Xinhua News Service" data set has a file for each day
in 1998 and 1999, that is, 365 files for 1998 and 365 files for 1999,
amounted to 730 files.
"xie2ntc.pl" converts to multiple files on a directory to a file.
If you convert the files for 1998 and 1999 separetely by the script,
converted files' sizes are as follows:
XIE/1998 -> ntc4-e-xie98.txt #103,470 144.4MB
XIE/1999 -> ntc4-e-xie99.txt #104,698 145.9MB
-------------------------------------------------
total #208,168 290.2MB
#: the number of documents