README for a Conversion Script for Xinhua News Service articiles ------------------------------------------------------------------------------- Data conversion script "xie2ntc.pl": "xie2ntc.pl" is a script for format conversion of document data "Xinhua News Service" from Aquiant Corpus' format to NTCIR-4 WS data format. USAGE: % perl xie2ntc.pl is supposed to be a directory on which "Xinhua News Service" document files in each year of 1998 and 1999. For exmaple, if there are "Xinhua New Service" document files for 1998 on a directory "XIE/1998", you can input as follows: % perl xie2ntc.pl XIE/1998 ntc4-e-xie98.txt ------------------------------------------------------------------------------- Data file size: The original "Xinhua News Service" data set has a file for each day in 1998 and 1999, that is, 365 files for 1998 and 365 files for 1999, amounted to 730 files. "xie2ntc.pl" converts to multiple files on a directory to a file. If you convert the files for 1998 and 1999 separetely by the script, converted files' sizes are as follows: XIE/1998 -> ntc4-e-xie98.txt #103,470 144.4MB XIE/1999 -> ntc4-e-xie99.txt #104,698 145.9MB ------------------------------------------------- total #208,168 290.2MB #: the number of documents