README for Coversion Script for Yomiuri Newspaper '98 and '99 ------------------------------------------------------------------------------- Data conversion script "yomi2ntc.pl": "yomi2ntc.pl" is a script for format conversion of document data from "Yomiuri Newspaper" format to NTCIR-4 WS data format. USAGE: % jperl yomi2ntc.pl is supposed to be a directory on which "Yomiuri Newspaper" document files in each year of 1998 and 1999. For exmaple, if there are "Yomiuri Newspaper" document files of 1998 on a directory "Yomiuri/1998", you can input as follows: % jperl yomi2ntc.pl Yomiuri/1998 ntc4-j01-yomi98.txt ------------------------------------------------------------------------------- Data file size: The original "Yomiuri Newspaper" data set has a CSV file for each month in 1998 and 1999, that is, 12 files for 1998 and 12 files for 1999, amounted to 24 files. They are coded in Shift-JIS(SJIS) code.(*) "yomi2ntc.pl" converts to multiple files on a directory to a file in EUC code. (*) If you convert the files for 1998 and 1999 separetely by the script, converted files' sizes are as follows: CSV files for 1998 -> ntc4-j01-yomi98.txt #132,995 182.8MB CSV files for 1999 -> ntc4-j01-yomi99.txt #242,985 311.4MB ------------------------------------------------------------ total #375,980 494.2MB #: the number of documents (*)There are three character codes used for Japanese characters in Japan. The PCs with Windows OS use Shift-JIS(SJIS) code, and workstations and PCs with UNIX-like OS use EUC code in Japan. We use EUC code for Japanese document data of NTCIR Workshops. If you need to have the documents in other codes (SJIS, unicode, and so on), please convert them by yourself (by using "nkf") or ask us about any code-coversion script for Japanese language.