Japanese Document Collection (Yomiuri Newspaper articles '98,'99) The document collection consists of two parts, which are 132,995 Japanese newspaper articles from Mainichi Newspaper in 1998 and 242,985 Japanese newspaper articles from Mainichi Newspaper in 1999. The file "ntc4-j01-yomi98.txt" includes the articles in 1998, and the file "ntc4-j01-yomi99.txt" includes the articles in 1999. All the documents, i.e. newspaper articles, are encoded in a Japanese character code; Japanese EUC (Japanese Extensive Unix Code). All files are compressed by gzip. The sizes of the text files and their compressed files are as follows. (1,024B = 1KB, 1,024KB = 1MB) ntc4-j01-yomi98.txt 182.8MB ntc4-j01-yomi98.txt.gz 80.3MB ntc4-j01-yomi99.txt 311.4MB ntc4-j01-yomi99.txt.gz 134.9MB Notice: There are three character codes used for Japanese characters in Japan. The PCs with Windows OS use Shift-JIS(SJIS) code, and workstations and PCs with UNIX-like OS use EUC code in Japan. We use EUC code for Japanese document data of NTCIR Workshops. If you need to have the documents in other codes (SJIS, unicode, and so on), please convert them by yourself (by using "nkf") or ask us about any code-coversion script for Japanese language.