----------------------------------------------------------------------- About the Document Data of the NTCIR-5 WEB Test Collection (NW1000G-04) ----------------------------------------------------------------------- This file describes about the files comprising the Document Data of the NTCIR-5 WEB Test Collection, ``NW1000G-04.'' 1. List of data ~~~~~~~~~~~~~~~~ NW1000G-04 consists of the following data: [lists] lists/sitelist/ : list of crawled sites lists/doclist/ : list of contained Web pages lists/linklist.out/ : list of forward links between pages contained in doclist lists/linklist.in/ : list of backward links between pages contained in doclist lists/anclist.out/ : list of anchor texts attached to forward links between pages contained in doclist lists/anclist.in/ : list of anchor texts attached to backward links between pages contained in doclist [document data] raw/ : Original document data as they were crawled euc/ : Document data with Japanese characters converted to EUC code cook/ : Document data in EUC code with unnecessary tags removed mecab/ : Document data processed by the Japanese morphological analyzer MeCab. Note: There are files with extensions ".encode" and ".filelist" in the above mentioned document data directories. Please ignore them. 2. "sitelist" ~~~~~~~~~~~~~ This is a list of crawled sites. All the documents in the "doclist" were crawled from these sites. Each list item consists of a site ID and a host name, separated by a single tab and terminated by a new line character. The site ID is a string of seven decimal characters, and is unique within the list. While the site IDs are given in dictionary order of host names, they are not necessarily contiguous. The server type is fixed to "http". The host name is a DNS host name. The port number of crawled site is limited to "80". The sitelist files are split by every 10,000 site IDs, and the name of each sitelist file is given as the concatenation of first three characters of the site IDs and the characters of "xxxx.sitelist". (e.g. the list of site IDs from 1230000 to 1239999 are stored in the file "123xxxx.sitelist".) -- sample of the sitelist files: 073xxxx.sitelist 0730011 http://www.barnes.co.jp 0730079 http://www.barneys.co.jp 0730203 http://www.barockhaus.co.jp 0730227 http://www.baron-ik.co.jp 0730229 http://www.baron.co.jp (snip) 0739864 http://www.bec-csk.co.jp 0739869 http://www.bec.co.jp 0739876 http://www.bec1993.co.jp 0739891 http://www.because.co.jp 0739926 http://www.becgroup.co.jp 3. "doclist" ~~~~~~~~~~~~ This is a list of documents included in the Document Data, which have been crawled from the sites listed in the "sitelist". Each list item consists of a document ID and a URL, separated by a single tab character and terminated by a new line character. A document ID is made by concatenating the site ID (seven digits), '_' and the page ID (seven digits), and is unique within the list. The page IDs in each host are given in dictionary order of the URL. The doclist files are split by every 10,000 site IDs, and the name of each doclist file is given as the concatenation of first three characters of the site IDs and the characters of "xxxx.doclist". (e.g. the list of site IDs from 1230000 to 1239999 are stored in the file "123xxxx.doclist".) -- sample of the doclist files: 073xxxx.doclist 0730011_0000001 http://www.barnes.co.jp/ 0730011_0000002 http://www.barnes.co.jp/Dew.htm 0730011_0000003 http://www.barnes.co.jp/Ene-Pow.htm 0730011_0000004 http://www.barnes.co.jp/IR.htm 0730011_0000005 http://www.barnes.co.jp/News.htm (snip) 0739926_0000314 http://www.becgroup.co.jp/zawaz/catalog.html 0739926_0000315 http://www.becgroup.co.jp/zawaz/home.html 0739926_0000316 http://www.becgroup.co.jp/zawaz/order/order.html 0739926_0000317 http://www.becgroup.co.jp/zawaz/r_index.html 0739926_0000318 http://www.becgroup.co.jp/zawaz/up_bar.html 4. "linklist.out" ~~~~~~~~~~~~~~~~~ This is a list of forward links between pages contained in the "doclist". Each list item consists of two pairs of a document ID and a URL, the first for the originating page and the other for the destination page. The linklist.out files are split by every 10,000 site ID, and the name of each file is given as the concatenation of first three characters of the site IDs of originating pages and the characters of "xxxx.outlink". (e.g. the list of links originating from pages in sites with IDs from 1230000 to 1239999 are stored in the file "123xxxx.outlink".) -- sample of the linklist files: 073xxxx.outlink 0730011_0000001 http://www.barnes.co.jp/ 0730011_0000002 http://www.barnes.co.jp/Dew.htm 0730011_0000001 http://www.barnes.co.jp/ 0730011_0000003 http://www.barnes.co.jp/Ene-Pow.htm 0730011_0000001 http://www.barnes.co.jp/ 0730011_0000004 http://www.barnes.co.jp/IR.htm 0730011_0000001 http://www.barnes.co.jp/ 0730011_0000006 http://www.barnes.co.jp/Non-dest.htm 0730011_0000001 http://www.barnes.co.jp/ 0730011_0000010 http://www.barnes.co.jp/semicon.htm (snip) 0739926_0000304 http://www.becgroup.co.jp/kentos/umeda/par.html 0739926_0000304 http://www.becgroup.co.jp/kentos/umeda/par.html 0739926_0000305 http://www.becgroup.co.jp/kentos/umeda/sys.html 0739926_0000304 http://www.becgroup.co.jp/kentos/umeda/par.html 0739926_0000312 http://www.becgroup.co.jp/up_bar.html 0739926_0000278 http://www.becgroup.co.jp/home.html 0739926_0000315 http://www.becgroup.co.jp/zawaz/home.html 0739926_0000317 http://www.becgroup.co.jp/zawaz/r_index.html 0739926_0000316 http://www.becgroup.co.jp/zawaz/order/order.html 0739926_0000314 http://www.becgroup.co.jp/zawaz/catalog.html 5. "linklist.in" ~~~~~~~~~~~~~~~~~ This is a list of backward links between pages contained in the "doclist". Each list item consists of two pairs of a document ID and a URL, the first for the destination page and the other for the originating page. The linklist.in files are split by every 10,000 site ID, and the name of each file is given as the concatenation of first three characters of the site IDs of destination pages and the characters of "xxxx.inlink". (e.g. the list of links pointing to pages in sites with IDs from 1230000 to 1239999 are stored in the file "123xxxx.inlink".) -- sample of the linklist files: 073xxxx.inlink 0730011_0000001 http://www.barnes.co.jp/ 1852829_0000024 http://www.semiconbrain.com/50/ni.htm a,href 0730011_0000002 http://www.barnes.co.jp/Dew.htm 0730011_0000001 http://www.barnes.co.jp/ a,href 0730011_0000002 http://www.barnes.co.jp/Dew.htm 0730011_0000002 http://www.barnes.co.jp/Dew.htm a,href 0730011_0000002 http://www.barnes.co.jp/Dew.htm 0730011_0000003 http://www.barnes.co.jp/Ene-Pow.htm a,href 0730011_0000002 http://www.barnes.co.jp/Dew.htm 0730011_0000004 http://www.barnes.co.jp/IR.htm a,href (snip) 0739990_0000040 http://www.becker-japan.net/totop.html 0739990_0000028 http://www.becker-japan.net/rvolc.html a,href 0739990_0000040 http://www.becker-japan.net/totop.html 0739990_0000029 http://www.becker-japan.net/rvolvp.html a,href 0739990_0000040 http://www.becker-japan.net/totop.html 0739990_0000030 http://www.becker-japan.net/scb.html a,href 0739990_0000040 http://www.becker-japan.net/totop.html 0739990_0000031 http://www.becker-japan.net/scvp.html a,href 0739990_0000040 http://www.becker-japan.net/totop.html 0739990_0000039 http://www.becker-japan.net/toride.html a,href 6. "anclist.out" ~~~~~~~~~~~~~~~~~ This is a list of anchor texts attached to forward links between pages contained in the "doclist". Each list item consists of a document ID of the originating page, a document ID of the destination page and an anchor text. The anclist.out files are split by every 10,000 site ID, and the name of each file is given as the concatenation of first three characters of the site IDs of originating pages and the characters of "xxxx.outlink". (e.g. the list of links originating from pages in sites with IDs from 1230000 to 1239999 are stored in the file "123xxxx.outlink".) -- sample of the anclist files: 073xxxx.outlink 0730011_0000001 0730011_0000001 ボタン 0730011_0000001 0730011_0000002 露点温度測定器 0730011_0000001 0730011_0000003 光パワー/光エネルギー測定機器 0730011_0000001 0730011_0000004 赤外線応用製品 0730011_0000001 0730011_0000005 ボタン (snip) 0739990_0000038 0739990_0000033 Asia 0739990_0000038 0739990_0000034 Europe 0739990_0000038 0739990_0000035 Japan 0739990_0000038 0739990_0000036 Oceania 0739990_0000038 0739990_0000037 U.S.A 7. "anclist.in" ~~~~~~~~~~~~~~~~~ This is a list of anchor texts attached to backward links between pages contained in the "doclist". Each list item consists of a document ID of the destination page, a document ID of the originating page and an anchor text. The anclist.in files are split by every 10,000 site ID, and the name of each file is given as the concatenation of first three characters of the site IDs of destination pages and the characters of "xxxx.inlink". (e.g. the list of links pointing to pages in sites with IDs from 1230000 to 1239999 are stored in the file "123xxxx.inlink".) -- sample of the anclist files: 073xxxx.inlink 0730011_0000001 0730011_0000011 Top 0730011_0000001 1852829_0000024 http://www.barnes.co.jp 0730011_0000001 0730011_0000002 ホームページ 0730011_0000001 0730011_0000003 ホームページ 0730011_0000001 0730011_0000004 ホームページ (snip) 0739990_0000038 0739990_0000020 サービス網 0739990_0000038 0739990_0000021 サービス網 0739990_0000039 0739990_0000023 写 真 0739990_0000040 0739990_0000020 トップページへ 0739990_0000040 0739990_0000021 トップページへ 8. "raw" document data ~~~~~~~~~~~~~~~~~~~~~~ This is a set of web data as they were crawled from sites in the "sitelist". The directory structure and the naming scheme are as follows: - The name of the first level subdirectory is the first three characters of the site ID. - The name of the second level subdirectory is the concatenation of fourth and fifth characters of the site ID and "xx". - The name of the third level subdirectory is the site ID themselves (seven digits). - The name of the fourth level subdirectory is the first three characters of the page ID. - The name of the fifth level subdirectory is the concatenation of fourth and fifth characters of the page ID and "xx". Web page data crawled from each site are stored in each corresponding fifth level subdirectory. The file name of each page data is described as the concatenation of site ID, "_", page ID, and the extension of ".dat". e.g. site ID: 1234567, page ID: 0000123 ==> file path: raw/123/45xx/1234567/000/01xx/1234567_0000123.dat 9. "euc" document data ~~~~~~~~~~~~~~~~~~~~~~~ This is a set of web data that are processed from "raw" data by converting Japanese two-byte characters to EUC code. The directory structure and the naming scheme are same as that of the "raw" document data, except that the extension is ".euc". e.g. site ID: 1234567, page ID: 0000123 ==> file path: euc/123/45xx/1234567/000/01xx/1234567_0000123.euc 10. "cook" document data ~~~~~~~~~~~~~~~~~~~~~~~~ This is a set of web data that are processed from "euc" data by removing unnecessary tags and others. The directory structure and the naming scheme are same as that of the "raw" document data, except that the extension is ".cooked". e.g. site ID: 1234567, page ID: 0000123 ==> file path: cook/123/45xx/1234567/000/01xx/1234567_0000123.cooked The web page data were processed with the following rules: (1) HTML comments, XML declarations and XML definitions are removed. (2) tag pairs of "" and their contents are removed. (3) Concerning each "" tag, if value of the "name" attribute is either "keywords" or "description", then value of the "content" attribute is output in a single line prefixed with "". e.g. ==> information retrieval, test collection (4) Concerning each "" tag, value of the "alt" attribute is output in a single line prefixed with "". (5) All the other tags are simply removed. (6) Character code entity references are removed (e.g. ऩ ʭ). (7) Character entity references are replaced as follows: & ==> & < ==> < > ==> >   ==> ' ' " ==> '"' Α - Ω ==> corresponding Greek upper case letters in EUC α - ω ==> corresponding Greek lower case letters in EUC alphabets with diacritical marks ==> corresponding alphabets without diacritical marks Æ ==> AE Ð ==> ETH ß ==> ss æ ==> ae ð ==> eth Others ==> single space character (' '). (8) Consecutive tabs and spaces are replaced with a single space character (' '). (9) Null lines and lines containing tabs and spaces only are removed. 11. "mecab" document data ~~~~~~~~~~~~~~~~~~~~~~~~~ This is a set of web data that are processed from "cook" data by applying Japanese morphological analyzer MeCab. The directory structure and the naming scheme are same as that of the "raw" document data, except that the extension is ".mecab". e.g. site ID: 1234567, page ID: 0000123 ==> file path: mecab/123/45xx/1234567/000/01xx/1234567_0000123.mecab