-----------------------------------------------------------------------
About the Document Data of the NTCIR-5 WEB Test Collection (NW1000G-04)
-----------------------------------------------------------------------

This file describes about the files comprising the Document Data of 
the NTCIR-5 WEB Test Collection, ``NW1000G-04.''


1. List of data
~~~~~~~~~~~~~~~~
NW1000G-04 consists of the following data:

[lists]
lists/sitelist/
            : list of crawled sites
lists/doclist/
            : list of contained Web pages
lists/linklist.out/
            : list of forward links between pages contained in doclist
lists/linklist.in/
            : list of backward links between pages contained in doclist
lists/anclist.out/
            : list of anchor texts attached to forward links between 
              pages contained in doclist
lists/anclist.in/
            : list of anchor texts attached to backward links between 
              pages contained in doclist

[document data]
raw/        : Original document data as they were crawled
euc/        : Document data with Japanese characters converted to EUC 
              code
cook/       : Document data in EUC code with unnecessary tags removed 
mecab/      : Document data processed by the Japanese morphological
	          analyzer MeCab.

Note: There are files with extensions ".encode" and ".filelist" in the 
above mentioned document data directories. Please ignore them.


2. "sitelist"
~~~~~~~~~~~~~
This is a list of crawled sites.

All the documents in the "doclist" were crawled from these sites. 

Each list item consists of a site ID and a host name, separated by a single
tab and terminated by a new line character.

The site ID is a string of seven decimal characters, and is unique within 
the list.

While the site IDs are given in dictionary order of host names, they are 
not necessarily contiguous.

The server type is fixed to "http".

The host name is a DNS host name.

The port number of crawled site is limited to "80".

The sitelist files are split by every 10,000 site IDs, and the name of 
each sitelist file is given as the concatenation of first three characters 
of the site IDs and the characters of "xxxx.sitelist". (e.g. the list of 
site IDs from 1230000 to 1239999 are stored in the file "123xxxx.sitelist".) 

-- sample of the sitelist files: 073xxxx.sitelist

0730011	http://www.barnes.co.jp
0730079	http://www.barneys.co.jp
0730203	http://www.barockhaus.co.jp
0730227	http://www.baron-ik.co.jp
0730229	http://www.baron.co.jp
(snip)
0739864	http://www.bec-csk.co.jp
0739869	http://www.bec.co.jp
0739876	http://www.bec1993.co.jp
0739891	http://www.because.co.jp
0739926	http://www.becgroup.co.jp


3. "doclist"
~~~~~~~~~~~~
This is a list of documents included in the Document Data,
which have been crawled from the sites listed in the "sitelist".

Each list item consists of a document ID and a URL, separated by a 
single tab character and terminated by a new line character.

A document ID is made by concatenating the site ID (seven digits), 
'_' and the page ID (seven digits), and is unique within the list.

The page IDs in each host are given in dictionary order of the URL.

The doclist files are split by every 10,000 site IDs, and the name of 
each doclist file is given as the concatenation of first three characters 
of the site IDs and the characters of "xxxx.doclist". (e.g. the list of 
site IDs from 1230000 to 1239999 are stored in the file "123xxxx.doclist".) 

-- sample of the doclist files: 073xxxx.doclist

0730011_0000001 http://www.barnes.co.jp/
0730011_0000002 http://www.barnes.co.jp/Dew.htm
0730011_0000003 http://www.barnes.co.jp/Ene-Pow.htm
0730011_0000004 http://www.barnes.co.jp/IR.htm
0730011_0000005 http://www.barnes.co.jp/News.htm
(snip)
0739926_0000314 http://www.becgroup.co.jp/zawaz/catalog.html
0739926_0000315 http://www.becgroup.co.jp/zawaz/home.html
0739926_0000316 http://www.becgroup.co.jp/zawaz/order/order.html
0739926_0000317 http://www.becgroup.co.jp/zawaz/r_index.html
0739926_0000318 http://www.becgroup.co.jp/zawaz/up_bar.html


4. "linklist.out"
~~~~~~~~~~~~~~~~~
This is a list of forward links between pages contained in the "doclist".

Each list item consists of two pairs of a document ID and a URL, the 
first for the originating page and the other for the destination page.

The linklist.out files are split by every 10,000 site ID, and the name of 
each file is given as the concatenation of first three characters 
of the site IDs of originating pages and the characters of "xxxx.outlink". 
(e.g. the list of links originating from pages in sites with IDs from 
1230000 to 1239999 are stored in the file "123xxxx.outlink".) 

-- sample of the linklist files: 073xxxx.outlink

0730011_0000001	http://www.barnes.co.jp/	0730011_0000002 http://www.barnes.co.jp/Dew.htm
0730011_0000001	http://www.barnes.co.jp/	0730011_0000003 http://www.barnes.co.jp/Ene-Pow.htm
0730011_0000001	http://www.barnes.co.jp/	0730011_0000004 http://www.barnes.co.jp/IR.htm
0730011_0000001	http://www.barnes.co.jp/	0730011_0000006 http://www.barnes.co.jp/Non-dest.htm
0730011_0000001	http://www.barnes.co.jp/	0730011_0000010 http://www.barnes.co.jp/semicon.htm
(snip)
0739926_0000304	http://www.becgroup.co.jp/kentos/umeda/par.html	0739926_0000304 http://www.becgroup.co.jp/kentos/umeda/par.html
0739926_0000305	http://www.becgroup.co.jp/kentos/umeda/sys.html	0739926_0000304 http://www.becgroup.co.jp/kentos/umeda/par.html
0739926_0000312	http://www.becgroup.co.jp/up_bar.html	0739926_0000278 http://www.becgroup.co.jp/home.html
0739926_0000315	http://www.becgroup.co.jp/zawaz/home.html	0739926_0000317 http://www.becgroup.co.jp/zawaz/r_index.html
0739926_0000316	http://www.becgroup.co.jp/zawaz/order/order.html	0739926_0000314 http://www.becgroup.co.jp/zawaz/catalog.html


5. "linklist.in"
~~~~~~~~~~~~~~~~~
This is a list of backward links between pages contained in the "doclist".

Each list item consists of two pairs of a document ID and a URL, the 
first for the destination page and the other for the originating page.

The linklist.in files are split by every 10,000 site ID, and the name of 
each file is given as the concatenation of first three characters 
of the site IDs of destination pages and the characters of "xxxx.inlink". 
(e.g. the list of links pointing to pages in sites with IDs from 
1230000 to 1239999 are stored in the file "123xxxx.inlink".) 

-- sample of the linklist files: 073xxxx.inlink

0730011_0000001	http://www.barnes.co.jp/	1852829_0000024	http://www.semiconbrain.com/50/ni.htm	a,href
0730011_0000002	http://www.barnes.co.jp/Dew.htm	0730011_0000001	http://www.barnes.co.jp/	a,href
0730011_0000002	http://www.barnes.co.jp/Dew.htm	0730011_0000002	http://www.barnes.co.jp/Dew.htm	a,href
0730011_0000002	http://www.barnes.co.jp/Dew.htm	0730011_0000003	http://www.barnes.co.jp/Ene-Pow.htm	a,href
0730011_0000002	http://www.barnes.co.jp/Dew.htm	0730011_0000004	http://www.barnes.co.jp/IR.htm	a,href
(snip)
0739990_0000040	http://www.becker-japan.net/totop.html	0739990_0000028	http://www.becker-japan.net/rvolc.html	a,href
0739990_0000040	http://www.becker-japan.net/totop.html	0739990_0000029	http://www.becker-japan.net/rvolvp.html	a,href
0739990_0000040	http://www.becker-japan.net/totop.html	0739990_0000030	http://www.becker-japan.net/scb.html	a,href
0739990_0000040	http://www.becker-japan.net/totop.html	0739990_0000031	http://www.becker-japan.net/scvp.html	a,href
0739990_0000040	http://www.becker-japan.net/totop.html	0739990_0000039	http://www.becker-japan.net/toride.html	a,href


6. "anclist.out"
~~~~~~~~~~~~~~~~~
This is a list of anchor texts attached to forward links between pages 
contained in the "doclist".

Each list item consists of a document ID of the originating page, a 
document ID of the destination page and an anchor text.

The anclist.out files are split by every 10,000 site ID, and the name of 
each file is given as the concatenation of first three characters 
of the site IDs of originating pages and the characters of "xxxx.outlink". 
(e.g. the list of links originating from pages in sites with IDs from 
1230000 to 1239999 are stored in the file "123xxxx.outlink".) 

-- sample of the anclist files: 073xxxx.outlink

0730011_0000001	0730011_0000001	ボタン
0730011_0000001	0730011_0000002	露点温度測定器
0730011_0000001	0730011_0000003	光パワー／光エネルギー測定機器
0730011_0000001	0730011_0000004	赤外線応用製品
0730011_0000001	0730011_0000005	ボタン
(snip)
0739990_0000038	0739990_0000033	Asia
0739990_0000038	0739990_0000034	Europe
0739990_0000038	0739990_0000035	Japan
0739990_0000038	0739990_0000036	Oceania
0739990_0000038	0739990_0000037	U.S.A


7. "anclist.in"
~~~~~~~~~~~~~~~~~
This is a list of anchor texts attached to backward links between pages 
contained in the "doclist".

Each list item consists of a document ID of the destination page, a 
document ID of the originating page and an anchor text.

The anclist.in files are split by every 10,000 site ID, and the name of 
each file is given as the concatenation of first three characters 
of the site IDs of destination pages and the characters of "xxxx.inlink". 
(e.g. the list of links pointing to pages in sites with IDs from 
1230000 to 1239999 are stored in the file "123xxxx.inlink".) 

-- sample of the anclist files: 073xxxx.inlink

0730011_0000001	0730011_0000011	Top
0730011_0000001	1852829_0000024	http://www.barnes.co.jp
0730011_0000001	0730011_0000002	ホームページ
0730011_0000001	0730011_0000003	ホームページ
0730011_0000001	0730011_0000004	ホームページ
(snip)
0739990_0000038	0739990_0000020	サービス網
0739990_0000038	0739990_0000021	サービス網
0739990_0000039	0739990_0000023	写 真
0739990_0000040	0739990_0000020	トップページへ
0739990_0000040	0739990_0000021	トップページへ


8. "raw" document data
~~~~~~~~~~~~~~~~~~~~~~
This is a set of web data as they were crawled from sites in the "sitelist". 

The directory structure and the naming scheme are as follows:

- The name of the first level subdirectory is the first three 
  characters of the site ID.

- The name of the second level subdirectory is the concatenation 
  of fourth and fifth characters of the site ID and "xx".

- The name of the third level subdirectory is the site ID themselves
  (seven digits). 

- The name of the fourth level subdirectory is the first three 
  characters of the page ID.

- The name of the fifth level subdirectory is the concatenation 
  of fourth and fifth characters of the page ID and "xx".

Web page data crawled from each site are stored in each corresponding 
fifth level subdirectory. 

The file name of each page data is described as the concatenation 
of site ID, "_", page ID, and the extension of ".dat".

  e.g. site ID: 1234567, page ID: 0000123
   ==> file path: raw/123/45xx/1234567/000/01xx/1234567_0000123.dat


9. "euc" document data
~~~~~~~~~~~~~~~~~~~~~~~
This is a set of web data that are processed from "raw" data by converting
Japanese two-byte characters to EUC code.

The directory structure and the naming scheme are same as that of the 
"raw" document data, except that the extension is ".euc".

  e.g. site ID: 1234567, page ID: 0000123
   ==> file path: euc/123/45xx/1234567/000/01xx/1234567_0000123.euc
  

10. "cook" document data
~~~~~~~~~~~~~~~~~~~~~~~~
This is a set of web data that are processed from "euc" data by removing
unnecessary tags and others.

The directory structure and the naming scheme are same as that of the 
"raw" document data, except that the extension is ".cooked".

  e.g. site ID: 1234567, page ID: 0000123
   ==> file path: cook/123/45xx/1234567/000/01xx/1234567_0000123.cooked

The web page data were processed with the following rules:

(1) HTML comments, XML declarations and XML definitions are removed.

(2) tag pairs of "<script>" and "</script>" and their contents are removed.

(3) Concerning each "<meta>" tag, if value of the "name" attribute
is either "keywords" or "description", then value of the "content"
attribute is output in a single line prefixed with "<NWD:META/>".

  e.g. <meta name="keywords" content="information retrieval, test
       collection">
   ==> <NWD:META/>information retrieval, test collection<newline>

(4) Concerning each "<img>" tag, value of the "alt" attribute is
output in a single line prefixed with "<NWD:IMG/>".

(5) All the other tags are simply removed.

(6) Character code entity references are removed (e.g. &#2345; &#x2ad;).

(7) Character entity references are replaced as follows:

  &amp;  ==> &
  &lt;   ==> <
  &gt;   ==> >
  &nbsp; ==> ' '
  &quot; ==> '"'

  &Alpha; - &Omega; ==> corresponding Greek upper case letters in EUC
  &alpha; - &omega; ==> corresponding Greek lower case letters in EUC

  alphabets with diacritical marks
          ==> corresponding alphabets without diacritical marks

  &AElig; ==> AE
  &ETH;   ==> ETH
  &szlig; ==> ss
  &aelig; ==> ae
  &eth;   ==> eth

  Others  ==> single space character (' ').

(8) Consecutive tabs and spaces are replaced with a single space 
character (' ').

(9) Null lines and lines containing tabs and spaces only are removed.


11. "mecab" document data
~~~~~~~~~~~~~~~~~~~~~~~~~
This is a set of web data that are processed from "cook" data by applying
Japanese morphological analyzer MeCab.

The directory structure and the naming scheme are same as that of the 
"raw" document data, except that the extension is ".mecab".

  e.g. site ID: 1234567, page ID: 0000123
   ==> file path: mecab/123/45xx/1234567/000/01xx/1234567_0000123.mecab