NTCIR-3/NTCIR-4 WEB Document Data

This README.data file describes on the files comprising the 
Document Data of NTCIR-3/NTCIR-4 WEB Test Collection, NW100G-01.

1. List of files
~~~~~~~~~~~~~~~~
NW100G-01 consists of the following files:

README.data : this file
sitelist    : list of crawled sites
aliaslist   : list of aliased sites
doclist     : list of documents with page data
duplist     : list of duplicated pages
targetlist  : list of documents to be search targets
linklist    : list of links from pages in doclist to pages in 
              targetlist
raw/??/????/??????/*
            : Original document data as they were crawled (files with 
              extention ".data") and their corresponding metadata 
              (files with extention ".meta")
euc/??/????/??????/*
            : Document data with Japanese characters converted to EUC 
              code (files with extention ".data") and their 
              corresponding metadata (files with extention ".meta")
cooked/??/????/??????/*
            : Document data with unnecessary tags removed (files with 
              extention ".data") and their corresponding metadata 
              (files with extention ".meta")

2. "sitelist"
~~~~~~~~~~~~~
This file contains a list of crawled sites.

All the documents in the "doclist" were crawled from these sites. When 
it was detected that a site had multiple host names, only the one which 
was found first is selected and included in this list.

Each list item consists of a site id, a server type, a host name, and 
a port number, separated by single tab characters and terminated by a 
new line character.

The site id is a string of six decimal characters, and is unique within 
the list.

The server type is fixed to "http".

The host name is a DNS host name.

The port number is a TCP/IP port number with which the site accepted 
http access. It may take another number than "80".


3. "aliaslist"
~~~~~~~~~~~~~~
This file contains a list of aliased sites.

When it was detected that a site had multiple host names, host names 
which are not included in the "sitelist" file are included in this 
list.

Each list item consists of an aliased site name and the corresponding 
site id, separated by a single tab character and terminated by a new 
line character.

An aliased site name takes a form: ("http://"<host name>(":"<port number>)?). 
When the port number is "80", the port number part is often omitted.

The site id necessarily corresponds to a single list item in the 
"sitelist".


4. "doclist"
~~~~~~~~~~~~
This file consists of a list of documents (web pages) with searchable 
page data, crawled from the sites listed in the "sitelist".

This document set simulates all the crawled pages for a given search 
engine.

When it was detected that a document had multiple URL's, only one URL 
which was found first is selected and included in this list.

Each list item consists of a document id and a URL, separated by a 
single tab character and terminated by a new line character.

A document id is a string of nine decimal characters preceded by a 
prefix "NW", and is unique within the list.


5. "duplist"
~~~~~~~~~~~~
This file consists of a list of duplicated pages, i.e., pages having 
a same content.

When it was detected that a document had multiple URL's, URL's which 
are not included in the "doclist" file are included in this list.

Each list item consists of an duplicated URL and the corresponding 
document id, separated by a single tab character and terminated by 
a new line character.

Multiple list items may have a same document id.


6. "targetlist"
~~~~~~~~~~~~~~~
This file consists of a list of documents to be search targets. Any 
document id's listed in this file may returned as search results.

All the documents in this list were actually crawled and their page 
data were saved. However, the participants can use only the page data 
of the documents included in the "doclist".

This document set approximately simulates all the discovered pages for 
a given search engine. It is a superset of the "doclist". The 
difference between this list and the "doclist" simulates a set of 
documents which are linked from documents already crawled but are not 
crawled yet.

Each list item takes the same form as the "doclist".


7. "linklist"
~~~~~~~~~~~~~
This file consists of a list of links from pages in the "doclist" to 
pages in the "targetlist".

This link list approximately simulates a set of all the discovered 
links for a given search engine when pages in the "doclist" have been 
crawled.

Each list item consists of a pair of a document id and a URL, 
indicating originating page and destination page respectively.


8. "raw" document data
~~~~~~~~~~~~~~~~~~~~~~
The directory "raw" holds in its subdirectories all web page data 
which were crawled from sites in the "sitelist" and their 
corresponding metadata. The page data remain as they were without any 
processing.

The names of the first level subdirectories are the first two 
characters of the site id's.

The names of the second level subdirectories are the first four 
characters of the site id's.

The names of the third level subdirectories are the site id's 
themselves.

Web page data crawled from each site are stored in each 
corresponding third level subdirectory. The file name of each page 
data was created combining the number of crawling sequence in four 
columns zero padding and a extention ".data". Corresponding metadata 
file has the extention ".meta".

The mdtadata file, beginning with "<NW:META>" and ending with 
"</NW:META>", consists of the following data elements.

- NW:DOCID
  The document id listed in the "doclist".

- NW:DATE
  Date extracted from the HTTP header field "Last-Modified".
  This may be empty.

- NW:CTYPE
  Content-Type of the web page.

- NW:URL
  URL of the web page.

- NW:HTTPH
  HTTP header returned from the server when the page was fetched.


9. "euc" document data
~~~~~~~~~~~~~~~~~~~~~~
The subdirectory structure and the naming scheme are same as the 
"raw" directory, except that the two-byte characters of JIS code set 
in web page data were converted to Japanese EUC code.


10. "cooked" document data
~~~~~~~~~~~~~~~~~~~~~~~~~~
The subdirectory structure and the naming scheme are same as the 
"raw" directory, except that unnecessary tags and data elements in 
web page data were removed.

The web page data were processed as follows:

(1) HTML comments, XML declarations and XML definitions are removed.

(2) "<script>", "</script>" tag pairs and the contents are removed.

(3) Concerning each "<meta>" tag, if value of the "name" attribute
is either "keywords" or "description", then value of the "content"
attribute is output in a single line with "<NWD:META/>" string prepended
at the beginning of the line.

  e.g. <meta name="keywords" content="information retrieval, test
       collection">
   ==> <NWD:META/>information retrieval, test collection<newline>

(4) Concerning each "<img>" tag, value of the "alt" attribute is
output in a single line with "<NWD:IMG/>" string prepended at the
beginning of the line.

(5) All the other tags are removed.

(6) Character code entity references are removed (e.g. &#2345; &#x2ad;).

(7) Character entity references are replaced as follows:

  &amp;  ==> &
  &lt;   ==> <
  &gt;   ==> >
  &nbsp; ==> ' '
  &quot; ==> '"'

  &Alpha; - &Omega; ==> corresponding greek upper case letters in EUC
  &alpha; - &omega; ==> corresponding greek lower case letters in EUC

  alphabets with diacritical marks ==> corresponding alphabets without
diacritical marks

  &AElig; ==> AE
  &ETH;   ==> ETH
  &szlig; ==> ss
  &aelig; ==> ae
  &eth;   ==> eth

  others are replaced with a single space character (' ').