NTCIR-3/NTCIR-4 WEB Document Data This README.data file describes on the files comprising the Document Data of NTCIR-3/NTCIR-4 WEB Test Collection, NW100G-01. 1. List of files ~~~~~~~~~~~~~~~~ NW100G-01 consists of the following files: README.data : this file sitelist : list of crawled sites aliaslist : list of aliased sites doclist : list of documents with page data duplist : list of duplicated pages targetlist : list of documents to be search targets linklist : list of links from pages in doclist to pages in targetlist raw/??/????/??????/* : Original document data as they were crawled (files with extention ".data") and their corresponding metadata (files with extention ".meta") euc/??/????/??????/* : Document data with Japanese characters converted to EUC code (files with extention ".data") and their corresponding metadata (files with extention ".meta") cooked/??/????/??????/* : Document data with unnecessary tags removed (files with extention ".data") and their corresponding metadata (files with extention ".meta") 2. "sitelist" ~~~~~~~~~~~~~ This file contains a list of crawled sites. All the documents in the "doclist" were crawled from these sites. When it was detected that a site had multiple host names, only the one which was found first is selected and included in this list. Each list item consists of a site id, a server type, a host name, and a port number, separated by single tab characters and terminated by a new line character. The site id is a string of six decimal characters, and is unique within the list. The server type is fixed to "http". The host name is a DNS host name. The port number is a TCP/IP port number with which the site accepted http access. It may take another number than "80". 3. "aliaslist" ~~~~~~~~~~~~~~ This file contains a list of aliased sites. When it was detected that a site had multiple host names, host names which are not included in the "sitelist" file are included in this list. Each list item consists of an aliased site name and the corresponding site id, separated by a single tab character and terminated by a new line character. An aliased site name takes a form: ("http://"(":")?). When the port number is "80", the port number part is often omitted. The site id necessarily corresponds to a single list item in the "sitelist". 4. "doclist" ~~~~~~~~~~~~ This file consists of a list of documents (web pages) with searchable page data, crawled from the sites listed in the "sitelist". This document set simulates all the crawled pages for a given search engine. When it was detected that a document had multiple URL's, only one URL which was found first is selected and included in this list. Each list item consists of a document id and a URL, separated by a single tab character and terminated by a new line character. A document id is a string of nine decimal characters preceded by a prefix "NW", and is unique within the list. 5. "duplist" ~~~~~~~~~~~~ This file consists of a list of duplicated pages, i.e., pages having a same content. When it was detected that a document had multiple URL's, URL's which are not included in the "doclist" file are included in this list. Each list item consists of an duplicated URL and the corresponding document id, separated by a single tab character and terminated by a new line character. Multiple list items may have a same document id. 6. "targetlist" ~~~~~~~~~~~~~~~ This file consists of a list of documents to be search targets. Any document id's listed in this file may returned as search results. All the documents in this list were actually crawled and their page data were saved. However, the participants can use only the page data of the documents included in the "doclist". This document set approximately simulates all the discovered pages for a given search engine. It is a superset of the "doclist". The difference between this list and the "doclist" simulates a set of documents which are linked from documents already crawled but are not crawled yet. Each list item takes the same form as the "doclist". 7. "linklist" ~~~~~~~~~~~~~ This file consists of a list of links from pages in the "doclist" to pages in the "targetlist". This link list approximately simulates a set of all the discovered links for a given search engine when pages in the "doclist" have been crawled. Each list item consists of a pair of a document id and a URL, indicating originating page and destination page respectively. 8. "raw" document data ~~~~~~~~~~~~~~~~~~~~~~ The directory "raw" holds in its subdirectories all web page data which were crawled from sites in the "sitelist" and their corresponding metadata. The page data remain as they were without any processing. The names of the first level subdirectories are the first two characters of the site id's. The names of the second level subdirectories are the first four characters of the site id's. The names of the third level subdirectories are the site id's themselves. Web page data crawled from each site are stored in each corresponding third level subdirectory. The file name of each page data was created combining the number of crawling sequence in four columns zero padding and a extention ".data". Corresponding metadata file has the extention ".meta". The mdtadata file, beginning with "" and ending with "", consists of the following data elements. - NW:DOCID The document id listed in the "doclist". - NW:DATE Date extracted from the HTTP header field "Last-Modified". This may be empty. - NW:CTYPE Content-Type of the web page. - NW:URL URL of the web page. - NW:HTTPH HTTP header returned from the server when the page was fetched. 9. "euc" document data ~~~~~~~~~~~~~~~~~~~~~~ The subdirectory structure and the naming scheme are same as the "raw" directory, except that the two-byte characters of JIS code set in web page data were converted to Japanese EUC code. 10. "cooked" document data ~~~~~~~~~~~~~~~~~~~~~~~~~~ The subdirectory structure and the naming scheme are same as the "raw" directory, except that unnecessary tags and data elements in web page data were removed. The web page data were processed as follows: (1) HTML comments, XML declarations and XML definitions are removed. (2) "" tag pairs and the contents are removed. (3) Concerning each "" tag, if value of the "name" attribute is either "keywords" or "description", then value of the "content" attribute is output in a single line with "" string prepended at the beginning of the line. e.g. ==> information retrieval, test collection (4) Concerning each "" tag, value of the "alt" attribute is output in a single line with "" string prepended at the beginning of the line. (5) All the other tags are removed. (6) Character code entity references are removed (e.g. ऩ ʭ). (7) Character entity references are replaced as follows: & ==> & < ==> < > ==> >   ==> ' ' " ==> '"' Α - Ω ==> corresponding greek upper case letters in EUC α - ω ==> corresponding greek lower case letters in EUC alphabets with diacritical marks ==> corresponding alphabets without diacritical marks Æ ==> AE Ð ==> ETH ß ==> ss æ ==> ae ð ==> eth others are replaced with a single space character (' ').