NTCIR-3 Web-IR Task sample data This file describes about sample files for Web Task sample data. 1. List of files Uncompress the zip'ed or tar-and-gzip'ed file, you will have the following files. README : this file aliaslist : list of aliased sites doclist : list of documents with page data duplist : list of duplicated pages sitelist : list of crawled sites targetlist : list of searched documents euc/??/????/?????? : document data in EUC character code raw/??/????/?????? : document data in original character code 2. "sitelist" This file contains a list of crawled sites. All the documents with page data were crawled from these sites. When it was detected that a site had multiple host names, only one representative host name is selected and included in this list. Each list item consists of a site id, a server type, a host name, and a port number, separated by single tab characters and terminated by a new line character. The site id is a string of six decimal characters, and is unique within the list. The server type is fixed to "http". The host name is a DNS host name. The port number is a TCP/IP port number with which the site accepted http access(es). It may take another number than "80". 3. "aliaslist" This file contains a list of aliased sites. When it was detected that a site had multiple host names, host names which are not selected as the representative host name are included in this list. Each list item consists of an aliased site name and the corresponding site id, separated by a single tab character and terminated by a new line character. An aliased site name takes a form: ("http://"(":")?). When the port number is "80", the port number part is often omitted. The site id necessarily corresponds to a single list item in the "sitelist". 4. "doclist" This file consists of a list of documents (web pages) with searchable page data, crawled from the sites listed in the "sitelist". This document set simulates all the crawled pages for a given search engine. When it was detected that a document had multiple URL's, only one representative URL is selected and included in this list. Each list item consists of a document id and a URL, separated by a single tab character and terminated by a new line character. A document id is a string of nine decimal characters preceded by a prefix "NW", and is unique within the list. 5. "duplist" This file consists of a list of duplicated pages. When it was detected that a document had multiple URL's, URL's which are not selected as the representative URL are included in this list. Each list item consists of an duplicated URL and the corresponding document id, separated by a single tab character and terminated by a new line character. Multiple list items may have a same document id. 6. "targetlist" This file consists of a list of searched documents. Any document id's listed in this file may returned as search results. All the documents in this list were actually crawled and their page data were saved. However, the participants can use only the page data of the documents included in the "doclist". This document set approximately simulates all the discovered pages for a given search engine. It is a superset of the "doclist". The difference between this list and the "doclist" simulates a set of documents which are linked from documents already crawled but are not crawled yet. Each list item takes the same form as the "doclist". 7. Document data in original character code Each file under the directory "raw" consists of web page data and the corresponding metadata which were crawled from a site in the "sitelist". The name of the first level subdirectory is the first two characters of the site id. The name of the second level subdirectory is the first four characters of the site id. The file name is the site id itself. Each file may contain multiple document data. Each document data, beginning with "" and ending with "", consists of a metadata part and a page data part. The mdtadata part, beginning with "" and ending with "", consists of the following data elements. - NW:DOCID The document id listed in the "doclist". - NW:DATE Date extracted from the HTTP header field "Last-Modified". This may be empty. - NW:CTYPE Content-Type of the web page. - NW:URL URL of the web page. - NW:HTTPH HTTP header returned from the server when the page was fetched. The page data part, beginning with "" and ending with "", contains a "NW:DSIZE" element directly followed by the actual page data. "NW:DSIZE" element indicates the real size of the following page data in bytes. It should be noticed that page data of different kind of character codes may exist in a document data file. 8. Document data in EUC character code Each file under the directory "euc" consists of document data and the corresponding metadata which were crawled from a site in the "sitelist" and were converted to EUC character code. The directory structure and the format of the document data are the same as those of the previous section. It should be noticed that, although all the page data are in EUC character code, just converting a document data file to a different character code may change sizes of the page data, causing the integrity being destroyed. If you just rely on "NW:DATA" tags and ignore "NW:DSIZE" element, it may cause no problem. However, we do not guarantee that there exist no "NW:DATA" tags in any page data.