-----------------------------------------------------------------------
About the Document Data of the NTCIR-5 WEB Test Collection (NW1000G-04)
-----------------------------------------------------------------------
This file describes about the files comprising the Document Data of
the NTCIR-5 WEB Test Collection, ``NW1000G-04.''
1. List of data
~~~~~~~~~~~~~~~~
NW1000G-04 consists of the following data:
[lists]
lists/sitelist/
: list of crawled sites
lists/doclist/
: list of contained Web pages
lists/linklist.out/
: list of forward links between pages contained in doclist
lists/linklist.in/
: list of backward links between pages contained in doclist
lists/anclist.out/
: list of anchor texts attached to forward links between
pages contained in doclist
lists/anclist.in/
: list of anchor texts attached to backward links between
pages contained in doclist
[document data]
raw/ : Original document data as they were crawled
euc/ : Document data with Japanese characters converted to EUC
code
cook/ : Document data in EUC code with unnecessary tags removed
mecab/ : Document data processed by the Japanese morphological
analyzer MeCab.
Note: There are files with extensions ".encode" and ".filelist" in the
above mentioned document data directories. Please ignore them.
2. "sitelist"
~~~~~~~~~~~~~
This is a list of crawled sites.
All the documents in the "doclist" were crawled from these sites.
Each list item consists of a site ID and a host name, separated by a single
tab and terminated by a new line character.
The site ID is a string of seven decimal characters, and is unique within
the list.
While the site IDs are given in dictionary order of host names, they are
not necessarily contiguous.
The server type is fixed to "http".
The host name is a DNS host name.
The port number of crawled site is limited to "80".
The sitelist files are split by every 10,000 site IDs, and the name of
each sitelist file is given as the concatenation of first three characters
of the site IDs and the characters of "xxxx.sitelist". (e.g. the list of
site IDs from 1230000 to 1239999 are stored in the file "123xxxx.sitelist".)
-- sample of the sitelist files: 073xxxx.sitelist
0730011 http://www.barnes.co.jp
0730079 http://www.barneys.co.jp
0730203 http://www.barockhaus.co.jp
0730227 http://www.baron-ik.co.jp
0730229 http://www.baron.co.jp
(snip)
0739864 http://www.bec-csk.co.jp
0739869 http://www.bec.co.jp
0739876 http://www.bec1993.co.jp
0739891 http://www.because.co.jp
0739926 http://www.becgroup.co.jp
3. "doclist"
~~~~~~~~~~~~
This is a list of documents included in the Document Data,
which have been crawled from the sites listed in the "sitelist".
Each list item consists of a document ID and a URL, separated by a
single tab character and terminated by a new line character.
A document ID is made by concatenating the site ID (seven digits),
'_' and the page ID (seven digits), and is unique within the list.
The page IDs in each host are given in dictionary order of the URL.
The doclist files are split by every 10,000 site IDs, and the name of
each doclist file is given as the concatenation of first three characters
of the site IDs and the characters of "xxxx.doclist". (e.g. the list of
site IDs from 1230000 to 1239999 are stored in the file "123xxxx.doclist".)
-- sample of the doclist files: 073xxxx.doclist
0730011_0000001 http://www.barnes.co.jp/
0730011_0000002 http://www.barnes.co.jp/Dew.htm
0730011_0000003 http://www.barnes.co.jp/Ene-Pow.htm
0730011_0000004 http://www.barnes.co.jp/IR.htm
0730011_0000005 http://www.barnes.co.jp/News.htm
(snip)
0739926_0000314 http://www.becgroup.co.jp/zawaz/catalog.html
0739926_0000315 http://www.becgroup.co.jp/zawaz/home.html
0739926_0000316 http://www.becgroup.co.jp/zawaz/order/order.html
0739926_0000317 http://www.becgroup.co.jp/zawaz/r_index.html
0739926_0000318 http://www.becgroup.co.jp/zawaz/up_bar.html
4. "linklist.out"
~~~~~~~~~~~~~~~~~
This is a list of forward links between pages contained in the "doclist".
Each list item consists of two pairs of a document ID and a URL, the
first for the originating page and the other for the destination page.
The linklist.out files are split by every 10,000 site ID, and the name of
each file is given as the concatenation of first three characters
of the site IDs of originating pages and the characters of "xxxx.outlink".
(e.g. the list of links originating from pages in sites with IDs from
1230000 to 1239999 are stored in the file "123xxxx.outlink".)
-- sample of the linklist files: 073xxxx.outlink
0730011_0000001 http://www.barnes.co.jp/ 0730011_0000002 http://www.barnes.co.jp/Dew.htm
0730011_0000001 http://www.barnes.co.jp/ 0730011_0000003 http://www.barnes.co.jp/Ene-Pow.htm
0730011_0000001 http://www.barnes.co.jp/ 0730011_0000004 http://www.barnes.co.jp/IR.htm
0730011_0000001 http://www.barnes.co.jp/ 0730011_0000006 http://www.barnes.co.jp/Non-dest.htm
0730011_0000001 http://www.barnes.co.jp/ 0730011_0000010 http://www.barnes.co.jp/semicon.htm
(snip)
0739926_0000304 http://www.becgroup.co.jp/kentos/umeda/par.html 0739926_0000304 http://www.becgroup.co.jp/kentos/umeda/par.html
0739926_0000305 http://www.becgroup.co.jp/kentos/umeda/sys.html 0739926_0000304 http://www.becgroup.co.jp/kentos/umeda/par.html
0739926_0000312 http://www.becgroup.co.jp/up_bar.html 0739926_0000278 http://www.becgroup.co.jp/home.html
0739926_0000315 http://www.becgroup.co.jp/zawaz/home.html 0739926_0000317 http://www.becgroup.co.jp/zawaz/r_index.html
0739926_0000316 http://www.becgroup.co.jp/zawaz/order/order.html 0739926_0000314 http://www.becgroup.co.jp/zawaz/catalog.html
5. "linklist.in"
~~~~~~~~~~~~~~~~~
This is a list of backward links between pages contained in the "doclist".
Each list item consists of two pairs of a document ID and a URL, the
first for the destination page and the other for the originating page.
The linklist.in files are split by every 10,000 site ID, and the name of
each file is given as the concatenation of first three characters
of the site IDs of destination pages and the characters of "xxxx.inlink".
(e.g. the list of links pointing to pages in sites with IDs from
1230000 to 1239999 are stored in the file "123xxxx.inlink".)
-- sample of the linklist files: 073xxxx.inlink
0730011_0000001 http://www.barnes.co.jp/ 1852829_0000024 http://www.semiconbrain.com/50/ni.htm a,href
0730011_0000002 http://www.barnes.co.jp/Dew.htm 0730011_0000001 http://www.barnes.co.jp/ a,href
0730011_0000002 http://www.barnes.co.jp/Dew.htm 0730011_0000002 http://www.barnes.co.jp/Dew.htm a,href
0730011_0000002 http://www.barnes.co.jp/Dew.htm 0730011_0000003 http://www.barnes.co.jp/Ene-Pow.htm a,href
0730011_0000002 http://www.barnes.co.jp/Dew.htm 0730011_0000004 http://www.barnes.co.jp/IR.htm a,href
(snip)
0739990_0000040 http://www.becker-japan.net/totop.html 0739990_0000028 http://www.becker-japan.net/rvolc.html a,href
0739990_0000040 http://www.becker-japan.net/totop.html 0739990_0000029 http://www.becker-japan.net/rvolvp.html a,href
0739990_0000040 http://www.becker-japan.net/totop.html 0739990_0000030 http://www.becker-japan.net/scb.html a,href
0739990_0000040 http://www.becker-japan.net/totop.html 0739990_0000031 http://www.becker-japan.net/scvp.html a,href
0739990_0000040 http://www.becker-japan.net/totop.html 0739990_0000039 http://www.becker-japan.net/toride.html a,href
6. "anclist.out"
~~~~~~~~~~~~~~~~~
This is a list of anchor texts attached to forward links between pages
contained in the "doclist".
Each list item consists of a document ID of the originating page, a
document ID of the destination page and an anchor text.
The anclist.out files are split by every 10,000 site ID, and the name of
each file is given as the concatenation of first three characters
of the site IDs of originating pages and the characters of "xxxx.outlink".
(e.g. the list of links originating from pages in sites with IDs from
1230000 to 1239999 are stored in the file "123xxxx.outlink".)
-- sample of the anclist files: 073xxxx.outlink
0730011_0000001 0730011_0000001 ボタン
0730011_0000001 0730011_0000002 露点温度測定器
0730011_0000001 0730011_0000003 光パワー/光エネルギー測定機器
0730011_0000001 0730011_0000004 赤外線応用製品
0730011_0000001 0730011_0000005 ボタン
(snip)
0739990_0000038 0739990_0000033 Asia
0739990_0000038 0739990_0000034 Europe
0739990_0000038 0739990_0000035 Japan
0739990_0000038 0739990_0000036 Oceania
0739990_0000038 0739990_0000037 U.S.A
7. "anclist.in"
~~~~~~~~~~~~~~~~~
This is a list of anchor texts attached to backward links between pages
contained in the "doclist".
Each list item consists of a document ID of the destination page, a
document ID of the originating page and an anchor text.
The anclist.in files are split by every 10,000 site ID, and the name of
each file is given as the concatenation of first three characters
of the site IDs of destination pages and the characters of "xxxx.inlink".
(e.g. the list of links pointing to pages in sites with IDs from
1230000 to 1239999 are stored in the file "123xxxx.inlink".)
-- sample of the anclist files: 073xxxx.inlink
0730011_0000001 0730011_0000011 Top
0730011_0000001 1852829_0000024 http://www.barnes.co.jp
0730011_0000001 0730011_0000002 ホームページ
0730011_0000001 0730011_0000003 ホームページ
0730011_0000001 0730011_0000004 ホームページ
(snip)
0739990_0000038 0739990_0000020 サービス網
0739990_0000038 0739990_0000021 サービス網
0739990_0000039 0739990_0000023 写 真
0739990_0000040 0739990_0000020 トップページへ
0739990_0000040 0739990_0000021 トップページへ
8. "raw" document data
~~~~~~~~~~~~~~~~~~~~~~
This is a set of web data as they were crawled from sites in the "sitelist".
The directory structure and the naming scheme are as follows:
- The name of the first level subdirectory is the first three
characters of the site ID.
- The name of the second level subdirectory is the concatenation
of fourth and fifth characters of the site ID and "xx".
- The name of the third level subdirectory is the site ID themselves
(seven digits).
- The name of the fourth level subdirectory is the first three
characters of the page ID.
- The name of the fifth level subdirectory is the concatenation
of fourth and fifth characters of the page ID and "xx".
Web page data crawled from each site are stored in each corresponding
fifth level subdirectory.
The file name of each page data is described as the concatenation
of site ID, "_", page ID, and the extension of ".dat".
e.g. site ID: 1234567, page ID: 0000123
==> file path: raw/123/45xx/1234567/000/01xx/1234567_0000123.dat
9. "euc" document data
~~~~~~~~~~~~~~~~~~~~~~~
This is a set of web data that are processed from "raw" data by converting
Japanese two-byte characters to EUC code.
The directory structure and the naming scheme are same as that of the
"raw" document data, except that the extension is ".euc".
e.g. site ID: 1234567, page ID: 0000123
==> file path: euc/123/45xx/1234567/000/01xx/1234567_0000123.euc
10. "cook" document data
~~~~~~~~~~~~~~~~~~~~~~~~
This is a set of web data that are processed from "euc" data by removing
unnecessary tags and others.
The directory structure and the naming scheme are same as that of the
"raw" document data, except that the extension is ".cooked".
e.g. site ID: 1234567, page ID: 0000123
==> file path: cook/123/45xx/1234567/000/01xx/1234567_0000123.cooked
The web page data were processed with the following rules:
(1) HTML comments, XML declarations and XML definitions are removed.
(2) tag pairs of "" and their contents are removed.
(3) Concerning each "" tag, if value of the "name" attribute
is either "keywords" or "description", then value of the "content"
attribute is output in a single line prefixed with "".
e.g.
==> information retrieval, test collection
(4) Concerning each "" tag, value of the "alt" attribute is
output in a single line prefixed with "".
(5) All the other tags are simply removed.
(6) Character code entity references are removed (e.g. ऩ ʭ).
(7) Character entity references are replaced as follows:
& ==> &
< ==> <
> ==> >
==> ' '
" ==> '"'
Α - Ω ==> corresponding Greek upper case letters in EUC
α - ω ==> corresponding Greek lower case letters in EUC
alphabets with diacritical marks
==> corresponding alphabets without diacritical marks
Æ ==> AE
Ð ==> ETH
ß ==> ss
æ ==> ae
ð ==> eth
Others ==> single space character (' ').
(8) Consecutive tabs and spaces are replaced with a single space
character (' ').
(9) Null lines and lines containing tabs and spaces only are removed.
11. "mecab" document data
~~~~~~~~~~~~~~~~~~~~~~~~~
This is a set of web data that are processed from "cook" data by applying
Japanese morphological analyzer MeCab.
The directory structure and the naming scheme are same as that of the
"raw" document data, except that the extension is ".mecab".
e.g. site ID: 1234567, page ID: 0000123
==> file path: mecab/123/45xx/1234567/000/01xx/1234567_0000123.mecab