DfromHtml - Extract D-records from Html file

[ English | Japanese ]

[visit D-home]

SYNOPSIS

DfromHtml [ -p Xpath ] [ options ] [ input-file.. ]

DESCRIPTION

DfromHtml extracts potions of HTML input-files and convert them to D-records. Typically, this is used to derive URL references from a HTML file, extract meatadata from the header, or convert HTML tables into D-file.

To select elements to convert, "-p xpath" option is used. Xpath is a language to address nodes of XML. HTML is not completely a subset of XML, but can be represented by same tree structure, and its nodes can be adressed by xpath expression.

Default "-p" option is "//a/@href/..", which selects all anchor elements that has href attribute. In xpath expression, "//" means arbitrary number of node levels, "@" means attribute node and ".." means the parent node.

To convert an HTML table into D-records, "//tr" will work in usual cases. This is to select all the table row elements, and thus making table columns or table header cells as D-fields. If there are more than one table in the document, "//table[n]//tr", where n is sequence number of tables in that level, may work. See Tips for xpath usage subsections below.

To select title from HTML header, you can use "/html/head/title". Simillarly, "/html/head/meta" selects meta data from the header.

D-record output

In the case of default xpath, output is something like

node:/html/body/p[1]/a
@href:../jpn/Dintro.html
text:Japanese

node:/html/body/table[1]/tbody/tr[2]/td[1]/a
@href:DfromLine.html
b:DfromLine

node:/html/body/address/a
@href:mailto:miyazawa@nii.ac.jp
img:
@src:logo.gif

Output record is made from each selected element. The "node" field at the top of each D-record shows the xpath expression to identify the selected element. Following fields of which field name start with COMMERCIAL AT(@) character are made from attributes of the selected element. The field name is @attribute-name and the field value is the attribute value. Texts of the selected element has field name "text".

Elements direct under the selected element are also converted into separate D-fields. The field name is the element name and the field value is aggregation of texts and elements under that element. The attributes of that element is converted into separate D-fields with @attribute-name as the field name and attribute value as the field value. They are placed after the element field (e.g., @src field after img field in the example above).

Raw mode output

By default, tags are stripped and character entity references (e.g. &lt; etc.) are converted to normal character (e.g. < etc.) in the output fields. But when -r option is given, tags are preserved.

For example, when the input file is

<p><b>Bold</b> &amp; <i>Italic</i>.</p>

and this <p> element becomes a D-field, normal output is:

p:Bold & Italic.

But, in raw mode output, it becomes:

p:<b>Bold</b> &amp; <i>Italic</i>.

Character set and locale

DfromHtml can handle most of encodings. Usually, it detects character encoding of the input file. But when the input file does not have proper character encoding designation, you may have to give character encoding with -e option. Available encoding name is given with command iconv -l in most UNIX. For WINDOWS, see Encoding name list.

Output D-file is encoded in the current locale character set, unless UTF I/O feature is invoked for the output encoding. When producing output D-records, a character not in the locale character set is converted to a QUESTION MARK (?) character. (When using UTF output, this doesn't happen).

Tips for xpath usage (1)

First of all, you have to make sure that the content you want is acturally in the input file. Sometimes, what you see with a browser may not be what you download by the browser.

Generally, you don't know how the source HTML file is composed. The node field of the output can help you to find xpath expression you need. You may try

DfromHtml -p "/html/body/*"

This command produces D-reords from the first level elements of the body. For example like:

node:/html/body/div[1]
text:
table:
@width:100%

node:/html/body/div[2]
text:
table:Home >
@width:100%

node:/html/body/table
@width:100%
tbody:D is a series of commands to perform .....

node:/html/body/hr
hr:

......

You look into this and find most of the content is in the third element as a table. Then, you try

DfromHtml -p "/html/body/table/*"

In the case of a table, of course, most of the content is in the tbody. So, you try next:

DfromHtml -p "/html/body/table/tbody/*"

Repeating this process, you will get to an xpath expression you need.

Note that tags of HTML are case-insensitive. You can use either /TBODY or /tobdy, though field names and node field values in output D-file appear in lower case.

Tips for xpath usage (2)

If the part you want to extract has some characteristic word, you can use other D-commands to find xpath. For example, let the part you want has a word "Examples", and command

DfromHtml -p "//*" input-file \
| Dgrep "Examples" \
| Dpr -p node

produces the following output.

rec# node
 1 /html
 2 /html/body
 3 /html/body/table[5]
 4 /html/body/table[5]/tr[16]
 5 /html/body/table[5]/tr[16]/td[3]
 6 /html/body/table[5]/tr[43]
 7 /html/body/table[5]/tr[43]/td[3]
 8 /html/body/table[5]/tr[52]
 9 /html/body/table[5]/tr[52]/td[3]
10 /html/body/table[5]/tr[81]
11 /html/body/table[5]/tr[81]/td[3]

Xpath expression "//*" extracts all the elements from the input. The second D-command Dgrep selects D-records that has word "Examples" somewhere. The third D-command Dpr print node field of the input records.

From this output, you can see that

-p "/html/body/table[5]/tr"

may be the xpath you need.

OPTIONS

-p xpath
Xpath expression for the element selection. Default is "//a/@href/..", which selects all anchor elements that has href attribute. Xpath for HTML is case-insensitive.
-e
Encoding of the input file. See Character set and locale subsection.
-F
adds "filename" field to each output record.
-n newline-replacement
gives the character string to replace the the newline character. If this option is not given, the replacement is NULL string (new line character is just removed).
-r
raw mode output; see Raw mode output.
-D odatautf=8|16|32
UTF I/O feature (see manual page of UTF I/O feature.)
idatautf is ignored.

FIXED NAME FIELDS

filename:
at the top of each record, when -F is specified; value is the input file name in the form of the command arguments after globbed by the shell. This field is not added when the input file is the standard input.
node:
at the top of other fields, after filename field if any; xpath expression to identify the element.
text:
direct texts under the selected element.
comment:
when the selected element has comment directly under it.

EXAMPLES

The most simple use of DfromHtml extracts hypertext links with their texts and attributes.

DfromHtml foo.html

Convert a HTML table (the first one) into D-records, with a table row as a D-record.

DfromHtml "/html/body/table[1]/tbody/tr" foo.html

Extract titles with filenames.

DfromHtml -F -p "/html/head/title" *.html

Convert option list in this manual (it is the first <dl> element) to a D-record.

DfromHtml -p "/html/body/dl[1]" DfromHtml.html

Extract <img> element. In the output, @src field can be used as an image file list.

DfromHtml -p //img bar.html

ENVIRONMENT

Dodatautf
for UTF I/O feature.
Didatautf is ignored.

DIAGNOSTICS

See the manual of D_msg.

There are some messages from HTML parser and XPATH processor. Please consult libxml2 manuals for these messages.

SEE ALSO

Dintro, DfromXml, DtoXml, D_msg.

AUTHOR

MIYAZAWA Akira


miyazawa@nii.ac.jp
2006