DfromHtml : D-2.6

DESCRIPTION

DfromHtml extracts potions of HTML input-files and convert them to D-records. Typically, this is used to derive URL references from a HTML file, extract meatadata from the header, or convert HTML tables into D-file.

To select elements to convert, "-p xpath" option is used. Xpath is a language to address nodes of XML. HTML is not completely a subset of XML, but can be represented by same tree structure, and its nodes can be adressed by xpath expression.

Default "-p" option is "//a/@href/..", which selects all anchor elements that has href attribute. In xpath expression, "//" means arbitrary number of node levels, "@" means attribute node and ".." means the parent node.

To convert an HTML table into D-records, "//tr" will work in usual cases. This is to select all the table row elements, and thus making table columns or table header cells as D-fields. If there are more than one table in the document, "//table[n]//tr", where n is sequence number of tables in that level, may work. See Tips for xpath usage subsections below.

To select title from HTML header, you can use "/html/head/title". Simillarly, "/html/head/meta" selects meta data from the header.

D-record output

In the case of default xpath, output is something like

node:/html/body/p[1]/a @href:../jpn/Dintro.html text:Japanese node:/html/body/table[1]/tbody/tr[2]/td[1]/a @href:DfromLine.html b:DfromLine node:/html/body/address/a @href:mailto:miyazawa@nii.ac.jp img: @src:logo.gif

Output record is made from each selected element. The "node" field at the top of each D-record shows the xpath expression to identify the selected element. Following fields of which field name start with COMMERCIAL AT(@) character are made from attributes of the selected element. The field name is @attribute-name and the field value is the attribute value. Texts of the selected element has field name "text".

Elements direct under the selected element are also converted into separate D-fields. The field name is the element name and the field value is aggregation of texts and elements under that element. The attributes of that element is converted into separate D-fields with @attribute-name as the field name and attribute value as the field value. They are placed after the element field (e.g., @src field after img field in the example above).

Raw mode output

By default, tags are stripped and character entity references (e.g. < etc.) are converted to normal character (e.g. < etc.) in the output fields. But when -r option is given, tags are preserved.

For example, when the input file is

Bold & Italic.

and this  element becomes a D-field, normal output is:

p:Bold & Italic.

But, in raw mode output, it becomes:

p:Bold & Italic.

Character set and locale

DfromHtml can handle most of encodings. Usually, it detects character encoding of the input file. But when the input file does not have proper character encoding designation, you may have to give character encoding with -e option. Available encoding name is given with command iconv -l in most UNIX. For WINDOWS, see Encoding name list.

Output D-file is encoded in the current locale character set, unless UTF I/O feature is invoked for the output encoding. When producing output D-records, a character not in the locale character set is converted to a QUESTION MARK (?) character. (When using UTF output, this doesn't happen).

Tips for xpath usage (1)

First of all, you have to make sure that the content you want is acturally in the input file. Sometimes, what you see with a browser may not be what you download by the browser.

Generally, you don't know how the source HTML file is composed. The node field of the output can help you to find xpath expression you need. You may try

DfromHtml -p "/html/body/*"

This command produces D-reords from the first level elements of the body. For example like:

node:/html/body/div[1] text: table: @width:100% node:/html/body/div[2] text: table:Home > @width:100% node:/html/body/table @width:100% tbody:D is a series of commands to perform ..... node:/html/body/hr hr: ......

You look into this and find most of the content is in the third element as a table. Then, you try

DfromHtml -p "/html/body/table/*"

In the case of a table, of course, most of the content is in the tbody. So, you try next:

DfromHtml -p "/html/body/table/tbody/*"

Repeating this process, you will get to an xpath expression you need.

Note that tags of HTML are case-insensitive. You can use either /TBODY or /tobdy, though field names and node field values in output D-file appear in lower case.

Tips for xpath usage (2)

If the part you want to extract has some characteristic word, you can use other D-commands to find xpath. For example, let the part you want has a word "Examples", and command

DfromHtml -p "//*" input-file \ | Dgrep "Examples" \ | Dpr -p node

produces the following output.

rec# node 1 /html 2 /html/body 3 /html/body/table[5] 4 /html/body/table[5]/tr[16] 5 /html/body/table[5]/tr[16]/td[3] 6 /html/body/table[5]/tr[43] 7 /html/body/table[5]/tr[43]/td[3] 8 /html/body/table[5]/tr[52] 9 /html/body/table[5]/tr[52]/td[3] 10 /html/body/table[5]/tr[81] 11 /html/body/table[5]/tr[81]/td[3]

Xpath expression "//*" extracts all the elements from the input. The second D-command Dgrep selects D-records that has word "Examples" somewhere. The third D-command Dpr print node field of the input records.

From this output, you can see that

-p "/html/body/table[5]/tr"

may be the xpath you need.

DfromHtml - Extract D-records from Html file

SYNOPSIS

DESCRIPTION

D-record output

Raw mode output

Character set and locale

Tips for xpath usage (1)

Tips for xpath usage (2)

OPTIONS

FIXED NAME FIELDS

EXAMPLES

ENVIRONMENT

DIAGNOSTICS

SEE ALSO

AUTHOR