DfromHtml [ -p Xpath ] [ options ] [ input-file.. ]
DfromHtml extracts potions of HTML input-files and convert them to D-records. Typically, this is used to derive URL references from a HTML file, extract meatadata from the header, or convert HTML tables into D-file.
To select elements to convert, "-p xpath" option is used. Xpath is a language to address nodes of XML. HTML is not completely a subset of XML, but can be represented by same tree structure, and its nodes can be adressed by xpath expression.
Default "-p" option is "//a/@href/..", which selects all anchor elements that has href attribute. In xpath expression, "//" means arbitrary number of node levels, "@" means attribute node and ".." means the parent node.
To convert an HTML table into D-records, "//tr" will work in usual cases. This is to select all the table row elements, and thus making table columns or table header cells as D-fields. If there are more than one table in the document, "//table[n]//tr", where n is sequence number of tables in that level, may work. See Tips for xpath usage subsections below.
To select title from HTML header, you can use "/html/head/title". Simillarly, "/html/head/meta" selects meta data from the header.
In the case of default xpath, output is something like
node:/html/body/p[1]/a
@href:../jpn/Dintro.html
text:Japanese
node:/html/body/table[1]/tbody/tr[2]/td[1]/a
@href:DfromLine.html
b:DfromLine
node:/html/body/address/a
@href:mailto:miyazawa@nii.ac.jp
img:
@src:logo.gif
Output record is made from each selected element. The "node" field at the top of each D-record shows the xpath expression to identify the selected element. Following fields of which field name start with COMMERCIAL AT(@) character are made from attributes of the selected element. The field name is @attribute-name and the field value is the attribute value. Texts of the selected element has field name "text".
Elements direct under the selected element are also converted into separate D-fields. The field name is the element name and the field value is aggregation of texts and elements under that element. The attributes of that element is converted into separate D-fields with @attribute-name as the field name and attribute value as the field value. They are placed after the element field (e.g., @src field after img field in the example above).
By default, tags are stripped and character entity references (e.g. < etc.) are converted to normal character (e.g. < etc.) in the output fields. But when -r option is given, tags are preserved.
For example, when the input file is
<p><b>Bold</b> & <i>Italic</i>.</p>
and this <p> element becomes a D-field, normal output is:
p:Bold & Italic.
But, in raw mode output, it becomes:
p:<b>Bold</b> & <i>Italic</i>.
DfromHtml can handle most of encodings. Usually, it detects character encoding of the input file. But when the input file does not have proper character encoding designation, you may have to give character encoding with -e option. Available encoding name is given with command iconv -l in most UNIX. For WINDOWS, see Encoding name list.
Output D-file is encoded in the current locale character set, unless UTF I/O feature is invoked for the output encoding. When producing output D-records, a character not in the locale character set is converted to a QUESTION MARK (?) character. (When using UTF output, this doesn't happen).
First of all, you have to make sure that the content you want is acturally in the input file. Sometimes, what you see with a browser may not be what you download by the browser.
Generally, you don't know how the source HTML file is composed. The node field of the output can help you to find xpath expression you need. You may try
DfromHtml -p "/html/body/*"
This command produces D-reords from the first level elements of the body. For example like:
node:/html/body/div[1]
text:
table:
@width:100%
node:/html/body/div[2]
text:
table:Home >
@width:100%
node:/html/body/table
@width:100%
tbody:D is a series of commands to perform .....
node:/html/body/hr
hr:
......
You look into this and find most of the content is in the third element as a table. Then, you try
DfromHtml -p "/html/body/table/*"
In the case of a table, of course, most of the content is in the tbody. So, you try next:
DfromHtml -p "/html/body/table/tbody/*"
Repeating this process, you will get to an xpath expression you need.
Note that tags of HTML are case-insensitive. You can use either /TBODY or /tobdy, though field names and node field values in output D-file appear in lower case.
If the part you want to extract has some characteristic word, you can use other D-commands to find xpath. For example, let the part you want has a word "Examples", and command
DfromHtml -p "//*" input-file \
| Dgrep "Examples" \
| Dpr -p node
produces the following output.
rec# node
1 /html
2 /html/body
3 /html/body/table[5]
4 /html/body/table[5]/tr[16]
5 /html/body/table[5]/tr[16]/td[3]
6 /html/body/table[5]/tr[43]
7 /html/body/table[5]/tr[43]/td[3]
8 /html/body/table[5]/tr[52]
9 /html/body/table[5]/tr[52]/td[3]
10 /html/body/table[5]/tr[81]
11 /html/body/table[5]/tr[81]/td[3]
Xpath expression "//*" extracts all the elements from the input. The second D-command Dgrep selects D-records that has word "Examples" somewhere. The third D-command Dpr print node field of the input records.
From this output, you can see that
-p "/html/body/table[5]/tr"
may be the xpath you need.
The most simple use of DfromHtml extracts hypertext links with their texts and attributes.
DfromHtml foo.html
Convert a HTML table (the first one) into D-records, with a table row as a D-record.
DfromHtml "/html/body/table[1]/tbody/tr" foo.html
Extract titles with filenames.
DfromHtml -F -p "/html/head/title" *.html
Convert option list in this manual (it is the first <dl> element) to a D-record.
DfromHtml -p "/html/body/dl[1]" DfromHtml.html
Extract <img> element. In the output, @src field can be used as an image file list.
DfromHtml -p //img bar.html
See the manual of D_msg.
There are some messages from HTML parser and XPATH processor. Please consult libxml2 manuals for these messages.
MIYAZAWA Akira