DfromXml - Extract D-records from Xml file

[ English | Japanese ]

[visit D-home]

SYNOPSIS

DfromXml [ -p Xpath ] [ options ] [ input-file.. ]

DESCRIPTION

DfromXml reads xml files and make D-records. Output D-record is generated for each node selected by the -p xpath option. When no -p options is given, the default xpath expression //* is applied. This is to select all the nodes including the root node, regardless to its level from the root.

Each input file must be valid xml file. DfromXml read an input file and parse it. When xml parsing fails, no output is produced from that file. After the file is parsed, the xpath expression is applied to the file and D-record is generated.

Input xml file can be in any character encoding. Output D-file is encoded in the current locale character code, unless UTF I/O feature is invoked. A character not in the output character set is converted to a QUESTION MARK (?) character.

Xpath

Xpath is a language to address the nodes of an XML tree. The default xpath in this program "//*" means any element node on any level under the root node. Usually, the default is too much for your purpose. You use -p xpath option to select nodes you want. For the explanation, we use next example:

<?xml version="1.0" encoding="UTF-8"?>
<doc>/doc
  <a>text-a/doc/a
    <a>text-aa/doc/a/a
      <b>text-aab</b>/doc/a/a/b
    </a>
    <b>text-ab</b>/doc/a/b
  </a>
  <b>text-b1/doc/b[1]
    <a>text-b1a</a>/doc/b[1]/a
    <b>text-b1b</b>/doc/b[1]/b
  </b>
  <b>text-b2/doc/b[2]
    <a>text-b2a</a>/doc/b[2]/a
    <b>text-b2b</b>/doc/b[2]/b
  </b>
</doc>

In the above example, right hand side column shows the xpath expression corresponding to the node beginning at the line.

Subscript([]) identifies same name nodes on the same level. Without subscript, all the same sevel and same name nodes are selected. For example, xpath "/doc/b" selects /doc/b[1] and /doc/b[2].

Wild card "*" correspond to any element node. For example, "/doc/*" selects /doc/a, /doc/b[1] and /doc/b[2].

"//" corresponds to arbitrary number of "/*", or the root ("/"). For example "//b" selects /doc/a/a/b, /doc/a/b, /doc/b[1], /doc/b[1]/b, /doc/b[2] and /doc/b[2]/b. Default xpath "//*" selects all the nodes listed in the example.

Xpath language has many other functions and operators which you can use. But, in most cases, "[]", "*" and "//" will be enough for your purpose.

Namespace

When the input xml file uses namespace, you also have to use prefix in the xpath. There are two ways to bind prefix to an URI. One is implicit binding which uses prefix bindings in the input file. The other is explicit binding which uses -N option. If you don't give any -N option, and the input file uses namespace, implicit binding is assumed.

Implicit binding

With implicit binding, you can use the tags as appear in the input file. The right column in the example below shows the xpath you can use.

<?xml version="1.0"?>
<oai_dc:dc/oai_dc:dc
    xmlns:oai_dc="http://www.openarchives..."
    xmlns:dc="http://purl.org/dc/elements...">
  <dc:title>DfromXml : D-2.4</dc:title>/oai_dc:dc/dc:title
  <dc:creator>Akira Miyazawa</dc:creator>/oai_dc:dc/dc:creator
</oai_dc:dc>

But, when the input use the default namespace, things are more complicated.

<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/...">/D:OAI-PMH
  <ListRecords>/D:OAI-PMH/D:List
    <record>/D:OAI-PMH/D:List/D:record
    ........
    </record>
  </ListRecords>
</OAI-PMH>

As in the right column, you have to use prefix "D:" for the tags under the default namespace in the input. You may use "-p //D:record" in this case.

Xpath language has no default namespace. Names without prefix in an xpath expression adress only tags outside any namespace scope in the input document. When there are two different default namespaces in the input document, DfromXml gives prefix D_ to the second default namespace. See the third xpath on the right column of the following example.

<?xml version="1.0"?>
<out>/out
  <in-1 xmlns="http://foo.org" >/out/D:in-1
    <in-2 xmlns="http://bar.org">/out/D:in-1/D_:in-2
    </in-2>
  </in-1>
</out>

implicit binding of DfromXml has general rule to avoid prefix conflict. For each input document, the program inspects namespace-prefix bindings. When it finds same prefix bound to different namespace name, it adds LOW LINE (_) to the second prefix and binds it to the second name space. The third one is added two LOW LINEs (__), and so on. Similarly, "D_" is bound to the second default namespace name and "D__" to the third one.

Explicit binding

Explicit binding is more simple. You give bindings of all the prefices used in the xpath explicitly with -N options. Format is "prefix=URI" and you can give more than one -N option. This prefix is local to -p option and may not be same as input document's prefix. Names are compared not with the prefix-local name pair, but with the URI(bound to the prefix)-local name pair.

D-record output

Record node

For the convenience of explanation, record node is defined as the node selected by the -p xpath option to produce an output.

Field node

Each node directly under the record node makes a D-field. These nodes are defined as field nodes. The field order in the D-record keeps the sequence of field nodes in the input XML file. Element name of the field node makes the D-field name, and the field value is composed of the texts under the field node. If the record node has text directly under it, it becomes a D-field with the field name "text" and the text as the field value.

Let the record node be /doc/a of the above example.

<a>text-a
  <a>text-aa
    <b>text-aab</b>
  </a>
  <b>text-ab</b>
</a>

DfromXml produces the following D-record.

node:/doc/a
text:text-a
a:text-aa text-aab
text:
b:text-ab
text:

"node" field is a fixed name field to identify the record node. This value can be used for an xpath expression to select the node, if no namespace is used. See FIXED NAME FIELDS section.

"a" field comprises texts of the node /doc/a/a and the node /doc/a/a/b. The second and third "text" comprise only white spaces. These fields come from new line and leading spaces between end-tag (e.g. </a>) and start-tag (e.g. <b>). It may seem no use, but as an XML processor, DfromXml passes all characters in a document to the application. It is up to your application to decide what is needed. (If the /doc/a node is an element content, that contain no character data, i.e. no "text-a" in this case, no "text" field is produced for white spaces.)

Conversion to D-field includes predefined entities (&lt;, &gt;, &amp;, &quot;, &apos;) and character references (&#x4e00 etc.) to normal characters.

Namespace prefix is not reflected in the D-field name. Only local name part is used.

Raw mode output

With -r option, output D-recorded is in raw mode. In raw mode, D-field value is basically a field node in original XML form without start and end tag. For the same record node as the example above:

<a>text-a
  <a>text-aa
    <b>text-aab</b>
  </a>
  <b>text-ab</b>
</a>

DfromXml -r produces the following D-record.

node:/doc/a
text:text-a
a:text-aa <b>text-aab</b>
text:
b:text-ab
text:

See Special markups section below for the comments and other special constructs. Most of XML constructs, including entity references are kept as they are, but character references are replaced by normal character.

Attributes

By default, attributes are not included in the output. Option -a makes attributes to be reflected. There are two ways to do so. One is -a f option or attribute field output, and the other is -a n option or field name variation.

Attribute field output

With -a f or "attribute field output" option, an attribute is treated as if an element and makes a D-field called attribute field. Field name of an attribute field is the attribute name preceded by COMMERCIAL AT(@) character. Field value is the attribute value with enclosing QUOTATION MARK(")s stripped. Only the attributes pertaining to the record node or the field node are converted to attribute fields, and other level attributes are ignored. Record node attribute fields come before other field node fields, and just after the node field. Field node attribute fields come just after the field created from the node. See next example.

<doc lang="en">
  <rec a="va" b="vb">record node
    <f1 c="vc">field node
      <f11 d="vd">value11</f11>
      <f12 d="vd">value12</f12>
    </f1>
    <f2 e="ve">valuef2</f2>field node
  </rec>
</doc>

With "-a f" option, you will get following D-record.

node:/doc/rec
@a:va
@b:vb
f1:value11value12
@c:vc
f2:valuef2
@e:ve

Note that attributes in the outer node(<doc>) and the inner node(<f11>, <f12>) are not reflected.

Field name variation

With -a n or "field name variation" option, attribute becomes a part of D-field name. This is applied only to the field node attributes, and the record node attributes are ignored.

Typical usage of -a f option is such as:

<title lang="en">Algorithms</title>
<title lang="fr">Algorithmique</title>

When these nodes become the field nodes with -a n option, they are converted to:

title-en:Algorithms
title-fr:Algorithmique

You can give fieldname-specification with "-a nfieldname-specification" option. It is a character string, and for each attributes of the field node, it is appended to the element name to form the D-field name. Within the fieldname-specification, "\a" is converted to the attribute name, "\v" is converted to the attribute value, and "\\" is to "\". When no fieldname-specification is given, the default is "-\v". In the above example, "-en" and "-fr" is appended to the element name "title".

Following example is artificial to explain fieldname-specification. The field node:

<field a="va" b="vb">value</field>

With "-a n-\a=\v" option, DfromXml produces

field-a=va-b=vb:value

Special markups

There are some special markups in XML aside from element nodes and attributes. They are comments, processing instructions, CDATA sections and entity references. The following table shows how they are treated in the output D-record.

Special markups
markup under field node under field node(raw mode) field node field node(raw mode)
comment
<!--text-->
ignored <!--text--> comment:text comment:<!--text-->
processing instruction
<?target instruction?>
ignored <?target instruction?> PI:target instruction PI:<?target instruction?>
CDATA section
<![CDATA[text]]>
text <![CDATA[text]]> CDATA:text CDATA:<![CDATA[text]]>
entity reference
&name;
replacement text &name; ENTITY:name ENTITY:&name;

Usually, you do not choose these constructs as a record node. But, when you do, the record node is treated as if a field node.

OPTIONS

-p xpath
Xpath expression for record node selection. Default is "//*", which means all element nodes.
-N prefix=URI
Namespace for the xpath expression by explicit binding. This option may be more than one.
-e encoding
Encoding of the input file. Usually, you don't need this option. But, if input file does not have proper encoding information, and it is not encoded in UTF-8, you have to give this option. Encoding name is given in the form acceptable by iconv(1) command. In most UNIX environment iconv -l command reports the acceptable encoding names. For WINDOWS, see Encoding name list.
-a f
attributes to be output with attribute field output.
-a nfieldname-specification
attributes to be output with field name variation.
-r
raw mode output; output field contains lower level tags; entity references are left as it is (&lt; etc.).
-n newline-replacement
gives the character string to replace the the newline character. If this option is not given, the replacement is NULL string (new line character is just removed).
-F
adds "filename" field to each output record.
-D odatautf=8|16|32
UTF I/O feature (see manual page of UTF I/O feature.)
idatautf is ignored.

FIXED NAME FIELDS

filename:
at the top of each record, when -F is specified; value is the input file name in the form of the command arguments after globbed by the shell. This field is not added when the input file is the standard input.
node:
at the top of other fields, after filename field if any; xpath expression which identifies the record node. You can use this field value as -p xpath option if there is no namespace used. When namespaces are used in the input document, each step of the path is prefix (if any) as in the input document. Generally, it cannot be used in -p option as it is. But, if all of the following conditions are satisfied, you can use the value in -p option.
  1. You use implicit binding.
  2. No default namespace is used.
  3. There is no same prefix bound to different namespace.
text:
when the record node has texts directly under it.
comment:
when the record node has comment directly under it.
PI:
when the record node has processing information directly under it.
CDATA:
when the record node has CDATA section directly under it.
ENTITY:
when the record node has texts using non-predefined entity reference directly under it.

EXAMPLES

The most simple use produces all nodes as D-records with default xpath "//*".

DfromXml foo.xml

Select Dublin Core metadata from The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) responce coded as in the example of OAI-PMH.

DfromXml -p //oai_dc:dc response.xml

Select <record> node from an XML file using just one default namespace.

DfromXml -p //D_:record bar.xml

ENVIRONMENT

Dodatautf
for UTF I/O feature.
Didatautf is ignored.

DIAGNOSTICS

See the manual of D_msg.

There are some messages from XML parser and XPATH processor. Please consult libxml2 manuals for these messages.

SEE ALSO

Dintro, DtoXml, DfromHtml, D_msg.

AUTHOR

MIYAZAWA Akira

This program use Daniel Veillard's libxml2.


miyazawa@nii.ac.jp
2006