DfromXml : D-2.6

DESCRIPTION

DfromXml reads xml files and make D-records. Output D-record is generated for each node selected by the -p xpath option. When no -p options is given, the default xpath expression //* is applied. This is to select all the nodes including the root node, regardless to its level from the root.

Each input file must be valid xml file. DfromXml read an input file and parse it. When xml parsing fails, no output is produced from that file. After the file is parsed, the xpath expression is applied to the file and D-record is generated.

Input xml file can be in any character encoding. Output D-file is encoded in the current locale character code, unless UTF I/O feature is invoked. A character not in the output character set is converted to a QUESTION MARK (?) character.

Xpath

Xpath is a language to address the nodes of an XML tree. The default xpath in this program "//*" means any element node on any level under the root node. Usually, the default is too much for your purpose. You use -p xpath option to select nodes you want. For the explanation, we use next example:

<?xml version="1.0" encoding="UTF-8"?>

<doc> /doc

 <a>text-a /doc/a

 <a>text-aa /doc/a/a

 text-aab /doc/a/a/b

 </a>

 text-ab /doc/a/b

 </a>

 text-b1 /doc/b[1]

 <a>text-b1a</a> /doc/b[1]/a

 text-b1b /doc/b[1]/b

 

 text-b2 /doc/b[2]

 <a>text-b2a</a> /doc/b[2]/a

 text-b2b /doc/b[2]/b

 

</doc>

In the above example, right hand side column shows the xpath expression corresponding to the node beginning at the line.

Subscript([]) identifies same name nodes on the same level. Without subscript, all the same sevel and same name nodes are selected. For example, xpath "/doc/b" selects /doc/b[1] and /doc/b[2].

Wild card "*" correspond to any element node. For example, "/doc/*" selects /doc/a, /doc/b[1] and /doc/b[2].

"//" corresponds to arbitrary number of "/*", or the root ("/"). For example "//b" selects /doc/a/a/b, /doc/a/b, /doc/b[1], /doc/b[1]/b, /doc/b[2] and /doc/b[2]/b. Default xpath "//*" selects all the nodes listed in the example.

Xpath language has many other functions and operators which you can use. But, in most cases, "[]", "*" and "//" will be enough for your purpose.

Namespace

When the input xml file uses namespace, you also have to use prefix in the xpath. There are two ways to bind prefix to an URI. One is implicit binding which uses prefix bindings in the input file. The other is explicit binding which uses -N option. If you don't give any -N option, and the input file uses namespace, implicit binding is assumed.

Implicit binding

With implicit binding, you can use the tags as appear in the input file. The right column in the example below shows the xpath you can use.

<?xml version="1.0"?>

<oai_dc:dc /oai_dc:dc

 xmlns:oai_dc="http://www.openarchives..."

 xmlns:dc="http://purl.org/dc/elements...">

 <dc:title>DfromXml : D-2.4</dc:title> /oai_dc:dc/dc:title

 <dc:creator>Akira Miyazawa</dc:creator> /oai_dc:dc/dc:creator

</oai_dc:dc>

But, when the input use the default namespace, things are more complicated.

<?xml version="1.0" encoding="UTF-8"?>

<OAI-PMH xmlns="http://www.openarchives.org/..."> /D:OAI-PMH

 <ListRecords> /D:OAI-PMH/D:List

 <record> /D:OAI-PMH/D:List/D:record

 ........

 </record>

 </ListRecords>

</OAI-PMH>

As in the right column, you have to use prefix "D:" for the tags under the default namespace in the input. You may use "-p //D:record" in this case.

Xpath language has no default namespace. Names without prefix in an xpath expression adress only tags outside any namespace scope in the input document. When there are two different default namespaces in the input document, DfromXml gives prefix D_ to the second default namespace. See the third xpath on the right column of the following example.

<?xml version="1.0"?>

<out> /out

 <in-1 xmlns="http://foo.org" > /out/D:in-1

 <in-2 xmlns="http://bar.org"> /out/D:in-1/D_:in-2

 </in-2>

 </in-1>

</out>

implicit binding of DfromXml has general rule to avoid prefix conflict. For each input document, the program inspects namespace-prefix bindings. When it finds same prefix bound to different namespace name, it adds LOW LINE (_) to the second prefix and binds it to the second name space. The third one is added two LOW LINEs (__), and so on. Similarly, "D_" is bound to the second default namespace name and "D__" to the third one.

Explicit binding

Explicit binding is more simple. You give bindings of all the prefices used in the xpath explicitly with -N options. Format is "prefix=URI" and you can give more than one -N option. This prefix is local to -p option and may not be same as input document's prefix. Names are compared not with the prefix-local name pair, but with the URI(bound to the prefix)-local name pair.

D-record output

Record node

For the convenience of explanation, record node is defined as the node selected by the -p xpath option to produce an output.

Field node

Each node directly under the record node makes a D-field. These nodes are defined as field nodes. The field order in the D-record keeps the sequence of field nodes in the input XML file. Element name of the field node makes the D-field name, and the field value is composed of the texts under the field node. If the record node has text directly under it, it becomes a D-field with the field name "text" and the text as the field value.

Let the record node be /doc/a of the above example.

<a>text-a <a>text-aa text-aab </a> text-ab </a>

DfromXml produces the following D-record.

node:/doc/a text:text-a a:text-aa text-aab text: b:text-ab text:

"node" field is a fixed name field to identify the record node. This value can be used for an xpath expression to select the node, if no namespace is used. See FIXED NAME FIELDS section.

"a" field comprises texts of the node /doc/a/a and the node /doc/a/a/b. The second and third "text" comprise only white spaces. These fields come from new line and leading spaces between end-tag (e.g. </a>) and start-tag (e.g. ). It may seem no use, but as an XML processor, DfromXml passes all characters in a document to the application. It is up to your application to decide what is needed. (If the /doc/a node is an element content, that contain no character data, i.e. no "text-a" in this case, no "text" field is produced for white spaces.)

Conversion to D-field includes predefined entities (<, >, &, ", ') and character references (&#x4e00 etc.) to normal characters.

Namespace prefix is not reflected in the D-field name. Only local name part is used.

Raw mode output

With -r option, output D-recorded is in raw mode. In raw mode, D-field value is basically a field node in original XML form without start and end tag. For the same record node as the example above:

<a>text-a <a>text-aa text-aab </a> text-ab </a>

DfromXml -r produces the following D-record.

node:/doc/a text:text-a a:text-aa text-aab text: b:text-ab text:

See Special markups section below for the comments and other special constructs. Most of XML constructs, including entity references are kept as they are, but character references are replaced by normal character.

Attributes

By default, attributes are not included in the output. Option -a makes attributes to be reflected. There are two ways to do so. One is -a f option or attribute field output, and the other is -a n option or field name variation.

Attribute field output

With -a f or "attribute field output" option, an attribute is treated as if an element and makes a D-field called attribute field. Field name of an attribute field is the attribute name preceded by COMMERCIAL AT(@) character. Field value is the attribute value with enclosing QUOTATION MARK(")s stripped. Only the attributes pertaining to the record node or the field node are converted to attribute fields, and other level attributes are ignored. Record node attribute fields come before other field node fields, and just after the node field. Field node attribute fields come just after the field created from the node. See next example.

<doc lang="en">

 <rec a="va" b="vb"> record node

 <f1 c="vc"> field node

 <f11 d="vd">value11</f11>

 <f12 d="vd">value12</f12>

 </f1>

 <f2 e="ve">valuef2</f2> field node

 </rec>

</doc>

With "-a f" option, you will get following D-record.

node:/doc/rec @a:va @b:vb f1:value11value12 @c:vc f2:valuef2 @e:ve

Note that attributes in the outer node(<doc>) and the inner node(<f11>, <f12>) are not reflected.

Field name variation

With -a n or "field name variation" option, attribute becomes a part of D-field name. This is applied only to the field node attributes, and the record node attributes are ignored.

Typical usage of -a f option is such as:

<title lang="en">Algorithms</title> <title lang="fr">Algorithmique</title>

When these nodes become the field nodes with -a n option, they are converted to:

title-en:Algorithms title-fr:Algorithmique

You can give fieldname-specification with "-a nfieldname-specification" option. It is a character string, and for each attributes of the field node, it is appended to the element name to form the D-field name. Within the fieldname-specification, "\a" is converted to the attribute name, "\v" is converted to the attribute value, and "\\" is to "\". When no fieldname-specification is given, the default is "-\v". In the above example, "-en" and "-fr" is appended to the element name "title".

Following example is artificial to explain fieldname-specification. The field node:

<field a="va" b="vb">value</field>

With "-a n-\a=\v" option, DfromXml produces

field-a=va-b=vb:value

Special markups

There are some special markups in XML aside from element nodes and attributes. They are comments, processing instructions, CDATA sections and entity references. The following table shows how they are treated in the output D-record.

Special markups
markup under field node under field node(raw mode) field node field node(raw mode)

comment
 ignored  comment:text comment:

processing instruction
<?target instruction?> ignored <?target instruction?> PI:target instruction PI:<?target instruction?>

CDATA section
<![CDATA[text]]> text <![CDATA[text]]> CDATA:text CDATA:<![CDATA[text]]>

entity reference
&name; replacement text &name; ENTITY:name ENTITY:&name;

**Special markups**
markup	under field node	under field node(raw mode)	field node	field node(raw mode)
comment `<!--text-->`	`ignored`	`<!--text-->`	`comment:text`	`comment:<!--text-->`
processing instruction `<?target instruction?>`	`ignored`	`<?target instruction?>`	`PI:target instruction`	`PI:<?target instruction?>`
CDATA section `<![CDATA[text]]>`	`text`	`<![CDATA[text]]>`	`CDATA:text`	`CDATA:<![CDATA[text]]>`
entity reference `&name;`	`replacement text`	`&name;`	`ENTITY:name`	`ENTITY:&name;`

Usually, you do not choose these constructs as a record node. But, when you do, the record node is treated as if a field node.

DfromXml - Extract D-records from Xml file

SYNOPSIS

DESCRIPTION

Xpath

Namespace

Implicit binding

Explicit binding

D-record output

Record node

Field node

Raw mode output

Attributes

Attribute field output

Field name variation

Special markups

OPTIONS

FIXED NAME FIELDS

EXAMPLES

ENVIRONMENT

DIAGNOSTICS

SEE ALSO

AUTHOR

`<?xml version="1.0" encoding="UTF-8"?>`
`<doc>`	`/doc`
`<a>text-a`	`/doc/a`
`<a>text-aa`	`/doc/a/a`
`<b>text-aab</b>`	`/doc/a/a/b`
`</a>`
`<b>text-ab</b>`	`/doc/a/b`
`</a>`
`<b>text-b1`	`/doc/b[1]`
`<a>text-b1a</a>`	`/doc/b[1]/a`
`<b>text-b1b</b>`	`/doc/b[1]/b`
`</b>`
`<b>text-b2`	`/doc/b[2]`
`<a>text-b2a</a>`	`/doc/b[2]/a`
`<b>text-b2b</b>`	`/doc/b[2]/b`
`</b>`
`</doc>`

`<?xml version="1.0"?>`
`<oai_dc:dc`	`/oai_dc:dc`
`xmlns:oai_dc="http://www.openarchives..."`
`xmlns:dc="http://purl.org/dc/elements...">`
`<dc:title>DfromXml : D-2.4</dc:title>`	`/oai_dc:dc/dc:title`
`<dc:creator>Akira Miyazawa</dc:creator>`	`/oai_dc:dc/dc:creator`
`</oai_dc:dc>`

`<?xml version="1.0" encoding="UTF-8"?>`
`<OAI-PMH xmlns="http://www.openarchives.org/...">`	`/D:OAI-PMH`
`<ListRecords>`	`/D:OAI-PMH/D:List`
`<record>`	`/D:OAI-PMH/D:List/D:record`
`........`
`</record>`
`</ListRecords>`
`</OAI-PMH>`

`<?xml version="1.0"?>`
`<out>`	`/out`
`<in-1 xmlns="http://foo.org" >`	`/out/D:in-1`
`<in-2 xmlns="http://bar.org">`	`/out/D:in-1/D_:in-2`
`</in-2>`
`</in-1>`
`</out>`

`<doc lang="en">`
`<rec a="va" b="vb">`	`record node`
`<f1 c="vc">`	`field node`
`<f11 d="vd">value11</f11>`
`<f12 d="vd">value12</f12>`
`</f1>`
`<f2 e="ve">valuef2</f2>`	`field node`
`</rec>`
`</doc>`