Dintro - introduction to D-commands

[ English | Japanese ]

[visit D-home]

DESCRIPTION

D is a series of commands to perform file operations on D-files. Names of D-commands always start with capital letter D.

Each D-command provides a basic file operation such as selection or join (matching) for the files in D-file format. Result of an operation is also in D-file format, and is written to the standard output. D-commands are invoked from a shell, combined each other or combined with other commands, to perform a complex data processing.

D is not an "all in one" package. It has only limited functions for computation or string manipulation. The basic philosophy of D is "open system". It is effectively used with sed, awk, perl or with any other programs.

D-COMMANDS LISTING

Conversion from/to other formats

NameDescription
DfromLineLine format to D-file
DfromChunkCreation of D-reords from chunk of lines of text files
DtoLineD-file to Line format
DfromCsvCsv file to D-file
DtoCsvD-file to Csv file
DfromHtmlExtract D-records from Html file
DfromXmlExtract D-records from Xml file
DtoXmlD-file to Xml portion
DtoTexD-file to TeX

Display/print

NameDescription
DprDisplay/print D-files

Selection

NameDescription
DselectSelection by a conditional expression
DgrepSelection by a regular expression matching
DextractSelection by the record number list
DheadSelection of the top n records
DtailSelection of the last n records
DmaxSelection of the maximum/minimum value records

File operation

NameDescription
DprojProjection
DjoinJoin/matching of D-files by key values
DpasteHorizontal merge of D-files
DrenameRenaming the field names
DorderChanging field sequence of D-records
DsortSorting by key values
DcatConcatenation of D-files
DupdateUpdate a D-file
DfillFilling empty fields from the preceding records
DdecomposeDecomposition of D-records into field instance unit representation
DcomposeCompossition of D-records from field instance unit representation

File operation with repeating fields

NameDescription
DbundleVertical merge of D-records by key values
DunbundleSlicing D-records with a repeating group
DtieConcatenation of group fields
DuntieDecomposition of a field into some fields
DpackConcatenation of repeating fields into a field
DunpackDecomposition of a field into repeating fields

Simple statistics

NameDescription
DfdField description
DfdpDfd and display (not available in Windows)
DfreqFrequency count of field values
DrcRecord count
DrcpDrc and display (not available in Windows)
DmeansCalculation of max., min., average and standard deviation
DmaxSelection of the maximum/minimum value records

D-file stream editor

NameDescription
DedD-file stream editor with Dl interpreter

D-FILE

Data model

The data model of D-file is simple enough as follows:

<file>::= an ordered set of <record>s
<record>::= an ordered set of <field>s
<field>::= a pair of a <field name> and a <value>

D-file conventions

D-file is a text file following the D-file conventions, which are:

1.
One line in the text file is a <field>, except for null lines.
2.
In a <field>, the <field name> and the <value> is separated by the first COLON (:).
3.
Null line[s] is the end of a <record>.
4.
EOF of the text file is the end of a <file>.

By the conventions above, <value> can include any character but the newline character (nl). However, for the convenience of the implementation, we add:

5.
A line must not contain null character (nul) and the next smallest control character (soh).

By this rule, the <value> must not contain nl, nul and soh, while the <field name> must not contain COLON besides them. Any other characters supported by the operating system under the current locale can be used in both <field name> and <value>.

It's worth to note here that the last record in the file may not be followed by a null line according to the rule 4 above. And there is no null-record (i.e. field count is zero) as the consequence of the rule 3, because two or more consecutive null lines are regarded as one delimiter.

There is no limitation of the length of <fieldname> or <value>. And there is no limitation of the number of <field>s in a <record> or the number of <record>s in a <file>, except for the system limitation.

Unlike mail header (RFC # 822) fields, D-file convention does not treat space characters as special. Spaces after the field delimiter COLON are a part of the value, There is no way to split a field into a multiple-line representation.

Null value

The <value> and the <fieldname> can be null string (i.e. length zero). However, it is discouraged to use null string as the field name. Some D-commands interprets null string field name as special functions, such as to skip the field.

There is no special thing in using null string as a <value>. Though, it must be distinguished from the null value (no <field> with the given <field name>). In the following example, the first record has null value for the field "b" (no field "b"), while the second has a field "b" with null string value.

Example:

a:1

a:2
b:

You may use either way to represent a missing value, but should be careful about the difference between these two representations. Because they never are same in D-file context.

Errors

A line which does not have any COLON is an error field. Handling of error field is subject to implementation. In the current implementation, generally, they are left in the output record as they are. When a field name selection is performed, they are never selected.

Bytes which do not obey the character encoding of the current locale are also error data. D-commands prints warning message for the encoding errors.

Null characters in the input are also error data. But, in this case, the current implementation tacitly discards null characters. They don't appear in the output records.

Numeric values

In a D-file, numeric values are represented by character strings. When a D-command makes numeric value field, it converts the value by %d (integer) or %g (non-integer) of printf format. When a D-command evaluates a value as a numeric, the string must be a decimal representation acceptable to strtod, or 0x prefixed hexadecimal representation acceptable to strtol, optionally led and followed by space characters. (printf, strtod and strtol are standard C functions). Any string which does not follow these formats is evaluated as zero. For example,

a:1/2
b:123c
c:1 2 3

are all evaluated as zero. Note that D-commands do not report these zero evaluations. Also note that decimal point character depends on the locale and not always FULL STOP (.) character.

Repeating fields

A D-record may have more than two fields with the same field name. For example:

a:1
b:2
a:3
b:4

D-commands, in principle, handle repeating field as one-dimensional dynamic array of elements; i.e., like an array of perl, it has a sequence of zero or more simple values handled under a given name. In the above example, field "a" has value { 1 3 } and field "b" has value { 2 4 }, while field "c", which does not exist actually, has value { }, or null value.

Elements of an array are always simple values. Therefore, two dimensional array or matrix is not handled by D-commands. Also note that when evaluating a repeating field as numeric, all elements are evaluated as numeric. This array is homogeneous and can't mix numeric and string values in it.

D-commands' parameters don't use suffix for a fieldname. If you really want to change the handling of a field by its relative position, you should give it a separate field name. Dl is an exception, which has a suffix operation.

Field order

Field order with different field names is not significant. Under this principle, next two records have same values.

a:1
b:2
a:3
b:4

a:1
a:3
b:2
b:4

But the field order of same field is significant. Next record is different from the examples above because "a" field value { 3 1 } is not same as { 1 3 }.

a:3
b:2
a:1
b:4

For D-commands which handle repeating groups (Dunbundle and Dtie) may use field order to find out leaves. In this case only, field order of different fields is significant. (See the manual of D_lsa).

Unless explicitly specified, a D-command preserves the input record"s field order. For example, Dorder keeps the field order of which field names are not in the field-list.

Comparison of values

Values can be compared as string or as numeric values. The default is string comparison. Unless a key-flag tells numeric comparison (see the section Key-field-list below), two values are compared as strings. (In )

String comparison follows the internal process code values of the character. In the case of Windows, internal code is UNICODE (ISO 10646 UCS2). In the case of Gnu/Linux, internal code is UNICODE (ISO 10646 UCS4). In the other UNIX, internal process code may depend on the locale. See the documents on the character set handling of the operating system.

When numeric comparison is applied, values are converted to numeric values. Note that a field not representing a valid numeric form is converted to 0. (See the Numeric Values section above).

When a field repeats, the first elements are compared first, and if they are equal, then the second ones are compared, and so on, until unequal elements are found, or the end of elements is reached. When one side is used up, that side is smaller. In other words, non-existing field is treated as the smallest value (smaller than null string, and -Infinity).

Two or more fields can be compared similarly. When fields "a,b" are compared, the field "a" is compared first, and if they are equal then the field "b" is compared. In this case the order of field "a" and field "b" in a record does not affect to the comparison.

Repeating group

Repeating group is a set of fields which repeats in a reord. For example, in the next D-record,

continent:Asia
code:CN
name:China
code:JP
name:Japan
code:KR name:Korea (the Republic of)

three pairs of field code and field name a forms "repeating group". There is no explicit design to accomodate this type of structure, such as nested fields, in the D-file convention. But, some D-commands can handle repeating groups using leaf separation algorithm.

Typically, Dunbundle is used to flatten a level of the tree structure, or Dtie covert a repeating group to repeating fields. Some examples are found in D_tutorial.

COMMAND ARGUMENTS

D-command arguments follow the next rules:

  1. Command names begin with capital letter "D" followed by small letters, or (in limited cases) by capital letters.
  2. Option names are one character long.
  3. All options are preceded by "-".
  4. Options with no arguments may be grouped after a single "-".
  5. Option's arguments may follow the option names preceded by
  6. space characters, or directly follow the option name. However, if the option argument is null string, it must be preceded by tab or space characters.
  7. All options precede operands on the command line.
  8. "--" may be used to indicate the end of the options.
  9. The order of options does not have significance except for the case same option name repeats. When a same option name repeats, it is handled in the way determined by each command. Generally, only the last one is effective, but in some commands, repeating option is significant.
  10. After options, mandatory operand(s), if any, follows. After that optionally follows the input file names.
  11. If there is no input file name or the number of input file is less than mandatory number (Djoin, Dpaste and Dupdate request at least two input files) standard input is used as the last input file.
  12. "-" as an input file name means standard input.
  13. Two long options --help and --version are supported despite the rules above.
Dhead and Dtail are exceptions of above rules. They conventionally use -10 like options as UNIX command head(1) and tail(1) do.

Common command options

The following options are used in many D-commands and have same meaning.

-? help
-F file by file process; "filename" field output.
-g group by.
-k default key options. (See Key-field-list subsection below).
-t default delimiter string for the format. (See the manual of D_fmt(1)).
-z default options for the format. (See the manual of D_fmt(1)).
-D Definition of D-command internal variables. (See D-command internal variable subsection below).
--help help; same as -?
--version print version

Regular expression

Some D-commands use regular expressions in their arguments, which follows the UNIX egrep specifications.

| or; matches left side or right side
* zero or more repeat of preceding one
+ one or more repeat of the preceding one
? optional; zero or one occurrence of the preceding one
( ) grouping; make a regular expression one unit
. single character; matching any single character
^ top; matching the null string at the beginning of the string
$ end; matching the null string at the end of the string
\ escape; matching the following character
[...] range; matching any one of enclosed characters
[^...] excluding range; matching any one character not included in the enclosed characters but for the top "^"
- in [], shorthand for the full list of characters in internal code between left hand character and right hand character

Note that in [ ], "\" does not work as an escape. To use "]", "^" or "-" in [ ], put them in out of context position. If you want to use "]", put it at the first position or just after the "^" at the first position. Similarly, "^" not at the first position is just a normal character, and "-" at the first or at the last position is also a normal. Be careful to use "-", because it depends on the internal code.

The regular expression handler used in the current implementation is modified from Henry Spencer's V8 regular expression. It bases on character and not byte. Matching is always made to a character and not to a byte.

Field-list

In many D-commands, field-lists are used as operands or in option arguments. Field-list is a COMMA separated list of field names.

Before D-2.6, spaces could be used as field name separators in a field-list. It caused complexity in using spaces in the field-list, especially in field-format list. From D-2.6, separator in the field-list is COMMA only.

Semi-formal syntax of the field-list is as follows:

<field-list> ::= [^] [<fspec> [{,<fspec>}..]]
<fspec> ::= <field name>[:<additional-inf>]

Note that SPACE after COMMA will be a part of the next field name.

"a, b"

In the example above, the first field name is, one character "a" and the second field name is two characters " b", not one character "b".

The character REVERSE SOLIDUS (\) is used to escape COMMA(,) and REVERSE SOLIDUS itself. CIRCUMFLEX ACCENT(^) at the top of the filed-list can be escaped also with a REVERSE SOLIDUS. For example:

"\^ab\,cd,\\ef"
"\ef,^ab\,cd"

In the first line, two fields "^ab,cd" and "\ef" are listed. Note that REVERSE SOLIDUS is removed only when it appears before COMMA or REVERSE SOLIDUS, (or it appeares at the top and the following character is CIRCUMFLEX ACCENT). All other REVERSE SOLIDUSs are intact. Thus, the second line has same two fields "\ef" and "^ab,cd".

A field-list may have additional information for each field name in it. It is separated by COLON from the field name, and REVERSE SOLIDUS escape mechanism also works in this additional information part. Additional information is used to specify key-flags, input-output format, new field name, etc., and depending on its additional information type, a field-list may be called as key-field-list, field-format-list, leaf-field-list, or field-rename-list. See following sections for these specific field-lists.

Field-format-list

General form of field-format-list is a list of:

field-name[:format]

A field-format-list specifies how to convert a character string to/from D-fields. Two options -t and -z may be used with a field-format-list to give default values of the format.

See the manual of D_fmt for the detail.

Key-field-list

General form of a key-field-list is a list of:

field-name[:key-flags]

A key-field-list specifies fields to be used as key values, together with their attributes ( i.e. numeric or string, etc.) The option -k is used with a key-field-list to give the default value of key-flags.

Key-flags here is a string of following characters (same as used in UNIX command sort and its subset) and gives the key attributes of the field.

n
numeric; values are converted to numeric (double)
r
reverse order (descending order)
f
capital letters; small letters are converted to capital
d
dictionary order; alpha-numeric and spaces only
i
non-print characters are eliminated

-k default has effect only for the fields which have no key-flags in the list. Therefore, following examples have same meaning:

"a,b:nr"
-k nr "a: b"
-k n "a: b:nr"

But, next one is different:

-k n "a: b:r"

because field "b" is not numeric in this case.

Key-flags "f" and "d" is effective for non ASCII characters if the locale supports those characters.

Duplicate name in the list is allowed, but semantics of it is subject to each command interpretation. In many cases, only the first one is effective.

Leaf-field-list

General form of the leaf-field-list is:

fieldname[:*]

The leaf-field-list is used in Dunbundle and Dtie to give the fields to be handled as "leaves", together with their repeatability. In the case of Dtie the leaf-field-list is in the same time a field-format-list. Actually, a leaf-field-list is a subset of the field-format-list, of which only the field name and "*" are in use.

See the manual of D_lsa for the detail.

Field-rename-list

General form is

old-field-name:new-field-name

A field-rename-list is used in Drename to specify field rename. Note that "\" escape mechanism also works for the new-field-name. For example to change the field name "a-b" to "a,b", use:

"a-b:a\,b"

There is no COLON checking for the new-field-name. You can write

"a:b:c"

as a field-rename-list. But the result is to change the field "a" to field "b" with values always led by "c:". This type of usage might be checked in future versions, and it is recommended to avoid such usage.

Exclusive field-list

A field-list begins with CIRCUMFLEX ACCENT(^) means exclusion. (Suggested from [^..] in regular expression). Exclusive field-list can be used in places where D-commands request the field-list argument, unless it contradicts with the command semantics. In the extreme case, one character "^" is an exclusive field-list that means "all fields".

Note that additional information in an exclusive field-list has no meaning and is just ignored. But, you may use exclusive field-list for a field-format-list or key-field-list. When an exclusive field-list is used as these special field-list, default value for the format or key-flags are applied.

In the internal process of exclusive field-list, a D-command first establishes a corresponding inclusive field-list with no entry in it. During the process, each time the command encounters a new field not in the exclusive field-list nor in this corresponding inclusive field-list, it adds this field name to the corresponding inclusive field-list. This corresponding inclusive field-list is used as actual process. Therefore, the field order depends on actual input file.

This is important when the field order has special meaning as in Dorder, or in a key-field-list. For example, in the next case,

Dsort ^a file1 > tmp1
Dsort ^a file2 > tmp2
Djoin ^a tmp1 tmp2

the Djoin command may fail, because corresponding inclusive lists of the first Dsort and the second Dsort may be different.

D-command internal variable

Some of D-commands internal variables are open for users to control the command's behavior. Command argument -D is used to give it a value. General form is:

-D variable-name=value[,.. ]

Variable names and their valid values are command depended, and are described in each command manual page. But, D-command internal variable for UTF I/O feature is general. For example,

-D datautf=8

works for most commands to process UTF-8 encoded files.

OUTPUT

D-commands write result to the standard output, and messages (if any) to the standard error. The result is also a D-file, as a rule. DtoLine, DtoTex, and Dpr are the exceptions which output non D-file. Dsort may use temporary files for sort work, and Dpr for Windows may use temporary files for printer output. These are the only exception which writes other than standard output or standard error.

Pre-defined field names

Some D-commands use pre-defined field names in their output. For example, Dcat and some other D-commands may add "filename" fields.

Pre-defined field name is never changed even if the input file already has the same field name. In such case, the output records will have repeating, for example, "filename" field. Whether the newly added field comes before or after the existing one depends on each D-command specification. In the case of Dcat, "filename" field is added always at the top of the record.

To avoid undesirable repeating, use Drename to change existing field names. There is no other way to change these pre-defined field names.

INTERNATIONALIZATION AND CHARACTER CODE

D-commands are internationalized in character code, numeric representation and date-time representation. Messages are not localized and always in English.

D-files are basically locale dependent. When you process a D-file created under different locale, the D-command may fail unless character code and numeric representation of creation locale is same as the processing locale.

For example, if your current locale is "ja_JP.eucJP" and read a D-file created under "ja_JP.UTF-8" locale, the D-command may report code error warnings, or may finish normally with a wrong result.

Difference of numeric representation may cause error. For example, you use Dmeans command under French locale and get a result like:

avg.size:2,5

and when you read this result under English locale, the value is evaluated as zero, but not 2.5. (See Numeric Values).

Input data, output data, messages and command arguments of a D-command are coded in the character code of the processing locale, as a rule. But, there are two exceptions in this rule.

One is DfromHtml and DfromXml. HTML or XML input file has (usually) encoding designation in it, and these commands can use it. Output file, command arguments and messages are not affected by the input file encoding. They are always in the current locale encoding.

Another exception is UTF I/O feature described in the following section.

When the input character code is different from the locale character code to output, there may be characters not in the output locale code. A character not in the output character set is converted to a QUESTION MARK (?) character.

UTF I/O Feature

D-commands can process UTF-8/16/32 encoded input/output files, regardless to the locale character code. UTF I/O feature is invoked by an environment variable, or by a command argument giving D-command internal variable. UTF I/O feature affects only input/output file character encoding. Command arguments and messages are in the locale character code. Three variables in the next table invoke UTF I/O feature.

Environment variableD-command internal variablevalid valuesDescription
Ddatautfdatautf8, 16, 32input/output data encoding
Didatautfidatautf8, 16, 32, -16, -32input data encoding
Dodatautfodatautf8, 16, 32output data encoding

When both environment variable and D-command internal variable are given, D-command internal variable value overrides the other. When both datautf and i/odatautf are given, i/odatautf value is used.

Meaning of the value is showed in the following table:

[i/o]datautf valuemeaning
8encoding is UTF-8
16encoding is UTF-16 in native endian
32encoding is UTF-32 in native endian
-16encoding is UTF-16 in opposite endian
-32encoding is UTF-32 in opposite endian

For example,

Ded -D datautf=8 -f program.dl input.d

tells that input file input.d, Dl program input file program.dl and output file encoded in UTF-8.

Note that UTF I/O feature is available in the system internal code is ISO 10646 compliant. That means Windows and all recent distributions of linux can use it. But, under Solaris, when you are using EUC code locale, you can't use this feature. (When you are using UTF-8 locale, UTF I/O works. But, in this case, anyway you can use UTF-8.).

UTF I/O feature is designed for two purposes. One is to use UTF-8 on WIndows where UTF-8 encoding is not supported by locales. Another purpose is to use UTF-32 or UTF-16 internal code as I/O, so that to reduce inter command I/O overhead.

RETURN VALUES

D-command return value is uniformly 0 to 3; which means

0
Normal.
1
Warning; the program reaches the end with some error messages on standard error.
2
Error; errors are found during parameter analysis, and no output is made.
3
Emergency; the program is stopped during the process.

SEE ALSO

D_fmt, D_lsa, D_msg, D_tutorial

AUTHOR

MIYAZAWA Akira


miyazawa@nii.ac.jp
2013