Dintro - introduction to D-commands

[ English | Japanese ]

[visit D-home]

DESCRIPTION

D is a series of commands to perform file operations on D-files. Names of D-commands always start with capital letter D.

Each D-command provides a basic file operation such as selection or join (matching) for the files in D-file format. Result of an operation is also in D-file format, and is written to the standard output. D-commands are invoked from a shell, combined each other or combined with other commands, to perform a complex data processing.

D is not an "all in one" package. It has only limited functions for computation or string manipulation. The basic philosophy of D is "open system". It is effectively used with sed, awk, perl or with any other programs.

D-COMMANDS LISTING

Short names are not available in Windows version.

Conversion from/to other formats

NameShort nameDescription
DfromLineDfLLine format to D-file
DtoLineDtLD-file to Line format
DfromCsv Csv file to D-file
DtoCsv D-file to Csv file
DfromHtml Extract D-records from Html file
DfromXml Extract D-records from Xml file
DtoXml D-file to Xml portion
DtoTex D-file to TeX

Display/print

Dpr Display/print D-files

Selection

NameShort nameDescription
DselectDselSelection by a conditional expression
DgrepDgpSelection by a regular expression matching
DextractDexSelection by the record number list
DheadDhdSelection of the top n records
DtailDtaSelection of the last n records
Dmax Selection of the maximum/minimum value records

File operation

NameShort nameDescription
DprojDpjProjection
DjoinDjnJoin/matching of D-files by key values
DpasteDptHorizontal merge of D-files
DrenameDrnRenaming the field names
DorderDodChanging field sequence of D-records
DsortDstSorting by key values
Dcat Concatenation of D-files
DupdateDupdUpdate a D-file

File operation with repeating fields

NameShort nameDescription
DbundleDbnVertical merge of D-records by key values
DunbundleDubnSlicing D-records with a repeating group
DtieDtiConcatenation of group fields
DuntieDutiDecomposition of a field into some fields
DpackDpkConcatenation of repeating fields into a field
DunpackDupkDecomposition of a field into repeating fields

Simple statistics

NameShort nameDescription
Dfd Field description
Dfdp Dfd and display (not available in Windows)
DfreqDfqFrequency count of field values
Drc Record count
Drcp Drc and display (not available in Windows)
DmeansDmnCalculation of max., min., average and standard deviation
Dmax Selection of the maximum/minimum value records

D-file editor

NameShort nameDescription
Ded D-file editor with Dl interpreter

D-FILE

Data model

The data model of D-file is simple enough as follows:

<file>::= an ordered set of <record>s
<record>::= an ordered set of <field>s
<field>::= a pair of a <field name> and a <value>

D-file conventions

D-file is a text file following the D-file conventions, which are:

1.
One line in the text file is a <field>, except for null lines.
2.
In a <field>, the <field name> and the <value> is separated by the first COLON (:).
3.
Null line[s] is the end of a <record>.
4.
EOF of the text file is the end of a <file>.

By the conventions above, <value> can include any character but the newline character (nl). However, for the convenience of the implementation, we add:

5.
A line must not contain null character (nul) and the next smallest control character (soh).

By this rule, the <value> must not contain nl, nul and soh, while the <field name> must not contain COLON besides them. Any other characters supported by the operating system under the current locale can be used in both <field name> and <value>.

It's worth to note here that the last record in the file may not be followed by a null line according to the rule 4 above. And there is no null-record (i.e. field count is zero) as the consequence of the rule 3, because two or more consecutive null lines are regarded as one delimiter.

There is no limitation of the length of <fieldname> or <value>. And there is no limitation of the number of <field>s in a <record> or the number of <record>s in a <file>, except for the system limitation.

Unlike mail header (RFC # 822) fields, D-file convention does not treat space characters as special. Spaces after the field delimiter COLON are a part of the value, There is no way to split a field into a multiple-line representation.

Null value

The <value> and the <fieldname> can be null string (i.e. length zero). However, it is discouraged to use null string as the field name. Some D-commands interprets null string field name as special functions, such as to skip the field.

There is no special thing in using null string as a <value>. Though, it must be distinguished from the NULL value (no <field> with the given <field name>). In the following example, the first record has NULL value for the field "b" (no field "b"), while the second has a field "b" with null string value.

Example:

a:1

a:2
b:

You may use either way to represent a missing value, but should be careful about the difference between these two representations. Because they never are same in D-file context.

Errors

A line which does not have any COLON is an error field. Handling of error field is subject to implementation. In the current implementation, generally, they are left in the output record as they are. When a field name selection is performed, they are never selected.

Bytes which do not obey the character encoding of the current locale are also error data. D-commands prints warning message for the encoding errors.

Null characters in the input are also error data. But, in this case, the current implementation tacitly discards null characters. They don't appear in the output records.

Numeric values

In a D-file, numeric values are represented by character strings. When a D-command makes numeric value field, it converts the value by %d (integer) or %g (non-integer) of printf format. When a D-command evaluates a value as a numeric, the string must be a decimal representation acceptable to strtod, or 0x prefixed hexadecimal representation acceptable to strtol, optionally led and followed by space characters. (printf, strtod and strtol are standard C functions). Any string which does not follow these formats is evaluated as zero. For example,

a:1/2
b:123c
c:1 2 3

are all evaluated as zero. Note that D-commands do not report these zero evaluations. Also note that decimal point character depends on the locale and not always FULL STOP (.) character.

Repeating fields

A D-record may have more than two fields with the same field name. For example:

a:1
b:2
a:3
b:4

D-commands, in principle, handle repeating field as one-dimensional dynamic array of elements; i.e., like an array of perl, it has a sequence of zero or more simple values handled under a given name. In the above example, field "a" has value { 1 3 } and field "b" has value { 2 4 }, while field "c", which does not exist actually, has value { }, or NULL value.

Elements of an array are always simple values. Therefore, two dimensional array or matrix is not handled by D-commands. Also note that when evaluating a repeating field as numeric, all elements are evaluated as numeric. This array is homogeneous and can't mix numeric and string values in it.

D-commands' parameters don't use suffix for a fieldname. If you really want to change the handling of a field by its relative position, you should give it a separate field name. Dl is an exception, which has a suffix operation.

Field order

Field order with different field names is not significant. Under this principle, next two records have same values.

a:1
b:2
a:3
b:4

a:1
a:3
b:2
b:4

But the field order of same field is significant. Next record is different from the examples above because "a" field value { 3 1 } is not same as { 1 3 }.

a:3
b:2
a:1
b:4

For D-commands which handle repeating groups (Dunbundle and Dtie) may use field order to find out leaves. In this case only, field order of different fields is significant. (See the manual of D_lsa).

Unless explicitly specified, a D-command preserves the input record"s field order. For example, Dorder keeps the field order of which field names are not in the field-list.

Comparison of values

Values can be compared as string or as numeric values. The default is string comparison. Unless a key-flag tells numeric comparison (see the section Key-field-list below), two values are compared as strings. (In )

String comparison follows the internal process code values of the character. In the case of Windows, internal code is UNICODE (or ISO 10646). In the case of UNIX, internal process code depends on the locale. See the documents on the character set handling of the operating system.

When numeric comparison is applied, values are converted to numeric values. Note that a field not representing a valid numeric form is converted to 0. (See the Numeric Values section above).

When a field repeats, the first elements are compared first, and if they are equal, then the second ones are compared, and so on, until unequal elements are found, or the end of elements is reached. When one side is used up, that side is smaller. In other words, non-existing field is treated as the smallest value (smaller than null string, and -Infinity).

Two or more fields can be compared similarly. When fields "a,b" are compared, the field "a" is compared first, and if they are equal then the field "b" is compared. In this case the order of field "a" and field "b" in a record does not affect to the comparison.

COMMAND ARGUMENTS

D-command arguments follow the next rules:

  1. Command names begin with capital letter "D" followed by small letters, or (in limited cases) by capital letters.
  2. Option names are one character long.
  3. All options are preceded by "-".
  4. Options with no arguments may be grouped after a single "-".
  5. Option's arguments may follow the option names preceded by space characters, or directly follow the option name. However, if the option argument is null string, it must be preceded by tab or space characters.
  6. All options precede operands on the command line.
  7. "--" may be used to indicate the end of the options.
  8. The order of options does not have significance except for the case same option name repeats. When a same option name repeats, it is handled in the way determined by each command. Generally, only the last one is effective, but in some commands, repeating option is significant.
  9. After options, mandatory operand(s), if any, follows. After that optionally follows the input file names.
  10. If there is no input file name or the number of input file is less than mandatory number (Djoin, Dpaste and Dupdate request at least two input files) standard input is used as the last input file.
  11. "-" as an input file name means standard input.
  12. Two long options --help and --version are supported despite the rules above.
Dhead and Dtail are exceptions of above rules. They conventionally use -10 like options as UNIX command head(1) and tail(1) do.

Common command options

The following options are used in many D-commands and have same meaning.

-? help
-F file by file process; "filename" field output.
-g group by.
-k default key options. (See Key-field-list subsection below).
-t default delimiter string for the format. (See the manual of D_fmt(1)).
-z default options for the format. (See the manual of D_fmt(1)).
-D Definition of D-command internal variables. (See D-command internal variable subsection below).
--help help; same as -?
--version print version

Regular expression

Some D-commands use regular expressions in their arguments, which follows the UNIX egrep specifications.

| or; matches left side or right side
* zero or more repeat of preceding one
+ one or more repeat of the preceding one
? optional; zero or one occurrence of the preceding one
( ) grouping; make a regular expression one unit
. single character; matching any single character
^ top; matching the null string at the beginning of the string
$ end; matching the null string at the end of the string
\ escape; matching the following character
[...] range; matching any one of enclosed characters
[^...] excluding range; matching any one character not included in the enclosed characters but for the top "^"
- in [], shorthand for the full list of characters in internal code between left hand character and right hand character

Note that in [ ], "\" does not work as an escape. To use "]", "^" or "-" in [ ], put them in out of context position. If you want to use "]", put it at the first position or just after the "^" at the first position. Similarly, "^" not at the first position is just a normal character, and "-" at the first or at the last position is also a normal. Be careful to use "-", because it depends on the internal code.

The regular expression handler used in the current implementation is modified from Henry Spencer's V8 regular expression. It bases on character and not byte. Matching is always made to a character and not to a byte.

Field-list

In many D-commands, field-lists are used as operands or in option arguments. Field-listis a COMMA or SPACE separated list of field names. In the shell, quoting is required for SPACE separated field-list. For example, next three field-lists

a,b,c
"a b c"
a\ b\ c

are all identical in the UNIX shell. (Though, few may use the last example).

Semi-formal syntax of the field-list is as follows:

<field-list> ::= [^] [<fspec> [{<dlm><fspec>}..]]
<fspec> ::= <field name>[:<additional-inf>]
<dlm> ::= <sp>[<sp>..]
| [<sp>..],[<sp>..]
<sp> ::= SPACE | TAB | NEWLINE

The character REVERSE SOLIDUS (\) is used to escape COMMA(,), CIRCUMFLEX ACCENT(^) or <sp> above. Field name including these characters can be written with an escape mechanism. For example in UNIX shell,

"\^a\,b\ \"
\\\^a\\,b\\\ \\

both represent a field name "^a,b \". Note that REVERSE SOLIDUS (\) is removed only when it appears before COMMA, CIRCUMFLEX ACCENT and spaces, and other REVERSE SOLIDUSs are intact. For example

"\START\,\END"

is a field name "\START,\END".

A field-list may have additional information for each field name in it. It is separated by COLON from the field name, and "\" escape mechanism also works in additional information part. Additional information is used to specify key-flags, input-output format, new field name, etc., and by its additional information type, a field-list becomes key-field-list, field-format-list, leaf-field-list, or field-rename-list. See following sections for these special field-lists.

Field-format-list

General form of field-format-list is a list of:

field-name[:format]

A field-format-list specifies how to convert a character string to/from D-fields. Two options -t and -z may be used with a field-format-list to give default values of the format.

See the manual of D_fmt for the detail.

Key-field-list

General form of a key-field-list is a list of:

field-name[:key-flags]

A key-field-list specifies fields to be used as key values, together with their attributes ( i.e. numeric or string, etc.) The option -k is used with a key-field-list to give the default value of key-flags.

Key-flags here is a string of following characters (same as used in UNIX command sort and its subset) and gives the key attributes of the field.

n
numeric; values are converted to numeric (double)
r
reverse order (descending order)
f
capital letters; small letters are converted to capital
d
dictionary order; alpha-numeric and spaces only
i
non-print characters are eliminated

-k default has effect only for the fields which have no key-flags in the list. Therefore, following examples have same meaning:

"a,b:nr"
-k nr "a: b"
-k n "a: b:nr"

But, next one is different:

-k n "a: b:r"

because field "b" is not numeric in this case.

Key-flags "f" and "d" is effective for non ASCII characters if the locale supports those characters.

Duplicate name in the list is allowed, but semantics of it is subject to each command interpretation. In many cases, only the first one is effective.

Leaf-field-list

General form of the leaf-field-list is:

fieldname[:*]

The leaf-field-list is used in Dunbundle and Dtie to give the fields to be handled as "leaves", together with their repeatability. In the case of Dtie the leaf-field-list is in the same time a field-format-list. Actually, a leaf-field-list is a subset of the field-format-list, of which only the field name and "*" are in use.

See the manual of D_lsa for the detail.

Field-rename-list

General form is

old-field-name:new-field-name

A field-rename-list is used in Drename to specify field rename. Note that "\" escape mechanism also works for the new-field-name. For example to change the field name "a-b" to "a b", use:

"a-b:a\ b"

There is no COLON checking for the new-field-name. You can write

"a:b:c"

as a field-rename-list. But the result is to change the field "a" to field "b" with values always led by "c:". This type of usage might be checked in future versions, and it is recommended to avoid such usage.

Exclusive field-list

A field-list begins with CIRCUMFLEX ACCENT(^) means exclusion. (Suggested from [^..] in regular expression). Exclusive field-list can be used in places where D-commands request the field-list argument, unless it contradicts with the command semantics. In the extreme case, one character "^" is an exclusive field-list that means "all fields".

Note that additional information in an exclusive field-list has no meaning and is just ignored. But, you may use exclusive field-list for a field-format-list or key-field-list. When an exclusive field-list is used as these special field-list, default value for the format or key-flags are applied.

In the internal process of exclusive field-list, a D-command first establishes a corresponding inclusive field-list with no entry in it. During the process, each time the command encounters a new field not in the exclusive field-list nor in this corresponding inclusive field-list, it adds this field name to the corresponding inclusive field-list. This corresponding inclusive field-list is used as actual process. Therefore, the field order depends on actual input file.

This is important when the field order has special meaning as in Dorder, or in a key-field-list. For example, in the next case,

Dsort ^a file1 > tmp1
Dsort ^a file2 > tmp2
Djoin ^a tmp1 tmp2

the Djoin command may fail, because corresponding inclusive lists of the first Dsort and the second Dsort may be different.

D-command internal variable

Some of D-commands internal variables are open for users to control the command's behavior. Command argument -D is used to give it a value. General form is:

-D variable-name=value[,.. ]

Variable names and their valid values are command depended, and are described in each command manual page. But, D-command internal variable for UTF I/O feature is general. For example,

-D datautf=8

works for most commands to process UTF-8 encoded files.

OUTPUT

D-commands write result to the standard output, and messages (if any) to the standard error. The result is also a D-file, as a rule. DtoLine, DtoTex, and Dpr are the exceptions which output non D-file. Dsort may use temporary files for sort work, and Dpr for Windows may use temporary files for printer output. These are the only exception which writes other than standard output or standard error.

Pre-defined field names

Some D-commands use pre-defined field names in their output. For example, Dcat and some other D-commands may add "filename" fields.

Pre-defined field name is never changed even if the input file already has the same field name. In such case, the output records will have repeating, for example, "filename" field. Whether the newly added field comes before or after the existing one depends on each D-command specification. In the case of Dcat, "filename" field is added always at the top of the record.

To avoid undesirable repeating, use Drename to change existing field names. There is no other way to change these pre-defined field names.

INTERNATIONALIZATION AND CHARACTER CODE

D-commands are internationalized in character code, numeric representation and date-time representation. Messages are not localized and always in English.

D-files are basically locale dependent. When you process a D-file created under different locale, the D-command may fail unless character code and numeric representation of creation locale is same as the processing locale.

For example, if your current locale is "ja_JP.eucJP" and read a D-file created under "ja_JP.UTF-8" locale, the D-command may report code error warnings, or may finish normally with a wrong result.

Difference of numeric representation may cause error. For example, you use Dmeans command under French locale and get a result like:

avg.size:2,5

and when you read this result under English locale, the value is evaluated as zero, but not 2.5. (See Numeric Values).

Input data, output data, messages and command arguments of a D-command are coded in the character code of the processing locale, as a rule. But, there are two exceptions in this rule.

One is DfromHtml and DfromXml. HTML or XML input file has (usually) encoding designation in it, and these commands can use it. Output file, command arguments and messages are not affected by the input file encoding. They are always in the current locale encoding.

Another exception is UTF I/O feature described in the following section.

When the input character code is different from the locale character code to output, there may be characters not in the output locale code. A character not in the output character set is converted to a QUESTION MARK (?) character.

UTF I/O Feature

D-commands can process UTF-8/16/32 encoded input/output files, regardless to the locale character code. UTF I/O feature is invoked by an environment variable, or by a command argument giving D-command internal variable. UTF I/O feature affects only input/output file character encoding. Command arguments and messages are in the locale character code. Three variables in the next table invoke UTF I/O feature.

Environment variableD-command internal variablevalid valuesDescription
Ddatautfdatautf8, 16, 32input/output data encoding
Didatautfidatautf8, 16, 32, -16, -32input data encoding
Dodatautfodatautf8, 16, 32output data encoding

When both environment variable and D-command internal variable are given, D-command internal variable value overrides the other. When both datautf and i/odatautf are given, i/odatautf value is used.

Meaning of the value is showed in the following table:

[i/o]datautf valuemeaning
8encoding is UTF-8
16encoding is UTF-16 in native endian
32encoding is UTF-32 in native endian
-16encoding is UTF-16 in opposite endian
-32encoding is UTF-32 in opposite endian

For example,

Ded -D datautf=8 -f program.dl input.d

tells that input file input.d, Dl program input file program.dl and output file encoded in UTF-8.

Note that UTF I/O feature is available in the system internal code is ISO 10646 compliant. That means Windows and all recent distributions of linux can use it. But, under Solaris, when you are using EUC code locale, you can't use this feature. (When you are using UTF-8 locale, UTF I/O works. But, in this case, anyway you can use UTF-8.).

UTF I/O feature is designed for two purposes. One is to use UTF-8 on WIndows where UTF-8 encoding is not supported by locales. Another purpose is to use UTF-32 or UTF-16 internal code as I/O, so that to reduce inter command I/O overhead.

RETURN VALUES

D-command return value is uniformly 0 to 3; which means

0
Normal.
1
Warning; the program reaches the end with some error messages on standard error.
2
Error; errors are found during parameter analysis, and no output is made.
3
Emergency; the program is stopped during the process.

SEE ALSO

D_fmt, D_lsa, D_msg

AUTHOR

MIYAZAWA Akira


miyazawa@nii.ac.jp
2003