[ English | Japanese ]
D is a series of commands to perform file operations on D-files. Names of D-commands always start with capital letter D.
Each D-command provides a basic file operation such as selection or join (matching) for the files in D-file format. Result of an operation is also in D-file format, and is written to the standard output. D-commands are invoked from a shell, combined each other or combined with other commands, to perform a complex data processing.
D is not an "all in one" package. It has only limited functions for computation or string manipulation. The basic philosophy of D is "open system". It is effectively used with sed, awk, perl or with any other programs.
D-commands can handle any character under the current locale which the operating system supports. Note that if you read D-files produced under a different locale, it may cause coding error.
Messages are not internationalized; only English is available in the current implementation.
Short names are not available in Windows version.
Name | Short name | Description |
DfromLine | DfL | Line format to D-file |
DtoLine | DtL | D-file to Line format |
Dpr | Display/print D-files | |
DtoTex | D-file to TeX |
Name | Short name | Description |
Dselect | Dsel | Selection by a conditional expression |
Dgrep | Dgp | Selection by a regular expression matching |
Dhead | Dhd | Selection of the top n records |
Dtail | Dta | Selection of the last n records |
Dextract | Dex | Selection by the record number list |
Name | Short name | Description |
Dcat | Concatenation of D-files | |
Dpaste | Dpt | Horizontal merge of D-files |
Djoin | Djn | Join/matching of D-files by key values |
Dproj | Dpj | Projection |
Dsort | Dst | Sorting by key values |
Dorder | Dod | Changing field sequence of D-records |
Drename | Drn | Renaming the field names |
Dupdate | Dupd | Update a D-file |
Name | Short name | Description |
Dbundle | Dbn | Vertical merge of D-records by key values |
Dunbundle | Dubn | Slicing D-records with a repeating group |
Dpack | Dpk | Concatenation of repeating fields into a field |
Dunpack | Dupk | Decomposition of a field into repeating fields |
Dtie | Dti | Concatenation of group fields |
Duntie | Duti | Decomposition of a field into some fields |
Name | Short name | Description |
Drc | Record count | |
Drcp | Drc and display (not available in Windows) | |
Dfd | Field description | |
Dfdp | Dfd and display (not available in Windows) | |
Dfreq | Dfq | Frequency count of field values |
Dmeans | Dmn | Calculation of max., min., average and standard deviation |
Name | Short name | Description |
Ded | D-file editor with Dl interpreter |
The data model of D-file is simple enough as follows:
<file> | ::= an ordered set of <record>s |
<record> | ::= an ordered set of <field>s |
<field> | ::= a pair of a <field name> and a <value> |
D-file is a text file following the D-file conventions, which are:
By the conventions above, <value> can include any character but the newline character (nl). However, for the convenience of the implementation, we add:
By this rule, the <value> must not contain nl, nul and soh, while the <field name> must not contain COLON besides them. Any other characters supported by the operating system under the current locale can be used in both <field name> and <value>.
It's worth to note here that the last record in the file may not be followed by a null line according to the rule 4 above. And there is no null-record (i.e. field count is zero) as the consequence of the rule 3, because two or more consecutive null lines are regarded as one delimiter.
There is no limitation of the length of <fieldname> or <value>. And there is no limitation of the number of <field>s in a <record> or the number of <record>s in a <file>, except for the system limitation.
Unlike mail header (RFC # 822) fields, D-file convention does not treat space characters as special. Spaces after the field delimiter COLON are a part of the value, There is no way to split a field into a multiple-line representation.
The <value> and the <fieldname> can be null string (i.e. length zero). However, it is discouraged to use null field name. Some D-commands interprets null field name as special functions, such as to skip the field.
There is no special thing in using null string as a <value>. Though, it must be distinguished from the NULL value (no <field> with the given <field name>). In the following example, the first record has NULL value for the field "b" (no field "b"), while the second has a field "b" with null string value.
Example:
a:1
a:2
b:
You may use either way to represent a missing value, but should be careful about the difference between these two representations. Because they never are same in D-file context.
A line which does not have any COLON is an error field. Handling of error field is subject to implementation. In the current implementation, generally, they are left in the output record as they are. When a field name selection is performed, they are never selected.
Null characters and bytes which do not obey the character encoding of the current locale are also error data. The current implementation tacitly suppresses these error characters and they don't appear in the output records.
In a D-file, numeric values are represented by character strings. When a D-command makes numeric value field, it converts the value by %d (integer) or %g (non-integer) of printf format. When a D-command evaluates a value as a numeric, the string must be a decimal representation acceptable to strtod, or 0x prefixed hexadecimal representation acceptable to strtol, optionally led and followed by space characters. Any string which does not follow these formats is evaluated as zero. For example,
a:1/2
b:123c
c:1 2 3
are all evaluated as zero. Note that D-commands do not report these zero evaluations.
(printf, strtod and strtol are standard C
A D-record may have more than two fields with the same field name. For example:
a:1
b:2
a:3
b:4
D-commands, in principle, handle repeating field as one-dimensional dynamic array of elements; i.e., like an array of perl, it has a sequence of zero or more simple values handled under a given name. In the above example, field "a" has value { 1 3 } and field "b" has value { 2 4 }, while field "c", which does not exist actually, has value { }, or NULL value.
Elements of an array are always simple values. Therefore, two dimensional array or matrix is not handled by D-commands. Also note that when evaluating a repeating field as numeric, all elements are evaluated as numeric. This array is homogeneous and can't mix numeric and string values in it.
D-commands' parameters don't use suffix for a fieldname. If you really want to change the handling of a field by its relative position, you should give it a separate field name. Dl is an exception, which has a suffix operation.
Field order with different field names is not significant. Under this principle, next two records have same values.
a:1
b:2
a:3
b:4
a:1
a:3
b:2
b:4
But the field order of same field is significant. Next record is different from the examples above because "a" field value { 3 1 } is not same as { 1 3 }.
a:3
b:2
a:1
b:4
For D-commands which handle repeating groups (Dunbundle and Dtie) may use field order to find out leaves. In this case only, field order of different fields is significant. (See the manual of D_lsa).
Unless explicitly specified, a D-command preserves the input record"s field order. For example, Dorder keeps the field order of which field names are not in the field-list.
Values can be compared as string or as numeric values. The default is string comparison. Unless a key-flag tells numeric comparison (see the section Key-field-list below), two values are compared as strings. (In )
String comparison follows the internal process code values of the character. In the case of Windows, internal code is UNICODE (or ISO 10646). In the case of UNIX, internal process code depends on the locale. See the documents on the character set handling of the operating system.
When numeric comparison is applied, values are converted to numeric values. Note that a field not representing a valid numeric form is converted to 0. (See the Numeric Values section above).
When a field repeats, the first elements are compared first, and if they are equal, then the second ones are compared, and so on, until unequal elements are found, or the end of elements is reached. When one side is used up, that side is smaller. In other words, non-existing field is treated as the smallest value (smaller than null string, and -Infinity).
Two or more fields can be compared similarly. When fields "a,b" are compared, the field "a" is compared first, and if they are equal then the field "b" is compared. In this case the order of field "a" and field "b" in a record does not affect to the comparison.
D-command arguments follow the next rules:
The following options are used in many D-commands and have same meaning.
-? | help |
-F | file by file process; "filename" field output. |
-g | group by. |
-k | default key options. (See Key-field-list subsection below). |
-t | default delimiter string for the format. (See the manual of D_fmt(1)). |
-z | default options for the format. (See the manual of D_fmt(1)). |
Some D-commands use regular expressions in their arguments, which follows the UNIX egrep specifications.
| | or; matches left side or right side |
* | zero or more repeat of preceding one |
+ | one or more repeat of the preceding one |
? | optional; zero or one occurrence of the preceding one |
( ) | grouping; make a regular expression one unit |
. | single character; matching any single character |
^ | top; matching the null string at the beginning of the string |
$ | end; matching the null string at the end of the string |
\ | escape; matching the following character |
[...] | range; matching any one of enclosed characters |
[^...] | excluding range; matching any one character not included in the enclosed characters but for the top "^" |
- | in [], shorthand for the full list of characters in internal code between left hand character and right hand character |
Note that in [ ], "\" does not work as an escape. To use "]", "^" or "-" in [ ], put them in out of context position. If you want to use "]", put it at the first position or just after the "^" at the first position. Similarly, "^" not at the first position is just a normal character, and "-" at the first or at the last position is also a normal. Be careful to use "-", because it depends on the internal code.
The regular expression handler used in the current implementation is modified from Henry Spencer's V8 regular expression. It bases on character and not byte. Matching is always made to a character and not to a byte.
In many D-commands, field-lists are used as operands or in option arguments. Field-listis a COMMA or SPACE separated list of field names. In the shell, quoting is required for SPACE separated field-list. For example, next three field-lists
a,b,c
"a b c"
a\ b\ c
are all identical in the UNIX shell. (Though, few may use the last example).
Semi-formal syntax of the field-list is as follows:
<field-list> | ::= | [^] [<fspec> [{<dlm><fspec>}..]] |
<fspec> | ::= | <field name>[:<additional-inf>] |
<dlm> | ::= | <sp>[<sp>..] |
| | [<sp>..],[<sp>..] | |
<sp> | ::= | SPACE | TAB | NEWLINE |
The character REVERSE SOLIDUS (\) is used to escape COMMA(,), CIRCUMFLEX ACCENT(^) or <sp> above. Field name including these characters can be written with an escape mechanism. For example in UNIX shell,
"\^a\,b\ \"
\\\^a\\,b\\\ \\
both represent a field name "^a,b \". Note that REVERSE SOLIDUS (\) is removed only when it appears before COMMA, CIRCUMFLEX ACCENT and spaces, and other REVERSE SOLIDUSs are intact. For example
"\START\,\END"
is a field name "\START,\END".
A field-list may have additional information for each field name in it. It is separated by COLON from the field name, and "\" escape mechanism also works in additional information part. Additional information is used to specify key-flags, input-output format, new field name, etc., and by its additional information type, a field-list becomes key-field-list, field-format-list, leaf-field-list, or field-rename-list. See following sections for these special field-lists.
General form of field-format-list is a list of:
field-name[:format]
A field-format-list specifies how to convert a character string to/from D-fields. Two options -t and -z may be used with a field-format-list to give default values of the format.
See the manual of D_fmt for the detail.
General form of a key-field-list is a list of:
field-name[:key-flags]
A key-field-list specifies fields to be used as key values, together with their attributes ( i.e. numeric or string, etc.) The option -k is used with a key-field-list to give the default value of key-flags.
Key-flags here is a string of following characters (same as used in UNIX command sort and its subset) and gives the key attributes of the field.
-k default has effect only for the fields which have no key-flags in the list. Therefore, following examples have same meaning:
"a,b:nr"
-k nr "a: b"
-k n "a: b:nr"
But, next one is different:
-k n "a: b:r"
because field "b" is not numeric in this case.
Key-flags "f" and "d" is effective for non ASCII characters if the locale supports those characters.
Duplicate name in the list is allowed, but semantics of it is subject to each command interpretation. In many cases, only the first one is effective.
General form of the leaf-field-list is:
fieldname[:*]
The leaf-field-list is used in Dunbundle and Dtie to give the fields to be handled as "leaves", together with their repeatability. In the case of Dtie the leaf-field-list is in the same time a field-format-list. Actually, a leaf-field-list is a subset of the field-format-list, of which only the field name and "*" are in use.
See the manual of D_lsa for the detail.
General form is
old-field-name:new-field-name
A field-rename-list is used in Drename to specify field rename. Note that "\" escape mechanism also works for the new-field-name. For example to change the field name "a-b" to "a b", use:
"a-b:a\ b"
There is no COLON checking for the new-field-name. You can write
"a:b:c"
as a field-rename-list. But the result is to change the field "a" to field "b" with values always led by "c:". This type of usage might be checked in future versions, and it is recommended to avoid such usage.
A field-list begins with CIRCUMFLEX ACCENT(^) means exclusion. (Suggested from [^..] in regular expression). Exclusive field-list can be used in places where D-commands request the field-list argument, unless it contradicts with the command semantics. In the extreme case, one character "^" is an exclusive field-list that means "all fields".
Note that additional information in an exclusive field-list has no meaning and is just ignored. But, you may use exclusive field-list for a field-format-list or key-field-list. When an exclusive field-list is used as these special field-list, default value for the format or key-flags are applied.
In the internal process of exclusive field-list, a D-command first establishes a corresponding inclusive field-list with no entry in it. During the process, each time the command encounters a new field not in the exclusive field-list nor in this corresponding inclusive field-list, it adds this field name to the corresponding inclusive field-list. This corresponding inclusive field-list is used as actual process. Therefore, the field order depends on actual input file.
This is important when the field order has special meaning as in Dorder, or in a key-field-list. For example, in the next case,
Dsort ^a file1 > tmp1
Dsort ^a file2 > tmp2
Djoin ^a tmp1 tmp2
the Djoin command may fail, because corresponding inclusive lists of the first Dsort and the second Dsort may be different.
D-commands output result D-file to the standard output, and messages (if any) to the standard error. DtoLine, DtoTex, and Dpr are the exceptions which output non D-file. Dsort may use temporary files for sort work, and Dpr for Windows may use temporary files for printer output. These are the only exception which writes other than standard output or standard error.
Some D-commands use pre-defined field names in their output. For example, Dcat and some other D-commands may add "filename" fields.
Pre-defined field name is never changed even if the input file already has the same field name. In such case, the output records will have repeating, for example, "filename" field. Whether the newly added field comes before or after the existing one depends on each D-command specification. In the case of Dcat, "filename" field is added always at the top of the record.
To avoid undesirable repeating, use Drename to change existing field names. There is no other way to change these pre-defined field names.
D-command return value is uniformly 0 to 3; which means
MIYAZAWA Akira