D_fmt is a form to specify how a character string is converted to/from D-fields. It is written as an additional information of a field-list (see the manual of Dintro). A character string to D-fields conversion specification is called "input" format, which is typically used in DfromLine, and also used in Duntie and Dunpack. Conversion specification from D-fields to a character string is called "output" format, typically used in DtoLine and in Dtie, Dpack as well. Output format is partially used in Dpr, too. Both input and output formats have same syntax with semantic diversity.
Input or output format action maintains "current position" on the character string, during its conversion process. This is indicated by the number of characters before the position, or, simply, the character position beginning with 0. D_fmt doesn't know about the encoding and doesn't have concept of the byte or octet.
field-format-list | ::= | field-list of field-name[:format]. | |
format | ::= | [start] [end] [options] [repeat] [:cfmt] | |
start | ::= | absolute | relative | |
absolute | ::= | DIGITS | |
relative | ::= | +DIGITS | |
end | ::= | end-position | length | delimiter | pattern | |
end-position | ::= | -DIGITS | |
length | ::= | (DIGITS) | |
delimiter | ::= | /STRING/ | |
pattern | ::= | @STRING@ | |
options | ::= | { alignment | quoting | with-field-name }.. | |
alignment | ::= | l | r | n | |
quoting | ::= | q | x | Q | b | n | |
with-field-name | ::= | f | n | |
repeat | ::= | * | |
cfmt | ::= | scanf(3) format with one % |
DIGITS here is a string of "0"-"9". STRING here is an arbitrary character string. (See the Special Characters section below).
Two command options -t and -z are valid wherever a D-command takes field-format-list arguments. They are used to give the default values of end and options.
A D-command may have two field-format-lists arguments. In this case, -t and -z command options give default values for both field-format-lists.
Input format action starts from the top field entry of the field-format-list, with the current position (cp) value zero.
The input scanner moves the cp to the start position given by the format and reads the string to the end. When an options is given with the field entry, it controls the read in action. After reading in, read in string may be converted with the cfmt (if any) to produce one D-field. After producing a D-field, cp is moved to a new position determined by each end specification.
If repeat is specified, the same field entry process is repeated with the new cp, until the cp reaches to the end of input string, or pattern match fails.
When the process of a field entry ends, the input scanner moves to the next field entry of the field-format-list (even if the cp is at the end of input string). If the field name is null string, the corresponding field is skipped (i.e., read in but no D-field is produced). This process is repeated until the field-format-list comes to the end.
Default start is cp.
Default action of end is delimiter given by -t command option, or /TAB/ (control character tab) if -t is not provided.
Alignment options are used for fixed length (i.e., end-position or length) fields to remove leading or trailing space characters. They have no effect on varying length (i.e., delimiter or pattern) fields.
When both l and r are specified, both leading and trailing space characters are removed.
Quoting options are used for varying length (i.e., delimiter or pattern) fields to escape delimiter characters in the field. They have no effect on fixed length fields.
When q, Q or b are used together, the first one encountered restrains other quoting mechanisms until it closes. That means in a QUOTATION MARK (") quotation, APOSTROPH (') or REVERSE SOLIDUS (\) quotation, QUOTATION MARK (") or REVERSE SOLIDUS (\) is normal. A REVERSE SOLIDUS (\) outside a quotation can cancel opening quotes. Option x does not affect to above rule. It is subordinate of option q and effective only in the QUOTATION MARKs.
With-field-name option is a bit tricky.
When using f option, the field name in the field-format-list has no meaning, because the field name is given by the input string. The field name in the field-format-list is ignored and you may write any name there. But it is recommendable to use "." for the field name that has f option.
Current implementation does not check whether the input with option f really has COLON or not. Users are responsible for feeding proper input to get a valid D-file as a result.
Option n cancels all other options regardless to their category. This is used with -z option to nullify its effect for a field-list entry. For example, assume there are four fields "a", "b", "c", "d" separated by TAB in a line, and to read field "b", "c", "d" with -q option while field "a" with no options.
-z q "a:n,b,c,d"
This is same as next example:
"a,b:q,c:q,d:q"
For absolute start position and end-position field entry, repeat is invalid.
When cfmt is specified, the string read in does not directly yield the D-filed. It is passed to sscanf (C language) function with given cfmt as the format and the read in string as input string of sscanf. The cfmt should have just one % element in it (but for %%).
If it is %d, %o, %x, %i, %n, %u, %f, %g, %c or %wc, the input string is received by a numeric variable and then converted to a string, which becomes the D-field value. If it is %s, %S or %ws, the input string is received by a string and it becomes the D-field value.
When using %c, %C or %wc, the first byte of the read in string (in file code) is read by an integer variable, and the value is converted to a string which becomes the D-field value. This is the only case in which D-command handles encoding.
When a field entry does not have any format information, the default field entry is applied. (Although this will be the major case). It is an unreal field entry with the default values described above. It takes cp as its start, -t command option value (or TAB if there is no -t) as its delimiter, -z command option value (or none if there is no -z) as the options and none for its repeat and cfmt.
When the cp is at the end of the input string, and the start position is not reset by absolute positioning, usually no field is read in. But there is an exception. For example, delimiter is comma and the data has leading and trailing comma:
DfromLine -t "," "a,b,c"
,word,
natural result will be:
a:
b:word
c:
However, in the case the delimiter is space(s) and the data has leading and trailing space instead of commas:
-t " +" "a,b,c"
word
In this case
a:word
seems to be the acceptable result.
To generalize this situation, we introduce a concept of "hard" delimiter and "soft" delimiter.
For a given delimiter pattern, if a string matches the pattern but the doubled string of the same one does not, the string is defined as "hard" for that pattern. When the doubled string also matches the given pattern, it is defined as "soft".
For example, "," matches the delimiter pattern /,/, but ",," does not. Thus, "," is "hard" delimiter for /,/. On the contrary, one SPACE and two SPACEs both match the pattern / +/, thus it is "soft" for / +/.
When the cp is at the end of input string, and the preceding delimiter is "hard", then the corresponding field is read in as a null string. In addition, if there is a "soft" delimiter from the cp, it is skipped before reading. (In the case of "hard" delimiter from the cp, null string is read in). This is, of course, applied only for delimiter field entries but not for the fixed length or pattern field entryies.
For example:
DfromLine -t " *, *| +" a,b,c,d,e
A B , ,D ,
result is:
a:A
b:B
c:
d:D
e:
Note that " " at the top of the input line is "soft" delimiter, while " , " after "B" and " ," at the end are "hard" delimiters for the pattern " *, *| +".
When an input string is null string, cp is at the end of input string from the first, and you can't tell the preceding delimiter is "hard" or "soft". In this case, if the first field's delimiter is null string, that means to read the whole string, then "hard" is assumed, and "soft" is assumed otherwise. In other words, null input string yileds nothing usually, but yields a field with null string value when the field wants to read the whole string in by -t "" or ://.
The definition of this "hard" and "soft" may not be precise enough to satisfy mathematician"s accuracy. In fact, some oddities are observed in -t ",,?,?" where "," is "soft" but ",," is "hard". There may be improved definition of hard and soft. But, the current definition well works for most of the usual cases, and considering the cost for its detection, it is not wise to employ more complicated definition. (But, there may be simpler definition like spaces are soft and other characters are hard...)
There are two types of output action. One is D-record order output, which is the default action. Another is field-format-list order output. The latter action is taken whe -p command option is provided in DtoLine The difference of these two types of action is only in their order of output fields.
In D-record order output, action starts from the first D-field of the given D-record as the current field, or in the case of Dtie, the first D-field of the subset of the D-record becomes the current field. The cp value is set to zero. Then the output routine searches the field-format-list for the field entry which has same name as the current field. If found, it becomes the current field entry and following output action is controlled by the field entry. If not found, the default field entry becomes the current field entry. After the current field output, next D-field of the D-record, or the next D-field of the subset (in the case of Dtie) becomes the current field. This action is repeated until all the D-fields in the D-record or in the subset is processed.
In field-formatlist order output, the first field entry of the field-format-list becomes the current field entry, and the cp is set to zero. Then outputroutine searches the D-record for the same field name fields. If found, the field becomes the current field, the output action is taken. If two or more same name field is found, the second one and following one becomes the current field in turn. After all found fields are processed or no field was found from the first, the current field entry is moved to the next field entry of the field-format-list. This action is repeated until all the field-format-list is processed.
Once the current field and the current field entry is set, cfmt conversion is made if any. Then, the start specification of the current field entry adjust the cp, and by end specification, output length is controlled or delimiter is appended after the output field. After the output, cp is at the position just after the output string.
There is no "buffering" in the output routine level. Once a field is output, the cp can not go back to younger number. If you want to output the field "b" at 1-4 column and the field "a" at 6-9 column, the field "b" must be processed before the field "a".
Default start is cp.
There is no pattern for the output formats.
Alignment options are used for fixed length (i.e., end-position or length) fields. SPACE characters are padded when the output value length is shorter than the field length. Or truncation is made when the output value length is longer. They have no effect on varying length (i.e., delimiter) fields.
Alignment options of the output format is exclusive each other. But when both l and r are specified, l option beats the r option.
Quoting options are used for varying length (i.e., delimiter) fields to escape delimiter characters in the field. They have no effect on fixed length fields.
Quoting options of the output format is exclusive each other. But when q, Q or b are used at same time, q is the top precedence and then comes Q.
Option n cancels all other options. This is used with -z option to nullify its effect for a field-format-list entry.
Repeat in the output format is ignored. Because whether the field actually repeats or not is determined by the D-record field occurrence, even in the field-format-list order output. Only when the format is used as a leaf-field-list, repeat has meaning. See the manual of D_lsa.
When cfmt is specified, the field value is converted before output. This conversion is made by (C language) sprintf function with the cfmt as the format and the field value as the variable. The cfmt must have just one % element (but for %%) in it. If it is %d, %o, %x, %i, %n, %u, %c or %wc, the field value is converted into integer and it becomes the variable. If it is %f or %g, the field value is converted to double to become the variable. If it is %s, %S or %ws, the field value string is passed as the variable in multi-byte string or wide character string. Then sprintf(3) conversion is made and the result string is treated as the value to be output.
When using %wc, field value must be valid internal code value.
Option f is not allowed with cfmt.
When there is no format information in the field-format-list or the field entry is not found, the default field entry is applied. It is an unreal field entry with the default values described above. It takes cp as its start value, -t command option value (or TAB if there is no -t) as its delimiter, -z command option value (or none if there is no -z)as its options and none forl its cfmt.
You have to be careful to escape special effects of the shell and D-commands to use some characters such as SPACE, REVERSE SOLIDUS (\) or SOLIDUS (/) in a format.
Within /STRING/ of a format, you need extra REVERSE SOLIDUS(\) before spaces, COMMA(,), CIRCUMFLEX ACCENT(^), SOLIDUS(/) and REVERSE SOLIDUS(\). Within @STRING@, instead of SOLIDUS, COMMERCIAL AT(@) needs extra REVERSE SOLIDUS.
It may be useful to learn how the command arguments are processed by the shell and D-commands. First, the shell parses your input and separate arguments by space characters.
In the UNIX shells, space characters can be included in a command argument using REVERSE SOLIDUS (\) before it or quoting with QUOTATION MARK (") or APOSTROPH ('). Other special characters like GREATER-THAN SIGN (>) or AMPERSAND (&) can be included using same quoting mechanism. There are minor differences of quoting mechanisms between sh(1) and csh(1) (and other shells also). Generally it is safe to quote your field-format-list by APOSTROPH ('), under the condition that the list does not have character APOSTROPHE in it.
In the Windows command window shell, space characters can be included in a command argument only by QUOTATION MARK (") quoting mechanism. Other special characters in the windows shell, i.e., GREATER-THAN SIGN (>), VERTICAL LINE (|) or AMPERSAND (&) can be included by using QUOTATION MARK quoting mechanism, or putting CIRCUMFLEX ACCENT (^) before it. To use CIRCUMFLEX ACCENT in a command argument, put it in QUOTATION MARKS or double the CURCUMFLEX ACCENT. In addition to above characters, PERCENT SIGN (%) has to be escaped when it is used as a valid environment variable context. For example %path% is replaced by path directory names by the shell. In most cases, you need not worry about this. But if you use %xx% like words in a command argument, you can use CIRCUMFLEX ACCENT before the second % (or before x). You cannot escape it by QUOTATION MARKs.
These quoting characters are removed before the argument is handed to a D-command.
After the shell process, a field-format-list argument is parsed by a D-command in two stages. The first stage is the field-list parsing, in which space characters, COMMA (,) and CIRCUMFLEX ACCENT (^) have special syntactic functions. These special characters can be included in a field name or in a format by placing REVERSE SOLIDUS before them. After the parsing, these REVESE SOLIDUSs are removed. REVERSE SOLIDUSs before other characters are intact. Note that "\^" is changed to "^" regardless to its position, despite the fact that CIRCUMFLEX ACCENT has special meaning only at the top of the field-list.
The second stage is the format parsing, in which SOLIDUS (/) or COMMERCIAL AT (@) is used as a syntactical element. Again this can be escaped with the REVERSE SOLIDUS before it. In addition to these characters, REVERSE SOLIDUS itself needs to be doubled. (Otherwise you can't make delimiter strings end with '\'). Finally, this REVERSE SOLIDUS before REVERSE SOLIDUS and before SOLIDUS (in the case of delimiter) or before COMMERCIAL AT (in the case of pattern) is removed.
See examples below.
Read field "a", "b", "c" separated by TAB in a line:
"a,b,c"
Read ""csv"" file with field names "a", "b", "c":
-t "," -z q "a,b,c"
(Csv (comma separated value) file has fields separated by "," with "" quoting for string fields. Above example shows the case that has only data lines.)
Read words of a C source file:
-t "[^a-zA-Z0-9_]+" -z qQ "words:*"
Read characters one by one:
"c:(1)*"
Read hexadecimal value:
v::%x
Conversion to .csv file (data lines only):
-t "," -z q
Assuming the input D-file is like:
name:MIYAZAWA
point:67
point:72
point:36
getting:
MIYAZAWA: 67, 72, 36
like line from this:
"name:/:\ /,point:/\,\ /"
(Note SPACE and COMMA are escaped with REVERSE SOLIDUS. Note also repeat is not required for the field "point" entry. COMMA and SPACE are inserted only between the "point" fields, and not at the end of the output line.)
Using c-format; getting each value enclosed by ().
value::(%s)
Dintro, D_lsa, DfromLine, DtoLine, Dtie, Duntie, Dpack, Dunpack, Dpr.
MIYAZAWA Akira