D_fmt is a form to specify how a character string is converted to/from D-fields. It is written as an additional information of a field-list (see the manual of Dintro). A character string to D-fields conversion specification is called "input" format, which is typically used in DfromLine, and also used in Duntie and Dunpack. Conversion specification from D-fields to a character string is called "output" format, typically used in DtoLine and in Dtie, Dpack as well. Output format is partially used in Dpr, too. Both input and output formats have same syntax with semantic diversity.
Input or output format action maintains "current position" on the character string, during its conversion process. This is indicated by the number of characters before the position, or, simply, the character position beginning with 0. D_fmt doesn't know about the encoding and doesn't have concept of the byte or octet, except for the C-format usage.
field-format-list | ::= | field-list of field-name[:format]. | |
format | ::= | [start] [end] [options] [repeat] [:C-format] | |
start | ::= | absolute | relative | |
absolute | ::= | DIGITS | |
relative | ::= | +DIGITS | |
end | ::= | end-position | length | delimiter | pattern | |
end-position | ::= | -DIGITS | |
length | ::= | (DIGITS) | |
delimiter | ::= | /STRING/ | |
pattern | ::= | @STRING@ | |
options | ::= | { alignment | quoting | with-field-name }.. | |
alignment | ::= | l | r | n | |
quoting | ::= | q | x | Q | b | n | |
with-field-name | ::= | f | n | |
repeat | ::= | * | |
C-format | ::= | scanf(3) or printf(3) format with one % |
DIGITS here is a string of "0"-"9". STRING here is an arbitrary character string. (See the Special Characters section below).
Two command options -t and -z are valid wherever a D-command takes field-format-list arguments. They are used to give the default values of end and options.
A D-command may have two field-format-lists arguments. In this case, -t and -z command options give default values for both field-format-lists.
Input format action starts from the top field entry of the field-format-list, with the current position (cp) value zero.
The input scanner moves the cp to the start position given by the format and reads the string to the end. When an options is given with the field entry, it controls the read in action. After reading in, read in string may be converted with the C-format (if any) to produce one D-field. After producing a D-field, cp is moved to a new position determined by each end specification.
If repeat is specified, the same field entry process is repeated with the new cp, until the cp reaches to the end of input string, or pattern match fails.
When the process of a field entry ends, the input scanner moves to the next field entry of the field-format-list (even if the cp is at the end of input string). If the field name is null string, the corresponding field is skipped (i.e., read in but no D-field is produced). This process is repeated until the field-format-list comes to the end.
Default start is cp.
Default action of end is delimiter given by -t command option, or /TAB/ (control character tab) if -t is not provided.
Alignment options are used for fixed length (i.e., end-position or length) fields to remove leading or trailing space characters. They have no effect on varying length (i.e., delimiter or pattern) fields.
When both l and r are specified, both leading and trailing space characters are removed.
Quoting options are used for varying length (i.e., delimiter or pattern) fields to escape delimiter characters in the field. They have no effect on fixed length fields.
When q, Q or b are used together, the first one encountered restrains other quoting mechanisms until it closes. That means in a QUOTATION MARK (") quotation, APOSTROPH (') or REVERSE SOLIDUS (\) quotation, QUOTATION MARK (") or REVERSE SOLIDUS (\) is normal. A REVERSE SOLIDUS (\) outside a quotation can cancel opening quotes. Option x does not affect to above rule. It is subordinate to option q and effective only in the QUOTATION MARKs.
With-field-name option is a bit tricky.
When using f option, the field name in the field-format-list has no meaning, because the field name is given by the input string. The field name in the field-format-list is ignored and you may write any name there. But it is recommendable to use "." for the field name that has f option.
Current implementation does not check whether the input with option f really has COLON or not. Users are responsible for feeding proper input to get a valid D-file as a result.
Option n cancels all other options regardless to their category. This is used with -z option to nullify its effect for a field-list entry. For example, assume there are four fields "a", "b", "c", "d" separated by TAB in a line, and to read field "b", "c", "d" with -q option while field "a" with no options.
-z q "a:n,b,c,d"
This is same as next example:
"a,b:q,c:q,d:q"
For absolute start position and end-position field entry, repeat is invalid.
When the C-format is specified, the input string is further converted by sscanf (C language) function to yield the final string. The input string is scanned by sscanf function using the C-format as the format, and the scanned value is converted to the result string.
The C-format must have one and only one effective format specifier in it. The format specfier begins with PERCENT SIGN (%) and has the following form:
%[*][width][size]type
A format specifier marked with * is not counted as effective. A sequence %% which matches single PERCENT SIGN is not counted as effective, either.
In the C-format, type must be one of the following:
'd', 'i', 'o', 'u', 'x', 'X', 'f', 'e', 'E', 'g', 'G', 's', 'S', 'c', 'C'
If present, size must be one of the following:
'h', 'hh', 'l', 'll', 'L', 'w', 'I64'
Width is decimal integer which controls maximum number of characters used for the conversion. In the C-format, when the type is 'c' or 'C', width must be either omitted or value 1. If you want to read multiple characters regardless to spaces, you should use the length in the end specification of the D_fmt.
When the type is 's' or 'S', the converted string is received by a string variable and becomes the final value. In all other types, the converted value is received by an appropriate type variable, and then converted to a numeric value. In the current implementation of D-commands, internal representation of numeric values is "double" type. Conversion from each type to "double" type follows the type conversion rules of the C language. Finally, the numeric value is represented by character string following numeric value representation of D-commands.
The type 'c' is no exception of the rule above. One character (byte) of the input string is received by "char" type variable and then converted to "double". Therefore the result value is the internal code value of the character. When size 'l', 'w' is used with the type 'c', or the type is type 'C', one multi-byte character is converted to its internal code. This type 'c' of C-format is the only exception of D_fmt to handle character codes in D-commands.
When C-format is specified, you can not use the option f in the same format entry.
The C-format specification of D-commands may not be accepted by your run time library. For example, size 'I64' is accepted by Windows, but not by other runtime environments. Standard C accepts type 'n', but is not accepted by D-command's C-format. To operate C-format conversion properly, a C-format specification must accepted both by this D-commands specification and by your runtime library's sscanf specification.
There are some more points to be noted in using C-format with non-ASCII charactrers. Sscanf function is operated with the locale character code. When you are using UTF I/O feature, some characters may not be represented by your locale character code. In this case, these characters are discarded before sscanf operation. With locale character code with multi-byte representation, some C-format specification may cause character code error. For example, C-format "%1s" may pick a part of multi-byte character. In this case, such error characters are discarded from the final value. If you are using ASCII characters only, these problems do not occur.
When a field entry does not have any format information, the default field entry is applied. (Although this will be the major case). It is an imaginary field entry with the default values described above. It takes cp as its start, -t command option value (or TAB if there is no -t) as its delimiter, -z command option value (or none if there is no -z) as the options and none for its repeat and C-format.
When the cp is at the end of the input string, and the start position is not reset by absolute positioning, usually no field is read in. But there is an exception. For example, delimiter is comma and the data has leading and trailing comma:
DfromLine -t "," "a,b,c"
,word,
natural result will be:
a:
b:word
c:
However, in the case the delimiter is space(s) and the data has leading and trailing space instead of commas:
-t " +" "a,b,c"
word
In this case
a:word
seems to be the acceptable result.
To generalize this situation, we introduce a concept of "hard" delimiter and "soft" delimiter.
For a given delimiter pattern, if a string matches the pattern but the doubled string of the same one does not, the string is defined as "hard" for that pattern. When the doubled string also matches the given pattern, it is defined as "soft".
For example, "," matches the delimiter pattern /,/, but ",," does not. Thus, "," is "hard" delimiter for /,/. On the contrary, one SPACE and two SPACEs both match the pattern / +/, thus it is "soft" for / +/.
When the cp is at the end of input string, and the preceding delimiter is "hard", then the corresponding field is read in as a null string. In addition, if there is a "soft" delimiter from the cp, it is skipped before reading. (In the case of "hard" delimiter from the cp, null string is read in). This is, of course, applied only for delimiter field entries but not for the fixed length or pattern field entryies.
In the following example:
DfromLine -t " *, *| +" a,b,c,d,e
A B , ,D ,
the delimiter means a COMMA optionally surrounded by SPACEs, or more than on SPACEs. The result is:
a:A
b:B
c:
d:D
e:
Note that two SPACEs at the top of the input line is "soft" delimiter, while "SPACE COMMA SPACE" after "B" and "SPACE COMMA" at the end of the line are "hard" delimiters for the pattern " *, *| +".
When an input string is null string, cp is at the end of input string from the first, and you can't tell the preceding delimiter is "hard" or "soft". In this case, if the first field's delimiter is null string, that means to read the whole string, then "hard" is assumed, and "soft" is assumed otherwise. In other words, null input string yileds nothing usually, but yields a field with null string value when the field wants to read the whole string in by -t "" or ://.
The definition of this "hard" and "soft" may not be precise enough to satisfy mathematician"s accuracy. In fact, some oddities are observed in -t ",,?,?" where "," is "soft" but ",," is "hard". There may be improved definition of hard and soft. But, the current definition well works for most of the usual cases, and considering the cost for its detection, it is not wise to employ more complicated definition. (But, there may be simpler definition like spaces are soft and other characters are hard...)
There are two types of output action. One is D-record order output, which is the default action. Another is field-format-list order output. The latter action is taken whe -p command option is provided in, for example, DtoLine The difference of these two types of action is only in their order of output fields.
In D-record order output, action starts from the first D-field of the given D-record as the current field, or in the case of Dtie, the first D-field of the subset of the D-record becomes the current field. The cp value is set to zero. Then the output routine searches the field-format-list for the field entry which has same name as the current field. If found, it becomes the current field entry and following output action is controlled by the field entry. If not found, the default field entry becomes the current field entry. After the current field output, next D-field of the D-record, or the next D-field of the subset (in the case of Dtie) becomes the current field. This action is repeated until all the D-fields in the D-record or in the subset is processed.
In field-formatlist order output, the first field entry of the field-format-list becomes the current field entry, and the cp is set to zero. Then outputroutine searches the D-record for the same field name fields. If found, the field becomes the current field, the output action is taken. If two or more same name field is found, the second one and following one becomes the current field in turn. After all found fields are processed or no field was found from the first, the current field entry is moved to the next field entry of the field-format-list. This action is repeated until all the field-format-list is processed.
Once the current field and the current field entry is set, C-format conversion is made if any. Then, the start specification of the current field entry adjust the cp, and by end specification, output length is controlled or delimiter is appended after the output field. After the output, cp is at the position just after the output string.
There is no "buffering" in the output routine level. Once a field is output, the cp can not go back to younger number. If you want to output the field "b" at 1-4 column and the field "a" at 6-9 column, the field "b" must be processed before the field "a".
Default start is cp.
There is no pattern for the output formats.
Alignment options are used for fixed length (i.e., end-position or length) fields. SPACE characters are padded when the output value length is shorter than the field length. Or truncation is made when the output value length is longer. They have no effect on varying length (i.e., delimiter) fields.
Alignment options of the output format is exclusive each other. But when both l and r are specified, l option beats the r option.
Quoting options are used for varying length (i.e., delimiter) fields to escape delimiter characters in the field. They have no effect on fixed length fields.
Quoting options of the output format is exclusive each other. But when q, Q or b are used at same time, q is the top precedence and then comes Q.
Option n cancels all other options. This is used with -z option to nullify its effect for a field-format-list entry.
Repeat in the output format is ignored. Because whether the field actually repeats or not is determined by the D-record field occurrence, even in the field-format-list order output. Only when the format is used as a leaf-field-list, repeat has meaning. See the manual of D_lsa.
When the C-format is specified, the field value is converted by sprintf (C language) function before output. The C-format is used as the format and the field value as the variable, and the result of the conversion is handed to ordinary D_fmt process.
The C-format must have one and only one format specifier in it. The format specfier begins with PERCENT SIGN (%) and has the following form:
%[flags][width][.precision][size]type
A sequence "%%" denotes a character PERCENT SIGN and is not regarded as the format specifier, here.
The flags must be sequence of following characters:
'#', '0', '-', ' ', '+'
These flags alter the way of representation such as justification, sign or hexadecimal prefixes. See the manual of sprintf (C language function) for the detail.
The type must be one of the following:
'd', 'i', 'o', 'u', 'x', 'X', 'f', 'e', 'E', 'g', 'G', 's', 'S', 'c', 'C'
If present, size must be one of the following:
'h', 'hh', 'l', 'll', 'L', 'w', 'I64'
The width and precision are decimal integers which control a certain number of characters in the result string, depending on the type. See the manual of sprintf (C language function) for the detail. It should be noted here that the width and the precision do not limit the length of result string in the case of numeric conversion. This is by the specification of sprintf. It is recommended to use D_fmt length in the end specification with C-format to control maximum length.
When the type is numeric (i.e. neither 's' nor 'S'), the original field value is converted to numeric value (double type in this implementation), then is cast to appropriate type specified by the type and the size before converted by sprintf function. After sprintf operation, the converted string is normalized to the internal string value (wchar string in this implementation), then, is handed to the D_fmt process.
The type 'c' or 'C' is also numeric. The sprintf operation in this case is the coversion from internal character code to the character. If the size is 'l', the character can be multi byte character, otherwise the character is limited into one byte character range.
When the type is 's' or 'S', the original value is not changed by sprintf function, basically. The C-format "%s" does nothing. This is usually used to add some characters to the original string, such as the C-fromat "(%s)" to parenthesize the original value. The size 'l' may be used with the type 's'. But, the result is same unless used with the width and the precision which is not recommended.
When C-format is specified, you can not use the option f in the same format entry.
The C-format specification of D-commands may not be accepted by your run time library. For example, size 'w', which is valid in the D-command, does not work with Windows. The type 'p', which is in the standard C language, is rejected by C-format of the D-commands. To operate C-format conversion properly, a C-format specification must accepted both this D-commands specification and your runtime library's sprintf specification.
There are some more points to be noted in using C-format with non-ASCII charactrers. Sprintf function is operated with the locale character code. When you are using UTF I/O feature, some characters may not be represented by your locale character code. The result of "%lc" conversion for such character code depends on your runtime environment. In addition, the width and the precision is counted by bytes in the C-format "%s". It may produce invalid code representation, but the result depends on your runtime environment. If you are using ASCII characters only, these problems do not occur.
When there is no format information in the field-format-list or the field entry is not found, the default field entry is applied. It is an imaginary field entry with the default values described above. It takes cp as its start value, -t command option value (or TAB if there is no -t) as its delimiter, -z command option value (or none if there is no -z)as its options and none for its C-format.
You have to use special characters such as REVERSE SOLIDUS (\), DOLLAR SIGN ($) or AMPERSAND (&) in your field-format-list, especially with input delimiter. To give proper string to the D-format processor, you have to carefully escape shell special character handling, D-field-list and D-format parser. Here is practical way to do this.
It may be useful to learn how the command arguments are processed by the shell and D-commands. There are two levels of character escape processes between your command argument and the string to be handled. The first level is the SHELL (cmd shell in Windows), the second is the processes by D-commands. The latter is further divided into some levels. The first is D-field-list parsing process, the second is D-format parsing process, and the deepest is regular expression matching process which is called only for input delimiter or pattern. In each process, some "escape" characters may be stripped off. The following paragraphs describe the details.
In the UNIX shells, space characters can be included in a command argument by using REVERSE SOLIDUS (\) before it or by quoting with QUOTATION MARK (") or APOSTROPH ('). Other special characters like GREATER-THAN SIGN (>), DOLLAR SIGN ($), or AMPERSAND (&) can be included using same quoting mechanism. There are minor differences of quoting mechanisms or repertoir of special characters between sh(1) and csh(1) (and other shells also). It is beyond this tutorial to go into the details. But, generally it is recommended to quote your field-format-list by APOSTROPH. If you need to use APOSTROPH in your format, quote it with QUOTATION MARK. You can use more than one quoting in an argument. A command argument '"DON'"'"'NT"' becomes seven character string "DON'T" when it is handed to the command.
In the Windows command window shell, space characters can be included in a command argument only by QUOTATION MARK quoting mechanism. Other special characters in the windows shell, i.e., GREATER-THAN SIGN, VERTICAL LINE or AMPERSAND can be included by using QUOTATION MARK quoting mechanism, or putting CIRCUMFLEX ACCENT (^) before it. QUOTATION MARK itself can be used within QUOTATION MARKs by doubling it (""). To use CIRCUMFLEX ACCENT in a command argument, put it in QUOTATION MARKS or double the CURCUMFLEX ACCENT. In addition to above characters, PERCENT SIGN (%) has to be escaped when it is used as a valid environment variable context. For example %path% is replaced by path directory names by the shell. In most cases, you need not worry about this. But if you use %xx% like words in a command argument, you can use CIRCUMFLEX ACCENT before the second % (or before x). You cannot escape it by QUOTATION MARKs.
These quoting characters of the shell are removed before the argument is handed to a D-command.
After the shell process, D-commands receive the processed argument. D-field-list parsing does following process.
<additional-inf> is processed by D-format parser. In this process:
The result is handed to the D-format formatting process including regular expression matching.
When you want to any one of SOLIDUS (/), VERTICAL LINE (|) or REVERSE SOLIDUS (\) as the delimiter of field foo.
The regular expression of this delimiter should be
/|\||\\
The first SOLIDUS is a normal character. Next VERTICAL LINE is regular expression syntax element meaning OR. Next REVERSE SOLIDUS makes following VERTICAL LINE as a normal character. Next VERTICAL LINE again is syntax element, And the last two REVERSE SOLIDUS makes one normal character REVERSE SOLIDUS. This is the form D-format should hand to regular expression parser.
In the field-format-list, you have to write this as /STRING/, and you have to escape the first SOLIDUS. Furthermore, you have to double the last two REVERSE SOLIDUSs as D-format parser removes REVERSE SOLIDUS preceding REVERSE SOLIDUS,
/\/|\||\\\\/
As this has no COMMA, you don't need further escape in the field-list.
'foo:/\/|\||\\\\/'
This should be the command argument. In Windows, use QUOTATION MARK instead of APOSTROPHE as "foo:/\/|\||\\\\/".
You may use the regular expression construct []. In this case 'fn:/[\/|\\]/' works. This is simpler and better answer. The example above is for explanation.
Read field "a", "b", "c" separated by TAB in a line:
"a,b,c"
Read ""csv"" file with field names "a", "b", "c":
-t "," -z q "a,b,c"
(Csv (comma separated value) file has fields separated by "," with "" quoting for string fields. Above example shows the case that has only data lines.)
Read words of a C source file:
-t "[^a-zA-Z0-9_]+" -z qQ "words:*"
Read characters one by one:
"c:(1)*"
Read hexadecimal value:
v::%x
Conversion to .csv file (data lines only):
-t "," -z q
To get the following line:
MIYAZAWA: 67, 72, 36
From the following input D-record:
name:MIYAZAWA
point:67
point:72
point:36
Use the following output format.
"name:/: /,point:/\, /"
(Note COMMA is escaped with REVERSE SOLIDUS. Note also repeat is not required for the field "point" entry. COMMA and SPACE are inserted only between the "point" fields, and not at the end of the output line.)
Using c-format; getting each value enclosed by ().
value::(%s)
MIYAZAWA Akira