[ English | Japanese ]
D_fmt is a form to specify how a character string is converted to/from D-fields. It is written as an additional information of a field-list (see the manual of Dintro). A character string to D-fields conversion specification is called "input" format, which is typically used in DfromLine, and also used in Duntie and Dunpack. Conversion specification from D-fields to a character string is called "output" format, typically used in DtoLine and in Dtie, Dpack as well. Output format is partially used in Dpr, too. Both input and output formats have same syntax with semantic diversity.
Input or output format action maintains "current position" on the character string, during its conversion process. This is indicated by the number of characters before the position, or, simply, the character position beginning with 0. D_fmt doesn't know about the encoding and doesn't have concept of the byte or octet, except for the C-format usage.
field-format-list | ::= | field-list of field-name[:format]. | |
format | ::= | [start] [end] [options] [repeat] [:C-format] | |
start | ::= | absolute | relative | |
absolute | ::= | DIGITS | |
relative | ::= | +DIGITS | |
end | ::= | end-position | length | delimiter | pattern | |
end-position | ::= | -DIGITS | |
length | ::= | (DIGITS) | |
delimiter | ::= | /STRING/ | |
pattern | ::= | @STRING@ | |
options | ::= | { alignment | quoting | with-field-name }.. | |
alignment | ::= | l | r | n | |
quoting | ::= | q | x | Q | b | n | |
with-field-name | ::= | f | n | |
repeat | ::= | * | |
C-format | ::= | scanf(3) or printf(3) format with one % |
DIGITS here is a string of "0"-"9". STRING here is an arbitrary character string. (See the Special Characters section below).
Two command options -t and -z are valid wherever a D-command takes field-format-list arguments. They are used to give the default values of end and options.
A D-command may have two field-format-lists arguments. In this case, -t and -z command options give default values for both field-format-lists.
Input format action starts from the top field entry of the field-format-list, with the current position (cp) value zero.
The input scanner moves the cp to the start position given by the format and reads the string to the end. When an options is given with the field entry, it controls the read in action. After reading in, read in string may be converted with the C-format (if any) to produce one D-field. After producing a D-field, cp is moved to a new position determined by each end specification.
If repeat is specified, the same field entry process is repeated with the new cp, until the cp reaches to the end of input string, or pattern match fails.
When the process of a field entry ends, the input scanner moves to the next field entry of the field-format-list (even if the cp is at the end of input string). If the field name is null string, the corresponding field is skipped (i.e., read in but no D-field is produced). This process is repeated until the field-format-list comes to the end.
Default start is cp.
Default action of end is delimiter given by -t command option, or /TAB/ (control character tab) if -t is not provided.
Alignment options are used for fixed length (i.e., end-position or length) fields to remove leading or trailing space characters. They have no effect on varying length (i.e., delimiter or pattern) fields.
When both l and r are specified, both leading and trailing space characters are removed.
Quoting options are used for varying length (i.e., delimiter or pattern) fields to escape delimiter characters in the field. They have no effect on fixed length fields.
When q, Q or b are used together, the first one encountered restrains other quoting mechanisms until it closes. That means in a QUOTATION MARK (") quotation, APOSTROPH (') or REVERSE SOLIDUS (\) quotation, QUOTATION MARK (") or REVERSE SOLIDUS (\) is normal. A REVERSE SOLIDUS (\) outside a quotation can cancel opening quotes. Option x does not affect to above rule. It is subordinate of option q and effective only in the QUOTATION MARKs.
With-field-name option is a bit tricky.
When using f option, the field name in the field-format-list has no meaning, because the field name is given by the input string. The field name in the field-format-list is ignored and you may write any name there. But it is recommendable to use "." for the field name that has f option.
Current implementation does not check whether the input with option f really has COLON or not. Users are responsible for feeding proper input to get a valid D-file as a result.
Option n cancels all other options regardless to their category. This is used with -z option to nullify its effect for a field-list entry. For example, assume there are four fields "a", "b", "c", "d" separated by TAB in a line, and to read field "b", "c", "d" with -q option while field "a" with no options.
-z q "a:n,b,c,d"
This is same as next example:
"a,b:q,c:q,d:q"
For absolute start position and end-position field entry, repeat is invalid.
When the C-format is specified, the input string is further converted by sscanf (C language) function to yield the final string. The input string is scanned by sscanf function using the C-format as the format, and the scanned value is converted to the result string.
The C-format must have one and only one effective format specifier in it. The format specfier begins with PERCENT SIGN (%) and has the following form:
%[*][width][size]type
A format specifier marked with * is not counted as effective. A sequence %% which matches single PERCENT SIGN is not counted as effective, either.
In the C-format, type must be one of the following:
'd', 'i', 'o', 'u', 'x', 'X', 'f', 'e', 'E', 'g', 'G', 's', 'S', 'c', 'C'
If present, size must be one of the following:
'h', 'hh', 'l', 'll', 'L', 'w', 'I64'
Width is decimal integer which controls maximum number of characters used for the conversion. In the C-format, when the type is 'c' or 'C', width must be either omitted or value 1. If you want to read multiple characters regardless to spaces, you should use the length in the end specification of the D_fmt.
When the type is 's' or 'S', the converted string is received by a string variable and becomes the final value. In all other types, the converted value is received by an appropriate type variable, and then converted to a numeric value. In the current implementation of D-commands, internal representation of numeric values is "double" type. Conversion from each type to "double" type follows the type conversion rules of the C language. Finally, the numeric value is represented by character string following numeric value representation of D-commands.
The type 'c' is no exception of the rule above. One character (byte) of the input string is received by "char" type variable and then converted to "double". Therefore the result value is the internal code value of the character. When size 'l', 'w' is used with the type 'c', or the type is type 'C', one multi-byte character is converted to its internal code. This type 'c' of C-format is the only exception of D_fmt to handle character codes in D-commands.
When C-format is specified, you can not use the option f in the same format entry.
The C-format specification of D-commands may not be accepted by your run time library. For example, size 'I64' is accepted by Windows, but not by other runtime environments. Standard C accepts type 'n', but is not accepted by D-command's C-format. To operate C-format conversion properly, a C-format specification must accepted both by this D-commands specification and by your runtime library's sscanf specification.
There are some more points to be noted in using C-format with non-ASCII charactrers. Sscanf function is operated with the locale character code. When you are using UTF I/O feature, some characters may not be represented by your locale character code. In this case, these characters are discarded before sscanf operation. With locale character code with multi-byte representation, some C-format specification may cause character code error. For example, C-format "%1s" may pick a part of multi-byte character. In this case, such error characters are discarded from the final value. If you are using ASCII characters only, these problems do not occur.
When a field entry does not have any format information, the default field entry is applied. (Although this will be the major case). It is an unreal field entry with the default values described above. It takes cp as its start, -t command option value (or TAB if there is no -t) as its delimiter, -z command option value (or none if there is no -z) as the options and none for its repeat and C-format.
When the cp is at the end of the input string, and the start position is not reset by absolute positioning, usually no field is read in. But there is an exception. For example, delimiter is comma and the data has leading and trailing comma:
DfromLine -t "," "a,b,c"
,word,
natural result will be:
a:
b:word
c:
However, in the case the delimiter is space(s) and the data has leading and trailing space instead of commas:
-t " +" "a,b,c"
word
In this case
a:word
seems to be the acceptable result.
To generalize this situation, we introduce a concept of "hard" delimiter and "soft" delimiter.
For a given delimiter pattern, if a string matches the pattern but the doubled string of the same one does not, the string is defined as "hard" for that pattern. When the doubled string also matches the given pattern, it is defined as "soft".
For example, "," matches the delimiter pattern /,/, but ",," does not. Thus, "," is "hard" delimiter for /,/. On the contrary, one SPACE and two SPACEs both match the pattern / +/, thus it is "soft" for / +/.
When the cp is at the end of input string, and the preceding delimiter is "hard", then the corresponding field is read in as a null string. In addition, if there is a "soft" delimiter from the cp, it is skipped before reading. (In the case of "hard" delimiter from the cp, null string is read in). This is, of course, applied only for delimiter field entries but not for the fixed length or pattern field entryies.
For example:
DfromLine -t " *, *| +" a,b,c,d,e
A B , ,D ,
result is:
a:A
b:B
c:
d:D
e:
Note that " " at the top of the input line is "soft" delimiter, while " , " after "B" and " ," at the end are "hard" delimiters for the pattern " *, *| +".
When an input string is null string, cp is at the end of input string from the first, and you can't tell the preceding delimiter is "hard" or "soft". In this case, if the first field's delimiter is null string, that means to read the whole string, then "hard" is assumed, and "soft" is assumed otherwise. In other words, null input string yileds nothing usually, but yields a field with null string value when the field wants to read the whole string in by -t "" or ://.
The definition of this "hard" and "soft" may not be precise enough to satisfy mathematician"s accuracy. In fact, some oddities are observed in -t ",,?,?" where "," is "soft" but ",," is "hard". There may be improved definition of hard and soft. But, the current definition well works for most of the usual cases, and considering the cost for its detection, it is not wise to employ more complicated definition. (But, there may be simpler definition like spaces are soft and other characters are hard...)
There are two types of output action. One is D-record order output, which is the default action. Another is field-format-list order output. The latter action is taken whe -p command option is provided in DtoLine The difference of these two types of action is only in their order of output fields.
In D-record order output, action starts from the first D-field of the given D-record as the current field, or in the case of Dtie, the first D-field of the subset of the D-record becomes the current field. The cp value is set to zero. Then the output routine searches the field-format-list for the field entry which has same name as the current field. If found, it becomes the current field entry and following output action is controlled by the field entry. If not found, the default field entry becomes the current field entry. After the current field output, next D-field of the D-record, or the next D-field of the subset (in the case of Dtie) becomes the current field. This action is repeated until all the D-fields in the D-record or in the subset is processed.
In field-formatlist order output, the first field entry of the field-format-list becomes the current field entry, and the cp is set to zero. Then outputroutine searches the D-record for the same field name fields. If found, the field becomes the current field, the output action is taken. If two or more same name field is found, the second one and following one becomes the current field in turn. After all found fields are processed or no field was found from the first, the current field entry is moved to the next field entry of the field-format-list. This action is repeated until all the field-format-list is processed.
Once the current field and the current field entry is set, C-format conversion is made if any. Then, the start specification of the current field entry adjust the cp, and by end specification, output length is controlled or delimiter is appended after the output field. After the output, cp is at the position just after the output string.
There is no "buffering" in the output routine level. Once a field is output, the cp can not go back to younger number. If you want to output the field "b" at 1-4 column and the field "a" at 6-9 column, the field "b" must be processed before the field "a".
Default start is cp.
There is no pattern for the output formats.
Alignment options are used for fixed length (i.e., end-position or length) fields. SPACE characters are padded when the output value length is shorter than the field length. Or truncation is made when the output value length is longer. They have no effect on varying length (i.e., delimiter) fields.
Alignment options of the output format is exclusive each other. But when both l and r are specified, l option beats the r option.
Quoting options are used for varying length (i.e., delimiter) fields to escape delimiter characters in the field. They have no effect on fixed length fields.
Quoting options of the output format is exclusive each other. But when q, Q or b are used at same time, q is the top precedence and then comes Q.
Option n cancels all other options. This is used with -z option to nullify its effect for a field-format-list entry.
Repeat in the output format is ignored. Because whether the field actually repeats or not is determined by the D-record field occurrence, even in the field-format-list order output. Only when the format is used as a leaf-field-list, repeat has meaning. See the manual of D_lsa.
When the C-format is specified, the field value is converted by sprintf (C language) function before output. The C-format is used as the format and the field value as the variable, and the result of the conversion is handed to ordinary D_fmt process.
The C-format must have one and only one format specifier in it. The format specfier begins with PERCENT SIGN (%) and has the following form:
%[flags][width][.precision][size]type
A sequence "%%" denotes a character PERCENT SIGN and is not regarded as the format specifier, here.
The flags must be sequence of followin characters:
'#', '0', '-', ' ', '+'
These flags alter the way of representation such as justification, sign or hexadecimal prefixes. See the manual of sprintf (C language function) for the detail.
The type must be one of the following:
'd', 'i', 'o', 'u', 'x', 'X', 'f', 'e', 'E', 'g', 'G', 's', 'S', 'c', 'C'
If present, size must be one of the following:
'h', 'hh', 'l', 'll', 'L', 'w', 'I64'
The width and precision are decimal integers which control a certain number of characters in the result string, depending on the type. See the manual of sprintf (C language function) for the detail. It should be noted here that the width and the precision do not limit the length of result string in the case of numeric conversion. This is by the specification of sprintf. It is recommended to use D_fmt length in the end specification with C-format to control maximum length.
When the type is numeric (i.e. neither 's' nor 'S'), the original field value is converted to numeric value (double type in this implementation), then is cast to appropriate type specified by the type and the size before converted by sprintf function. After sprintf operation, the converted string is normalized to the internal string value (wchar string in this implementation), then, is handed to the D_fmt process.
The type 'c' or 'C' is also numeric. The sprintf operation in this case is the coversion from internal character code to the character. If the size is 'l', the character can be multi byte character, otherwise the character is limited into one byte character range.
When the type is 's' or 'S', the original value is not changed by sprintf function, basically. The C-format "%s" does nothing. This is usually used to add some characters to the original string, such as the C-fromat "(%s)" to parenthesize the original value. The size 'l' may be used with the type 's'. But, the result is same unless used with the width and the precision which is not recommended.
When C-format is specified, you can not use the option f in the same format entry.
The C-format specification of D-commands may not be accepted by your run time library. For example, size 'w', which is valid in the D-command, does not work with Windows. The type 'p', which is in the standard C language, is rejected by C-format of the D-commands. To operate C-format conversion properly, a C-format specification must accepted both this D-commands specification and your runtime library's sprintf specification.
There are some more points to be noted in using C-format with non-ASCII charactrers. Sprintf function is operated with the locale character code. When you are using UTF I/O feature, some characters may not be represented by your locale character code. The result of "%lc" conversion for such character code depends on your runtime environment. In addition, the width and the precision is counted by bytes in the C-format "%s". It may produce invalid code representation, but the result depends on your runtime environment. If you are using ASCII characters only, these problems do not occur.
When there is no format information in the field-format-list or the field entry is not found, the default field entry is applied. It is an unreal field entry with the default values described above. It takes cp as its start value, -t command option value (or TAB if there is no -t) as its delimiter, -z command option value (or none if there is no -z)as its options and none for its C-format.
You have to be careful to escape special effects of the shell and D-commands to use some characters such as SPACE, REVERSE SOLIDUS (\) or SOLIDUS (/) in a format.
Within /STRING/ of a format, you need extra REVERSE SOLIDUS(\) before spaces, COMMA(,), CIRCUMFLEX ACCENT(^), SOLIDUS(/) and REVERSE SOLIDUS(\). Within @STRING@, instead of SOLIDUS, COMMERCIAL AT(@) needs extra REVERSE SOLIDUS.
It may be useful to learn how the command arguments are processed by the shell and D-commands. First, the shell parses your input and separate arguments by space characters.
In the UNIX shells, space characters can be included in a command argument using REVERSE SOLIDUS (\) before it or quoting with QUOTATION MARK (") or APOSTROPH ('). Other special characters like GREATER-THAN SIGN (>) or AMPERSAND (&) can be included using same quoting mechanism. There are minor differences of quoting mechanisms between sh(1) and csh(1) (and other shells also). Generally it is safe to quote your field-format-list by APOSTROPH ('), under the condition that the list does not have character APOSTROPHE in it.
In the Windows command window shell, space characters can be included in a command argument only by QUOTATION MARK (") quoting mechanism. Other special characters in the windows shell, i.e., GREATER-THAN SIGN (>), VERTICAL LINE (|) or AMPERSAND (&) can be included by using QUOTATION MARK quoting mechanism, or putting CIRCUMFLEX ACCENT (^) before it. To use CIRCUMFLEX ACCENT in a command argument, put it in QUOTATION MARKS or double the CURCUMFLEX ACCENT. In addition to above characters, PERCENT SIGN (%) has to be escaped when it is used as a valid environment variable context. For example %path% is replaced by path directory names by the shell. In most cases, you need not worry about this. But if you use %xx% like words in a command argument, you can use CIRCUMFLEX ACCENT before the second % (or before x). You cannot escape it by QUOTATION MARKs.
These quoting characters are removed before the argument is handed to a D-command.
After the shell process, a field-format-list argument is parsed by a D-command in two stages. The first stage is the field-list parsing, in which space characters, COMMA (,) and CIRCUMFLEX ACCENT (^) have special syntactic functions. These special characters can be included in a field name or in a format by placing REVERSE SOLIDUS before them. After the parsing, these REVESE SOLIDUSs are removed. REVERSE SOLIDUSs before other characters are intact. Note that "\^" is changed to "^" regardless to its position, despite the fact that CIRCUMFLEX ACCENT has special meaning only at the top of the field-list.
The second stage is the format parsing, in which SOLIDUS (/) or COMMERCIAL AT (@) is used as a syntactical element. Again this can be escaped with the REVERSE SOLIDUS before it. In addition to these characters, REVERSE SOLIDUS itself needs to be doubled. (Otherwise you can't make delimiter strings end with '\'). Finally, this REVERSE SOLIDUS before REVERSE SOLIDUS and before SOLIDUS (in the case of delimiter) or before COMMERCIAL AT (in the case of pattern) is removed.
See examples below.
Read field "a", "b", "c" separated by TAB in a line:
"a,b,c"
Read ""csv"" file with field names "a", "b", "c":
-t "," -z q "a,b,c"
(Csv (comma separated value) file has fields separated by "," with "" quoting for string fields. Above example shows the case that has only data lines.)
Read words of a C source file:
-t "[^a-zA-Z0-9_]+" -z qQ "words:*"
Read characters one by one:
"c:(1)*"
Read hexadecimal value:
v::%x
Conversion to .csv file (data lines only):
-t "," -z q
Assuming the input D-file is like:
name:MIYAZAWA
point:67
point:72
point:36
getting:
MIYAZAWA: 67, 72, 36
like line from this:
"name:/:\ /,point:/\,\ /"
(Note SPACE and COMMA are escaped with REVERSE SOLIDUS. Note also repeat is not required for the field "point" entry. COMMA and SPACE are inserted only between the "point" fields, and not at the end of the output line.)
Using c-format; getting each value enclosed by ().
value::(%s)
Dintro, D_lsa, DfromLine, DtoLine, Dtie, Duntie, Dpack, Dunpack, Dpr.
MIYAZAWA Akira