D_fmt - field-format-list of D-commands

[ English | Japanese ]

[visit D-home]

DESCRIPTION

D_fmt is a form to specify how a character string is converted to/from D-fields. It is written as an additional information of a field-list (see the manual of Dintro). A character string to D-fields conversion specification is called "input" format, which is typically used in DfromLine, and also used in Duntie and Dunpack. Conversion specification from D-fields to a character string is called "output" format, typically used in DtoLine and in Dtie, Dpack as well. Output format is partially used in Dpr, too. Both input and output formats have same syntax with semantic diversity.

Input or output format action maintains "current position" on the character string, during its conversion process. This is indicated by the number of characters before the position, or, simply, the character position beginning with 0. D_fmt doesn't know about the encoding and doesn't have concept of the byte or octet.

SYNTAX

field-format-list ::= field-list of field-name[:format].
format ::= [start] [end] [options] [repeat] [:cfmt]
start ::= absolute | relative
  absolute ::= DIGITS
  relative ::= +DIGITS
end ::= end-position | length | delimiter | pattern
  end-position ::= -DIGITS
  length ::= (DIGITS)
  delimiter ::= /STRING/
  pattern ::= @STRING@
options ::= { alignment | quoting | with-field-name }..
  alignment ::= l | r | n
  quoting ::= q | x | Q | b | n
  with-field-name ::= f | n
repeat ::= *
cfmt ::= scanf(3) format with one %

DIGITS here is a string of "0"-"9". STRING here is an arbitrary character string. (See the Special Characters section below).

COMMAND OPTIONS AND FIELD-FORMAT-LIST

Two command options -t and -z are valid wherever a D-command takes field-format-list arguments. They are used to give the default values of end and options.

-t STRING
gives the default delimiter STRING. Enclosing SOLIDUS (/) is not used in the -t option. STRING is mandatory here. Use -t "" to specify the null string. When -t is not present, the default delimiter is a control character TAB.
-z options
gives the default options. This is effective for field entries without any options and for the default format. If any of options is given to an field entry, the -z option has no effect for it, even if the category (alignment, quoting, with-field-name) is different.

A D-command may have two field-format-lists arguments. In this case, -t and -z command options give default values for both field-format-lists.

INPUT FORMAT

Action

Input format action starts from the top field entry of the field-format-list, with the current position (cp) value zero.

The input scanner moves the cp to the start position given by the format and reads the string to the end. When an options is given with the field entry, it controls the read in action. After reading in, read in string may be converted with the cfmt (if any) to produce one D-field. After producing a D-field, cp is moved to a new position determined by each end specification.

If repeat is specified, the same field entry process is repeated with the new cp, until the cp reaches to the end of input string, or pattern match fails.

When the process of a field entry ends, the input scanner moves to the next field entry of the field-format-list (even if the cp is at the end of input string). If the field name is null string, the corresponding field is skipped (i.e., read in but no D-field is produced). This process is repeated until the field-format-list comes to the end.

Start

DIGITS
absolute; value is the start position
+DIGITS
relative; the value added to the cp becomes the start position.

Default start is cp.

End

-DIGITS
end-position; the value is the field end position; New cp moves to the next position.
(DIGITS)
length; the value is number of characters to be read in; New cp moves to the next position.
/STRING/
delimiter; a regular expression STRING is scanned from the start position, and the string before the matched string becomes the field. If there is no matched string or the STRING is null (//), the string up to the end of input string becomes the field. New cp moves to the position after the matched delimiter.
@STRING@
pattern; a regular expression pattern is tested matching from the start position, and the field spans as long as it matches the pattern. Usually, but not necessarily, the pattern begins with "^" to ensure just match from the cp. If the pattern doesn't match from the start position, nothing is read in and corresponding D-field is not produced. (Note that it does not produce a D-filed with NULL string value) New cp moves to the character just after the matched pattern. When the pattern does not match, cp does not move.

Default action of end is delimiter given by -t command option, or /TAB/ (control character tab) if -t is not provided.

Options

Alignment options

Alignment options are used for fixed length (i.e., end-position or length) fields to remove leading or trailing space characters. They have no effect on varying length (i.e., delimiter or pattern) fields.

l
left alignment; trailing space characters are truncated.
r
right alignment; leading space characters are removed.

When both l and r are specified, both leading and trailing space characters are removed.

Quoting options

Quoting options are used for varying length (i.e., delimiter or pattern) fields to escape delimiter characters in the field. They have no effect on fixed length fields.

q
"" quoting; if a QUOTATION MARK (") is encountered during delimiter search, the string to the matching QUOTATION MARK refrains from being a delimiter, and the enclosing QUOTATION MARKs are removed. If a QUOTATION MARK is encountered during the pattern matching, (and this QUOTATION MARK matches the pattern), the string to the matching QUOTATION MARK becomes a part of the field regardless it matches the pattern, and the enclosing QUOTATION MARKs are removed. Two consecutive QUOTATION MARKs ("") in the quoted string becomes one QUOTATION MARK not being the end of the quotation, if x options is not provided. See next.
x
UNIX mode quotation escape; used with q option, it controls how to put QUOTATION MARK in the quoted string. There are two conventions to do it. In "csv" file (and historically, in many IBM mainframe operated softwares) "abc""def" makes abc"def, and this is the default of D format. But in UNIX world, "abc\"def" does that. This x option turns on UNIX mode escape (and turn off csv mode). In addition, double REVERSE SOLIDUS (\\) becomes one REVERSE SOLIDUS.
Q
'' quoting; same as q, but the quoting character is APOSTROPH ('). But there is no way to put APOSTROPH itself in this quoted string. Use q option to include APOSTROPH in a quoted string.
b
\ quoting; if a back-slant or REVERSE SOLIDUS (\) is encountered during the delimiter search, this REVERSE SOLIDUS is eliminated and the following character refrains from being a delimiter. During the pattern match, if the REVERSE SOLIDUS matches the pattern, it is eliminated and the following character is forced to be included in the field.

When q, Q or b are used together, the first one encountered restrains other quoting mechanisms until it closes. That means in a QUOTATION MARK (") quotation, APOSTROPH (') or REVERSE SOLIDUS (\) quotation, QUOTATION MARK (") or REVERSE SOLIDUS (\) is normal. A REVERSE SOLIDUS (\) outside a quotation can cancel opening quotes. Option x does not affect to above rule. It is subordinate of option q and effective only in the QUOTATION MARKs.

With field name option

With-field-name option is a bit tricky.

f
the input string bears not only the value but its field name as well. The string read in becomes whole D-field with field name. Therefore, the string has to be field-name:value form.

When using f option, the field name in the field-format-list has no meaning, because the field name is given by the input string. The field name in the field-format-list is ignored and you may write any name there. But it is recommendable to use "." for the field name that has f option.

Current implementation does not check whether the input with option f really has COLON or not. Users are responsible for feeding proper input to get a valid D-file as a result.

Canceling options

Option n cancels all other options regardless to their category. This is used with -z option to nullify its effect for a field-list entry. For example, assume there are four fields "a", "b", "c", "d" separated by TAB in a line, and to read field "b", "c", "d" with -q option while field "a" with no options.

-z q "a:n,b,c,d"

This is same as next example:

"a,b:q,c:q,d:q"

Repeat

*
the field is processed repeatedly, until the cp reaches the end of input string (when end is length, or delimiter), or until the string from the cp position no more matches the pattern.

For absolute start position and end-position field entry, repeat is invalid.

Cfmt

When cfmt is specified, the string read in does not directly yield the D-filed. It is passed to sscanf (C language) function with given cfmt as the format and the read in string as input string of sscanf. The cfmt should have just one % element in it (but for %%).

If it is %d, %o, %x, %i, %n, %u, %f, %g, %c or %wc, the input string is received by a numeric variable and then converted to a string, which becomes the D-field value. If it is %s, %S or %ws, the input string is received by a string and it becomes the D-field value.

When using %c, %C or %wc, the first byte of the read in string (in file code) is read by an integer variable, and the value is converted to a string which becomes the D-field value. This is the only case in which D-command handles encoding.

Default field entry

When a field entry does not have any format information, the default field entry is applied. (Although this will be the major case). It is an unreal field entry with the default values described above. It takes cp as its start, -t command option value (or TAB if there is no -t) as its delimiter, -z command option value (or none if there is no -z) as the options and none for its repeat and cfmt.

Hard and soft delimiter

When the cp is at the end of the input string, and the start position is not reset by absolute positioning, usually no field is read in. But there is an exception. For example, delimiter is comma and the data has leading and trailing comma:

DfromLine -t "," "a,b,c"
,word,

natural result will be:

a:
b:word
c:

However, in the case the delimiter is space(s) and the data has leading and trailing space instead of commas:

-t " +" "a,b,c"
 word 

In this case

a:word

seems to be the acceptable result.

To generalize this situation, we introduce a concept of "hard" delimiter and "soft" delimiter.

For a given delimiter pattern, if a string matches the pattern but the doubled string of the same one does not, the string is defined as "hard" for that pattern. When the doubled string also matches the given pattern, it is defined as "soft".

For example, "," matches the delimiter pattern /,/, but ",," does not. Thus, "," is "hard" delimiter for /,/. On the contrary, one SPACE and two SPACEs both match the pattern / +/, thus it is "soft" for / +/.

When the cp is at the end of input string, and the preceding delimiter is "hard", then the corresponding field is read in as a null string. In addition, if there is a "soft" delimiter from the cp, it is skipped before reading. (In the case of "hard" delimiter from the cp, null string is read in). This is, of course, applied only for delimiter field entries but not for the fixed length or pattern field entryies.

For example:

DfromLine -t " *, *| +" a,b,c,d,e
  A B , ,D  ,

result is:

a:A
b:B
c:
d:D
e:

Note that "  " at the top of the input line is "soft" delimiter, while " , " after "B" and " ," at the end are "hard" delimiters for the pattern " *, *| +".

When an input string is null string, cp is at the end of input string from the first, and you can't tell the preceding delimiter is "hard" or "soft". In this case, if the first field's delimiter is null string, that means to read the whole string, then "hard" is assumed, and "soft" is assumed otherwise. In other words, null input string yileds nothing usually, but yields a field with null string value when the field wants to read the whole string in by -t "" or ://.

The definition of this "hard" and "soft" may not be precise enough to satisfy mathematician"s accuracy. In fact, some oddities are observed in -t ",,?,?" where "," is "soft" but ",," is "hard". There may be improved definition of hard and soft. But, the current definition well works for most of the usual cases, and considering the cost for its detection, it is not wise to employ more complicated definition. (But, there may be simpler definition like spaces are soft and other characters are hard...)

OUTPUT FORMAT

Action

There are two types of output action. One is D-record order output, which is the default action. Another is field-format-list order output. The latter action is taken whe -p command option is provided in DtoLine The difference of these two types of action is only in their order of output fields.

In D-record order output, action starts from the first D-field of the given D-record as the current field, or in the case of Dtie, the first D-field of the subset of the D-record becomes the current field. The cp value is set to zero. Then the output routine searches the field-format-list for the field entry which has same name as the current field. If found, it becomes the current field entry and following output action is controlled by the field entry. If not found, the default field entry becomes the current field entry. After the current field output, next D-field of the D-record, or the next D-field of the subset (in the case of Dtie) becomes the current field. This action is repeated until all the D-fields in the D-record or in the subset is processed.

In field-formatlist order output, the first field entry of the field-format-list becomes the current field entry, and the cp is set to zero. Then outputroutine searches the D-record for the same field name fields. If found, the field becomes the current field, the output action is taken. If two or more same name field is found, the second one and following one becomes the current field in turn. After all found fields are processed or no field was found from the first, the current field entry is moved to the next field entry of the field-format-list. This action is repeated until all the field-format-list is processed.

Once the current field and the current field entry is set, cfmt conversion is made if any. Then, the start specification of the current field entry adjust the cp, and by end specification, output length is controlled or delimiter is appended after the output field. After the output, cp is at the position just after the output string.

There is no "buffering" in the output routine level. Once a field is output, the cp can not go back to younger number. If you want to output the field "b" at 1-4 column and the field "a" at 6-9 column, the field "b" must be processed before the field "a".

Start

DIGITS
absolute; value is the start position; if the cp is smaller than the value, SPACEs are padded. If the cp is already larger than the value, a warning message is printed and output starts from the cp.
+DIGITS
relative; the value added to the cp becomes the start position; SPACEs are inserted before output.

Default start is cp.

End

-DIGITS
end-position; the field ends at the value position; truncation or SPACE padding is made depending on the alignment option. If the start position is already larger than end-position, warning message is printed and no output is made. New cp moves to the end position plus one.
(DIGITS)
length; output field is truncated or SPACE padded to be the value length depending on the alignment option; New cp moves to the start position plus the length.
/STRING/
delimiter; the STRING is added after the field unless it is the last field to be output. New cp moves to the start position plus output field length and STRING length.

There is no pattern for the output formats.

Options

Alignment options

Alignment options are used for fixed length (i.e., end-position or length) fields. SPACE characters are padded when the output value length is shorter than the field length. Or truncation is made when the output value length is longer. They have no effect on varying length (i.e., delimiter) fields.

l
left alignment; SPACEs are added after the value, or truncation from the end of the string is made.
r
right alignment; SPACEs are added before the value, or truncation from the top of the string is made.

Alignment options of the output format is exclusive each other. But when both l and r are specified, l option beats the r option.

Quoting options

Quoting options are used for varying length (i.e., delimiter) fields to escape delimiter characters in the field. They have no effect on fixed length fields.

q
"" quoting; the whole value is enclosed by QUOTATION MARK ("). If the value contains QUOTATION MARK, it is doubled or REVERSE SOLIDUS preceded depending on the x option.
x
UNIX mode QUOTATION MARK escape; used with q option; by default, QUOTATION MARK in the value is doubled within "". But, when this option is specified, it is converted to REVERSE SOLIDUS and QUOTATION MARK (\"). In addition, REVERSE SOLIDUS before a REVERSE SOLIDUS or QUOATATION MARK is doubled.
Q
'' quoting; same as q, but the quoting character is APOSTROPH ('). APOSTROPH in the value is left as it is. You should avoid such usage.
b
\ quoting; if the first character of the delimiter appears in the value, it is preceded by REVERSE SOLIDUS (\). REVERSE SOLIDUS in the value is doubled.

Quoting options of the output format is exclusive each other. But when q, Q or b are used at same time, q is the top precedence and then comes Q.

With field name option

f
value is output with field-name and COLON before it.

Canceling options

Option n cancels all other options. This is used with -z option to nullify its effect for a field-format-list entry.

Repeat

Repeat in the output format is ignored. Because whether the field actually repeats or not is determined by the D-record field occurrence, even in the field-format-list order output. Only when the format is used as a leaf-field-list, repeat has meaning. See the manual of D_lsa.

Cfmt

When cfmt is specified, the field value is converted before output. This conversion is made by (C language) sprintf function with the cfmt as the format and the field value as the variable. The cfmt must have just one % element (but for %%) in it. If it is %d, %o, %x, %i, %n, %u, %c or %wc, the field value is converted into integer and it becomes the variable. If it is %f or %g, the field value is converted to double to become the variable. If it is %s, %S or %ws, the field value string is passed as the variable in multi-byte string or wide character string. Then sprintf(3) conversion is made and the result string is treated as the value to be output.

When using %wc, field value must be valid internal code value.

Option f is not allowed with cfmt.

Default field entry

When there is no format information in the field-format-list or the field entry is not found, the default field entry is applied. It is an unreal field entry with the default values described above. It takes cp as its start value, -t command option value (or TAB if there is no -t) as its delimiter, -z command option value (or none if there is no -z)as its options and none forl its cfmt.

SPECIAL CHARACTERS IN THE FORMAT

You have to be careful to escape special effects of the shell and D-commands to use some characters such as SPACE, REVERSE SOLIDUS (\) or SOLIDUS (/) in a format.

Within /STRING/ of a format, you need extra REVERSE SOLIDUS(\) before spaces, COMMA(,), CIRCUMFLEX ACCENT(^), SOLIDUS(/) and REVERSE SOLIDUS(\). Within @STRING@, instead of SOLIDUS, COMMERCIAL AT(@) needs extra REVERSE SOLIDUS.

It may be useful to learn how the command arguments are processed by the shell and D-commands. First, the shell parses your input and separate arguments by space characters.

In the UNIX shells, space characters can be included in a command argument using REVERSE SOLIDUS (\) before it or quoting with QUOTATION MARK (") or APOSTROPH ('). Other special characters like GREATER-THAN SIGN (>) or AMPERSAND (&) can be included using same quoting mechanism. There are minor differences of quoting mechanisms between sh(1) and csh(1) (and other shells also). Generally it is safe to quote your field-format-list by APOSTROPH ('), under the condition that the list does not have character APOSTROPHE in it.

In the Windows command window shell, space characters can be included in a command argument only by QUOTATION MARK (") quoting mechanism. Other special characters in the windows shell, i.e., GREATER-THAN SIGN (>), VERTICAL LINE (|) or AMPERSAND (&) can be included by using QUOTATION MARK quoting mechanism, or putting CIRCUMFLEX ACCENT (^) before it. To use CIRCUMFLEX ACCENT in a command argument, put it in QUOTATION MARKS or double the CURCUMFLEX ACCENT. In addition to above characters, PERCENT SIGN (%) has to be escaped when it is used as a valid environment variable context. For example %path% is replaced by path directory names by the shell. In most cases, you need not worry about this. But if you use %xx% like words in a command argument, you can use CIRCUMFLEX ACCENT before the second % (or before x). You cannot escape it by QUOTATION MARKs.

These quoting characters are removed before the argument is handed to a D-command.

After the shell process, a field-format-list argument is parsed by a D-command in two stages. The first stage is the field-list parsing, in which space characters, COMMA (,) and CIRCUMFLEX ACCENT (^) have special syntactic functions. These special characters can be included in a field name or in a format by placing REVERSE SOLIDUS before them. After the parsing, these REVESE SOLIDUSs are removed. REVERSE SOLIDUSs before other characters are intact. Note that "\^" is changed to "^" regardless to its position, despite the fact that CIRCUMFLEX ACCENT has special meaning only at the top of the field-list.

The second stage is the format parsing, in which SOLIDUS (/) or COMMERCIAL AT (@) is used as a syntactical element. Again this can be escaped with the REVERSE SOLIDUS before it. In addition to these characters, REVERSE SOLIDUS itself needs to be doubled. (Otherwise you can't make delimiter strings end with '\'). Finally, this REVERSE SOLIDUS before REVERSE SOLIDUS and before SOLIDUS (in the case of delimiter) or before COMMERCIAL AT (in the case of pattern) is removed.

See examples below.

EXAMPLES (Input)

Read field "a", "b", "c" separated by TAB in a line:

"a,b,c"

Read ""csv"" file with field names "a", "b", "c":

-t "," -z q "a,b,c"

(Csv (comma separated value) file has fields separated by "," with "" quoting for string fields. Above example shows the case that has only data lines.)

Read words of a C source file:

-t "[^a-zA-Z0-9_]+" -z qQ "words:*"

Read characters one by one:

"c:(1)*"

Read hexadecimal value:

v::%x

EXAMPLES (Output)

Conversion to .csv file (data lines only):

-t "," -z q

Assuming the input D-file is like:

name:MIYAZAWA
point:67
point:72
point:36

getting:

MIYAZAWA: 67, 72, 36

like line from this:

"name:/:\ /,point:/\,\ /"

(Note SPACE and COMMA are escaped with REVERSE SOLIDUS. Note also repeat is not required for the field "point" entry. COMMA and SPACE are inserted only between the "point" fields, and not at the end of the output line.)

Using c-format; getting each value enclosed by ().

value::(%s)

SEE ALSO

Dintro, D_lsa, DfromLine, DtoLine, Dtie, Duntie, Dpack, Dunpack, Dpr.

AUTHOR

MIYAZAWA Akira


miyazawa@nii.ac.jp
2003