D_fmt : D-2.6

INPUT FORMAT

Action

Input format action starts from the top field entry of the field-format-list, with the current position (cp) value zero.

The input scanner moves the cp to the start position given by the format and reads the string to the end. When an options is given with the field entry, it controls the read in action. After reading in, read in string may be converted with the C-format (if any) to produce one D-field. After producing a D-field, cp is moved to a new position determined by each end specification.

If repeat is specified, the same field entry process is repeated with the new cp, until the cp reaches to the end of input string, or pattern match fails.

When the process of a field entry ends, the input scanner moves to the next field entry of the field-format-list (even if the cp is at the end of input string). If the field name is null string, the corresponding field is skipped (i.e., read in but no D-field is produced). This process is repeated until the field-format-list comes to the end.

Start

DIGITS: absolute; value is the start position
+DIGITS: relative; the value added to the cp becomes the start position.

Default start is cp.

End

-DIGITS: end-position; the value is the field end position; New cp moves to the next position.
(DIGITS): length; the value is number of characters to be read in; New cp moves to the next position.
/STRING/: delimiter; a regular expression STRING is scanned from the start position, and the string before the matched string becomes the field. If there is no matched string or the STRING is null (//), the string up to the end of input string becomes the field. New cp moves to the position after the matched delimiter.
@STRING@: pattern; a regular expression pattern is tested matching from the start position, and the field spans as long as it matches the pattern. Usually, but not necessarily, the pattern begins with "^" to ensure just match from the cp. If the pattern doesn't match from the start position, nothing is read in and corresponding D-field is not produced. (Note that it does not produce a D-filed with NULL string value) New cp moves to the character just after the matched pattern. When the pattern does not match, cp moves to the next character.

Default action of end is delimiter given by -t command option, or /TAB/ (control character tab) if -t is not provided.

Options

Alignment options

Alignment options are used for fixed length (i.e., end-position or length) fields to remove leading or trailing space characters. They have no effect on varying length (i.e., delimiter or pattern) fields.

l: left alignment; trailing space characters are truncated.
r: right alignment; leading space characters are removed.

When both l and r are specified, both leading and trailing space characters are removed.

Quoting options

Quoting options are used for varying length (i.e., delimiter or pattern) fields to escape delimiter characters in the field. They have no effect on fixed length fields.

q: "" quoting; if a QUOTATION MARK (") is encountered during delimiter search, the string to the matching QUOTATION MARK refrains from being a delimiter, and the enclosing QUOTATION MARKs are removed. If a QUOTATION MARK is encountered during the pattern matching, (and this QUOTATION MARK matches the pattern), the string to the matching QUOTATION MARK becomes a part of the field regardless it matches the pattern, and the enclosing QUOTATION MARKs are removed. Two consecutive QUOTATION MARKs ("") in the quoted string becomes one QUOTATION MARK not being the end of the quotation, if x options is not provided. See next.
x: UNIX mode quotation escape; used with q option, it controls how to put QUOTATION MARK in the quoted string. There are two conventions to do it. In "csv" file (and historically, in many IBM mainframe operated softwares) "abc""def" makes abc"def, and this is the default of D format. But in UNIX world, "abc\"def" does that. This x option turns on UNIX mode escape (and turn off csv mode). In addition, double REVERSE SOLIDUS (\\) becomes one REVERSE SOLIDUS.
Q: '' quoting; same as q, but the quoting character is APOSTROPH ('). There is no way to put APOSTROPH itself in this quoted string. Use q option to include APOSTROPH in a quoted string.
b: \ quoting; if a back-slant or REVERSE SOLIDUS (\) is encountered during the delimiter search, this REVERSE SOLIDUS is eliminated and the following character refrains from being a delimiter. During the pattern match, if the REVERSE SOLIDUS matches the pattern, it is eliminated and the following character is forced to be included in the field.

When q, Q or b are used together, the first one encountered restrains other quoting mechanisms until it closes. That means in a QUOTATION MARK (") quotation, APOSTROPH (') or REVERSE SOLIDUS (\) quotation, QUOTATION MARK (") or REVERSE SOLIDUS (\) is normal. A REVERSE SOLIDUS (\) outside a quotation can cancel opening quotes. Option x does not affect to above rule. It is subordinate to option q and effective only in the QUOTATION MARKs.

With field name option

With-field-name option is a bit tricky.

f: the input string bears not only the value but its field name as well. The string read in becomes whole D-field with field name. Therefore, the string has to be field-name:value form.

When using f option, the field name in the field-format-list has no meaning, because the field name is given by the input string. The field name in the field-format-list is ignored and you may write any name there. But it is recommendable to use "." for the field name that has f option.

Current implementation does not check whether the input with option f really has COLON or not. Users are responsible for feeding proper input to get a valid D-file as a result.

Canceling options

Option n cancels all other options regardless to their category. This is used with -z option to nullify its effect for a field-list entry. For example, assume there are four fields "a", "b", "c", "d" separated by TAB in a line, and to read field "b", "c", "d" with -q option while field "a" with no options.

-z q "a:n,b,c,d"

This is same as next example:

"a,b:q,c:q,d:q"

Repeat

*: the field is processed repeatedly, until the cp reaches the end of input string (when end is length, or delimiter).

For absolute start position and end-position field entry, repeat is invalid.

C-format

When the C-format is specified, the input string is further converted by sscanf (C language) function to yield the final string. The input string is scanned by sscanf function using the C-format as the format, and the scanned value is converted to the result string.

The C-format must have one and only one effective format specifier in it. The format specfier begins with PERCENT SIGN (%) and has the following form:

%[*][width][size]type

A format specifier marked with * is not counted as effective. A sequence %% which matches single PERCENT SIGN is not counted as effective, either.

In the C-format, type must be one of the following:

'd', 'i', 'o', 'u', 'x', 'X', 'f', 'e', 'E', 'g', 'G', 's', 'S', 'c', 'C'

If present, size must be one of the following:

'h', 'hh', 'l', 'll', 'L', 'w', 'I64'

Width is decimal integer which controls maximum number of characters used for the conversion. In the C-format, when the type is 'c' or 'C', width must be either omitted or value 1. If you want to read multiple characters regardless to spaces, you should use the length in the end specification of the D_fmt.

When the type is 's' or 'S', the converted string is received by a string variable and becomes the final value. In all other types, the converted value is received by an appropriate type variable, and then converted to a numeric value. In the current implementation of D-commands, internal representation of numeric values is "double" type. Conversion from each type to "double" type follows the type conversion rules of the C language. Finally, the numeric value is represented by character string following numeric value representation of D-commands.

The type 'c' is no exception of the rule above. One character (byte) of the input string is received by "char" type variable and then converted to "double". Therefore the result value is the internal code value of the character. When size 'l', 'w' is used with the type 'c', or the type is type 'C', one multi-byte character is converted to its internal code. This type 'c' of C-format is the only exception of D_fmt to handle character codes in D-commands.

When C-format is specified, you can not use the option f in the same format entry.

The C-format specification of D-commands may not be accepted by your run time library. For example, size 'I64' is accepted by Windows, but not by other runtime environments. Standard C accepts type 'n', but is not accepted by D-command's C-format. To operate C-format conversion properly, a C-format specification must accepted both by this D-commands specification and by your runtime library's sscanf specification.

There are some more points to be noted in using C-format with non-ASCII charactrers. Sscanf function is operated with the locale character code. When you are using UTF I/O feature, some characters may not be represented by your locale character code. In this case, these characters are discarded before sscanf operation. With locale character code with multi-byte representation, some C-format specification may cause character code error. For example, C-format "%1s" may pick a part of multi-byte character. In this case, such error characters are discarded from the final value. If you are using ASCII characters only, these problems do not occur.

Default field entry

When a field entry does not have any format information, the default field entry is applied. (Although this will be the major case). It is an imaginary field entry with the default values described above. It takes cp as its start, -t command option value (or TAB if there is no -t) as its delimiter, -z command option value (or none if there is no -z) as the options and none for its repeat and C-format.

Hard and soft delimiter

When the cp is at the end of the input string, and the start position is not reset by absolute positioning, usually no field is read in. But there is an exception. For example, delimiter is comma and the data has leading and trailing comma:

DfromLine -t "," "a,b,c" ,word,

natural result will be:

a: b:word c:

However, in the case the delimiter is space(s) and the data has leading and trailing space instead of commas:

-t " +" "a,b,c" word

In this case

a:word

seems to be the acceptable result.

To generalize this situation, we introduce a concept of "hard" delimiter and "soft" delimiter.

For a given delimiter pattern, if a string matches the pattern but the doubled string of the same one does not, the string is defined as "hard" for that pattern. When the doubled string also matches the given pattern, it is defined as "soft".

For example, "," matches the delimiter pattern /,/, but ",," does not. Thus, "," is "hard" delimiter for /,/. On the contrary, one SPACE and two SPACEs both match the pattern / +/, thus it is "soft" for / +/.

When the cp is at the end of input string, and the preceding delimiter is "hard", then the corresponding field is read in as a null string. In addition, if there is a "soft" delimiter from the cp, it is skipped before reading. (In the case of "hard" delimiter from the cp, null string is read in). This is, of course, applied only for delimiter field entries but not for the fixed length or pattern field entryies.

In the following example:

DfromLine -t " *, *| +" a,b,c,d,e A B , ,D ,

the delimiter means a COMMA optionally surrounded by SPACEs, or more than on SPACEs. The result is:

a:A b:B c: d:D e:

Note that two SPACEs at the top of the input line is "soft" delimiter, while "SPACE COMMA SPACE" after "B" and "SPACE COMMA" at the end of the line are "hard" delimiters for the pattern " *, *| +".

When an input string is null string, cp is at the end of input string from the first, and you can't tell the preceding delimiter is "hard" or "soft". In this case, if the first field's delimiter is null string, that means to read the whole string, then "hard" is assumed, and "soft" is assumed otherwise. In other words, null input string yileds nothing usually, but yields a field with null string value when the field wants to read the whole string in by -t "" or ://.

The definition of this "hard" and "soft" may not be precise enough to satisfy mathematician"s accuracy. In fact, some oddities are observed in -t ",,?,?" where "," is "soft" but ",," is "hard". There may be improved definition of hard and soft. But, the current definition well works for most of the usual cases, and considering the cost for its detection, it is not wise to employ more complicated definition. (But, there may be simpler definition like spaces are soft and other characters are hard...)

OUTPUT FORMAT

Action

There are two types of output action. One is D-record order output, which is the default action. Another is field-format-list order output. The latter action is taken whe -p command option is provided in, for example, DtoLine The difference of these two types of action is only in their order of output fields.

In D-record order output, action starts from the first D-field of the given D-record as the current field, or in the case of Dtie, the first D-field of the subset of the D-record becomes the current field. The cp value is set to zero. Then the output routine searches the field-format-list for the field entry which has same name as the current field. If found, it becomes the current field entry and following output action is controlled by the field entry. If not found, the default field entry becomes the current field entry. After the current field output, next D-field of the D-record, or the next D-field of the subset (in the case of Dtie) becomes the current field. This action is repeated until all the D-fields in the D-record or in the subset is processed.

In field-formatlist order output, the first field entry of the field-format-list becomes the current field entry, and the cp is set to zero. Then outputroutine searches the D-record for the same field name fields. If found, the field becomes the current field, the output action is taken. If two or more same name field is found, the second one and following one becomes the current field in turn. After all found fields are processed or no field was found from the first, the current field entry is moved to the next field entry of the field-format-list. This action is repeated until all the field-format-list is processed.

Once the current field and the current field entry is set, C-format conversion is made if any. Then, the start specification of the current field entry adjust the cp, and by end specification, output length is controlled or delimiter is appended after the output field. After the output, cp is at the position just after the output string.

There is no "buffering" in the output routine level. Once a field is output, the cp can not go back to younger number. If you want to output the field "b" at 1-4 column and the field "a" at 6-9 column, the field "b" must be processed before the field "a".

Start

DIGITS: absolute; value is the start position; if the cp is smaller than the value, SPACEs are padded. If the cp is already larger than the value, a warning message is printed and output starts from the cp.
+DIGITS: relative; the value added to the cp becomes the start position; SPACEs are inserted before output.

Default start is cp.

End

-DIGITS: end-position; the field ends at the value position; truncation or SPACE padding is made depending on the alignment option. If the start position is already larger than end-position, warning message is printed and no output is made. New cp moves to the end position plus one.
(DIGITS): length; output field is truncated or SPACE padded to be the value length depending on the alignment option; New cp moves to the start position plus the length.
/STRING/: delimiter; the STRING is added after the field unless it is the last field to be output. New cp moves to the start position plus output field length and STRING length.

There is no pattern for the output formats.

Options

Alignment options

Alignment options are used for fixed length (i.e., end-position or length) fields. SPACE characters are padded when the output value length is shorter than the field length. Or truncation is made when the output value length is longer. They have no effect on varying length (i.e., delimiter) fields.

l: left alignment; SPACEs are added after the value, or truncation from the end of the string is made.
r: right alignment; SPACEs are added before the value, or truncation from the top of the string is made.

Alignment options of the output format is exclusive each other. But when both l and r are specified, l option beats the r option.

Quoting options

Quoting options are used for varying length (i.e., delimiter) fields to escape delimiter characters in the field. They have no effect on fixed length fields.

q: "" quoting; the whole value is enclosed by QUOTATION MARK ("). If the value contains QUOTATION MARK, it is doubled or REVERSE SOLIDUS preceded depending on the x option.
x: UNIX mode QUOTATION MARK escape; used with q option; by default, QUOTATION MARK in the value is doubled within "". But, when this option is specified, it is converted to REVERSE SOLIDUS and QUOTATION MARK (\"). In addition, REVERSE SOLIDUS before a REVERSE SOLIDUS or QUOATATION MARK is doubled.
Q: '' quoting; same as q, but the quoting character is APOSTROPH ('). APOSTROPH in the value is left as it is. You should avoid such usage.
b: \ quoting; if the first character of the delimiter appears in the value, it is preceded by REVERSE SOLIDUS (\). REVERSE SOLIDUS in the value is doubled.

Quoting options of the output format is exclusive each other. But when q, Q or b are used at same time, q is the top precedence and then comes Q.

With field name option

f: value is output with its field-name and COLON before it.

Canceling options

Option n cancels all other options. This is used with -z option to nullify its effect for a field-format-list entry.

Repeat

Repeat in the output format is ignored. Because whether the field actually repeats or not is determined by the D-record field occurrence, even in the field-format-list order output. Only when the format is used as a leaf-field-list, repeat has meaning. See the manual of D_lsa.

C-format

When the C-format is specified, the field value is converted by sprintf (C language) function before output. The C-format is used as the format and the field value as the variable, and the result of the conversion is handed to ordinary D_fmt process.

The C-format must have one and only one format specifier in it. The format specfier begins with PERCENT SIGN (%) and has the following form:

%[flags][width][.precision][size]type

A sequence "%%" denotes a character PERCENT SIGN and is not regarded as the format specifier, here.

The flags must be sequence of following characters:

'#', '0', '-', ' ', '+'

These flags alter the way of representation such as justification, sign or hexadecimal prefixes. See the manual of sprintf (C language function) for the detail.

The type must be one of the following:

'd', 'i', 'o', 'u', 'x', 'X', 'f', 'e', 'E', 'g', 'G', 's', 'S', 'c', 'C'

If present, size must be one of the following:

'h', 'hh', 'l', 'll', 'L', 'w', 'I64'

The width and precision are decimal integers which control a certain number of characters in the result string, depending on the type. See the manual of sprintf (C language function) for the detail. It should be noted here that the width and the precision do not limit the length of result string in the case of numeric conversion. This is by the specification of sprintf. It is recommended to use D_fmt length in the end specification with C-format to control maximum length.

When the type is numeric (i.e. neither 's' nor 'S'), the original field value is converted to numeric value (double type in this implementation), then is cast to appropriate type specified by the type and the size before converted by sprintf function. After sprintf operation, the converted string is normalized to the internal string value (wchar string in this implementation), then, is handed to the D_fmt process.

The type 'c' or 'C' is also numeric. The sprintf operation in this case is the coversion from internal character code to the character. If the size is 'l', the character can be multi byte character, otherwise the character is limited into one byte character range.

When the type is 's' or 'S', the original value is not changed by sprintf function, basically. The C-format "%s" does nothing. This is usually used to add some characters to the original string, such as the C-fromat "(%s)" to parenthesize the original value. The size 'l' may be used with the type 's'. But, the result is same unless used with the width and the precision which is not recommended.

When C-format is specified, you can not use the option f in the same format entry.

The C-format specification of D-commands may not be accepted by your run time library. For example, size 'w', which is valid in the D-command, does not work with Windows. The type 'p', which is in the standard C language, is rejected by C-format of the D-commands. To operate C-format conversion properly, a C-format specification must accepted both this D-commands specification and your runtime library's sprintf specification.

There are some more points to be noted in using C-format with non-ASCII charactrers. Sprintf function is operated with the locale character code. When you are using UTF I/O feature, some characters may not be represented by your locale character code. The result of "%lc" conversion for such character code depends on your runtime environment. In addition, the width and the precision is counted by bytes in the C-format "%s". It may produce invalid code representation, but the result depends on your runtime environment. If you are using ASCII characters only, these problems do not occur.

Default field entry

When there is no format information in the field-format-list or the field entry is not found, the default field entry is applied. It is an imaginary field entry with the default values described above. It takes cp as its start value, -t command option value (or TAB if there is no -t) as its delimiter, -z command option value (or none if there is no -z)as its options and none for its C-format.

SPECIAL CHARACTERS IN THE FORMAT

You have to use special characters such as REVERSE SOLIDUS (\), DOLLAR SIGN ($) or AMPERSAND (&) in your field-format-list, especially with input delimiter. To give proper string to the D-format processor, you have to carefully escape shell special character handling, D-field-list and D-format parser. Here is practical way to do this.

Use APOSTOROPHE (') quoting in UNIX shell or QUOTATION MARK (") quoting in Windows cmd shell.
Use doubled REVERSE SOLIDUS (\\) for a REVERSE SOLIDUS.
Use REVERSE SOLIDUS with COMMA (\,) for a COMMA not used as a separator in a field-list.
Use REVERSE SOLIDUS with SOLIDUS (\/) for a SOLIDUS inside /STRING/.
Use REVERSE SOLIDUS with COMMERCIAL AT (\@) for a SOLIDUS inside @STRING@.

It may be useful to learn how the command arguments are processed by the shell and D-commands. There are two levels of character escape processes between your command argument and the string to be handled. The first level is the SHELL (cmd shell in Windows), the second is the processes by D-commands. The latter is further divided into some levels. The first is D-field-list parsing process, the second is D-format parsing process, and the deepest is regular expression matching process which is called only for input delimiter or pattern. In each process, some "escape" characters may be stripped off. The following paragraphs describe the details.

In the UNIX shells, space characters can be included in a command argument by using REVERSE SOLIDUS (\) before it or by quoting with QUOTATION MARK (") or APOSTROPH ('). Other special characters like GREATER-THAN SIGN (>), DOLLAR SIGN ($), or AMPERSAND (&) can be included using same quoting mechanism. There are minor differences of quoting mechanisms or repertoir of special characters between sh(1) and csh(1) (and other shells also). It is beyond this tutorial to go into the details. But, generally it is recommended to quote your field-format-list by APOSTROPH. If you need to use APOSTROPH in your format, quote it with QUOTATION MARK. You can use more than one quoting in an argument. A command argument '"DON'"'"'NT"' becomes seven character string "DON'T" when it is handed to the command.

In the Windows command window shell, space characters can be included in a command argument only by QUOTATION MARK quoting mechanism. Other special characters in the windows shell, i.e., GREATER-THAN SIGN, VERTICAL LINE or AMPERSAND can be included by using QUOTATION MARK quoting mechanism, or putting CIRCUMFLEX ACCENT (^) before it. QUOTATION MARK itself can be used within QUOTATION MARKs by doubling it (""). To use CIRCUMFLEX ACCENT in a command argument, put it in QUOTATION MARKS or double the CURCUMFLEX ACCENT. In addition to above characters, PERCENT SIGN (%) has to be escaped when it is used as a valid environment variable context. For example %path% is replaced by path directory names by the shell. In most cases, you need not worry about this. But if you use %xx% like words in a command argument, you can use CIRCUMFLEX ACCENT before the second % (or before x). You cannot escape it by QUOTATION MARKs.

These quoting characters of the shell are removed before the argument is handed to a D-command.

After the shell process, D-commands receive the processed argument. D-field-list parsing does following process.

If the field-list starts with CIRCUMFLEX ACCENT (^), it is removed (exclusive field list).
If the field-list starts with REVERSE SOLIDUS and following CIRCUMFLEX ACCENT, REVERSE SOLIDUS at the top is removed.
In each <field name> part, REVERSE SOLIDUSs preceding COMMA or REVERSE SOLIDUS are removed.
In each <additional-inf> part, REVERSE SOLIDUSs preceding COMMA are removed. REVERSE SOLIDUSs preceding REVERSE SOLIDUS are preserved for further parsing.

<additional-inf> is processed by D-format parser. In this process:

In the delimiter (/STRING/), REVERSE SOLIDUSs preceding SOLIDUS (/) or REVERSE SOLIDUS are removed.
In the pattern (@STRING@), REVERSE SOLIDUSs preceding COMMERCIAL AT (@) or REVERSE SOLIDUS are removed.
In the C-format, REVERSE SOLIDUSs preceding REVERSE SOLIDUS are removed.

The result is handed to the D-format formatting process including regular expression matching.

Example

When you want to any one of SOLIDUS (/), VERTICAL LINE (|) or REVERSE SOLIDUS (\) as the delimiter of field foo.

The regular expression of this delimiter should be

/|\||\\

The first SOLIDUS is a normal character. Next VERTICAL LINE is regular expression syntax element meaning OR. Next REVERSE SOLIDUS makes following VERTICAL LINE as a normal character. Next VERTICAL LINE again is syntax element, And the last two REVERSE SOLIDUS makes one normal character REVERSE SOLIDUS. This is the form D-format should hand to regular expression parser.

In the field-format-list, you have to write this as /STRING/, and you have to escape the first SOLIDUS. Furthermore, you have to double the last two REVERSE SOLIDUSs as D-format parser removes REVERSE SOLIDUS preceding REVERSE SOLIDUS,

/\/|\||\\\\/

As this has no COMMA, you don't need further escape in the field-list.

'foo:/\/|\||\\\\/'

This should be the command argument. In Windows, use QUOTATION MARK instead of APOSTROPHE as "foo:/\/|\||\\\\/".

You may use the regular expression construct []. In this case 'fn:/[\/|\\]/' works. This is simpler and better answer. The example above is for explanation.

field-format-list		::=	field-list of field-name[`:`format].
format		::=	[start] [end] [options] [repeat] [`:`C-format]
start		::=	absolute \| relative
	absolute	::=	DIGITS
	relative	::=	`+`DIGITS
end		::=	end-position \| length \| delimiter \| pattern
	end-position	::=	`-`DIGITS
	length	::=	`(`DIGITS`)`
	delimiter	::=	`/`STRING`/`
	pattern	::=	`@`STRING`@`
options		::=	{ alignment \| quoting \| with-field-name }..
	alignment	::=	`l` \| `r` \| `n`
	quoting	::=	`q` \| `x` \| `Q` \| `b` \| `n`
	with-field-name	::=	`f` \| `n`
repeat		::=	`*`
C-format		::=	scanf(3) or printf(3) format with one `%`

D_fmt - field-format-list of D-commands

DESCRIPTION

SYNTAX

COMMAND OPTIONS AND FIELD-FORMAT-LIST

INPUT FORMAT

Action

Start

End

Options

Alignment options

Quoting options

With field name option

Canceling options

Repeat

C-format

Default field entry

Hard and soft delimiter

OUTPUT FORMAT

Action

Start

End

Options

Alignment options

Quoting options

With field name option

Canceling options

Repeat

C-format

Default field entry

SPECIAL CHARACTERS IN THE FORMAT

Example

EXAMPLES (Input)

EXAMPLES (Output)

SEE ALSO

AUTHOR