DfromChunk - Creation of D-reords from chunk of lines of text files.

[ English | Japanese ]

[visit D-home]

SYNOPSIS

DfromChunk [ options ] [ -h header-line pattern | -d delimiter-line pattern ] [ -s field-name-separator pattern ] [ input-file.. ]

DESCRIPTION

DfromChunk converts chunk of lines into a D-record. In computer world, there are many ways to represent logical records with a text file. One way is to represent a logical record with a text line, such as CSV file. Another way is to represent a logical field with a text line, and a logical record with a group of lines (chunk of lines). D-record is one of such representation, but, there are many others. One example is e-mail header. It has D-record like rules, but, allows line continuation and besides, field name separator COLON (:) may be followed by a SPACE by tradition. Another example is ".ini" file used in Windows. It uses section header line to mark the beginning of a section, and field-name separator is EQUAL SIGN (=). DfromChunk is a generalized converter of these types of representation to D-file.

By default, DfromChunk acts just like Dcat. With -d (delimiter-line pattern) option, -h (header-line pattern) option, -s (field-name-separator pattern) option and other options, DfromChunk can handle various type of input.

It identifies chunks of lines either by delimiter lines (-d option) or header lines (-h option). Then, comments (-r or -R option) are removed from each line. Finally, a field name and a value are parsed from the line following -s option, or -n and -v option.

Delimiter line

When no -h option is given, lines after a delimiter-line (or from the top of the file) and before the next delimiter-line (or before the end of the file) form a "chunk". Delimiter-line is a line that matches the regular expression given by -d option. Default of -d option is "^$", that means a null line is the delimiter-line.

Header line

When -h option is given, lines from a header-line to the line before next header-line (or the end of file) form a "chunk". Header-line is a line that matches the regular expression given by -h option. The lines between the beginning of file and the first header-line are just discarded. When both -h and -d option is given, -d option is ignored.

Line continuation

Line continuation is processed for a "chunk", if -c or -C option is specified. A line end with REVERSE SOLIDUS ('\') is joined to the next line, discarding the REVERSE SOLIDUS, if -c option is given. A line starts with white spaces is joined to the previous line, shrinking the white spaces to one SPACE character, if -C option is given.

When both -c and -C options are given, both line continuation processes are applied. In this case, -c process is made first, and after then -C process is performed.

Remarks (Comment lines and inline comments)

If remarks (commend-line) pattern is given with -r option, lines that match remarks pattern is removed from the chunk.

If remarks (in line comment) pattern is given with -R option, string that matches the pattern (only the first one) is removed from each line in the chunk.

Field name and value

From the header-line, fixed name field "HDR" is produced. The field value is the whole line. But, if the header-line regular expression pattern uses "()" in it, only the part matches the first "()" becomes the value.

The field name "HDR" can be changed with -H option.

For non header lines, separator inspection is made unless -n option is specified. Each line is matched with the regular expression field-name-separator pattern given with the -s option. If it matches, string before the matched part becomes the field name, and string after the matched part becomes the value, then a D-field is produced. If the field-name-separator pattern doesn't match, the line is handled as "error field" (see below). When no -s option (nor -n option) is given, ":" is used as the field-name-separator pattern.

Advanced field parsing

When -n option is given, advanced field parsing is performed for the lines in a chunk. In this case, field-name separator given with -s option (if present) is ignored.

Field-name-part pattern given with the -n option is matched with the line. If it matches, the string matched with the first "()" in the pattern gives the field-name. If no "()" is present in the pattern, whole part of the matched string gives the field-name. When the pattern does not match the line at all, the line is regarded as an error line.

When -n pattern matches a line, then the line is matched with the field-value-part pattern given with the -v option. If it matches, the string matched with the first "()" in the pattern gives the field-value. If no "()" is present in the pattern, whole part of the matched string gives the field-value. When the field-value-part pattern does not match with the line, the value is null string.

when no -v option is given with -n option, the string after the matched to the field-name-part pattern becomes the field-value.

Error fields

Lines from which field-name can't be parsed are error lines and just discarded. But, you can keep these lines as "ERR" fields, with -k option. The field name "ERR" can be changed with -K option.

OPTIONS

-d delimitter-line pattern
gives the delimiter-line regular expression. Default is "^$". See Delimiter line in the DESCIPTION.
-h header-line pattern
gives the header-line regular expression. See Header line in the DESCIPTION.
-H field-name
header-line's field name.
-s field-name-separator pattern
gives the regular expression to separate the field name and the value in a line. Default is ":". See Feld name and value in the DESCIPTION.
-n field-name-part pattern
specifies the field name part. See Advanced field parsing in the DESCIPTION.
-v field-value-part pattern
specifies the field value part. See Advanced field parsing in the DESCIPTION.
-r comment-line pattern
gives the remarks (comment-line) regular expression. See Remarks in the DESCIPTION.
-R inline-comment pattern
gives the remarks (in line comment) regular expression. See Remarks in the DESCIPTION.
-c
Line continuation with REVERSE SOLIDUS (\). See Line continuation in the DESCIPTION.
-C
Line continuation with leading white spaces. See Line ontinuation in the DESCIPTION.
-k
keeps the error fields.
-K field-name
gives the field name for the error fields. Default is "ERR".
-F
adds "filename" field to each output record; value is the input file name in the form of the command arguments after globbed by the shell. This field is not added when the input file is the standard input.
-D [i/o]datautf=8|16|32
UTF I/O feature (see manual page of UTF I/O feature.)

FIXED NAME FIELDS

HDR:
at the top of each record, when -h is specified (without -H).
ERR:
when -k is specified (without -K).
filename:
at the bottom of each record, when -F is specified.

ENVIRONMENT

Ddatautf, Didatautf, Dodatautf
for UTF I/O feature.

EXAMPLES

Read ".ini" files.

DfromChunk -h "^\[(.*)\]" -H section -s "=" -r "^;" input-file

Read e-mail header (without mail body).

DfromChunk -s ": *" -C input-file

DIAGNOSTICS

See the manual of D_msg.

SEE ALSO

Dintro, DfromLine, DfromCsv, DfromHtml, D_msg.

AUTHOR

MIYAZAWA Akira


miyazawa@nii.ac.jp
2013