[ English | Japanese ]
DfromChunk [ options ] [ -h header-line pattern | -d delimiter-line pattern ] [ -s field-name-separator pattern ] [ input-file.. ]
DfromChunk converts chunk of lines into a D-record. In computer world, there are many ways to represent logical records with a text file. One way is to represent a logical record with a text line, such as CSV file. Another way is to represent a logical field with a text line, and a logical record with a group of lines (chunk of lines). D-record is one of such representation, but, there are many others. One example is e-mail header. It has D-record like rules, but, allows line continuation and besides, field name separator COLON (:) may be followed by a SPACE by tradition. Another example is ".ini" file used in Windows. It uses section header line to mark the beginning of a section, and field-name separator is EQUAL SIGN (=). DfromChunk is a generalized converter of these types of representation to D-file.
By default, DfromChunk acts just like Dcat. With -d (delimiter-line pattern) option, -h (header-line pattern) option, -s (field-name-separator pattern) option and other options, DfromChunk can handle various type of input.
It identifies chunks of lines either by delimiter lines (-d option) or header lines (-h option). Then, comments (-r or -R option) are removed from each line. Finally, a field name and a value are parsed from the line following -s option, or -n and -v option.
When no -h option is given, lines after a delimiter-line (or from the top of the file) and before the next delimiter-line (or before the end of the file) form a "chunk". Delimiter-line is a line that matches the regular expression given by -d option. Default of -d option is "^$", that means a null line is the delimiter-line.
When -h option is given, lines from a header-line to the line before next header-line (or the end of file) form a "chunk". Header-line is a line that matches the regular expression given by -h option. The lines between the beginning of file and the first header-line are just discarded. When both -h and -d option is given, -d option is ignored.
Line continuation is processed for a "chunk", if -c or -C option is specified. A line end with REVERSE SOLIDUS ('\') is joined to the next line, discarding the REVERSE SOLIDUS, if -c option is given. A line starts with white spaces is joined to the previous line, shrinking the white spaces to one SPACE character, if -C option is given.
When both -c and -C options are given, both line continuation processes are applied. In this case, -c process is made first, and after then -C process is performed.
If remarks (commend-line) pattern is given with -r option, lines that match remarks pattern is removed from the chunk.
If remarks (in line comment) pattern is given with -R option, string that matches the pattern (only the first one) is removed from each line in the chunk.
From the header-line, fixed name field "HDR" is produced. The field value is the whole line. But, if the header-line regular expression pattern uses "()" in it, only the part matches the first "()" becomes the value.
The field name "HDR" can be changed with -H option.
For non header lines, separator inspection is made unless -n option is specified. Each line is matched with the regular expression field-name-separator pattern given with the -s option. If it matches, string before the matched part becomes the field name, and string after the matched part becomes the value, then a D-field is produced. If the field-name-separator pattern doesn't match, the line is handled as "error field" (see below). When no -s option (nor -n option) is given, ":" is used as the field-name-separator pattern.
When -n option is given, advanced field parsing is performed for the lines in a chunk. In this case, field-name separator given with -s option (if present) is ignored.
Field-name-part pattern given with the -n option is matched with the line. If it matches, the string matched with the first "()" in the pattern gives the field-name. If no "()" is present in the pattern, whole part of the matched string gives the field-name. When the pattern does not match the line at all, the line is regarded as an error line.
When -n pattern matches a line, then the line is matched with the field-value-part pattern given with the -v option. If it matches, the string matched with the first "()" in the pattern gives the field-value. If no "()" is present in the pattern, whole part of the matched string gives the field-value. When the field-value-part pattern does not match with the line, the value is null string.
when no -v option is given with -n option, the string after the matched to the field-name-part pattern becomes the field-value.
Lines from which field-name can't be parsed are error lines and just discarded. But, you can keep these lines as "ERR" fields, with -k option. The field name "ERR" can be changed with -K option.
Read ".ini" files.
DfromChunk -h "^\[(.*)\]" -H section -s "=" -r "^;" input-file
Read e-mail header (without mail body).
DfromChunk -s ": *" -C input-file
See the manual of D_msg.
MIYAZAWA Akira