Dfreq - Frequency count of field values

[ English | Japanese ]

[visit D-home]

SYNOPSIS

Dfreq [ options ] [ -g group-by-key-field-list ] key-field-list [ input-file.. ]

DESCRIPTION

Dfreq makes frequency count records of given key-field-list fields. An output record consists of fields in the key-field-list, and a predefined field "count". The key value is unique in the output, and the "count" field value is the number of D-records having the key value. Record sequence order in the output is the order given by key-field-list. It is same as the order used in Dsort.

Values of the key fields in the output may be altered depending on key flags as follows: f flag converts small letters to capital letters, n flag converts values to normalized numeric form, d flag eliminates delimiters from the values, and i flag eliminates non printing characters. The field order of key fields in the output follows the order in the key-field-list, and the field order of the input records is not preserved.

Output records also have field "percent" when -p option is given.

By default, missing value (i.e. input record without any of key fields) is not counted as an input. They can be counted by giving -m option. The output record for the missing value has only "count" field (possibly with "percent" field) without any key fields.

When there is more than one input files, Dfreq reads them as one consecutive file. But, when -F option is given, Dfreq makes frequency count records for each input file separately, adding "filename" field to each output record.

When -g group-by-key-field-list is given, Dfreq makes frequency count records each time it encounters a sequence break in the group by fields, i.e., the key value of the group by fields is not equal to the key value of the previous record, or at the end of all input files. These group by key fields are added to the output records. Generally (but not necessarily), this option is used after Dsort by the same group-by-key-field-list. In the group by process, "percent" (if any) field value is caliculated based on the records in the same key field value records.

Practically, next two commands:

Dfreq -g a b
Dfreq a,b

generate same result except for the "percent" field. (In the former case, each group makes 100%, while in the latter case, whole input makes 100%). Major difference of these two cases is memory usage. Group by process uses less memory, because Dfreq keeps all the key values in the memory until it flushes records out. In the case of group by process, Dfreq flushes records out each time the group by key value is changed. In the case of normal process, records are flushed out only at the end of the last input file. Group by option is therefore useful for very large input files, or when the key field has numerous variety of values.

OUTPUT RECORD

Each output record has following fields in that order.

filename:
when -F option is given; value is in the form of the command argument after globbed by the shell. This field is not added when the input file is the standard input.
group-by-key-fields
when -g option is given; field order follows the group-by-key-field-list, and the value may be altered depending on the key flags.
key-fields
field order follows the key-field-list, and the value may be altered depending on the key flags.
count
number of D-records which have the key field value.
percent
when -p option is given; percentage of the "count" value for the total count.

OPTIONS

-F
input files are processed individually; frequency count records are output at each end of input file. It also adds "filename" field to each output record.
-g group-by-key-field-list
group by process; frequency records are output when the group by key value is changed; adds the group-by-key-field-list fields to output records.
-k key-flags
default key-flags for both the key-field-list and group-by-key-field. See the manual of Dintro.
-m
missing value is counted.
-p
percentage is calculated; adds the "percent" field. The value is percentage of the present count value for the total count value. That is percentage in an input file when -F option is given, and is percentage in a same key group when -g option is given.
-D [i/o]datautf=8|16|32
UTF I/O feature (see manual page of UTF I/O feature.)

ENVIRONMENT

Ddatautf, Didatautf, Dodatautf
for UTF I/O feature.

DIAGNOSTICS

See the manual of D_msg.

SEE ALSO

Dintro, Dmeans, D_msg.

AUTHOR

MIYAZAWA Akira


miyazawa@nii.ac.jp
2003