Djoin - join a D-file with key matching

[ English | Japanese ]

[visit D-home]

SYNOPSIS

Djoin [ -c ] [ -k key-flags ] [ -o output-spec ] key-field-list input-file [ input-file.. ]

DESCRIPTION

Djoin joins the input files with key-field-list key values. When there is only one input-file, standard input is used as the second input file. There may be more than two input-files, which are joined together. All input files must be sorted by the key-field-list key order, unless -c option is given, otherwise Djoin reports an error and terminates its operation.

Matching

All input files are read in parallel. The records which has the same key value are grouped, and the output record[s] is composed from these records as follows.

  1. When the group has upto one record from each input file, just one record is composed. It has all fields from the first record, followed by all but the key fields from the second record, further followed by all but the key fields from the third record, and so on.
  2. When the group has more than one records from an input file, cross production from these records, which have same field composition as the case 1, is composed as output. The order of output records for the cross production follows "low order records change first" rule.

For examples to demonstrate above rules, we introduce Dl constant notation for D-record presentation.

{ a:1 b:2 }

is same as

a:1
b:2

which represents a D-record with field "a" value 1 and field "b" value 2.

When the file1 has { k:K a:1 }, the file2 has { b:1 k:K } and the file3 has { c:1 k:K }, then

Djoin k file1 file2 file3

will produce the record

{ k:K a:1 b:1 c:1 }

When the input records are:

file1: { k:K a:1 }, { k:K a:2 }
file2: { k:K b:1 }
file3: { k:K c:1 }, { k:K c:2 }

output records are

{ k:K a:1 b:1 c:1 }
{ k:K a:1 b:1 c:2 }
{ k:K a:2 b:1 c:1 }
{ k:K a:1 b:1 c:2 }

in this order.

Note that Djoin does not know if joined files have same field names. When they have a same field name, it becomes a repeating field in the output. You have to rename them beforehand to distinguish these fields.

Output selection

By default Djoin outputs only full matched records (i.e., same key value group with records from all input files), or all records from the first file (-c: core join case). But, by output-spec given by -o option, arbitrary combination of matching can be selected.

Output-spec is a string of which i-th character corresponds to the i-th input-file. This character has the following meaning:

1
MATCH: output records include the corresponding input-file records.
0
UNMATCH: output records do not include the corresponding input-file records.
x
DON'T CARE: output records may or may not include the corresponding input-file records.

There may be more than one -o options. In this case, output is the union of all output-specs.

Key fields

Key of join is specified with general key-field-list of D-commands. Two or more fields with numeric, case ignorance, or reverse order matching can be used.

However, there is no way to join D-files with different field names as the key. To use Djoin, you have to make the same key field name in the input files, by means of Drename or by other commands.

Core join

Under a certain conditions, you may join files which are not sorted in the key sequence order. This is called core join, which is invoked by -c option. In the core join the first input file has special privilege to determine the output record sequence. Second and subsequent files are subordinate files, of which all records are read into memory and then referred during the first file records process. Core join is usually used for table lookup type join.

The conditions to use core join are

  1. memory size, and
  2. output (-o) option.

You have to have enough memory to accommodate subordinate files. Note that characters are represented in internal code, which is two bytes or four bytes per character depending on the operating system's environment. And also note that it requires extra space for keys and B+tree structure. It is difficult to estimate exact memory size, but you should expect twice or more size of the bare input file is required.

Output option has a limitation. The output-spec of the first file must be MATCH (i.e., the first character of the -o options must be '1'). When no -o option is given, default is -o 1xxx.... This is different from normal join's default -o 1111....

Core join and normal join yield same result except for the output record sequence. Especially, when the first file is already sorted by the key, the result is precisely same (of course, assuming same -o option effects). In this case normal join is slightly faster.

OPTIONS

-c
core join; input files may not be in the key field order.
-k key-flags
default key-flags for the key-field-list. This is applied to the key-field-list entries without any flags. See the manual of Dintro.
-o output-spec
selects the output records by matching type. When an output-spec is shorter than the number of input files, DO'NT CARE is assumed for the rest of input files.
Instead of x, you may use any character (except for 1 and 0) for DON'T CARE value, but for the readability, x is recommendable.
Default is '1111...' for normal case, and '1xxx...' for core join case.
-D [i/o]datautf=8|16|32
UTF I/O feature (see manual page of UTF I/O feature.)

ENVIRONMENT

Ddatautf, Didatautf, Dodatautf
for UTF I/O feature.

EXAMPLES

Code table lookup

Assume a file "countrycode.d" contians records like:

countrycode:jp
countryname:Japan

countrycode:us
countryname:United States

File "city.d" has records like

city:Tokyo
countrycode:jp

city:New York
countrycode:us

...

Both files are sorted by "countrycode", then

Djoin -o 1x countrycode city.d countrycode.d

adds "countryname" field to the "city.d" records.

city:Tokyo
countrycode:jp
countryname:Japan

city:New York
countrycode:us
countryname:United States

...

When countrycode is not found in "countrycode.d" file, "city.d" records are unchanged.

When the same files are not sorted by countrycode, you may use

Djoin -c countrycode city.d countrycode.d

In this case, output record sequence follows the "city.d" file.

Stopwords elimination

Assume file "stopwds.d" contains records like:

wd:of

wd:the

...

and sorted by wd:f (case insensitive alphabetical order).

File "words.d" has records like:

wd:Djoin

wd:joins

wd:the

...

To pick non stop words only,

Dsort wd:f words.d | Djoin -o 10 wd:f - stopwds.d

Unmatch detection

Output only unmatched records from two input files:

Djoin -o 10 -o 01 key-filed input-file-1 input-file-2

Cross production

Key-field-list can be null list:

Djoin "" input-file-1 inpur-file-2

This operation makes cross production of input-file-1 and input-file-2. But, note that you need enough memory space to hold both files.

Record sequence check

To check whether a file is sorted in the key sequence, you can use Djoin with null file. Null file is /dev/null in Unix and NUL in Windows shell.

Djoin key input-file /dev/null
Djoin key input-file NUL

Note that Dsort does not have sequence check function, which sort -c command of Unix has.

DIAGNOSTICS

See the manual of D_msg.

SEE ALSO

Dintro, Dpaste, Dsort, D_msg.

AUTHOR

MIYAZAWA Akira


miyazawa@nii.ac.jp
2003-2004