Djoin [ -c ] [ -k key-flags ] [ -o output-spec ] key-field-list input-file [ input-file.. ]
Djoin joins the input files with key-field-list key values. When there is only one input-file, standard input is used as the second input file. There may be more than two input-files, which are joined together. All input files must be sorted by the key-field-list key order, unless -c option is given, otherwise Djoin reports an error and terminates its operation.
All input files are read in parallel. The records which has the same key value are grouped, and the output record[s] is composed from these records as follows.
For examples to demonstrate above rules, we introduce Dl constant notation for D-record presentation.
{ a:1 b:2 }
is same as
a:1
b:2
which represents a D-record with field "a" value 1 and field "b" value 2.
When the file1 has { k:K a:1 }, the file2 has { b:1 k:K } and the file3 has { c:1 k:K }, then
Djoin k file1 file2 file3
will produce the record
{ k:K a:1 b:1 c:1 }
When the input records are:
file1: { k:K a:1 }, { k:K a:2 }
file2: { k:K b:1 }
file3: { k:K c:1 }, { k:K c:2 }
output records are
{ k:K a:1 b:1 c:1 }
{ k:K a:1 b:1 c:2 }
{ k:K a:2 b:1 c:1 }
{ k:K a:1 b:1 c:2 }
in this order.
Note that Djoin does not know if joined files have same field names. When they have a same field name, it becomes a repeating field in the output. You have to rename them beforehand to distinguish these fields.
By default Djoin outputs only full matched records (i.e., same key value group with records from all input files), or all records from the first file (-c: core join case). But, by output-spec given by -o option, arbitrary combination of matching can be selected.
Output-spec is a string of which i-th character corresponds to the i-th input-file. This character has the following meaning:
There may be more than one -o options. In this case, output is the union of all output-specs.
Key of join is specified with general key-field-list of D-commands. Two or more fields with numeric, case ignorance, or reverse order matching can be used.
However, there is no way to join D-files with different field names as the key. To use Djoin, you have to make the same key field name in the input files, by means of Drename or by other commands.
Under a certain conditions, you may join files which are not sorted in the key sequence order. This is called core join, which is invoked by -c option. In the core join the first input file has special privilege to determine the output record sequence. Second and subsequent files are subordinate files, of which all records are read into memory and then referred during the first file records process. Core join is usually used for table lookup type join.
The conditions to use core join are
You have to have enough memory to accommodate subordinate files. Note that characters are represented in internal code, which is two bytes or four bytes per character depending on the operating system's environment. And also note that it requires extra space for keys and B+tree structure. It is difficult to estimate exact memory size, but you should expect twice or more size of the bare input file is required.
Output option has a limitation. The output-spec of the first file must be MATCH (i.e., the first character of the -o options must be '1'). When no -o option is given, default is -o 1xxx.... This is different from normal join's default -o 1111....
Core join and normal join yield same result except for the output record sequence. Especially, when the first file is already sorted by the key, the result is precisely same (of course, assuming same -o option effects). In this case normal join is slightly faster.
Assume a file "countrycode.d" contians records like:
countrycode:jp
countryname:Japan
countrycode:us
countryname:United States
File "city.d" has records like
city:Tokyo
countrycode:jp
city:New York
countrycode:us
...
Both files are sorted by "countrycode", then
Djoin -o 1x countrycode city.d countrycode.d
adds "countryname" field to the "city.d" records.
city:Tokyo
countrycode:jp
countryname:Japan
city:New York
countrycode:us
countryname:United States
...
When countrycode is not found in "countrycode.d" file, "city.d" records are unchanged.
When the same files are not sorted by countrycode, you may use
Djoin -c countrycode city.d countrycode.d
In this case, output record sequence follows the "city.d" file.
Assume file "stopwds.d" contains records like:
wd:of
wd:the
...
and sorted by wd:f (case insensitive alphabetical order).
File "words.d" has records like:
wd:Djoin
wd:joins
wd:the
...
To pick non stop words only,
Dsort wd:f words.d | Djoin -o 10 wd:f - stopwds.d
Output only unmatched records from two input files:
Djoin -o 10 -o 01 key-filed input-file-1 input-file-2
Key-field-list can be null list:
Djoin "" input-file-1 inpur-file-2
This operation makes cross production of input-file-1 and input-file-2. But, note that you need enough memory space to hold both files.
To check whether a file is sorted in the key sequence, you can use Djoin with null file. Null file is /dev/null in Unix and NUL in Windows shell.
Djoin key input-file /dev/null
Djoin key input-file NUL
Note that Dsort does not have sequence check function, which sort -c command of Unix has.
See the manual of D_msg.
MIYAZAWA Akira