SSPC: Similar Set Pair Comparison

Coded by Takeaki Uno, e-mail:uno@nii.jp,

homepage: http://research.nii.ac.jp/~uno/index.html

*This program is available for only academic use, basically. Anyone can modify this program, but he/she has to write down the change of the modification on the top of the source code. Neither contact nor appointment to Takeaki Uno is needed. If one wants to re-distribute this code, do not forget to refer the newest code, and show the link to homepage of Takeaki Uno, to notify the news about this code for the users. For the commercial use, please make a contact to Takeaki Uno. *

Usage

InputFileFormat

UseGeneralNamesforVariablesandOtherFormats

BatchFilesforSimpleuse

OutputFormat

Performance

AlgorithmsandImplementationIssue

References

Let I={1,...,n} be a set of items. An itemset is a subset of I. Let D be a transaction database such that each record (called transaction) is an itemset.In the other words, D is a (multi) subset family defined on I, i.e., D = {T1,...,Tm}, s.t., each Ti is a subset of I. For a given threshold/ratio value theta, the problem is to find all pairs Ti and Tj of D, such that

1: (intersection) having theta common items, i.e. |Ti \cap Tj|\ge theta,

2: (inclusion) ratio theta items of Ti is included in Tj, i.e.,

|Ti \cap Tj| \ge theta |Ti|

3: (both-inclusion) ratio theta items of Ti is included in Tj, and ratio theta

items of Tj is included in Ti, i.e.,

|Ti \cap Tj| \ge theta |Ti|, and |Ti \cap Tj| \ge theta |Tj|

4: (resemblance): find pairs (A,B) s.t. |A \cap B|/|A \cup B| >= [threshold]

5: (cosign distance): find pairs s.t. the inner product of their

normalized vectors >= [threshold]

6: (set similarity max): find pairs A and B s.t. |A \cap B| / max{|A|,|B|}

7: (set similarity min): find pairs A and B s.t. |A \cap B| / max{|A|,|B|}

The input file is in the form such that the i-th line is the list of items included in Ti.

** ==== How to Compile ====**

Unzip the file into arbitrary directory, and execute "make". Then you can see "sspc" (or sspc.exe) in the same directory.

** ==== Command Line Options ====**

To execute SSPC, just type sspc and give some parameters as follows. (Execute without any option shows a brief introduction)

% sspc ISCfQq [options] input-filename ratio/threshold [output-filename]

"%" is not needed to type. It is a symbol to represent command line.

"input-filename" is the filename of the input transaction database (D in the above). The 1st letter of input-filename must not be '-'. Otherwise it is regarded as an option. The input file format is written below. "ratio/threshold" is a real-number/integer which is used as the ratio of inclusion, similarity/ threshold value for the minimum size of the intersection. "output-filename" is the name of file to write the pairs of transactions the program finds. You can omit the output file to see only the number of pairs. If the output file name is "-", the solutions will be output to the standard output.

The first parameter is given to the program to indicate the task.

i (inclusion):

find pairs (A,B) s.t. [ratio] of A's items are included in B

I (both-inclusion):

find pairs (A,B) s.t. [ratio] of A's items are included in B, and

[ratio] of B's items are included in A.

S (set similarity max):

find pairs A and B s.t. |A \cap B| / max{|A|,|B|}

s (set similarity min):

find pairs A and B s.t. |A \cap B| / min{|A|,|B|}

T (intersection size):

find pairs s.t. the size of their intersection is no less than [threshold]

R (Resemblance):

find pairs (A,B) s.t. |A \cap B|/|A \cup B| >= [threshold]

C (cosign distance):

find pairs s.t. the inner product of their normalized vectors >= [threshold]

For example, if you want to find all pairs having intersection of size at least 30, type "T" in the first parameter. Additionally, we can give the following commands, to specify the output style:

f,Q: output the ratio/size of each pair following/preceding to the pair

q: no output to standard output (including messages w.r.t. input data)

V: show progress of the computation

N: (valid only with -c option) normalize the ID of latter sets, i.e.,

the ID of latter sets will start from 0.

t: transpose the database so that item i will be transaction i, thus

if item i is included in j-th transaction, then item j will be

included in i-th transaction.

The output format is explained below. The following options are to restrict the itemset to be found. We can type them between the first parameter and the second parameter. Of course the option can be nothing.

-K [num]: output the threshold/ratio of [num]-th largest pair

-w [filename]: read weights of items from the file; with weights, the

size of transactions are evaluated by the sum of weights of the items

it includes

-l,-u [num]: ignore the transactions with size (item weight sum)

less than [num]

-L,-U [num]: ignore the items included in more/less than [num] transactions

-c [num]:compare transactions of IDs less than num and the others

-S [num]: stop after outputting [num] solutions

-, [char]: give the separator of the numbers in the output

the numbers in the output file are separated by the given

character [char].

-Q [filename]: replace the output numbers

according to the permutation table written in the file of

[filename], replace the numbers in the output. The numbers in the

file can be separated by any non-numeric character such as newline

character.

If we specify "-w" and filename, then SSPC reads the file as a weight database of transactions(items, in the term of D). Each i-th line of the file is regarded as the weight of i-th transaction(items, in D). With transaction weights, the size of intersection is evaluated by the weight sum of the items in the intersection. Transaction A is included in B if the weight sum of the intersection of A and B is no less than the

(ratio)\times(the weight sum of items in A).

The weight can be real value with negative weights, but it does not admit the style "3.25E03".

When we give -K option, SSPC computes the pair which has [num]-th largest intersection/inclusion ratio/similarity, and outputs the size/ratio/similarity. ("Similarity" is the minimum of the inclusion ratio of A for B, and that of B for A). By giving the threshold/ratio/similarity to SSPC, we can enumerate top-[num] pairs.

If we give -S option, then SSPC terminates immediately after finding [num] pairs.

Examples)

**- **Find pairs I(j) and I(j') having intersection of size at least 10 in

"test.dat", and output the pairs to "out.dat". Stop if 1000 pairs will be

found. Ignore j such that I(j) includes less than 10 items.

% sspc C -S 1000 -l 10 test.dat 10 out.dat

**- **Find pairs I(j) and I(j') such that 80% of the items in I(j) is

included in I(j') with weight file "weight.dat", and output the pairs

in "out.dat" with their inclusion ratio following to the pair.

Output nothing to the standard output.

% sspc Ifq -w weight.dat test.dat 0.8 out.dat

Each line (row) of the input file is a transaction(in the term of D, I(j)). The items included in a transaction (transactions in I(j)) are listed in a line. The items must be numbers begin from 1. They can not be minus. The item numbers do not have to be continuous, but notice that the program takes memory linear in the maximum item number. The separator of numbers can be any non-numeric letter, such as "," " " ":" "a", etc.

Example) ( "[EOF]" is the end of file )

0 1 2

1

2 3 4

4,1 2 3

2,1

[EOF]

We can transform variable names in general strings to numbers so that we can input the data to the program, by some script files.

**- ** sortout < input-file > output-file

Sort the file in the lexicographical order. It is used to sort the output file.

**- ** transnum.pl table-file [separator] < input-file > output-file

Read file from standard input, and give a unique number to each name written by general strings (such as ABC, ttt), and transform every string name to a number, and output it to standard output. The mapping from string names to numbers is output to table-file. The default character for the separator of the string names is " "(space). It can be changed by giving a character for the option [separator]. For example, A,B is a string name, if the separator is space, but if we set the separator to ",", it is regarded as two names A and B. This is executed by "transnum.pl table-file "," < input-file...".

**- ** untransnum.pl table-file < input-file > output-file

According to the table-file output by transnum.pl, un-transform numbers to string names. The output of the program is composed of numbers, thus it is used if we want to transform to the original string names. It reads file from standard output, and output to the standard output.

**- ** appendnum.pl < input-file > output-file

When we want to distinct the same words in different columns, use this script. This append column number to each words, so we can distinct them. Then, by using transnum.pl, we transform the strings to numbers.

**- ** transpose.pl < input-file > output-file

Transpose the file. In the other words, consider the file as an adjacency matrix, and output the transposed matrix, or exchange the positions of items and transactions. For an input file, output the file in which the i-th line corresponds to item i, and includes the numbers j such that i is included in the j-th line of the input file. This script is used to convert transaction database D to D'.

**- ** 01conv.pl [separator] < input-file > output-file

Transform the transaction database written in the style of 01 matrix to our input style. The file has to be a list of 0 and 1, separated by [separator]. If [separator] is omitted, the default separator " " is used.

For general string names, we have several batch files scripts for basic usages "exec_sspc", "exec_sspc_", "sep_sspc", or "sep_sspc_". For example, when a database with "general item names" is,

dog pig cat

cat mouse

cat mouse dog pig

cow horse

horse mouse dog

[EOF]

All these replace strings in the input database by numbers, execute SSPC, and replace the numbers in the output file by the original strings. The usage of the scripts are

% exec_sspc [commands] input-filename ratio/threshold output-filename [options] You have to give the commands in the first parameter, but the options have to locate at the end. The separator of the items is " " (blank, space). If you want to use other character as a separator, use "sep_sspc". The usage is

% sep_sspc separator [commands] input-filename ratio/threshold output-filename [options]

It is almost same as "exec_sspc" but you have to specify separator at the fourth parameter. "exec_sspc_" and "sep_sspc_" are both for the aim to distinct the same items in the different columns. For example, it is used to the database such that different items are there in different columns, but some special symbols, such as "- is for missing data", are used commonly. An example is;

A true small

C true -

A false middle

B - -

C - middle

A true -

[EOF]

In the output file, the items are followed by "." and numbers where the numbers are the column number. For example, "dog.0" means the item "dog" on the 0th(first) column.

The usage of them are the same as "exec_sspc" and "sep_sspc", respectively. All these scripts use files of the names "__tmp1__", "__tmp2__", and "__tmp3__". The files of these names will be deleted after the execution.

Example)

% exec_sspc IQ test2.dat 0.7 out.dat -w weight.dat -l 2

% sep_sspc_ "," Cf test3.dat 3 out.dat -S 1000000

When the program is executed, the program prints out the #items, #transactions, and other features of the input database to standard error. After the termination of the enumeration, it outputs the total number of pairs found. If "q" is given in the first parameter, these do not appear in the standard output.

If output-filename was given, then the pairs found are written to the output file. Each line of the output file is the pair j and j' separated by " " such that I(j) and I(j') satisfies the given condition. If "f"/"Q" is given in the first parameter, the size/ratio of the pair is written following/preceding to the pair. For example,

4 13 (22)

means that the size of the intersection of I(4) and I(13) is 22.

0.8 : 232 45

means that 80% of I(232) is included in I(45) (in the sense of weighted version), or the similarity of I(232) and I(45) is 80%. The numbers in each line is separated by " ", and by giving "-," option we can change the separator. The output pairs are not sorted. If you want to sort it, use the script "sortout.pl". The usage is just,

% sortout.pl < input-file > output-file

"input-file" is the name of file to which SSPC outputs, and the sorted output will be written in the file of the name "output-file".

The performance of SSPC is pretty stable, for both computation time and memory use. The initialization and preprocess time of SSPC is linear in the size of input database. The computation time for finding pairs depends on the sparsity of the database. In exact, memory usage of SSPC is very stable. The memory usage is always linear in the size of the input database. Approximately SSPC uses integers at most two times as much as the database size, which is the sum of the number of items of each transaction.

The basic idea of the algorithm is motivated by LCM, the computation done in frequent itemset mining. Let D be a transaction database, 1,...,n be items, T1,...,Tm be the transactions in D. I(j) denotes the set of transactions including the item j. The frame work of the algorithm

SSPC (D:database, th:ratio/threshold )

1. sort the items in the increasing order of |I(j)|.

2. Compute I(j) for each i

3. For each item j<i, compute the size of the intersection of I(i) and I(j)

4. If i and j satisfying the given condition, output the pair i and j

The condition in step 4 can be checked easily by using the intersection, |I(i)\cap I(j)|. In step 2, we compute I(j) by using the following algorithm.

Delivery(S: set of transactions, i:item)

1. Set B[j] := empty set for each i

2. For each T in S

3. For each j<i in T, insert T to B[j]

By giving D and n to this algorithm, B[j] is equal to I(j) for any j after the termination. It takes time linear in the size of D. In step 3 of SSPC, we compute the intersection of I(i) and I(j) for all j. It can be also done by Delivery, efficiently. We give I(i) and i to Delivery, then the after the termination B[j] is equal to I(j) \cap I(i). The computation time is ||I(i)_{\le i}|| where ||I(i)|| is the sum of the size of the transactions in I(i). Thus, the total computation time is sum_{i=1}^n ||I(i)_{\le i}||. If almost all transactions have sizes at most a constant c, then the computation time is almost equal to c||D||. Thus, if the database is very sparse, and satisfies the zip law, the computation time will be very short.