SimSet: Similarity based multi-clustering and itemSET mining
Coded by Takeaki Uno, e-mail:uno@nii.jp,
homepage: http://research.nii.ac.jp/~uno/index.html
This program is available for only academic use, basically. Anyone can modify this program, but he/she has to write down the change of the modification on the top of the source code. Neither contact nor appointment to Takeaki Uno is needed. If one wants to re-distribute this code, do not forget to refer the newest code, and show the link to homepage of Takeaki Uno, to notify the news about codes for the users. For the commercial use, please make a contact to Takeaki Uno.
Simset inputs a set of itemsets (records, transactions), and find frequent patterns according to the clusters of the transactions. Simset first compute the similarity of the transactions, and then finds all maximal sets of transactions such that the similarity of any pair of its transactions is no less than the threshold (the sets are actually maximal cliques of the similarity graph in which pairs of transactions are connected when their similarity is no less than theta). Those sets of size less than gamma (given parameter) will be ignored. For each set, Simset then computes the items that appears in at least sigma % of the transactions (sigma is given as a parameter). The items are output as an itemset.
Compared to usual itemset mining, it does not find all the frequently appearing itemsets. However, the output itemsets by SimSet always corresponds to a cluster, thus can be considered to be meaningful.
==== How to Compile ====
Unzip the file into arbitrary directory, and execute "make". You then can see "simset" (or simset.exe), and other additional executable files which are components of SimSet, in the same directory.
==== Command Line Options ====
To execute SimSet, just type simset and give some parameters as follows.
% simset [command] [options] input-filename theta gamma [output-filename]
"%" is not needed to type. It is a symbol to represent command line. To see a simple explanation, just execute "simset" without parameters. An example is
%simset C test.dat 2 4 out
"input-filename" is the filename of the input file. The 1st letter of input-filename must not be '-'. Otherwise it is regarded as an option. The input file format is written below. "theta" is the similarity threshold, the pair of records (transactions) will be considered to be similar when the similarity is no less than theta. "gamma" is the threshold for the clique size. The cliques (similar transaction sets) of size less than "gamma" are ignored.
The first parameter [command] is composed of some letters, each of them corresponds to a command. This is given to the program to indicate the task and some other execution modes.
M: output intersection of each clique, instead of IDs of its members
when this command is given, SimSet outputs a frequent pattern.
Without this command, it outputs the ID of the transactions in each
clique.
O: repeatedly similarity clustering until convergence
repeatedly compute the similarity, according to the number of
common similar transactions. In this mode, after finding all pairs
of similar transactions, two transactions having [gamma] transactions
to those both are similar. Simset repeatedly compute the similarity
in this measure, until the convergence.
t: transpose the input database
with this command, simset transposes the input database so that the
i-th line is the item i, and the numbers in the line represents the
transactions that include the item i.
E: read edge list file
with this command, simset reads a file as a graph such that each line
is two vertex IDs of an edge. The sets to be compared are adjacency
lists of the vertices of the graph.
The following commands gives the similarity measure. When the following is given, the similarity measure is set to,
i (inclusion):
ratio of A's items are included in B
I (both-inclusion):
ratio of A's items are included in B, and ratio of B's items are
included in A.
S (set similarity max): |A \cap B| / max{|A|,|B|}
s (set similarity min): |A \cap B| / min{|A|,|B|}
T (intersection size): the size of their intersection
R (Resemblance): |A \cap B|/|A \cup B| >= [threshold]
C (cosign distance):
the inner product of their normalized vectors
Following to the first parameter (commands), we can give several options described below. Basically any option accompanies one or more values, and they are in the form "-? [num]" where ? is a letter. We can give no option.
-m,-M [num]: clustering cliques
after finding the cliques, SimSet finds all pairs of cliques whose
similarities are no less than [num]. SimSet then find the clusters
according to the graph given by the pairs, and output clusters
instead of cliques (or intersection of the members in cliques).
When -M is given, each connected component will be a cluster.
When -m is given, Simset first finds a maximal independent set as
a seed, and let other vertices (cliques) belong to the neighboring
one in the independent set. The similarity measure is given as
-MT. This means intersection size is used as the similarity measure.
-v [num]: specify majority threshold (default=0.5)
when M command is given, the items included in the transactions of
ratio at least [num] will be output. The default setting is 0.5.
-u [num]:ignore records of size larger than [num]
Simset ignores the records (transactions) of
-U [num]:ignore items of frequency larger than [num]
this is for ignoring the items that appear at least [num] records.
The items will be removed from all the records. When many such
items exist, many pairs of records will be quite similar, thus
to look the details, we remove such items.
-O [num]:specify the number of repetitions
this option is for O command. It given the maximum number of
repetition of computing the similarity, thus when the number of
execution of the repetitions, simset terminates the repetition
and goes to the next phase, finding the maximal cliques.
-, [char]: give the separator of the numbers in the output
the numbers in the output file are separated by the given
character [char].
-Q [filename]: replace the output numbers
according to the permutation table written in the file of
[filename], replace the numbers in the output. The numbers in the
file can be separated by any non-numeric character such as newline
character.
-W [dir]:specify working directory. All working files created during
the execution are created in the directory, and will be removed
after the execution of the program.
Examples)
- find all the pairs of records whose intersection size is no less
than 2, and find maximal cliques of size at least 4
%simset T test.dat 2 4 out
- find all the pairs of records whose consign distance is no less than
0.9, find maximal cliques of size at least 5, and output the items
appearing records of each clique with ratio at least 0.7.
%simset CM -v 0.7 test.dat 0.9 5 out
Each line (row) of the input file is a transaction. The items included in the transaction of ID i are listed in the (i+1)th line. The transaction ID begins from 0. The items must be numbers begin from 1. They can not be minus. The item numbers do not have to be used continuously, but notice that the program takes memory linear in the maximum item number. The separator of numbers can be any non-numeric letter, such as "," " " ":" "a", etc.
Example) ( "[EOF]" is the end of file )
0 1 2
1
2 3 4
4,1 2 3
2,1
[EOF]
When M command is not given, SimSet outputs the transaction ID's of each maximal clique. Each line of the output file corresponds to a maximal clique, and is a list of numbers that are ID's of the transactions. If M is given, SimSet outputs the items included in many transactions of each clique. For example, (1:0.91) (8:0.91) (2:0.73)
means that item 1 is included in 91% of the transactions in the clique, and item 8 is included in 91%, and item 2 is included in 73%. The items are sorted in the decreasing order of the ratio.