SimSet: Similarity based multi-clustering and itemSET mining

Coded by Takeaki Uno,,

This program is available for only academic use, basically. Anyone can modify this program, but he/she has to write down the change of the modification on the top of the source code. Neither contact nor appointment to Takeaki Uno is needed. If one wants to re-distribute this code, do not forget to refer the newest code, and show the link to homepage of Takeaki Uno, to notify the news about codes for the users. For the commercial use, please make a contact to Takeaki Uno.



Simset inputs a set of itemsets (records, transactions), and find frequent patterns according to the clusters of the transactions. Simset first compute the similarity of the transactions, and then finds all maximal sets of transactions such that the similarity of any pair of its transactions is no less than the threshold (the sets are actually maximal cliques of the similarity graph in which pairs of transactions are connected when their similarity is no less than theta). Those sets of size less than gamma (given parameter) will be ignored. For each set, Simset then computes the items that appears in at least sigma % of the transactions (sigma is given as a parameter). The items are output as an itemset.
Compared to usual itemset mining, it does not find all the frequently appearing itemsets. However, the output itemsets by SimSet always corresponds to a cluster, thus can be considered to be meaningful.


==== How to Compile ====

Unzip the file into arbitrary directory, and execute "make". You then can see "simset" (or simset.exe), and other additional executable files which are components of SimSet, in the same directory.

==== Command Line Options ====

To execute SimSet, just type simset and give some parameters as follows.

% simset [command] [options] input-filename theta gamma [output-filename]

"%" is not needed to type. It is a symbol to represent command line. To see a simple explanation, just execute "simset" without parameters. An example is

%simset C test.dat 2 4 out

"input-filename" is the filename of the input file. The 1st letter of input-filename must not be '-'. Otherwise it is regarded as an option. The input file format is written below. "theta" is the similarity threshold, the pair of records (transactions) will be considered to be similar when the similarity is no less than theta. "gamma" is the threshold for the clique size. The cliques (similar transaction sets) of size less than "gamma" are ignored.

The first parameter [command] is composed of some letters, each of them corresponds to a command. This is given to the program to indicate the task and some other execution modes.

   M: output intersection of each clique, instead of IDs of its members
     when this command is given, SimSet outputs a frequent pattern.
     Without this command, it outputs the ID of the transactions in each
   O: repeatedly similarity clustering until convergence
     repeatedly compute the similarity, according to the number of
     common similar transactions. In this mode, after finding all pairs
     of similar transactions, two transactions having [gamma] transactions
     to those both are similar. Simset repeatedly compute the similarity
     in this measure, until the convergence.
   t: transpose the input database
     with this command, simset transposes the input database so that the
     i-th line is the item i, and the numbers in the line represents the
     transactions that include the item i.
   E: read edge list file
     with this command, simset reads a file as a graph such that each line
     is two vertex IDs of an edge. The sets to be compared are adjacency
     lists of the vertices of the graph.
The following commands gives the similarity measure. When the following is given, the similarity measure is set to,

   i (inclusion):
     ratio of A's items are included in B
   I (both-inclusion):
     ratio of A's items are included in B, and ratio of B's items are
     included in A.
   S (set similarity max): |A \cap B| / max{|A|,|B|}
   s (set similarity min): |A \cap B| / min{|A|,|B|}
   T (intersection size): the size of their intersection
   R (Resemblance): |A \cap B|/|A \cup B| >= [threshold]
   C (cosign distance):
     the inner product of their normalized vectors

Following to the first parameter (commands), we can give several options described below. Basically any option accompanies one or more values, and they are in the form "-? [num]" where ? is a letter. We can give no option.

   -m,-M [num]: clustering cliques
     after finding the cliques, SimSet finds all pairs of cliques whose
     similarities are no less than [num]. SimSet then find the clusters
     according to the graph given by the pairs, and output clusters
     instead of cliques (or intersection of the members in cliques).
     When -M is given, each connected component will be a cluster.
     When -m is given, Simset first finds a maximal independent set as
     a seed, and let other vertices (cliques) belong to the neighboring
     one in the independent set. The similarity measure is given as
     -MT. This means intersection size is used as the similarity measure.
   -v [num]: specify majority threshold (default=0.5)
     when M command is given, the items included in the transactions of
     ratio at least [num] will be output. The default setting is 0.5.
   -u [num]:ignore records of size larger than [num]
     Simset ignores the records (transactions) of
   -U [num]:ignore items of frequency larger than [num]
     this is for ignoring the items that appear at least [num] records.
     The items will be removed from all the records. When many such
     items exist, many pairs of records will be quite similar, thus
     to look the details, we remove such items.
   -O [num]:specify the number of repetitions
     this option is for O command. It given the maximum number of
     repetition of computing the similarity, thus when the number of
     execution of the repetitions, simset terminates the repetition
     and goes to the next phase, finding the maximal cliques.
   -, [char]: give the separator of the numbers in the output
     the numbers in the output file are separated by the given
     character [char].
   -Q [filename]: replace the output numbers
     according to the permutation table written in the file of
     [filename], replace the numbers in the output. The numbers in the
     file can be separated by any non-numeric character such as newline
   -W [dir]:specify working directory. All working files created during
     the execution are created in the directory, and will be removed
     after the execution of the program.


- find all the pairs of records whose intersection size is no less
   than 2, and find maximal cliques of size at least 4

%simset T test.dat 2 4 out

- find all the pairs of records whose consign distance is no less than
   0.9, find maximal cliques of size at least 5, and output the items
   appearing records of each clique with ratio at least 0.7.

%simset CM -v 0.7 test.dat 0.9 5 out

Input File Format

Each line (row) of the input file is a transaction. The items included in the transaction of ID i are listed in the (i+1)th line. The transaction ID begins from 0. The items must be numbers begin from 1. They can not be minus. The item numbers do not have to be used continuously, but notice that the program takes memory linear in the maximum item number. The separator of numbers can be any non-numeric letter, such as "," " " ":" "a", etc.

Example) ( "[EOF]" is the end of file )
0 1 2
2 3 4
4,1 2 3

Output File Format

When M command is not given, SimSet outputs the transaction ID's of each maximal clique. Each line of the output file corresponds to a maximal clique, and is a list of numbers that are ID's of the transactions. If M is given, SimSet outputs the items included in many transactions of each clique. For example, (1:0.91) (8:0.91) (2:0.73)
means that item 1 is included in 91% of the transactions in the clique, and item 8 is included in 91%, and item 2 is included in 73%. The items are sorted in the decreasing order of the ratio.