######################################################################
SimSet: Similarity based multi-clustering and itemSET mining
   Coded by Takeaki Uno,   e-mail:uno@nii.jp, 
        homepage:   http://research.nii.ac.jp/~uno/index.html
######################################################################

** This program is available for only academic use, basically.   **
** Anyone can modify this program, but he/she has to write down  **
** the change of the modification on the top of the source code. **
** Neither contact nor appointment to Takeaki Uno is needed.     **
** If one wants to re-distribute this code, do not forget to     **
** refer the newest code, and show the link to homepage          **
** of Takeaki Uno, to notify the news about codes for the users. **
** For the commercial use, please make a contact to Takeaki Uno. **

1. Function
2. Usage
3. Input File Format
4. Output File Format
5. Acknowledgments
6. References

######################
####   Function   ####
######################

Simset inputs a set of itemsets (records, transactions), and find
frequent patterns according to the clusters of the transactions.
Simset first compute the similarity of the transactions, and then
finds all maximal sets of transactions such that the similarity of any
pair of its transactions is no less than the threshold (the sets are
actually maximal cliques of the similarity graph in which pairs of
transactions are connected when their similarity is no less than
theta). Those sets of size less than gamma (given parameter) will be
ignored. For each set, Simset then computes the items that appears in
at least sigma % of the transactions (sigma is given as a parameter).
The items are output as an itemset.
Compared to usual itemset mining, it does not find all the frequently
appearing itemsets. However, the output itemsets by SimSet always
corresponds to a cluster, thus can be considered to be meaningful.


#####################
####    Usage    ####
#####################

    ==== How to Compile ====

Unzip the file into arbitrary directory, and execute "make". You then
can see "simset" (or simset.exe), and other additional executable files
which are components of SimSet, in the same directory.


    ==== Command Line Options  ====

To execute SimSet, just type simset and give some parameters as follows.

% simset [command] [options] input-filename theta gamma [output-filename]

"%" is not needed to type. It is a symbol to represent command line.
To see a simple explanation, just execute "simset" without parameters.
An example is

%simset C test.dat 2 4 out

"input-filename" is the filename of the input file. The 1st letter of
input-filename must not be '-'. Otherwise it is regarded as an option.
The input file format is written below. "theta" is the similarity
threshold, the pair of records (transactions) will be considered to be
similar when the similarity is no less than theta. "gamma" is the
threshold for the clique size. The cliques (similar transaction sets) 
of size less than "gamma" are ignored.

The first parameter [command] is composed of some letters, each of 
them corresponds to a command. This is given to the program to
indicate the task and some other execution modes.

 %:show progress, 
 _:no message is shown
 +:write solutions to the output file with appending mode, thus the 
   output file is preserved, and output is written from the last of the
   file
 =:do not remove temporal files that are made by simset
 @:do not execute data polishing, thus directly enumerate the maximal
   cliques in the similarity graph.
 M: output intersection of each clique, instead of IDs of its members
   when this command is given, SimSet outputs a frequent pattern. 
   Without this command, it outputs the ID of the transactions in each
   clique.
 O: repeatedly similarity clustering until convergence
   repeatedly compute the similarity, according to the number of 
   common similar transactions. In this mode, after finding all pairs
   of similar transactions, two transactions having [gamma] transactions
   to those both are similar. Simset repeatedly compute the similarity
   in this measure, until the convergence.
 t: transpose the input database
   with this command, simset transposes the input database so that the
   i-th line is the item i, and the numbers in the line represents the
   transactions that include the item i.
 M:output intersection of each clique, instead of IDs of its members
 v (with M): output ratio of records, including each item
 m:do not remove edges in the data polishing phase
 O:repeatedly similarity clustering until convergence
 
The following commands gives the similarity measure. When the following
is given, the similarity measure is set to,

 i (inclusion): 
  ratio of A's items are included in B
 I (both-inclusion): 
  ratio of A's items are included in B, and ratio of B's items are
  included in A.
 S (set similarity max): |A \cap B| / max{|A|,|B|}
 s (set similarity min): |A \cap B| / min{|A|,|B|}
 T (intersection size): the size of their intersection
 R (Resemblance): |A \cap B|/|A \cup B| 
 C (cosign distance):
  the inner product of their normalized vectors
 P (PMI): log (|A\cap B|*|all| / |A|*|B|)
    (where |all| is the number of all items)

Following to the first parameter (commands), we can give several
options described below. Basically any option accompanies one or more
values, and they are in the form "-? [num]" where ? is a letter.
We can give no option. 

 -m,-M [num]: clustering cliques
   after finding the cliques, SimSet finds all pairs of cliques whose
   similarities are no less than [num]. SimSet then find the clusters
   according to the graph given by the pairs, and output clusters
   instead of cliques (or intersection of the members in cliques).
   When -M is given, each connected component will be a cluster.
   When -m is given, Simset first finds a maximal independent set as
   a seed, and let other vertices (cliques) belong to the neighboring
   one in the independent set. The similarity measure is given as
   -MT. This means intersection size is used as the similarity measure.
 
 -v [num]: specify majority threshold (default=0.5)
   when M command is given, the items included in the transactions of
   ratio at least [num] will be output. The default setting is 0.5.
 -l [num]:ignore vertices of degree less than [num]
 -u [num]:ignore vertices of degree more than [num]
 -U [num]:ignore items of frequency larger than [num]
   this is for ignoring the items that appear at least [num] records.
   The items will be removed from all the records. When many such
   items exist, many pairs of records will be quite similar, thus 
   to look the details, we remove such items.
-L [num]:ignore transactions of size less than [num] (with -G)
-I [num]:ignore items of frequency more than [num] (with -G)
-i [num]:ignore items of frequency less than [num] (with -G)
-T [num]:ignore pairs whose intersection size is less than [num] (only for first phase, with -G option)

 -O [num]:specify the number of repetitions
   this option is for O command. It given the maximum number of
   repetition of computing the similarity, thus when the number of 
   execution of the repetitions, simset terminates the repetition 
   and goes to the next phase, finding the maximal cliques.
 -, [char]: give the separator of the numbers in the output
    the numbers in the output file are separated by the given
    character [char].
 -Q [filename]: replace the output numbers 
    according to the permutation table written in the file of
    [filename], replace the numbers in the output. The numbers in the
    file can be separated by any non-numeric character such as newline
    character. 
 -W [dir]:specify working directory. All working files created during
   the execution are created in the directory, and will be removed
   after the execution of the program.

Examples)

- find all the pairs of records whose intersection size is no less
 than 2, and find maximal cliques of size at least 4

%simset T test.dat 2 4 out

- find all the pairs of records whose consign distance is no less than
 0.9, find maximal cliques of size at least 5, and output the items
 appearing records of each clique with ratio at least 0.7.

%simset TM -v 0.7 test.dat 0.9 5 out



###############################
####   Input File Format   ####
###############################

Each line (row) of the input file is a transaction. The items included
in the transaction of ID i are listed in the (i+1)th line. The
transaction ID begins from 0. The items must be numbers begin from 1.
They can not be minus. The item numbers do not have to be used
continuously, but notice that the program takes memory linear in the
maximum item number. The separator of numbers can be any non-numeric
letter, such as ","  " " ":" "a", etc.

Example)  ( "[EOF]" is the end of file )
0 1 2
1
2 3 4
4,1 2 3
2,1
[EOF]



##################################
####    Output File Format    ####
##################################

When M command is not given, SimSet outputs the transaction ID's 
of each maximal clique. Each line of the output file corresponds to a 
maximal clique, and is a list of numbers that are ID's of the
transactions. If M is given, SimSet outputs the items included in 
many transactions of each clique. For example, 
(1:0.91) (8:0.91) (2:0.73) 
means that item 1 is included in 91% of the transactions in the
clique, and item 8 is included in 91%, and item 2 is included in 73%.
The items are sorted in the decreasing order of the ratio.


#############################
####   Acknowledgments   ####
#############################



########################
####   References   ####
########################




// E: read edge list file
//   with this command, simset reads a file as a graph such that each line
//   is two vertex IDs of an edge. The sets to be compared are adjacency
//   lists of the vertices of the graph.
