acla readme

ACLA: Approximate Cover Listing Algorithm

Coded by Takeaki Uno, e-mail:uno@nii.jp,
homepage: http://research.nii.ac.jp/~uno/index.html

This program is available for only academic use, basically. Anyone can modify this program, but he/she has to write down the change of the modification on the top of the source code. Neither contact nor appointment to Takeaki Uno is needed. If one wants to re-distribute this code, do not forget to refer the newest code, and show the link to homepage of Takeaki Uno, to notify the news about codes for the users. For the commercial use, please make a contact to Takeaki Uno.

ProblemDefinition
Usage
InputFileFormat
UseGeneralNamesforVariablesandOtherFormats
BatchFilesforSimpleuse
OutputFormat
Performance
SolvingOtherProblems
AlgorithmsandImplementationIssue
Acknowledgments
References

Problem Definition

Let I be a set of items. An itemset is a subset of I. Let D be a transaction database such that each record (called transaction) is an itemset. The coverage of an itemset is the number of transactions including at least one item of the itemset. For a given threshold number t, an itemset is called an approximate cover if the coverage is no less than t. An itemset is also called covering. For given a database D, threshold t, size s, the problem ACLA solves is to enumerate all approximate coverings in D of size equal to s. Actually, ACLA does not output some approximate coverings which include smaller approximate coverings, because of redundancy.

Usage

==== How to Compile ====

Unzip the file into arbitrary directory, and execute "make". Then you can see "acla" (or acla.exe) in the same directory.

==== How to Use ====

To execute ACLA, just type acla and give some parameters as follows.

% acla FCfQIVq [options] input-filename threshold size [output-filename]

"%" is not needed to type. It is a symbol to represent command line. To see a simple explanation, just execute "acla" without parameters.

"input-filename" is the filename of the input transaction database. The 1st letter of input-filename must not be '-'. Otherwise it is regarded as an option. The input file format is written below. "threshold" is the threshold, which is for given the threshold of approximate covering. "size" is the size of approximate covering to be output. "output-filename" is the name of file to write the coverings the program finds. You can omit the output file to see only the number of approximate coverings in the database. If the output file name is "-", the solutions will be output to the standard output.

The first parameter is given to the program to indicate the task.
   F: enumerate approximate coverings of given size
   C: Enumerate approximate coverings of size at most given number
   f: output the coverage on the end of each covering found,
   Q: output the coverage on the head of each covering found,
   q: no output to standard output (including messages w.r.t. input data)
   I: output ID's of transactions covered by the approximate cover
   V: show the progress of computation
   t: transpose the database so that item i will be transaction i, thus
     if item i is included in j-th transaction, then item j will be
     included in i-th transaction.

We always have to give "F" or "C" parameter. The difference of "F" and "C" is that in the case of "F", ACLA outputs only coverings of the size equal to the given number, while it outputs smaller coverings if it satisfies the threshold conditions, and not output a part of its supersets.

The output format is explained below. The following options are to restrict the coverings to be found. We can type them between the first parameter and the second parameter. Of course the option can be nothing.

   -U [num]: enumerate coverings with coverage at most [num]
   -w [filename]: read weights of transactions from the file
   -K [num]: output the coverage of the approximate covering having the
     [num]-th largest coverage
   -S [num]: stop after outputting [num] solutions
   -o [num]: lower bound for the positive weight sum of the occurrences
     ACLA outputs only itemsets such that the sum of positive transaction
     weights in the occurrence is no less than [num]
   -O [num]: upper bound for the positive weight sum
   -n [num]: lower bound for the negative weight sum of the occurrences
     ACLA outputs only itemsets such that the sum of negative transaction
     weights in the occurrence is no less than [num]
   -N [num]: upper bound for the negative weight sum
   -, [char]: give the separator of the numbers in the output
     the numbers in the output file are separated by the given
     character [char].
   -Q [filename]: replace the output numbers
     according to the permutation table written in the file of
     [filename], replace the numbers in the output. The numbers in the
     file can be separated by any non-numeric character such as newline
     character.

-U options specifies the upper bound of coverage. Approximate coverings with coverage larger than [num] will not be output. coverings exceed the upper/lower bound will be not output. When we give -S option and a number [num], the program will terminate if it finds [num] solutions, even if there are more solutions.

If we specify "-w" and filename, then ACLA reads the file as a weight database of transactions. Each i-th line of the file is regarded as the weight of i-th transaction. With transaction weights, ACLA finds the coverings such that the sum of weights of the transaction covered by the covering is no less than the threshold. The weights can be real numbers. We can give negative weights.

When we give -K option, ACLA computes the coverage of the covering having [num]-th largest coverage, and output it to the standard output (print it to the command line). By giving the output coverage to ACLA, we can enumerate the top-[num] coverage coverings.

Examples)

- Find all approximate coverings in "test.dat" for threshold 4, and of
size 3, and output coverage of each coverings found. Do not output
to file. Show progress, and stop if 1,000,000 solutions will be found.

% acla FfV -l 3 -S 1000000 test.dat 4 3

- Find all approximate coverings in "test.dat" with weight file
     "weight.dat" with coverage at least 8, of size 3, and output to "out.dat"
     with transaction IDs, and no output for standard output.

% acla MqI -w weight.dat test.dat 8 3 out.dat

Input File Format

Each line (row) of the input file is a transaction. The items included in a transaction are listed in a line. The items must be numbers begin from 1. They can not be minus. The item numbers do not have to be continuous, but notice that the program takes memory linear in the maximum item number. The separator of numbers can be any non-numeric letter, such as "," " " ":" "a", etc.

Example) ( "[EOF]" is the end of file )
0 1 2
1
2 3 4
4,1 2 3
2,1
[EOF]

The file format of item constraints (or complement) graphs is as follows. The ith row (line) of the file corresponds to node i-1. The first line corresponds to the node 0, and the 10th line corresponds to node 9. The node larger than i-1 and adjacent to i-1 is listed in the i-th line. The nodes are written by numbers. Separator to the number can be ",", but the graph load routine accepts any letter for the separator but not a number, +, and -. If the edge set is {(0,1),(0,2),(1,1),(1,3),(2,3)}, the file will be

===========
1,2
1 3
3

[EOF]
=========

where "[EOF]" is a symbol for the end of file.

Use General Names for Variables and Other Formats

We can transform variable names in general strings to numbers so that we can input the data to the program, by some script files.

- sortout < input-file > output-file
Sort the file in the lexicographical order. It is used to sort the output file.

- transnum.pl table-file [separator] < input-file > output-file
Read file from standard input, and give a unique number to each name written by general strings (such as ABC, ttt), and transform every string name to a number, and output it to standard output. The mapping from string names to numbers is output to table-file. The default character for the separator of the string names is " "(space). It can be changed by giving a character for the option [separator]. For example, A,B is a string name, if the separator is space, but if we set the separator to ",", it is regarded as two names A and B. This is executed by "transnum.pl table-file "," < input-file...".

- untransnum.pl table-file < input-file > output-file
According to the table-file output by transnum.pl, un-transform numbers to string names. The output of the program is composed of numbers, thus it is used if we want to transform to the original string names. It reads file from standard output, and output to the standard output.

- appendnum.pl < input-file > output-file
When we want to distinct the same words in different columns, use this script. This append column number to each words, so we can distinct them. Then, by using transnum.pl, we transform the strings to numbers.

- transpose.pl < input-file > output-file
Transpose the file. In the other words, consider the file as an adjacency matrix, and output the transposed matrix, or exchange the positions of items and transactions. For an input file, output the file in which the i-th line corresponds to item i, and includes the numbers j such that i is included in the j-th line of the input file.

- 01conv.pl [separator] < input-file > output-file
Transform the transaction database written in the style of 01 matrix to our input style. The file has to be a list of 0 and 1, separated by [separator]. If [separator] is omitted, the default separator " " is used.

Batch Files for Simple use

For general string names, we have several batch files scripts for basic usages "exec_acla", "exec_acla_", "sep_acla", or "sep_acla_". For example, when a database with "general item names" is,

dog pig cat
cat mouse
cat mouse dog pig
cow horse
horse mouse dog
[EOF]

All these replace strings in the input database by numbers, execute ACLA, and replace the numbers in the output file by the original strings. The usage of the scripts are

% exec_acla [commands] input-filename threshold size output-filename [options]

You have to give the commands in the first parameter, but the options have to locate at the end. The separator of the items is " " (blank, space). If you want to use other character as a separator, use "sep_acla". The usage is

% sep_acla separator [commands] input-filename threshold size output-filename [options]

Almost same as "exec_acla" but you have to specify separator at the fourth parameter. "exec_acla_" and "sep_acla_" are both for the aim to distinct the same items in the different columns. For example, it is used to the database such that different items are there in different columns, but some special symbols, such as "- is for missing data", are used commonly. An example is;

A true small
B true -
C false middle
B - -
C - middle
A true -
[EOF]

In the output file, the items are followed by "." and numbers where the numbers are the column number. For example, "dog.0" means the item "dog" on the 0th(first) column.

The usage of them are the same as "exec_acla" and "sep_acla", respectively. All these scripts use files of the names "__tmp1__", "__tmp2__", and "__tmp3__". The files of these names will be deleted after the execution.

Example)

% exec_acla FI test2.dat 10 3 out.dat -w weight.dat

% sep_acla_ "," FQ test3.dat 3 3 out.dat -U 5

Output Format

When the program is executed, the program prints out the #items, #transactions, and other features of the input database to standard error. After the termination of the enumeration, it outputs the total number of approximate coverings found, and the numbers of coverings of each size. Note that only coverings of given size are found by this program. For example, if there are 4 approximate coverings of size 3,
the output to standard output will be,

4 <= total #approximate coverings
0 <= #approximate coverings of size 0 (empty set)
0 <= #approximate coverings of size 1
0 <= #approximate coverings of size 2
4 <= #approximate coverings of size 3

If "q" is given in the first parameter, these do not appear in the standard output.

If output-filename was given, then the coverings found are written to the output file. Each line of the output file is the list of items included in a covering found, separated by " ". By giving "-," option we can change the separator. If "f" is given in the first parameter, the coverage follows each covering, for example,

1 5 10 2 4 (22)

which means covering {1,2,4,5,10} has non-empty intersection with 22
transactions. When "Q" option is given, the output will be

(22) 1 5 10 2 4

The output coverings are not sorted. If you want to sort it, use the script "sortout.pl". The usage is just,

% sortout.pl < input-file > output-file

"input-file" is the name of file to which ACLA outputs, and the sorted output will be written in the file of the name "output-file". The items of each covering will be sorted in the increasing order of items, and all the coverings (lines) will be also sorted, by the lexicographical order (considered as a string). (Actually, you can specify separator like sortout.pl ",").

Performance

The performance of ACLA is stable, for both computation time and memory use. The initialization and preprocess time of ACLA is linear in the size of input database. The computation time for finding approximate coverings depends on the number of coverings found, and the coverage threshold. If the number of approximate coverings is large, ACLA works almost linear time in the number of output, that is, throughput is stable, such as 100,000 coverings in one second.
Memory usage of ACLA is very stable. It is an advantage compared to other implementations. The memory usage of ACLA is always linear in the size of the input database. Approximately ACLA uses integers at most three times as much as the database size, which is the sum of the number of items of each transaction.

Solving Other Problems

- Combination with Pattern Mining -

Pattern mining is a problem of finding all patterns which are included in many records of the given database. In some sense, a pattern included in many records can be considered as an explanation of the records. if all almost all the records are explained by a pattern, then

- Mining Cascade Rule -
A rule "X1 \cup X2 \cup ... \cup Xn => A, otherwise not A" is called a cascade rule. ACLA can find such a rule by giving options -o [num1] and -n [num2] as bounds for the accuracy. First, we make a weight file such that transactions of positive examples have positive weights, and transactions of negative examples have negative weights. The accuracies are given by absolute values. [num1] means the lower bound for the true-positive examples in the rule, and [num2] means the upper bound for the false-negative examples. Note that if we want to set the upper bound to 5, we have to give "-n -5", because this is a bound for the sum of negative weights. The support has to be very small negative value. Both bounds are given by weighted sum, thus we can change the intensities of the transactions.

Algorithms and Implementation Issue

The basic idea of the algorithm is depth first search. Let D be a transaction database, t be the coverage threshold, 1,...,n be items, T1,...,Tm be the transactions in D. D'(I) denotes the set of transactions not including any item of I. We denote the largest item of a covering I by tail(I). ACLA first computes the coverage of each covering composed of one item. Then, for each covering {i}, the algorithm recursively enumerates coverings obtained by adding one item to {i}. In this way, recursively, ACLA enumerates all coverings. To avoid duplications, ACLA adds items j to {i} only if j>i. In this way, we can find all coverings of given size s, but the computation time will be too long since it visits all possible combinations of s items. Thus, we give a pruning method to the algorithm. When we are operating a covering I, if there is no possibility to reach to coverage t, then we prune the iteration. In precise, at the time, we can add k = s-|I| items to the covering. Suppose that i1,...,ik are items of top-k largest coverages for database D'(I). If the sum of the coverages of i1,...,ik is less than m - (coverage of I), then there is no hope to be an approximate covering by adding k items. Thus, we can prune the iteration. This algorithm is written as follows:

ACLA (D:database, I:covering )
     if |I| = s, then
     if I is approximate covering, then output I;
     return;
     if the sum of top-k coverages of items in D is less than m-t, then return
     for each item j > tail(I), call ACLA (D'(I), I\cup{j})

By calling ACLA (D, emptyset), we can enumerate all approximate coverings.

However, a straightforward implementation of this algorithm is very slow, since computing coverage of (I\cup{j}) takes long time. To be fast, we use conditional database and occurrence deliver as follows.

==== Conditional database of covering I ===
Conditional database of covering I, denoted by D(I), is given by the database obtained by

     1. D(I) := all transactions including I
     2. remove all items larger than tail(I) from each transaction of D(I)
     3. merge the identical transactions into one.
     (do for all such identical transactions)

Then, the coverage of covering I plus J is the same, in D and D(I), thus ACLA uses D(I) instead of D in the recursive call with respect to I. This technique is quite common in implementations of frequent itemset mining.

Constructing D(I) possibly takes much memory, when the recursion is deep. Thus, ACLA uses a slight modification.

     1. D(I) := IDs of all transactions including I
     2. If there are transactions T1,...,Tk become the same after deleting
     unnecessary items, make a new transaction equal to it, and replace the
     IDs of T1,...,Tk by the ID of new transaction

After terminating the recursive call with respect to I, we delete the transactions generated in the process. By this, we can bound the memory usage for the transactions by twice the database size. From this, ACLA allocates the memory in the initialization, and never do again in the main routine. Thus the memory usage is very stable, and is very fast since computation time for memory allocation is not small in the frequent covering mining implementations.

=== occurrence deliver ===
Let T be a transaction database (subset family), a collection of subsets of E = {1,...,n}, and T(S) be the set of transactions (subsets) in T including a subset S. We suppose that each transaction is sorted. The occurrence deliver computes T(i) for all i>j, for a given j, in time linear in the sum of their sizes. First, for each i for which we compute T(i), we give an empty bucket. Then, for each transaction t in T, we scan it in decreasing order of items until we meet the item less than j, and for each item i>j, insert t to the bucket of i. After scanning all the transactions, the bucket of i is T(i). When we have T(S) and want to compute T(S\cup {i}) for all i>tail(S), we can do in the same way by setting T:=T(S), and j:=tail(S). For the check whether the current covering is a maximal frequent covering of not, we also use the occurrence deliver. If there is no item i such that S\cup {i} is frequent, then the frequent covering S is maximal frequent covering.

Acknowledgments

We thank to Hiroki Arimura of Hokkaido University for his contribution for the research, and Koji Tsuda of National Institute of Advanced Industrial Science and Technology.

A part of the research of ACLA is supported by joint-research fund of National Institute of Informatics Japan, and Grant-in-aid from the Ministry of Education, Science, Sport and Culture of Japan (Monbu-Kagaku-Sho).

References

Bart Goethal, "FIMI repository", http://fimi.cs.helsinki.fi/ (various frequent itemset implementation, benchmark databases, computational experiments comparing all implementations, and papers explaining the implementations, very useful site)

Takeaki Uno, Masashi Kiyomi, Hiroki Arimura, ACLA ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining, Open Source Data Mining Workshop on Frequent Pattern Mining Implementations 2005, Aug/2005

Takeaki Uno, Masashi Kiyomi, Hiroaki Arimura, "LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets," in Proceedings of IEEE ICDM'04 Workshop FIMI'04, 1/Nov/2004, http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS//Vol-126/

Takeaki Uno and Tatsuya Asai, Hiroaki Arimura and Yuzo Uchida, "LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets," Workshop on Frequent Itemset Mining Implementations (FIMI'03), http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS//Vol-90/

Takeaki Uno,Tatsuya Asai,Yuzo Uchida, Hiroki Arimura, "An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases," Lecture Notes in Artificial Intelligence 3245 (Proceedings of Discovery Science 2004), 4/Oct/2004