afim readme

AFIM: Ambiguous Frequent Itemset Miner

Coded by Takeaki Uno, e-mail:uno@nii.jp,
homepage: http://research.nii.ac.jp/~uno/index.html

This program is available for only academic use, basically. Anyone can modify this program, but he/she has to write down the change of the modification on the top of the source code. Neither contact nor appointment to Takeaki Uno is needed. If one wants to re-distribute this code, do not forget to refer the newest code, and show the link to homepage of Takeaki Uno, to notify the news about codes for the users. For the commercial use, please make a contact to Takeaki Uno.

ProblemDefinition
Usage
InputFileFormat
UseGeneralNamesforVariablesandOtherFormats
BatchFilesforSimpleuse
OutputFormat
Performance
SolvingOtherProblems
Acknowledgments
References

Problem Definition

Let I be a set of items. An itemset is a subset of I. Let D be a transaction database such that each record (called transaction) is an itemset. For a pair of an itemset I and a transaction set T, the density is defined by the average number of items of I included in a transaction of t. For given a frequency threshold s and density threshold d, an itemset I is called an ambiguous frequent itemset if there is a transaction set T of size at least s such that the density of I and T is no less than d. This program enumerates all ambiguous frequent itemsets, or all maximal ambiguous frequent itemsets, for transaction database D, frequency threshold s, and density threshold d.

Usage

==== How to Compile ====

Unzip the file into arbitrary directory, and execute "make". Then you can see "afim" (or afim.exe) in the same directory.

==== Command Line Options ====

To execute AFIM, just type lcm and give some parameters as follows.

% afim [command] [options] input-filename support density-threshold [output-filename]

"%" is not needed to type. It is a symbol to represent command line. To see a simple explanation, just execute "afim" without parameters.

"input-filename" is the filename of the input transaction database. The 1st letter of input-filename must not be '-'. Otherwise it is regarded as an option. The input file format is written below. "support" and "density" are the frequency-threshold and
density-threshold, given to define ambiguous frequent itemsets.
"output-filename" is the name of file to write the itemsets the program finds. You can omit the output file to see only the number of solutions to output the file. If the output file name is "-", the solutions will be output to the standard output.

The first parameter [command] is composed of some letters, and given to the program to indicate the task.

F: enumerate ambiguous frequent itemsets,
M: enumerate maximal ambiguous frequent itemsets

For example, if you want to enumerate maximal frequent itemset, type "M" in the first parameter. Additionally, we can give the following commands, to specify the output style:

   q: no output to standard output (including messages w.r.t. input data)
   f: output the frequency on the end of each itemset found,
   Q: output the frequency on the head of each itemset found,
   i: ignore the items that appear less than the [support].
   I: output ID's of transactions forming dense structure
     the list of the ID's of the transactions with which the itemset
     form a dense submatrix is printed in the line just below the
     itemset. In exact, the algorithm in greedy selects the transactions
     which have larger intersections with the itemset (so, dense
     submatrix is obtained). The algorithm selects the transactions
     unless the density of the submatrix is no less than the density
     threshold. In the case there are several transaction with the same
     intersection size, and the density will be smaller than the
     density threshold when all the transaction are added, the algorithm
     adds none of them.
   m: output ID's of transactions with selecting maximally.
     same as I command, but when there are transactions of the same
     intersection size of violating the density threshold constraint,
     the algorithm selects all them.
   V: show the progress of computation
   t: transpose the database so that item i will be transaction i, thus
     if item i is included in j-th transaction, then item j will be
     included in i-th transaction.

The output format is explained below. The following options are to restrict the itemset to be found. We can type them between the first parameter and the second parameter. Of course the option can be omitted.

   -l,-u [num]: enumerate itemsets with size at least/most [num]
   -U [num]: enumerate itemsets with frequency at most [num]
   -K [num]: output the frequency of [num]-th largest frequent(closed/maximal)
     itemsets. Notice that it outputs only the frequency, thus no
     itemset will be output
   -S [num]: stop after outputting [num] solutions
   -, [char]: give the separator of the numbers in the output
     the numbers in the output file are separated by the given
     character [char].
   -Q [filename]: replace the output numbers
     according to the permutation table written in the file of
     [filename], replace the numbers in the output. The numbers in the
     file can be separated by any non-numeric character such as newline
     character.
   -w [filename]: read weights of transactions from the file
     the frequency of itemset will be the sum of weights of the transactions
     which include the itemset.

-l, -u, and -U specify the upper/lower bound for itemset/frequency. itemsets exceed the upper/lower bound will be not output. When we give -S option and a number X, the program will terminate if it finds X solutions, even if there are more solutions.

If we specify "-w" and filename, then AFIM reads the file of the filename as a weight database of the transactions; the weight of each transaction. The numbers can be separated by any non-numeric characters except for '.' and '-'. Given transaction weights, the density of the itemsets I and transactions T is defined by (sum of w(t\cap I), t\in T) / (|I| * |T|), where w(t) is the weight of transaction t. In this definition, the above definition of the density is considered as a special case in that all the transaction weights are equal to 1. In the weighted case, density can be above one, thus density threshold can be above one. The negative weights might cause badly to the computation, so that we may miss several ambiguous frequent itemsets because negative weights violates the monotone condition.

When we give -K option, AFIM computes the frequency of the [num]-th largest frequent(closed/maximal) itemsets, and output it to the standard output (print it to the command line). By giving the output frequency to AFIM, we can enumerate top-[num] frequent itemsets (closed, or maximal).

Examples)

- to find all ambiguous frequent itemsets in "test.dat" for support 4,
     and density 0.8, and size of no less than 3, output frequencies of
     each itemsets found, do not output to file, show progress, and
     stop if 1,000,000 solutions will be found;

% afim FfV -l 3 -S 1000000 test.dat 4 0.8

- to find maximal ambiguous frequent itemsets in "test.dat" with

frequency at least 6, and density at least 0.9, sizes from 5 to 10,
and output itemsets to "out.dat";

% afim M -l 5 -u 10 test.dat 6 0.9 out.dat

Input File Format

Each line (row) of the input file is a transaction. The items included in a transaction are listed in a line. The items must be numbers begin from 1. They can not be minus. The item numbers do not have to be continuous, but notice that the program takes memory linear in the maximum item number. The separator of numbers can be any non-numeric letter, such as "," " " ":" "a", etc.

Example) ( "[EOF]" is the end of file )
0 1 2
1
2 3 4
4,1 2 3
2,1
[EOF]

The file format of item constraints (or complement) graphs is as follows. The ith row (line) of the file corresponds to node i-1. The first line corresponds to the node 0, and the 10th line corresponds to node 9. The node larger than i-1 and adjacent to i-1 is listed in the i-th line. The nodes are written by numbers. Separator to the number can be ",", but the graph load routine accepts any letter for the separator but not a number, +, and -. If the edge set is {(0,1),(0,2),(1,1),(1,3),(2,3)}, the file will be

===========
1,2
1 3
3

[EOF]
=========

where "[EOF]" is a symbol for the end of file.

Use General Names for Variables and Other Formats

We can transform variable names in general strings to numbers so that we can input the data to the program, by some script files.

- sortout < input-file > output-file
Sort the file in the lexicographical order. It is used to sort the output file.

- transnum.pl table-file [separator] < input-file > output-file
Read file from standard input, and give a unique number to each name written by general strings (such as ABC, ttt), and transform every string name to a number, and output it to standard output. The mapping from string names to numbers is output to table-file. The default character for the separator of the string names is " "(space). It can be changed by giving a character for the option [separator]. For example, A,B is a string name, if the separator is space, but if we set the separator to ",", it is regarded as two names A and B. This is executed by "transnum.pl table-file "," < input-file...".

- untransnum.pl table-file < input-file > output-file
According to the table-file output by transnum.pl, un-transform numbers to string names. The output of the program is composed of numbers, thus it is used if we want to transform to the original string names. It reads file from standard output, and output to the standard output.

- appendnum.pl < input-file > output-file
When we want to distinct the same words in different columns, use this script. This append column number to each words, so we can distinct them. Then, by using transnum.pl, we transform the strings to numbers.

- transpose.pl < input-file > output-file
Transpose the file. In the other words, consider the file as an adjacency matrix, and output the transposed matrix, or exchange the positions of items and transactions. For an input file, output the file in which the i-th line corresponds to item i, and includes the numbers j such that i is included in the j-th line of the input file.

- 01conv.pl [separator] < input-file > output-file
Transform the transaction database written in the style of 01 matrix to our input style. The file has to be a list of 0 and 1, separated by [separator]. If [separator] is omitted, the default separator " " is used.

Batch Files for Simple use

For general string names, we have several batch files scripts for basic usages "exec_afim", "exec_afim_", "sep_afim", or "sep_afim_". For example, when a database with "general item names" is,

dog pig cat
cat mouse
cat mouse dog pig
cow horse
horse mouse dog
[EOF]

All these replace strings in the input database by numbers, execute AFIM, and replace the numbers in the output file by the original strings. The usage of the scripts are

% exec_afim [command] input-filename support output-filename [options]

You have to give the commands in the first parameter, but the options have to locate at the end. The separator of the items is " " (blank, space). If you want to use other character as a separator, use "sep_afim". The usage is

% sep_afim separator [command] input-filename support output-filename [options]

Almost same as "exec_afim" but you have to specify separator at the fourth parameter. "exec_afim_" and "sep_afim_" are both for the aim to distinct the same items in the different columns. For example, it is used to the database such that different items are there in different columns, but some special symbols, such as "- is for missing data", are used commonly. An example is;

A true small
B true -
A false middle
B - -
C - middle
A true -
[EOF]

In the output file, the items are followed by "." and numbers where the numbers are the column number. For example, "dog.0" means the item "dog" on the 0th(first) column.

The usage of them are the same as "exec_afim" and "sep_afim", respectively. All these scripts use files of the names "__tmp1__", "__tmp2__", and "__tmp3__". The files of these names will be deleted after the execution.

Example)

% exec_afim F test2.dat 10 out.dat -w weight.dat -l 2

% sep_afim_ "," C test3.dat 3 out.dat -U 5

Output Format

When the program is executed, the program prints out the #items, #transactions, and other features of the input database to standard error. After the termination of the enumeration, it outputs the total number of solutions found ((maximal) ambiguous frequent itemsets), and the numbers of solutions of each size. For example, if there are 4 ambiguous frequent itemsets of size 1, 2 ambiguous frequent itemsets of size 3, and 1 ambiguous frequent itemset of size 3, then the output to standard output will be,

9 <= total #ambiguous frequent itemsets
1 <= #ambiguous frequent itemsets of size 0 (empty set), it always exists
4 <= #ambiguous frequent itemsets of size 1
3 <= #ambiguous frequent itemsets of size 2
1 <= #ambiguous frequent itemsets of size 3

If "q" is given in the first parameter, these do not appear in the standard output.

If output-filename was given, then the itemsets found are written to the output file. Each line of the output file is the list of items included in an itemset found, separated by " ". By giving "-," option we can change the separator. If "f" is given in the first parameter, the frequency follows each itemset, for example,

1 5 10 2 4 (22)

which means itemset {1,2,4,5,10} is included in 22 transactions. When "Q" option is given, the output will be

(22) 1 5 10 2 4

When I command is given, the list of the ID's of the transactions with which gives the ambiguous frequency of the itemset is printed in the line just below the itemset. The first letter of the line is always a blank (so an indent), and the ID's follow in the same format. The output itemsets are not sorted. If you want to sort it, use the script "sortout.pl". The usage is just,

% sortout.pl < input-file > output-file

"input-file" is the name of file to which AFIM outputs, and the sorted output will be written in the file of the name "output-file". The items of each itemset will be sorted in the increasing order of items, and all the itemsets (lines) will be also sorted, by the lexicographical order (considered as a string). (Actually, you can specify separator like sortout.pl ",").

Performance

The performance of AFIM is stable, for both computation time and memory use. The initialization and preprocess time of AFIM is linear in the size of input database. The computation time for finding ambiguous frequent itemsets basically depends on the number of itemsets found; it takes long time if solutions to be output is many, and the computation time is basically linear to the number of solutions output. When the minimum support is large and the number of solutions output is small, the computation time is short, so the bottle neck of the computation is the initialization. However, when the solutions output increases, the total computation time increases but the computation time per solution decreases. Roughly speaking when the size of output file is equal to the input database size, the computation time per solution will be constant, such as 1/1,0,000 sec (by a PC with CPU of 2GHz.) On contrary, when the database size increases, the computation time per solution increases, but if the number of solutions also increases (it is quite natural), it is sublinear to the database size; it increases very slowly compared to the increase of the database size.

Memory usage of AFIM is very stable. The memory usage of AFIM is always linear in the size of the input database. Approximately AFIM uses integers at most two times as much as the database size, which is the sum of the number of items of each transaction. The memory usage of the implementations for other pattern mining problems increases as the increase of the number of frequent itemsets, but that of AFIM does not.

Solving Other Problems

- Enumerating Dense Bipartite Subgraphs in a Bipartite Graph -

Enumerating all maximal bipartite cliques in a bipartite graph is equivalent to enumerating all ambiguous frequent itemsets. AFIM can be used for this task. For a bipartite graph of two vertex sets A and B, construct the database such that each line is the list of vertices incident to a vertex in A. Then, by executing

% afim FI (or MI) input-filename 1 [density] output-filename

you can enumerate (maximal) dense bipartite cliques.

Acknowledgments

We thank to Ken Satoh of National Institute of Informatics Japan, Hiroki Arimura of Hokkaido University for their contribution for the research. We would also like to present our sincere thanks to Bart Goethal, one of the organizers of FIMI (Frequent Itemset Mining Implementation) for his organizing the workshop to which the first version of AFIM was submitted, which was the start of our research. We also thank to Yukinobu Hamuro of Kwansei gakuin University, JAPAN, Hiroyuki Morita of Osaka Prefecture University, JAPAN for their bug
reports.

A part of the research of AFIM is supported by joint-research fund of National Institute of Informatics Japan, and Grant-in-aid from the Ministry of Education, Science, Sport and Culture of Japan (Monbu-Kagaku-Sho).

References

Takeaki Uno, Hiroki Arimura: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration. PAKDD 2008: 357-368

Takeaki Uno, Masashi Kiyomi, Hiroaki Arimura, "LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets," in Proceedings of IEEE ICDM'04 Workshop FIMI'04, 1/Nov/2004, http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS//Vol-126/

Takeaki Uno,Tatsuya Asai,Yuzo Uchida, Hiroki Arimura, "An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases," Lecture Notes in Artificial Intelligence 3245 (Proceedings of Discovery Science 2004), 4/Oct/2004

Bart Goethal, "FIMI repository", http://fimi.cs.helsinki.fi/ (various frequent itemset implementation, benchmark databases, computational experiments comparing all implementations, and papers explaining the implementations, very useful site)