LCM_seq: LCM algorithm for enumerating all frequently appearing sequences
Coded by Takeaki Uno, e-mail:uno@nii.jp,
homepage: http://research.nii.ac.jp/~uno/index.html
This program is available for only academic use, basically. Anyone can modify this program, but he/she has to write down the change of the modification on the top of the source code. Neither contact nor appointment to Takeaki Uno is needed. If one wants to re-distribute this code, do not forget to refer the newest code, and show the link to homepage of Takeaki Uno, to notify the news about codes for the users. For the commercial use, please make a contact to Takeaki Uno.
Let I be a set of items. An item sequence is a sequence composed of items I. Let D be a sequence database such that each record is a item sequence (record is called transaction in itemset mining). For item sequences I={i_1,...,i_n} and J={j_1,...,j_m}, if i_1 = j_k1, i_2 = j_k2,...,i_n = j_kn holds for some k1<k2<...
==== How to Compile ====
Unzip the file into arbitrary directory, and execute "make". Then you can see "lcm_seq" (or lcm_seq.exe) in the same directory.
==== Command Line Options ====
To execute LCM_seq, just type lcm and give some parameters as follows.
% lcm_seq FCfQIq [options] input-filename support [output-filename]
"%" is not needed to type. It is a symbol to represent command line. To see a simple explanation, just execute "lcm" without parameters.
"input-filename" is the filename of the input database. The 1st letter of input-filename must not be '-'. Otherwise it is regarded as an option. The input file format is written below. "support" is the support, which is for given the threshold of frequency. "output-filename" is the name of file to write the sequences the program finds. You can omit the output file to see only the number of frequent sequences in the database. If the output file name is "-", the solutions will be output to the standard output.
The first parameter is given to the program to indicate the task.
F: define frequency by position occurrence,
C: define frequency by document occurrence
For example, if you want to enumerate frequent sequence in the sense of document occurrence, give "F" in the first parameter. Additionally, we can give the following commands, to specify the output style:
m (extendable maximality): output only sequences that are not extendable
to frequent sequences
c (extendable closedness): output only sequences that are not extendable
to sequences with the same frequency
q: no output to standard output (including messages w.r.t. input data)
i: do not output itemset to the output file (only rules are written)
f: output the frequency on the end of each sequence found,
Q: output the frequency on the head of each sequence found,
A: output positive/negative frequency, and (frequency)/(absolute
frequency); positive/negative frequency is the sum of weights of
the occurrence which have positive/negative weights, respectively.
Absolute frequency is the sum of absolute value of the weights of
the occurrences.
s: output confidence and item frequency by absolute values; in default,
they are written by ratio, but by this command they are written by
#transactions/sum of the transaction weights
I: output ID's of transactions including each itemset; ID of a
transaction is given by the number of line in which the transaction
is written. The ID starts from 0.
V: show the progress of computation
t: transpose the database so that item i will be transaction i, thus
if item i is included in j-th transaction, then item j will be
included in i-th transaction.
The confidence of a rule A->B is the ratio (#transactions including the sequence A+B) / (#transactions including A). Intuitively, the confidence is the how much the rule is confident.
m command restricts the output sequences to only frequent sequences that can not extendable to frequent sequences. Precisely, for a frequent sequence S, if there is a frequent sequence S+a (satisfying the gap or window constraints described below, in the case we give these constraints), S will not be output. Only in the case that no S+a is frequent, S will be output (this is called "extendable maximality"). Similarly, c commands restricts the output sequences only to the frequent sequences that can not extendable to a sequence of the same frequency, with keeping the given gap/window constraints (this is called "extendable closedness").
Giving I command, LCMseq outputs the transaction ID's including the pattern, on the following line to the sequence. For example, if (1,3,4) is included in transactions 1, 4, 7, 8, 13 and 14, the output for this sequence will be
1 3 4 (6)
1 4 7 8 13 14
If the frequency is given by position occurrence, i.e., when we give F command, each occurrence is a pair of transaction ID and the position of the item corresponding to the last item of the sequence pattern. the position is reverse ordering, thus the last item is 0. For example, for the following database and pattern,
pattern = (1,3)
database:
1 3 5 6
1 4 5 3
1 2
4
1 2 3 1 3
the output will be
1 3
0 2 1 0 4 2 4 0
where 0 2 means the the second item 3 (2nd item from the last) of the first transaction (transaction ID 0) corresponds to the last item (item 3) of the pattern sequence. The following 1 0 means that the the last item 3 (0th item from the last) of the second transaction (transaction ID 1) corresponds to the last item (item 3) of the pattern sequence.
The output format is explained below. The following options are to restrict the sequence to be found. We can type them between the first parameter and the second parameter. Of course the option can be nothing.
-l,-u [num]: enumerate sequences with size at least/most [num]
-U [num]: enumerate sequences with frequency at most [num]
-w [filename]: read weights of records from the file; the frequency
of itemset will be the sum of weights of the transactions which
include the itemset.
-K [num]: output the frequency of [num]-th largest frequent(closed/maximal)
sequences. Notice that it outputs only the frequency, thus no
sequence will be output
-S [num]:stop after outputting [num] solutions
-, [char]: give the separator of the numbers in the output
the numbers in the output file are separated by the given
character [char].
-Q [filename]: replace the output numbers
according to the permutation table written in the file of
[filename], replace the numbers in the output. The numbers in the
file can be separated by any non-numeric character such as newline
character.
-g [num]: restrict gap length of each consecutive items by [num]; gap is
the number of items between two embedded items, and this option limits
the gap length.
-G [num]: restrict window size of each consecutive items by [num]; window
size is the limit of the length of embedding.
-f,-F [ratio]: let p be the product of (frequency of i)/(#sequences in
the database) for all item i in the sequence. By this option, a
sequence will be output only when its frequency is no greater/no less
than [ratio] times p \times #sequences in the database.
-i [num]: find association rule for item [num]; output all rules of the form
{1,3,5} => [num]. The criteria to output a rule can be given by other
options.
-a,-A [ratio]: find association rules of confidence at least [ratio]; an
association rule of {1,3,5} => 2 will be output if the frequency of
{1,3,5,2} is no less/no more than the frequency of {1,3,5} times [ratio].
-r,-R [ratio]: find association rules of relational confidence at least
[ratio]; an association rule of (1,3,5) => 2 will be output if the
frequency of (1,3,5,2) is no less/no more than the frequency of (1,3,5)
times (frequency of {2}/(#transactions in the database)) times [ratio].
-p,-P [num]: output pattern sequence only if (frequency) / (absolute
frequency) is no less/no greater than [num]. Absolute frequency is the
sum of absolute weights of the occurrences of the pattern sequence.
-n,-N [num]: output pattern sequence only if its negative frequency is
no less/no greater than [num]. Negative frequency is the sum of
weights of the occurrences which have negative weights
-o,-O [num]: output pattern sequence only if its positive frequency is
no less/no greater than [num]. Positive frequency is the sum of
weights of the occurrences which have positive weights
-s [num]: output itemset rule (of the form (a,b,c) => (d,e)) with
confidence at least [num] (only those whose frequency of the result
is no less than the support
-l, -u, and -U specify the upper/lower bound for sequence/frequency. sequences exceed the upper/lower bound will be not output. When we give -S option and a number X, the program will terminate if it finds X solutions, even if there are more solutions.
If -g [num] option is given, LCMseq finds embeds a sequence with satisfying that each consecutive two items will not be embedded two items with distance larger than [num]. The distance is the number of items between the embedded items, thus if the pattern sequence (1,2,3) is embedded to a sequence (1,2,4,5,7,3), the gap between 2 and 3 is 3, coming from that there are three items between 2 and 3 in the sequence. If -g 2 is given, then the sequence (1,2,4,5,7,3) is not regarded as an occurrence of the pattern sequence (1,2,3).
-G [num] option restricts the window size. Window size is the limit of the length of embedding. For example, the length of embedding of (1,2,3) into (6,1,2,4,5,7,3,4) is 6, thus if -G 5 is given, this is not counted as an occurrence.
If we specify "-w" and filename, then LCMseq reads the file as a weight database of records. Each i-th line of the file is regarded as the weight of i-th record. With record weights, LCMseq finds the sequences such that the sum of weights of the record including the sequence is no less than the support. The weights can be real numbers. We can give negative weights.
Examples)
- To find all frequent item sequences in "test.dat" for support 4, and
size of no less than 3. Output frequencies of each sequences found.
Do not output to file.
% lcm_seq -l 3 -f test.dat 4
- To find frequent item sequences in "test.dat" with frequency at least 6,
and sizes from 5 to 10. Output sequences to "out.dat".
% lcm_seq F -l 5 -u 10 test.dat 6 out.dat
-f and -F options are for finding sequences with high relational frequency. For each item i, let us consider (frequency of {i}) / (#records in D) as the appearance probability that item i is included in a record. Note that the value (frequency of {i}) / (#records in D) changes according to the definition of occurrence and frequency. When F command is given and the occurrence is defined by the document occurrence, the value never exceed 1, but in the case of position occurrence, the value can exceed 1.
If the appearance probability of each item is independent to each other, the frequency of a sequence S is expected to be the product of these appearance probabilities of its items times (#records in D). We here define the relational frequency of S by
(frequency of S) / (expected frequency of S). When we give -f [num] option, LCMseq outputs frequent sequences with relational frequency no less than [num]. Similarly, frequent sequences with relational frequency no greater than [num] is output when -F [num] option is given.
The remaining options are for the association rule mining. An association rule is a pair of a sequence S and an item x with the form "S => x". The confidence of the rule is the ratio of records including x, in the set of records including S. If the confidence is high, we can say that a record including S may include x with high probability. The remaining options are for enumerating all such association rules with higher or lower confidence
-a [ratio] and -A [ratio] options are for specifying the threshold value for the confidence. When -a option and [ratio] are given, LCMseq finds all association rules with confidence at least [ratio]. Conversely, association rules with confidence at most [ratio] are found when we give -A option.
-r [ratio] and -R [ratio] options are also for specifying the threshold, but the threshold values changes according to the frequency of each item. Precisely, when we give -R and [ratio], an association rule S => x is output if its confidence is no greater than [ratio] * (frequency of {x}), thus it is output only if the confidence is small compared to the frequency of x in the database. Conversely, when we give -r and [ratio], an association rule S => x is output if (1 - its confidence) is no greater than [ratio] * (1- frequency of {x}).
When -i [num] option is given, LCMseq finds only the rules of the form S => [num] ([num] is an item). By giving -s [num] options, LCMseq outputs association rule of the form (1,4,5) => (2,3), with confidence at least [num]. It means that an association rule is output if frequency of the pattern (1,4,5,2,3) is no less than the frequency of (1,4,5) times [num].
The following options are valid only when some transactions have negative weights. -p,-P [num] options restrict the output pattern sequences to those satisfy (frequency) / (absolute frequency) is no less/no greater than [num]. Here, absolute frequency is the sum of absolute weights of the occurrences of the pattern sequence.
-n,-N [num] options give bound for negative weights. If these options are given, LCMseq outputs a pattern sequence only if its negative frequency is no less/no greater than [num]. Here negative frequency is the sum of weights of the occurrences which have negative weights.
Similarly, -o,-O [num] options let the LCMseq output a pattern sequence only if its positive frequency is no less/no greater than [num]. Positive frequency is the sum of weights of the occurrences which have positive weights.
Examples)
- to find all association rule "A => b" such that frequency of A is no less
than 200, and the confidence is no less than 0.9; The input file is
"test.dat" and the output file is "out".
% lcm_seq C -a 0.9 test.dat 200 out
- to find all association rule "A => b" such that frequency of A is no less
than 200, and the confidence is no greater than
0.1 * (frequency of {x}/#records); the item b is specified as item 5.
The input file is "test.dat" and the output file is "out".
% lcm_seq C -R 0.1 -i 5 test.dat 200 out
Each line (row) of the input file is a record. The items included in a record is listed in a line. The items must be numbers begin from 0. They can not be minus. The item numbers do not have to be continuous, but notice that the program takes memory linear in the maximum item number. The separator of numbers can be any non-numeric letter, such as "," " " ":" "a", etc.
Example) ( "[EOF]" is the end of file )
0 1 2
1
2 3 4
4,1 2 3
2,1
[EOF]
We can transform variable names in general strings to numbers so that we can input the data to the program, by some script files.
- transnum.pl table-file [separator] < input-file > output-file
Read file from standard input, and give a unique number to each name written by general strings (such as ABC, ttt), and transform every string name to a number, and output it to standard output. The mapping from string names to numbers is output to table-file. The default character for the separator of the string names is " "(space). It can be changed by giving a character for the option [separator]. For example, A,B is a string name, if the separator is space, but if we set the separator to ",", it is regarded as two names A and B. This is executed by "transnum.pl table-file "," < input-file...".
- untransnum.pl table-file < input-file > output-file
According to the table-file output by transnum.pl, un-transform numbers to string names. The output of the program is composed of numbers, thus it is used if we want to transform to the original string names. It reads file from standard output, and output to the standard output.
For general string names, we have several batch files scripts for basic usages "exec_lcm_seq" and "sep_lcm_seq". For example, when a database with "general item names" is,
dog pig cat
cat mouse
cat mouse dog pig
cow horse
horse mouse dog
[EOF]
All these replace strings in the input database by numbers, execute lcm_seq, and replace the numbers in the output file by the original strings. The usage of the scripts are
% exec_seq [commands] input-filename support output-filename [options]
You have to give the commands in the first parameter, but the options have to locate at the end. The separator of the items is " " (blank, space). If you want to use other character as a separator, use "sep_lcm_seq". The usage is
% sep_seq separator [commands] input-filename support output-filename [options]
The scripts use files of the names "__tmp1__", "__tmp2__", and "__tmp3__", The file of these names will be deleted after the execution.
Example)
% exec_seq F test2.dat 10 out.dat -w weight.dat -l 2
% sep_seq "," C test3.dat 3 out.dat -U 5
When the program is executed, the program prints out the #items, #records(transactions), and other features of the input database to standard error. After the termination of the enumeration, it outputs the total number of item sequences found, and the numbers of item sequences of each size. For example, if there are 4 frequent item sequences of size 1, 2 frequent item sequences of size 3, and 1 frequent item sequence of size 3, then the output to standard output will be,
9 <= total #frequent item sequences
1 <= #frequent item sequences of size 0 (empty set), it is always frequent
4 <= #frequent item sequences of size 1
3 <= #frequent item sequences of size 2
1 <= #frequent item sequences of size 3
If "q" is given in the first parameter, these do not appear in the standard output.
If output-filename was given, then the item sequences found are written to the output file. Each line of the output file is the list of items included in an item sequence found, separated by " ". By giving "-," option we can change the separator. If "-f" option was given, the frequency follows each item sequence, for example,
1 5 10 2 4 (22)
which means item sequence (1,2,4,5,10) is included in 22 records. When "Q" option is given, the output will be
(22) 1 5 10 2 4
An association rule S => x, for example (1,2,3) => 5 is output in the form
[xxx yyy] 5 <= 1 2 3
where xxx is the confidence of the rule, and yyy is (frequency of y)/#records in D. Note that (1,2,3) => 5 means that if a record includes sequence (1,2,3), the 5 likely follows in the same record.
The performance of LCMseq is stable, for both computation time and memory use. The initialization and preprocess time of LCMseq is linear in the size of input database. The computation time for finding frequent item sequences depends on the number of item sequences found, and the minimum support. The computation time of LCMseq is scalable, i.e., almost linear in the number item sequences found, thus we here show the computation time for one item sequence found. When the minimum support is large and the number of the item sequences found is small, the computation time for each item sequence is relatively large. However, computation time per item sequence decreases as the increase of the item sequences found, and roughly speaking when the size of output file is equal to the input database size, it will be constant, such as 1/1,000,000 sec (by PC with CPU of 2GHz.)
Memory usage of LCMseq is also stable. It is an advantage compared to other implementations. The memory usage of LCMseq is always almost linear in the size of the input database. Approximately LCMseq uses integers at most three times as much as the database size, which is the sum of the number of items of each record, plus small portion of the database, which is needed to memory the information of occurrence deliver for the recursive calls. The memory usage of the other implementations increases as the increase of the number of frequent item sequences, but that of LCMseq does not.
The basic idea of the algorithm is depth first search, so called prefix span. Let D be a sequence database, t be the minimum support, 1,...,n be items, T1,...,Tm be the records in D.
For item sequences I={i_1,...,i_n} and J={j_1,...,j_m}, if i_1 = j_k1, i_2 = j_k2,...,i_n = j_kn holds for some k1<k2<... A position occurrence of pattern sequence I is actually defined by the pair (p,t) of position p and record t such that I is embedded (included) in t so that pth letter of t is corresponding to the first letter of I. We denote the set of records including I by D(I), which is the set of document occurrences of I, and all position occurrences of I by S(I). LCMseq first computes the frequency of each item sequence composed of one item. If an item sequence {i} is frequent, then enumerate frequent item sequences obtained by appending one item to the tail of {i}. In this way, recursively, LCMseq enumerates all frequent item sequences. It makes no duplication. In the case of document occurrence, the attachment never increase the frequency, thus in this way we can generate all frequent sequence. In the case of position occurrence, an addition may increase the frequency. However, the position occurrences are always much than document occurrences, thus we find all frequent sequences in the sense of the document occurrence and output only those frequent in the sense of the position occurrence. This algorithm is written as follows: FrequentItemSequenceMining (D:sequence database, I:item sequence ) By calling FrequentItemSequenceMining (D, emptyset), we can enumerate all frequent item sequences. The test of inclusion is done as follows. The leftmost position of each letter i_1,...,i_h. The leftmost position of i_k is the earliest (leftmost) position of the letter j_l that can correspond to i_k. That is, between the leftmost positions of i_k-1 and i_k, there is no letter equal to i_k. When we want to check whether I={i_1,...,i_h} is included in J={j_1,...,j_t} or not, for each i_k, we compute the position p_k such that p_k is the smallest position such that p_k > p_k-1 and j_p_k = i_k. In this way, if the last letter i_h has a corresponding letter, that is, p_k is computed, I is included in J. In the case of position occurrence, we compute all positions that can correspond to each letter. Thus, the frequency of a sequence can be computed in time linear in the product of sequence size and the size of sequence database. However, a straightforward implementation of this check is very slow, since computing frequency of (I+j) takes long time. To be fast, we occurrence deliver and conditional database explained below. === occurrence deliver === === conditional database === 1. D'(I) := all records of D including I Then, the frequency of item sequence I plus j is the same, in D and D'(I), thus LCMseq uses D'(I) instead of D in the recursive call with respect to I+j. This technique is quite common in implementations of frequent item sequence mining. The details of the algorithm is described as follows. 1. Loading the input file and initialize the memory 2. LCM_seq (occ, t_new, buf) - call Delivery(occ) to compute Occ[i] for each item i with occ_pw[i]>=th - for each item i in the copy of jump, === Delivery is done as follows Delivery (occ) === Memory allocation and loading the transactions are described as follows. 1. allocate int array of size #transactions + (sum of frq(i) with frq(i)>=th) for "buf" We thank to Ken Satoh of National Institute of Informatics Japan, and Hiroki Arimura of Hokkaido University for their contribution for the research, We also thank to Yukinobu Hamuro of Kwansei gakuin University, JAPAN, Hiroyuki Morita of Osaka Prefecture University, JAPAN, Yasuyuki Shirai of Mitsubishi Research Institute, Osamu Imamura of Tohoku University, Takanobu Nakahara of kansai University, for their bug reports. Bart Goethal, "FIMI repository", http://fimi.cs.helsinki.fi/ (various frequent itemset implementation, benchmark databases, computational experiments comparing all implementations, and papers explaining the implementations, very useful site) Takeaki Uno, Masashi Kiyomi, Hiroaki Arimura, "LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets," in Proceedings of IEEE ICDM'04 Workshop FIMI'04, 1/Nov/2004, http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS//Vol-126/ Takeaki Uno,Tatsuya Asai,Yuzo Uchida, Hiroki Arimura, "An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases," Lecture Notes in Artificial Intelligence 3245 (Proceedings of Discovery Science 2004), 4/Oct/2004
Output I
For each item j,
if (I + j) is frequent in the sense of the document occurrence,
call FrequentItemSequenceMining (D, I+j)
The occurrence deliver is originally developed for itemset mining. Here we explain its sequence version. The occurrence deliver computes the leftmost position and position occurrence of I+j for all item j, in time linear in D(I). First, for each j, we give an empty bucket. Then, for each record T in D, we scan it from the leftmost matching of i_n. Then, for each item i we meet, we insert the tuple (T, position of i) to the bucket of i. If we meet the same item during the scan of one record, we operate only for the first one and ignore the rest for leftmost position, and insert all in the case of the position occurrence. After scanning all the records including I, the tuples in the bucket of item j gives the set of records including I+j, and the (leftmost) position occurrence for them.
Conditional database of item sequence I, denoted by D'(I), is given by the database obtained by
2. from each record, remove all items preceding to leftmost position
3. merge the identical records of D'(I) into one.
(do for all such identical records)
- scan the file and for each item i, count #transactions, sum of transaction size, frq(i), where frq(i) = #items equal to i (position occurrence), or #transactions including i (document occurrence)
- allocate pointer array for T[][] (2d array for transactions) , Occ[][] (2d array of 3 integers)
- allocate memory for w[] (transaction weights), "buf" for T[][] and occ_buf for Occ[][] (size of buf is ||D||*2, and size of occ_buf = ||D||*3, where D is the removal of infrequent items (items i, frq(i)<th had removed) ). Size of Occ[i] is equal to frq(i)*3
- load the file to buffer. If neither gap nor window constraint is given, skip items i with frq(i) < th,
set T[t] to the pointer to the beginning of T[i] in "buf", put "end mark" (large int) to the last of each T[]
- load the weight file, or set the transaction weights to 1
- find the same transactions by radix sort (with using Occ[][])
- unify the same transactions into one, with adding the transaction weights
- set occ to the array composed of (0,0,0) (1,0,0),..., (#transactions-1,0,0)
- allocate int array "jump" of size #items
- call LCM_seq ( occ, #transactions, current end of buffer )
- call Delivery(occ), to compute for each item i, sum of transaction weights (occ_w[i]), and positive transactions weights (occ_pw[i]) in transactions in occ including i.
( Delivery set "jump" to items where occ_pw[i] or occ_w[i] is not equal to 0)
- for each i in jump with occ_pw[i] < th, set occ_w[i] and occ_pw[i] to 0 and remove i from jump
- if the sum of weights in occ >= th, output sequence, or rules
(in the case of document occurrence, operate the transaction at most once, so ignore the second or later appearance)
- if jump_t = 0, clear occ_w, occ_pw and jump, return
- find the same transactions by radix sort, with ignoring the items not included in "jump"
- create a new transaction for the same transactions detected above into one, with adding the transaction weights.
(the buffer and ID of created transactions start from buf, and t_new
update buf/t_new to the end of buffer/ID )
- replace the "merged" transactions from occ, and insert the newly created transactions to occ.
- if the current pattern is empty sequence, replace each tuple (t,p,s) in Occ[i] by (t,p,p)
- make a copy of jump, and a copy of Occ[i] for items i in jump
- clear Occ[i], occ_w[i], and occ_pw[i] for items in jump, and clear jump
call LCM_seq (copy of Occ[i], t_new, buf ) (note that t_new and buf are updated)
end for
end of LCM_seq
- for each tuple (t, p, s) in occ
- for each position j from 0 to p-1,
i = item at the position j in transaction t
( for computing occ_pw and occ_w)
if occ_pw[i] = 0 and occ_w[i] = 0, insert i to "jump"
add the weight of t to occ_w[i] and add to occ_pw[i] if the weight is positive
( for computing Occ[i])
if occ_pw[i] >= th, then insert tuple (t, j, s ) to Occ[i]
!! in the case of document occurrence without gap/window constraint, if there are two or more same items in positions 0 to p-1, operate only for the biggest position !!
end for
end for
end of Delivery
This data storage method is called "forward star", in graph data structures.
2. allocate pointer (to int) array of size #transactions for T[][]
3. buf' := pointer to the start position of buf, t=0
4. while ( not end of the file )
4.1 set T[t] to the pointer to buf'
4.2 read integer to memory starting from buf', until newline
4.3 put "end mark" to the tail of the integers
4.4 buf' = the position next to "end mark", t = t+1
end while
Acknowledgments
References