program codes

Program Codes of Takeaki Uno and Hiroki Arimura

In this page you can download several programs of the algorihtms for enumeration problems and data mining problems. All these programs are written by Takeaki Uno or my research colleague Prof. Hiroki Arimura of Hokkaido University. Each code has the responsible person, so any question or bug report should be sent to the person by mail; uno@nii.ac.jp for Takeaki Uno, and arim@ist.hokudai.ac.jp for Hiroki Arimura.

!!! Recent versions had errors for reading Macintosh format files. The current versions do not have this error, (bug was fixed).

LCM (Linear time Closed itemset Miner) ver.2 (C code, Takeaki Uno) [reference 1] Explanation

LCM finds all subsets, called patterns, frequently appearing in a database. LCM also can enumerate maximal patterns and closed patterns. This code is a bug fixed version of the code which got the best implementation award in FIMI 2004, which is a workshop of programing competition of data mining. It can enumerate bipartite cliques in a bipartite graph.

LCM ver. 2.5 (slightly faster version only for frequent itemsets and closed itemsets)

LCM ver. 3 (improved version of the code submitted to OSDM05, possibly fastest among these) [reference 1a]

LCM ver. 5.3 (having many detailed functions such as, association rule mining, giving (minus) weights to transactions, upper/lower bounds to frequency, itemset sizes, print the ID's of transactions including each pattern. However, instead of these functions and simpleness, it takes much time, about twice.

LCM_basic simple codes written in perl/C. for studying mining algorithms. LCM1 is just depth-first algorithm with delivery technique for computing occurrences, and LCM2 has equi-support, database reduction, and conditional database.

LCMseq ver 2.1 (C code, Takeaki Uno) Explanation

LCM_seq is a variation of LCM for sequence mining. It finds all frequently appearing sequence patterns from given database, each transaction is considered to be a sequence. The algorithm follows the scheme so called prefix span, but the data structures and processing method are LCM based.

ACLA (Approximate Cover Listing Algorithm) (C code, Takeaki Uno) Explanation

For given a set S of itemsets, ACLA finds all minimal subsets of S that cover many items, i.e., The size of its union is closed to the number of items. Combining with pattern mining, we can find many ways of explanation of the database with few patterns, and cascade form classification machines.

SSPC (Similar Set Pair Comparison) (C code, Takeaki Uno) Explanation

SSPC finds all the pairs of records (transactions) being similar, or having large intersection. For a transaction database each whose record is a subset of an itemset, it finds pair or records having large intersection larger than threshold value, or the ratio of the intersection is no less than the threshold. It does not perform the all pais comparison, the performance is quite well than ordinal methods. It needs transposition of database.

MaxMotif 2.0 (C code, Takeaki Uno) [reference 2] Explanation

MaxMotif (Java code, Hiroki Arimura) [reference 2]

MaxMotif enumerates all maximal (closed) motif patterns from a string where a motif pattern is a string with wildcards such that a wildcard can match any letter.

FREQT, FREQT ver.4 (Java code, Hiroki Arimura) [reference 3]

FREQT is a program for mining frequently appearing labeled ordered tree patterns. A labelded ordered tree is a tree with an ordering of children and a lable for each vertex. For given a labeled ordered tree G and a threshold value t, FREQT enumerates all labeled ordered trees which can be embeded to G in t ways, with a correspondence of the vertex labels.

UNOT (Java code, Hiroki Arimura) [reference 4]

UNOT is a program for mining frequently appearing labeled tree patterns. A labelded tree is a tree with a lable for each vertex. For given a labeled tree G and a threshold value t, UNOT enumerates all labeled trees which can be embeded to G in t ways, with a correspondence of the vertex labels.

SHD (Sparsity-based Hypergraph Dualization, ver. 3.1) (C code, Takeaki Uno) [reference 5] Explanation

Fast algorithms for the problem so called hypergraph dualization, minimal set covering, minimal transversal, minimal hitting set, etc. The algorithm is designed with considering sparsity, thus performs well even for large scale database.

MACE (MAximal Clique Enumerater, ver. 2.2) (C code, Takeaki Uno) [reference 6] Explanation

MACE enumerates all cliques/maximal cliques in a given graph (implementation of Makino&Uno algorithm). (The bug reported by Prof. David Eppstein was fixed 11/2010). Tomita et al.'s algorithm is also implemented here [X] (for maximal clique enumeration and maximum clique computation) (C code, Takeaki Uno)

PCE (Pseudo Clique Enumerater) (C code, Takeaki Uno) [reference 7] Explanation

PCE enumerates all pseudo cliques (dense subgraphs) in a given graph.

AFIM (Ambigious Frequent Itemset Miner) (C code, Takeaki Uno) [reference 8] Explanation

AFIM finds all itemsets which is approximately included in many transactions. The task of this program is equivallent to enumerate all pseudo bipartite cliques (dense subgraphs) in a given bipartite graph.

CYPATH (enumeration for CYcles and PATHs) (C code, Takeaki Uno) Explanation Instances in the experiments

Enupath enumreates all cycles/paths connecting two specified vertices in a given graph/directed graph, with lower/upper bounds of length. It can also enumerates those including no shortcut.

TGE (enumeration for Trees and subGraphs/Connected components) (C code, Takeaki Uno) Explanation

Enutree enumerates all directed/undirected subtrees/rooted trees/subgraphs/connected components, with lower/upper bounds of sizes.

SACHICA 3.4 (Scalable Algorithm for Characteristic/Homogenous Interval CAlculation) (C code, Takeaki Uno) [reference 9] Explanation

For the input string, SACHICA finds all pairs of substrings of specified length such that the Hamming distance between them is no greater than given threshold d. The performance is quite well compared to ordinal algorithms. It has mainly developed for genome homology search, thus has many options for genomic problems; genome comparison, solexa sequencing analysis, assembling etc.

SHEAP 1.2 (Similarity/Homology Efficient Anakyze Procedure) (C code, Takeaki Uno) Explanation

This program draws a picture (dot plot) of similarity between two texts (files, or genome sequences), or shows the areas of a text including many substrings frequently appearing in other areas of the text.

ENTRAD 1.0 (Envelope-in-tree algorithm for geometrical edit distance) (python, Takeaki Uno, coded by Tetsushi Matsui) Explanation

It computs the fragment-edit-disance of two strings which is a maximum matching of local similar structures between two strings, in a quite short time compared to edit distance. The program is written in python, and can be executed on a PC with OS with python installed, such as Linux、cygwin、macOS.

SIMSTR (Similarity-based frequent STRing , ver. 1.1) (C code, Takeaki Uno) Explanation Instances in the experiments

Find approaximately frequent appearing strings from string data, based on the enumeration of pairs of strings having short Hamming distance. It finds many, but not huge relatively long pattern strings in quite short time, even for strings of 100 million letters.

Epi-approx (Mining frequent approximate episodes) (C code, Yuzo Uchida, Hiroki Arimura)

Epi-approx enumerates all approximatly frequent episodes in a set of sequences, where an episode is just a sequence of letters and its approximate frequency is defined by the number of input sequences in which the episode appears within edit distance no more than a given threshold K.

POWIC (POWer Index Calculation) (C code, Takeaki Uno) Explanation Instances in the experiments

POWIC computes the power indices of a weighted majority games such as council and general meeting of a company, with different quontities of voting rights. The indices it computes are Banzhaf, Shapley-Shubik, and Deegan-Packel, where all these represent the power of each player.

HornSAT (Solving Horn Satisfiability Problems) (C code, Takeaki Uno)

It finds a satisfiable assignment for a Horn formula.

[1] Takeaki Uno, Tatsuya Asai, Yuzo Uchida, Hiroki Arimura, "An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases", Discovery Science 2004, LNAI 3245, pp.16-31

[1] Takeaki Uno, Masashi Kiyomi, Hiroki Arimura, "LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets," in Proceedings of IEEE ICDM'04 Workshop FIMI'04, available at http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS//Vol-126/, 2004.

[1] (version 1) Takeaki Uno, Tatsuya Asai, Hiroki Arimura and Yuzo Uchida, "LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets," Workshop on Frequent Itemset Mining Implementations (FIMI'03), available at http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS//Vol-90/

[1a] Takeaki Uno, Masashi Kiyomi, Hiroki Arimura, "LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining", Open Source Data Mining Workshop on Frequent Pattern Mining Implementations 2005, available at home page of Open Source Data Mining Workshop on Frequent Pattern Mining Implementations 2005, 2005

[2] Hiroki, Arimura, Takeaki Uno, "An Efficient Polynomial Space and Polynomial Delay Algorithm for Enumeration of Maximal Motifs in a Sequence", Journal of Combinatorial Optimization 13, pp.243-262, 2007

[3] Tatsuya Asai, Kenji Abe, Shinji Kawasoe, Hiroki Arimura, Hiroshi Sakamoto, Setsuo Arikawa: Efficient Substructure Discovery from Large Semi-structured Data. SDM 2002

[4] Tatsuya Asai, Hiroki Arimura, Takeaki Uno and Shin-ichi Nakano, "Discovering Frequent Substructures in Large Unordered Trees", Lecture Notes in Artifical Intelligence 2843 (Proceedings of 6th International Conference on Discovery Science (DS2003)), Springer-Verlag, pp. 47-60, 2003

[5] Keisuke Murakami, Takeaki Uno: Efficient Algorithms for Dualizing Large-Scale Hypergraphs, arXiv, CoRR abs/1102.3813: (2011)

[6] Kazuhisa Makino, Takeaki Uno, "New Algorithms for Enumerating All Maximal Cliques", Lecture Notes in Computer Science 3111 (Proceedings of SWAT 2004), Springer, pp.260-272, 2004

[7] Takeaki Uno, An Efficient Algorithm for Solving Pseudo Clique Enumeration Problem, Algorithmica 56(1), pp. 3-16, 2010

[8] Takeaki Uno and Hiroki Arimura, "Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration", Lecture Notes in Artificial Intelligence 5012, pp. 357-368, 2008

[9] Takeaki Uno, "An Efficient Algorithm for Finding Similar Short Substrings from Large Scale String Data", Lecture Notes in Artificial Intelligence 5012, pp. 345-356, 2008 (Best Paper Runner-up Award)

[9] Takeaki Uno, Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data, Knowledge and Information Systems 25(2), pp. 229-251, 2010

[10]

[11]

[X] Etsuji Tomita, Akira Tanaka, Haruhisa Takahashi, "The worst-case time complexity for generating all maximal cliques and computational experiments", Theoretical Computer Science 363, pp. 28-42, 2006

　top　　NII top