SACHICA: Scalable Algorithm for Characteristic/Homogeneous Interval Calculation

Coded by Takeaki Uno, e-mail:uno@nii.jp,
homepage: http://research.nii.ac.jp/~uno/index.html

This program is available for only academic use, basically. Anyone can modify this program, but he/she has to write down the change of the modification on the top of the source code. Neither contact nor appointment to Takeaki Uno is needed. If one wants to re-distribute this code, do not forget to refer the newest code, and show the link to homepage of Takeaki Uno, to notify the news about codes for the users. For the commercial use, please make a contact to Takeaki Uno.

ProblemDefinition
InputFileFormat
Usage
OutputFormat
ExampleofExecution
AdvancedUsage
Performance
Acknowledgments
References

Problem Definition

Let S be a set of strings of the same length l, and d be a threshold value. The problem to be solved is to find all the pairs of strings with Hamming distance at most d. The Hamming distance between two strings S1 and S2 is the number of positions i such that the i-th letters of S1 and S2 differ (here we say two strings are similar when their Hamming distance at most d, and a pair of similar strings similar pair). SACHICA inputs S directly, or inputs a long string and takes all the substrings of length l. Instead of outputting the pairs, SACHICA can count the number of strings similar to a string, for each string.

Input File Format

Sachica inputs a text file, and compare the strings in the file, so that each character is a letter (thus basically can read any binary file). Strings to be compared are all substrings of the given length in the text file. By giving a command, SACHICA regards two or more consecutive newline characters (such as "\n") as a separator of (long) strings, so that we can give several (long) strings. In this case, SACHICA does not take any substring that start from a string and end at another string (this mode is called "multi-string mode"). Note that the newline characters at the beginning of the strings are considered as a part of a separator, thus any string start with non- newline characters. SACHICA accepts any newline characters, windows ("\n\r"), UNIX ("\n") and Macintosh ("\r"). Suppose that we want to compare many strings of the same length l, but not the substrings of them. This can be done by make a text file including the strings separated by double newline characters, and setting the substring length (to be compared) l.

Usage

==== How to Compile ====

Unzip the file into arbitrary directory, and execute "make". You then can see "sachica" (or sachica.exe), and other additional executable files for various additional tools, in the same directory. sachica is for the mode of regarding one byte as one character, and sachica16 is for that of two bytes, and sachica32 is for four bytes. Note that sachica, sachica16 and sachica32 accepts characters of at most 255, and 65535, and 9,999,999, respectively.

==== Command Line Options ====

To execute SACHICA, just type sachica and give some parameters as follows.

% sachica [command] [options] input-filename length #mismatch [output-filename]

"%" is not needed to type. It is a symbol to represent command line. To see a simple explanation, just execute "sachica" without parameters. An example is

%sachica piclZ -l 60 test.txt 30 2 out

"input-filename" is the filename of the input file. The 1st letter of input-filename must not be '-'. Otherwise it is regarded as an option. The input file format is written below. "length" is the length of the strings to be compared, and "mismatch" is the threshold value for the Hamming distance. "output-filename" is the name of file to write the output by the program. If the file name is "-", the program outputs to the standard output. You can omit the output file to see only the number of pairs.

The first parameter [command] is composed of some letters, each of them corresponds to a command. This is given to the program to indicate the task and some other execution modes. The tasks are roughly three kinds.

p:find similar pairs
find all pairs of strings with Hamming distance at most "#mismatch",
and output the pairs of string IDs or positions.
m:compute mismatch tolerance
mismatch tolerance of a string s is the Hamming distance to the string
s' which has the minimum Hamming distance to s among all strings.
By giving this command, SACHICA computes, for each string, the
mismatch tolerance or reports that the mismatch tolerance is larger
than "#mismatch", by setting the mismatch tolerance value to
"#mismatch"+1.
r:count the number of similar substrings to each string
for each string s, count the number of strings s' such that the
Hamming distance between s and s' is at most d.

Additionally, SACHICA has a special command P, for print out the substring. By executing

%sachica P input-filename [num] [len]

SACHICA prints substring of the input file, from position [num] with length [len]. If we give Ps command, SACHICA prints [num]th substring. In this case [len] will be ignored, but it has to be specified. You can give any number or letter for [len].

The following commands give some variations for the tasks.

C:multi string mode (with self-comparison)
by giving this option, SACHICA considers the input file as
a set of strings. The separator of strings is two consecutive
newline characters. The details are described in the input file
format section.
c:multi string mode (without self-comparison)
with this command, SACHICA also use multi string comparison but
SACHICA does not compare pairs of substrings coming from the same
string. On contrary, with C command, any pair will be compared.
Z:output maximal continuously-small-mismatch intervals
if substrings starting at positions p and q are similar, i.e., of
Hamming distance at most "#mismatch", the neighboring positions,
p-1 and q-1, and p+1 and q+1 are likely similar, in usual. Given
this command, SACHICA checks such neighboring positions for each
similar pair, and extend the similar pair by checking further
neighbors, until meeting pairs of Hamming distance larger than
"#mismatch". The obtained substring pair, called "maximal similar
pair" is output by their starting positions, and the length. In
this output form, the actual Hamming distance of each similar pair
of a maximal similar pair will not be output.
l:comparison with interleaving positions
when we want to find maximal similar pair of sufficiently long
strings, SACHICA has to find only one of its similar substring
pair (remind that maximal similar pair is composed of
consecutive similar pairs). Using this command, SACHICA reduces
the substrings to be compared, such that any maximal string
pair of length 2l are surely found where l is the substring length.
H,E:use Hamming distance/edit distance for distance measure in the
pair extension
using Z command explained above, SACHICA can find pairs of
long similar substrings in string(s). The method is to first
find a similar pair and extend the pair for both direction
to specified length, or until the error ratio will be larger
than the given threshold. This command specifies the distance
measure used in the extension; H is for Hamming distance and
E for edit distance. If these commands are not given, the
distance measure is CIH (explained below) used in Z command;
go to both direction until meeting a non-similar pair.
Q:output a pair if a chunk of them matches
when the substrings to be compared is long, SACHICA internally
divides each substrings into several chunks (substrings) and
find similar chunks, since similar pair must have at least
one similar chunk. Using this command, SACHICA outputs a pair
of substrings when one pair of their chunks are similar.
The similarity threshold (Hamming distance) is
"# mismatches" / "# chunks".
v:count the similar strings, for each position of each string
given r command, SACHICA counts the number of similar strings
the substring starting at each position. By giving this command,
when SACHICA finds a similar pair, it increases the counter of
each position of the substrings, i.e., increase "length"
counters while one counter in usual. If this is given with "m"
command, the minimum mismatch torrelance among positions
a,...,a+[len]-1 is written to the position a.
S:compute minimum distance to the other strings for each string
this command is used to output the data related to strings
(separated by two newlines). In case of "p" command, it output
only string IDs but not positions. In case of r/m mode,
it counts the number of similar strings, or compute mismatch
tolerance for each string.

The following commands are for execution mode.

q:no message
SACHICA outputs some internal computational parameters and the
property of the input files such as length of the input string
and the number of strings in the multi-string mode (output to
standard error). By giving this command, SACHICA does not output
such information.
V:show progress
by giving this option, SACHICA outputs the progress of computation
in the form of (1/100,...,100/100). This is actually an
approximation, thus the output progress is not always proportional
to the past computational time compared to the whole computation
time.
a:append output to a (existing) file
the output of SACHICA will be appended to output file, so the
output file will not be cleared, if it exists.

The followings are input format commands.

i:ignore newline characters in the input file,
SACHICA reads a text file, with regarding each character as a
letter, thus basically can input any binary string. By giving
this command, all newline-codes ('\n' and '\r') are removed from
the input, and not counted. In the multi-string mode, the
separations are two or more consecutive newlines but they are
not ignored.
G:genome sequence mode
this is the mode of genome sequence mode (fasta file mode). In
precise, any letter aAtTgGcC will be ATGC, nNxX will be N, which
acts as a wildcard, and other characters will be ignored. Newline
codes are ignored (same as i command). Further, a line starting
with '>' is regarded as a comment line, and will be ignored.
J (for genome sequence mode): regard comment as a separator
a comment line is considered as a separator, thus if a string
includes a comment line, the string will be separated at the
comment.
J (for sachica16):read text as EUC text (Japanese text code)
this is the Japanese text mode (EUC mode), which can read both
8 bit characters and 16 bit characters, and external 24 bit characters.
This is valid only for sachica16 program (16 bit mode executable file),
explained later.
J (for sachica32):regard each byte as 1 digit of at most 200
sachica32 reads 4bytes as a character. It actually conflicts to
newline codes, since we can not give character whose first 1 byte is
a newline code. By giving this option, sachica32 regards each byte of
4 bytes as 1 digit of at most 199, such that 32 is 0, 33 is one,...
and 231 is 199. For example, the 4bytes 32 33 35 39 corresponding to
one character will be decoded to (32-32)*200*200*200 + (33-32)*200*200
+ (35-32)*200 + (39-32) = 40607. sachica32 accepts the characters of
less than 10,000,000, and if the decoded number is over 10,000,000
for example (35-32)*200*200*200 = 24,000,000, it will be regarded as 0.
It is common to the usual mode.

The following commands are for output formats.

s:output substring for all positions/pairs found,
given "p:find similar pair" command, a pair of substrings is output
by their positions in the string, and not output the substrings
themselves. By giving this command, SACHICA output the two
substrings following to the positions.
N:output numbers instead of letters (for sachica32)
in sachica32, output a character by its ID. With J command, it
outputs a number given by 4 digit of from 0 to 199. This is used
for outputting strings, such as s command.
I:output intervals of continuous small/large mismatch tolerance
given r or m command, sachica outputs the #similar strings/mismatch
tolerance of positions when it is smaller than the threshold (
the threshold is given by -k option, explained below). By this
command, the positions to be output are grouped in continuous
positions, and the beginning and the ending positions of the
intervals will be output, instead of outputting all the positions
contained in an interval. Note that in this mode the value will not
be output but just positions are output in the form of intervals.
M:output positions with large mismatch tolerance
given r or m command, sachica outputs the positions with #similar
strings/mismatch tolerance no greater than the threshold. Instead,
by this command, sachica outputs the positions with these values
no less than the threshold. This affects to I command above.
n:normalize positions of all strings
even in multi-string mode, the strings are output by the internal
position, which is slightly different from the original position.
This commands normalize the positions; output the position of a
substring in the string it belongs, i.e., if the substring belongs
to i-th string, then the output position will be the the position
of the substring in the i-th string.
d:output string IDs
output the string ID and position for each substring, thus one pair
will be output by four numbers, ID, position, ID, and position.
f:output #substrings in interval
with Z mode, Sachica outputs intervals with their lengths. The length
is the number of letters from the starting position to the ending
position of the interval. This command changes the format of the
length, to the number of substrings in the interval. So, if the
length of an interval is 50, and the length of substring is 30,
the output length will be 21.
O,o:output overlapping strings
two strings are said to be overlapping if the suffix of one string
(a substring to the end) matches to the prefix of the other (a
substring starting at the beginning). Here "match" means that the
distance between the prefix and the suffix is in the threshold
value. The distance measure and the threshold H,E and -e,-E.
If we give O command and -e [num] option, if the overlap (the
prefix and the suffix) has distance of at most [num], the string
pair will be output. In the case of -E [num] option, if the
overlap has distance of at most [num] * |overlap|, the string pair
will be output, where |overlap| is the length of the prefix/suffix.
In the case of o command, the output substring pair satisfies that
one of its substring is a prefix or a suffix of a string.
If we give -x [num] option, the prefix/suffix does not have to
start from/end at a string end. They can start/end at the letter
which has distance at most [num] from the string end.

Following to the first parameter (commands), we can give several options described below. Basically any option accompanies one or more values, and they are in the form "-? [num]" where ? is a letter. We can give no option.

The following options are for file input modes.

-2 [filename]:read additional file, attached to the input file
sachica inputs one file. This option enables to read the second
file. The strings will be attached to the end of the first file.
In multi-string mode, the beginning of the second file is regarded
as a separator.
-s [char]:ignore the intervals including [char]
when a substring taken from a position includes a letter specified
by [char], the substring will be discarded, thereby will be
compared to no other substrings.
-S [char]:ignore the intervals including more than #mismatch of [char]s
similar to "#mismatch" of letters specified by [char].
-R [num] [shift-len]:ignore repeated (tiling) substrings
a repeated (tiling) string is a string obtained by concatenating
a shorter string several times. This option is to ignore substrings
likely being a repeated string. A repeated string satisfies that
one of its prefix is similar to one of its suffix, thus we ignore
the substrings when a prefix and a suffix have Hamming distance at
most [num]. [shift-len] restricts the length of suffixes/prefixes.
Suffixes/prefixes of length greater than "length of substring" -
[shift-len] are considered.
-d [num]:duplicate strings from [num]th to the end
the strings with string ID from [num] to the last string will be
duplicated, and appended to the last string. Thus the first string
of the duplicated strings has one larger string ID than the
last string. When [num] is -1, -2, or -3, it is automatically
changed to a certain number (see (*1)).
-r [num]:reverse strings from [num]th string to the last string
reverse the strings from string ID [num] to the last. In the genome
mode, each letter will be transformed to its complement (A-T, G-C).
When [num] is -1, -2, or -3, it is automatically changed to a
certain number (see (*1)).
-3 [num]:matrix mode for each string up to [num]th
in matrix mode, each string will be considered as a matrix such that
each cell is a letter of the string. A row of the matrix corresponds
to a substring of the string, so that the string is divided into
several substrings of the same length and the i-th substring is
the i-th row. In the matrix mode, SACHICA takes only substrings
which are totally contained in a row. The length of the rows is
given by the length to the first newline character of the string in
the input file. It means that we can input matrices of different
row sizes. When [num] is -1, -2, or -3, it is automatically
changed to a certain number (see (*1)).
-y [num]:set pattern y-length (segments of same length mode if -1)\n\
by giving this option, we can compare submatrices of matrices,
(all strings will be considered as matrices, if -3 is not given).
In matrix mode, a substring to be compared is a submatrix of a
matrix such that the number of rows in the submatrix is one, and
the number of columns is "length of substring". This option
can specify the number of rows, thus the substring is no longer
a substring but a submatrix.
If -1 is given to [num], the matrices are dealt in "segment mode".
In segment mode, each row of each matrix is considered as a string,
thus we can give many strings of the same length. Compared to usual
way with two-consecutive-newlines, memory usage is reduced up to
32*(#strings) bytes, at maximum. Note that the substrings in a matrix
(string) will be considered to belong the same string, even in
segment mode, thus no two substrings taken from the same matrix
will not be compared, with c command. In segment mode, the segment
ID will be output to any position in each pair. If d command is
given, SACHICA does not output segment ID but output string ID.
-D [num]: compute the mismatch with considering each letter as [num]-digit
this option enables to compress texts on small alphabet, such as
01 sequence, to save memory space and fasten the computation.
Given -D option, SACHICA decompose each letter to digits from
0 to [num]. For example, if [num] is 2, 13 is decomposed to
1011, and if [num] is 3, 23 is decomposed to 212. To avoid the
conflict with newline characters, each number has to begin with 32,
thus, actually 45 (32 + 13) will be decomposed to 1011. However,
SACHICA can not take position starting at internal letter of
the decomposed sequence; the substring starting the third letter
of 1101 can not be considered. Thus, this option can be used only
when we want to deal with restricted positions, such as whole
string comparison (not substring comparison).
-T [num]: considering each letter as [num]-digit, and compute the
cost by the difference of the letter.
similar to -D option, giving -T option decomposes each letter
into digits. The difference is that the distance measure changes
slightly; the distance is not the number of mismatches but
the sum of differences between each digits. For example, the
distance between 25412 and 15122 is |2-1| + |5-5| + |4-1| + |1-2|
+ |2-2| = 5.
-C [filename]: read mismatch tolerance/#similar strings from file
read mismatch tolerance/#similar strings of each position from
the file [filename]. It means, in the case of counting similar
strings, the number of read from the file will be added to each
position. In the case of mismatch tolerance, if the value read
from the file is smaller than the mismatch tolerance, the
value from the file is output instead of the mismatch tolerance.
The format of the file is simple; write the score of position i
to the (i+1)th line.

The following options are for output formats.

-p [num]:extend the output interval [num] letters to both directions
given s command, SACHICA outputs substring itself following its
position. By giving -p option, the extended substring will be
output, instead the substring. For example, if we give "-p 10",
when SACHICA outputs a position i and the substring starting at
i-10 and ends at i + "length of substring" + 9.
-o [num]:output what related to strings from 1st to [num]th
for "p" command, output similar pairs only when one of substring
belongs to a string of ID no greater than [num]. For r and m
commands, output mismatch tolerance/#similar substrings only for
substrings of strings of ID no greater than [num]. When [num] is
-1, -2, or -3, it is automatically changed to a certain number
(see (*1)).
-n [num]:normalize positions by [num]th string
the position of substrings of strings from [num]th to the last
will be normalized by the starting position of [num]th string,
i.e., each position of string from [num]th to the last will be
subtracted the position of the first letter of [num]th string.
When [num] is -1, -2, or -3, it is automatically changed to a
certain number (see (*1)).
-b [num]: add [num] to each position
by this option, SACHICA adds [num] to each position when it outputs
a position. The combination of -b, -B and -c options enables
comparison in huge text file, by dividing it into several files,
and compare one-by-one. By -b and -B options, we can get the
positions in the original file.
-B [num]: add [num] to string ID/second position
this option is given with -b option. If d command is given, [num]
is added to all string IDs. If d is not given, [num] is added to
the position of second string in each pair. The positions of the
substrings of a pair is output so that smaller position is first.
-u,-l [num]: upper/lower bounds for intervals to be output
using Z command, SACHICA outputs intervals. These options restrict
the length of the intervals to be output; in the case of -l option,
only the maximal similar pairs of length at least [num] will be
output.
-c [num]:compare strings 1 to [num]-1th, and the other strings
this options divide the strings into two groups, from string ID 1
to [num]-1, and those from [num] to the last. SACHICA outputs
pairs of strings such that one belongs the first group and the
other belongs the latter. If both are included in the same group,
the pair will be not output. When [num] is -1, -2, or -3, it is
automatically changed to a certain number (see (*1)).
-h [filename]: print positions of head/tail of each string to [filename]
SACHICA has internal positions for each letter, and they are
different from the original positions in the file, in multi-string
mode. By this option, SACHICA outputs the position of the first
letter and the position of the end letter +1, of each string,
to the file of name [filename].

The following options control the execution and the output solutions.

-m [num]:execute the program with mismatch number from [num]
giving this option, SACHICA first executes with setting
"#mismatches" to [num]. After the execution, SACHICA increases
the "#mismatches", and executes again. This iteration is done
until "#mismatches" reaches to the original value. Combining
this option with -k/-K option, we can find for example first
10 similar substrings nearest to each position.
-K [num] (with -m): output pairs of distance no more than minimum+[num]
(-m command) by giving -m option, SACHICA increases #mismatches
one by one. For a substring positioning at x, suppose that
the first similar substring to x is found when #mismatches is
k, and denote it by m(x). By giving -K option, SACHICA outputs
a similar pair only when #mismatch is no more than m(x) or m(y).
-k [num]: (p command) output at most [num] similar strings for
each position
this option restricts the number of pairs including a position.
For each position p, after finding [num] pairs of p and another,
SACHICA never output such pair; they are found but discarded.
-k [num]: (r/m command) give threshold to output position
in r/m command mode, the value (#similar strings/mismatch
tolerance) of all positions will be output. This option restricts
the output; output only positions having values no greater than
[num]. In case of giving M command, output only positions with
values no less than [num].
-e [num]:set max absolute error number for similarity extension
in Z command, SACHICA extends the similar pair for both
direction. This option gives the absolute upper bound for the
distance; SACHICA start from the two empty strings, and
appends the letters on the left side of the similar pair one by
one, until the distance between the appended strings will over
[num]. Then, do the same operation for the right side, and finally
get a maximally extended string pair. This option is valid only
when the distance measure is Hamming distance or edit distance,
or the O/o command is given. In the case of O/o command (finding
overlap), SACHICA outputs a string pair only when their overlap
has distance at most [num].
-E [real-num]:set max error ratio for similar pair extension
similar to -e option, it gives upper bound for the distance
but it is given by a ratio; SACHICA extends the pair unless the
distance between the extended strings is no more than the
[real-num] * "length of extended string". This option is also
valid only when the distance measure is Hamming distance or edit
distance, or the O/o command is given. In the case of O/o command
(finding overlap), SACHICA outputs a string pair only when their
overlap has distance at most [real-num] * "overlap length".
-G,-g [num]: constraint for gap of pair, -G is maximum, -g minimum [num]
this option gives a restriction to the distance of the positions to
be output. Here the distance between positions x and y is just
|x-y|. -g and -G give the lower and upper bound for the distance,
thus for -g, only when the distance is no less than [num], the
pair will be output.
-w [char]: specify wildcard
a wildcard is a letter which can match to any letter. This option
specify wildcard, and the letter given by [char] (such as 'X', '*')
will match to every letter. However, the treatment of wildcard is
not complete, and sometimes a wildcard does not match to other
characters. In precise, wildcard is completely considered when
SACHICA extend the letters, and incomplete for finding similar
pair. More precisely, a wildcard is considered to compare blocks,
and is not considered at sorting the blocks, thus when at most
(#mismatch) blocks includes mismatches or wildcard, the wildcard
works.

The following options are for setting parameters for internal computation.

-q,-Q [num]: specify number of blocks and chunks
SACHICA internally divides each substrings to be compared into
several chunks, and each chunks are also divided into blocks.
The number of chunks and blocks affects to the speed of the
computation, and are determined automatically by using some
estimation method. Instead, these options specify the number of
chunks and blocks, so SACHICA never change it even it is far from
the estimated better one.
-I [num]:specify interleave length
this option specifies the accuracy for interleaving mode given by
l command. Maximal similar pairs of length at least [num] -1 +
"length of substrings" will be always found. Giving larger [num],
computational time will be decreased.
-i [num]:specify the first interleaving length
this option specifies the first interleaving length, so that
[num] consecutive positions, and positions x s.t. x mod [num] = 0
will be taken.
-M [num]: for multi-core processors, use [num] processors
SACHICA supports multi-core CPU computation, using pthread. To use
the multi-core, you have to compile with "make multicore". The
names of executable files are the same. By giving this option,
SACHICA makes [num] threads. [num] can be more than the number
of cores in the CPU, but the performance will be less.

(*1) for -d, -r, -c, -n, -3 and -o options, when [num] is -1, the value will be set to the first string in the second file if -2 option is given, and the or the half of strings (except for -d). If [num] is -2, [num] will be changed to the ID of top string generated by the duplication given by -d option. For -3 option, if [num] is -3, it will be changed to the ID of last string +1, which means all strings.

Output Format

According to the commands, the output of SACHICA is in several forms. In p command, every line of the output file corresponds to each similar pair. One line is composed of three numbers; first number and the second number are the positions of the substrings, and the third letter is the Hamming distance between the substrings. The order of the similar pairs is in some sense at random, thus not sorted. Two positions of a similar pair is ordered so that (1) smaller position is first, and larger position follows (2) if all the following three conditions hold, larger position is first, and smaller position follows; (a) mode is multi-string or segment, (b) the smaller position is 0 in its string, and (c) they are in the same group separated by -c option, if -c option is given.

If we give d command, each line will be composed of five numbers such that the first and the third numbers are string IDs of the strings the substrings are belonging to, and the second and the fourth numbers are positions of the substrings. The fifth number is the Hamming distance. If we give S command, each line corresponds to pair of strings (not substrings). The first and second numbers represent their string IDs, and the third letter represent the distance. If we give Z command, each line corresponds to a maximal similar pair, and the last number of each line is not the distance but the length of the maximal similar pair. With Z and d command, each line contains five numbers such that the last number is the length, too.

In the segment mode specified by "-y -1" option, each position is represented by a tuple of three numbers; string ID, position in the segment, and segment ID. string ID is printed only when d command is given. If S command is given, the segment ID is printed but position will not be printed.

In the case of r or m command, SACHICA outputs the mismatch tolerance or the number of similar strings to each substring. They are output to the output file so that each line corresponds to each position. A line is composed of two numbers, the first number is the position, and the second number is the value (mismatch tolerance or #similar strings). They are output in the increasing order of the positions. In default, all positions will be output (in m mode, positions having mismatch tolerance no greater than "#mismatches" will be output). By using -k option, we can give a threshold to the value, and only a part of positions will be output. By giving I command, each consecutive positions will be represented by an interval. In this form, each line corresponds to an interval, and composed of three numbers. The first and the second numbers are starting and ending position of the interval. When we use interleaves by l command or -I option, SACHICA reduces the number of substrings to be compared, and output the values only those compared.

When Z command is given, SACHICA extends the similar pair to a maximal similar pair of substrings [x..x+l] and [y..y+l]. In this case with r command, SACHICA increases the counter of each position from x to x+l-"length of substring"+1, and y to y+l-"length of substring"+1, so that the counter of each similar pair in the maximal similar pair will be increased.

When -h option is given, SACHICA writes the positions of first letters and the position of the last letter plus one of each position, to the given file. In the file, the i-th line is composed of these two numbers of the (i-1)-th string.

Example of Execution

Suppose that input file "test.txt" and execution command are the following.

-----
ABCDE
XYCZE
X

[EOF]

%sachica p test.txt 3 1 out

Here [EOF] means the end of the input file. The substrings to be compared and their positions are then

0:ABC, 1:BCD, 2:CDE, 3:DE#, 4:E#X, 5:#XY, 6:XYC, 7:YCZ, 8:CZE, 9:ZE#, 10:E#X, 11:#X#, 12:X##

Here '#' is the newline character (we consider the case that the newline code is composed of one character, such as '\n' or '\r'). The output file will be then

------
2,8,1
3,9,1
4,10,0
5,11,1
[EOF]

By giving Z command, SACHICA extends each similar pair, thus

%sachica pZ test.txt 3 1 out

by executing this, we have the following output file.

------
2,8,6
[EOF]

here the last number 6 represents the length of the maximal similar pair. By attaching s command, we can print the substrings themselves.

%sachica pZl test.txt 3 1 out

------
2,8,6 CZE
XY CZE
X

[EOF]

SACHICA prints newline characters in the substrings, thus each substring takes two or more lines. Actually, the printed substrings are CZE#XY and CZE#X#. If we give i command, then SACHICA ignores the newline characters, thus the substrings to be compared and their positions will be

%sachica pis test.txt 3 1 out

0:ABC, 1:BCD, 2:CDE, 3:DEX, 4:EXY, 5:XYC, 6:YCZ, 7:CZE, 8:ZEX

The output will be

------
2,7,1 CDE CZE
3,8,1 DEX ZEX
[EOF]

When we give m command instead of p command, we can compute the mismatch tolerance, thus output will be the following.

%sachica mi test.txt 3 1 out

------
0:2
1:2
2:1
3:1
4:2
5:2
6:2
7:1
8:1
[EOF]

Here 2 means that the mismatch tolerance of the position is larger than one. When we give m command, SACHICA counts the number of similar substrings for each substring, thus output will be the following.

%sachica ri test.txt 3 1 out

------
0:0
1:0
2:1
3:1
4:0
5:0
6:0
7:1
8:1
[EOF]

By giving -k option, SACHICA outputs only positions with values no greater than the threshold.

%sachica ri -k 0 test.txt 3 1 out

------
0:0
1:0
4:0
5:0
6:0
[EOF]

If we give I command, the consecutive positions small values will be represented by their start and the end.

%sachica rIi -k 0 test.txt 3 1 out

------
0,1
4,6
[EOF]

By giving M command, positions with values no less than the threshold will be output. Thus, the output will be,

%sachica rIiM -k 1 test.txt 3 1 out

------
2,3
7,8
[EOF]

Advanced Usage

==== 1. Comparison of Very Long Strings ====

The current version of SACHICA can input files of at most about 4GB. When the input file is larger than this, or has memory limitation, we can not input the file. In such case, we can approach by dividing the file into subfiles. Let f1,...,fn be the files obtained by dividing the original file so that each two of fi and fj can be input by SACHICA. We denote the positions of the first letter of file fi in the original file by [pi]. We can get the length of each fi by -h option.

By executing

%sachica pi -b [pi] fi [length] [#mismatch] [file]

for each i, and

%sachica pic -b [pi] -B [pj] -c -1 -2 fj fi [length] [#mismatch] [file]

for each i and j such that i!=j, we can get the result for the original file. For r and m command, we have to use -C option, to read sc file.

For each i, we first execute

%sachica ri -b [pi] fi [length] [#mismatch] __tmp__

where __tmp__ is a temporary file, and for each j!=i, execute

%sachica ric -b [pi] -c -1 -C __tmp__ -2 fj fi [length] [#mismatch] __tmp2__
%mv __tmp2__ __tmp__

mv command is for renaming the output file to give the next execution.

==== 2. Comparison of Strings (not substrings) ====

Suppose that we have a set SS of strings S1,...,Sn of the same length l, and want to compare them in the term of Hamming distance. In this case, there are two ways. The first way is use multi string mode. Make a file by concatenating S1 to Sn, with separating them with two consecutive newline codes, such as

--------
ABCDEFGH

ABCFEGTH

ABFFBFCG

ABCDFFGH

...

ABCGTHGR
[EOF]

and execute SACHICA with the following option.

%sachica picS textfile 8 2 out

where textfile is the name of the generated file, 8 is the length of each string, and 2 is the threshold for the Hamming distance. We then have the output file "out" in which each line is a combination of string IDs and the Hamming distance.

The second way is to use "segment mode". Make file by concatenating all strings with separating "one" newline, and execute SACHICA as follows.

--------
ABCDEFGH
ABCFEGTH
ABFFBFCG
ABCDFFGH
...
ABCGTHGR
[EOF]

%sachica piS -3 -3 -y -1 textfile 8 2 out

Note that if d command is given, the string ID is output instead of the segment ID (segment ID is the line number in a string, thus in this case it corresponds to the original string ID). Since the file is composed of one string, string ID is always one.

==== 3. Comparing Genome Sequences ====

A genome sequence is composed of five kinds of letters ATGC and wildcard N. SACHICA supports genome sequences thus by giving G (and J) command we can deal with genome sequences. For finding similar string similar to complement sequences, we can use r option. Suppose that we are given two genome sequences G1 and G2, and want to find similar pairs from G1 and G2 with considering complement sequences. The execution would then be as follows.

%sachica piGcd -r -2 -d -1 -c -1 -2 G2 G1 30 2 out

where 30 and 2 are length and #mismatches. By "-d -1", G2 is duplicated, and by "-r -2", the duplicated string is reversed, and becomes the complement of G2. The output pairs has string IDs, and if position p of string 1 is followed by the position q of string ID 2, substring starting at p of G1 matches the substring of G2 starting at position q. If the string ID of the latter is 3, the substring starting at p matches the complement of the substring of G2 starting at |G2|-q-30.

==== 4. Approximate Matching of Many Short Strings ====

Suppose that we have many short strings (fragments, denoted by F = {F1,...,Fn}), and a long string R (called reference), and for each fragment Fi, find all positions p such that the substring of R starting at position p has Hamming distance at most d to Fi. Intuitively, the task is to find all positions approximately matching to Fi.
First, make a file (ex. frg.txt) by concatenating F1 to Fn with
double newlines, and execute as follows.

%sachica pidO -2 ref.txt frg.txt 30 2 out

The output file consists of lines such as

1,0,13,23,1

This means that the fragment 1 matches the position 23 of reference string with Hamming distance 1. Since we are giving O command, we can reduce the "length of substring" (denoted by l) shorter than fragments. It also enables to deal with fragments of different lengths at once. Thus, l has to be no more than the length l' of the shortest fragment. In addition, if l is shorter than l', we can use interleave. In the case that the distance measure is CIH (which is default), we can set the interleave length to l'-l+1; we can give option "-I (l'-l+1)". Recall that maximal string pair with CIH is obtained by iteratively extend the similar pair until encountering a non-similar pair. Thus, it is enough to find at least one similar pair from the pair of a fragment and a substring of reference string.

By H/E command, we can switch the distance measure to Hamming/edit distance. In this case, the distance threshold is given by -e/-E option. In both cases, they give upper bounds on the distance of overlapping substrings. In these cases, interleave does not give any accuracy; it may fail to find some overlaps. However, in some case, we can avoid the fails. For example, when the Hamming distance of overlapping substrings is bounded by h, by finding similar pairs with "#mismatch"=h, we will have no fail. If l' is no less than h*l, and the error ratio threshold given by -E is r, by setting "#mismatch" to (h+1)l * r / h and giving l command, we will have no fail.

In the case of genome sequence with
considering complement of the reference, execute

%sachica piGdO -2 -d -1 -r -2 ref.txt frg.txt 30 2 out

If the ID of the substring is the ID of reference string plus one, the fragment matches to the complement.
If the length of fragments are all the same, we can use segment mode,
with constructing the input file such that each line is a string. The execution is

%sachica pincO -c -1 -3 -3 -y -1 -2 ref.txt frg.txt 30 2 out

In this case, the first number of each line is always 0, the second number is segment ID, and the third number is the position of a substring of the reference string (actually, the first number is the position in the segment, thus 0). The fourth number is the distance.

==== 5. Overlap Detection ====

Two strings A and B are said to be overlapping if the suffix (or prefix) of A is similar to the prefix (or suffix) of B. Suppose that we have a set of strings S = {S1,...,Sn}, and detect the overlapping pairs Si and Sj such that a prefix of Si and a suffix of Sj (they are the same length) have short distance. For the purpose, construct a file frg.txt by concatenating all strings with separating consecutive two newline codes, and execute SACHICA as follows.

%sachica picdO -l 40 frg.txt 30 2 out

The number given by -l option is the minimum length of the overlap. The substring length has to be no less than this number. As same as approximate matching, we can change the distance measure, and use interleave mode to reduce the computation time. The usage is the same as the approximate matching. We can also use segment mode, if the length of all strings are the same. The command execution would be

%sachica pincO -3 -3 -y -1 -l 40 frg.txt 30 2 out

In the case of genome sequence, we may want to consider the complement sequences. The command execution would be

%sachica pGncO -c -1 -3 -3 -y -1 -d -1 -r -2 -o -1 -2 frg.txt 30 2 out

Note that each overlap is output twice if a string S1 and the complement of another S2 overlap; S1 and the complement of S2, and the complement of S1 and S2. However, by -o option, the pair of complement and complement will not be output. If one wants to output all, remove -o option.

=== 6. Search for Similar Substring ====

Especially in genome science, people want to find partially similar substring S' of S from a long string R. More precisely, for short string S and long string R, the task is to find all pairs of substring S' of S and R' of R which are similar. The similarity measure can be either CIH, Hamming distance, or edit distance. The command would be

%sachica piGcZl -S 'N' -c -1 -l [min-len] -E [err_ratio] -2 [file2]
[file1] [len] [mis] [outputfile]

for genome sequence, where [min-len] is the minimum length of strings to be output, [err_ratio] is the threshold for the error ratio, for the distance similarity, and [len] [mis] are the parameters for finding the seeds from that the similar string pairs are extended. To use Hamming/edit distance, add H/E command to the first parameter.

Performance

SACHICA is designed for large scale string data with not so many similar pairs. Basically the computation time is linear in (input size)+(#similar pairs). Usually, #similar pairs increases mildly but quadratically, the computation time also increases proportional to it. However, under that #similar pairs is not so large compared to input size, say 100 times larger than the input size, the computation time increases almost linearly. This is a great advantage compared to other algorithms since usually they take quadratic time for some datasets even if the #similar pairs is small.

The number of blocks and chunks affect the computation time. it is automatically determined but sometimes loses efficiency by bad estimation. On the other hand, usually, setting #blocks to #mismatch + 2, and set #chunks so that two blocks consist sufficiently many letters so that the expected number of members in each group classified by the two blocks includes less than one member on average.

The memory usage of SACHICA is linear in the input size, and it does not depend on the output size, i.e., #similar pairs. Basically it takes 1 byte (2 or 4 bytes in sachica16 and sachica32) for each letter, 1 byte for each letter with r/m command or -k option, 24 bytes for each string on average, and 4bytes for each position to be compared. Intuitively, the memory usage is 6 byte for each letter without interleave mode, and 2 or 3 bytes in interleave mode.

Acknowledgments

We gratefully thank to Yasuhiro Ike of his comments for parallelization and coding. We also thank to Go Yamamoto of Fujitsu SSL for his help to this project. We would like to thank Dr. Koji Tsuda and Kana Shimizu of CBRC, Advanced Industrial Science and Technology for their helpful advices and bug reports.

References

- Takeaki Uno, Multi-sorting Algorithm for Finding Pairs of Similar Short Substrings from Large Scale String Data, Knowledge and Information Systems,
(DOI: 10.1007/s10115-009-0271-6), pp., 2009.

- Takeaki Uno, An Efficient Algorithm for Finding Similar Short
Substrings from Large Scale String Data, Lecture Notes in Computer Science 5012, pp.345-356, 2008.

- Takeaki Uno, An Efficient Algorithm for Computing Mismatch Tolerance
of Substrings [in Japanese], IEICE technical report. Theoretical foundations of Computing, 103(622) pp.55-60, 2004.