SHEAP: Similarity/Homology Efficient Analyze Procedure
Coded by Takeaki Uno, e-mail:uno@nii.jp,
homepage: http://research.nii.ac.jp/~uno/index.html
This program is available for only academic use, basically. Anyone can modify this program, but he/she has to write down the change of the modification on the top of the source code. Neither contact nor appointment to Takeaki Uno is needed. If one wants to re-distribute this code, do not forget to refer the newest code, and show the link to homepage of Takeaki Uno, to notify the news about codes for the users. For the commercial use, please make a contact to Takeaki Uno.
The program SHEAP inputs a text file and compare it with itself, or input two text files and compare them, considering one character (one byte) as one letter, and show the result by an image. Actually, SHEAP accepts non-text file, such as binary file, thus can input even image files but of course SHEAP compares them as binary files (byte sequences). SHEAP takes takes substrings from all positions of the input text file(s), and finds pairs of short substrings which are similar (within a given Hamming distance).
The output image, called dot plot, shows the result of comparison such that the x, and y axis correspond to the first and second strings, so each pixel corresponds to positions of the first and second strings, i.e., if the lengths of first and the second files are 10,000 and 20,000, respectively, and the the horizontal and vertical pixel sizes are both 1,000, the pixel (x,y) corresponds to the positions x*10+1 to x*10+10 of the first file, and positions y*20 to y*20+20 of the second file.
The intensity of each pixel shows how many similar pairs are in the pixel, i.e., if the intensity of pixel (x,y) is 4, it means that there are four pairs of substrings such that their start position is from x*10+1 to x*10+10 of the first file or y*20+1 to y*20+20 of the second file. When SHEAP is given only one file, it compares the file to itself, i.e., the second file is equal to the first file.
One more function of SHEAP is to show how many similar substrings are in the input file, for substring taken from each position of the first file. That is, for the substring taken from each position of the first file, SHEAP counts the substrings in the first (and the second) file that are similar to it. The result is shown by a picture. We call this function repeat detection. In the picture, the first file is decomposed, and decomposed fragment is represented by a horizontal line. The picture is vertically ordered horizontal lines, thus is a stripe, and the the first fragment corresponds to the top horizontal line, and the second to the last fragments correspond to the following horizontal lines. The intensity of each pixel indicates how many strings are there in the first (and the second) file, that are similar to the substrings starting at the corresponding positions in the first file.
SHEAP uses SACHICA as a subroutine, and some parameters and functions are deeply depending on the system of SACHICA. Refer the readme.txt of sachica to see the technical details and the definition of similar pairs, continuous interval Hamming distance (CIH), etc.
==== How to Compile ====
Unzip the file into arbitrary directory, and execute "make". You then can see "sheap" (or sheap.exe), and other additional executable files which are components of SHEAP, in the same directory.
==== Command Line Options ====
To execute SHEAP, just type sheap and give some parameters as follows.
% sheap [command] [options] input-filename length #mismatch [output-filename]
"%" is not needed to type. It is a symbol to represent command line. To see a simple explanation, just execute "sheap" without parameters. An example is
%sheap GiB -2 test2.txt test.txt 30 2 out
"input-filename" is the filename of the input file. The 1st letter of input-filename must not be '-'. Otherwise it is regarded as an option. The input file format is written below. SHEAP compares the files by finding all pairs of substrings of the files with short Hamming distance. "length" and "#mismatch" specify the length of the substrings to be found, and the threshold for Hamming distance; SHEAP will find pairs of substrings of length "length" with Hamming distance at most "#mismatch".
The first parameter [command] is composed of some letters, each of them corresponds to a command. This is given to the program to indicate the task and some other execution modes.
t:ASCII text comparison mode
this specifies a simple mode; read the file simply with considering
one byte as one letter.
G:genome sequence comparison mode
This is the mode of genome sequence mode (fasta file mode). In
precise, any letter aAtTgGcC will be ATGC, nNxX will be N, which
acts as a wildcard, and other characters will be ignored. Newline
codes are ignored (same as i command). Further, a line starting
with '>' is regarded as a comment line, and will be ignored.
i:ignore newline code
skip the newline codes in the file.
B:output in bmp (bitmap) format
output a bitmap file. Default format is pgm.
N:negative image
output negative image; in usual, black means non-similar, and white
means similar. This command switch the color to the negative.
R:repeat string detection (maximal similar string mode)
It finds maximal substring pairs in the sense of CIH; any
substrings of length "length" starting at the same position have
Hamming distance at most "#mismatch". It then accumulates the number
of similar pairs for each position, and the intensity of a pixel is
given by the maximum number among the numbers of corresponding
positions
X:do not execute box-filtering
It forces not to execute box-filtering. Box filtering is explained
below.
x:put a representative for each box, for box-filtering
In box-filtering, each consecutive pairs more than the threshold in
a diagonal box will be replaced by a representative pair.
Following to the first parameter (commands), we can give several options described below. Basically any option accompanies one or more values, and they are in the form "-? [num]" where ? is a letter. We can give no option.
-2 [filename]:compare 2 files (input-file and [filename])
SHEAP inputs one file. This option enables to read the second file,
and compare two files.
-Z [num]:emphasize each dot to size of [num]
replace each dot in the picture by a rectangle of [num] by [num].
If several dots will overlap, the intensity of the maximum will be
selected.
-t [num]:change the drawing mode to threshold drawing mode, in which
each pixel will be white when it includes at least [num] pairs,
i.e., put a dot when [num] pairs are in a pixel.
-h [num]:set the number of horizontal pixels to [num].
-p [num]:set the pixel size to [num], i.e., each pixel corresponds to
[num] letters.
-d [real_num]:brightness for dot plot
the default brightness is 1.00, and when there are 255 pairs in a
pixel, the pixel will be white.
-l [num]:specify gap between two horizontal lines
The option gives the gap between each two horizontal lines. The gap
will be [num] pixels, thus a line follows with [num] pixels the
previous horizontal line.
-T [num]:specify threshold num for box-filtering
give a threshold to box-filtering. Intuitively, only if that are
[num] pairs in a diagonal box, the pairs remains, and pairs satisfy
the condition for no diagonal box will be deleted.
-X [num]:specify the factor of box for filtering
The box-filtering uses a diagonal box. The default size is
100*"length" by 10*"length". This option changes this factor to
[num] and [num]/10. Intuitively, giving smaller number decreases
the brightness of the picture.
-r [num]: specify the length of the small box, for the box filtering
this option specifies the length of the small box representing
concentrated pairs, for the box for the box filtering. When the
many pairs are in the box, but all they are contained in a small
box, the pairs will not be deleted. The size will be given by the
ratio, and is set to [num]*[length of the box]. The default value
is the same as the width of the box.
-H [num]:extract pseudo diagonal lines of length at least [num]
Extract a pseudo diagonal lines from the picture of the comparison,
and erase the dots not includes in pseudo diagonal lines.
-W [dir]:specify working directory. All working files created during
the execution are created in the directory, and will be removed
after the execution of the program.
The purpose of the use of SHEAP is actually to look at the partial similarities of a file, or two files. In the picture drawn by SHEAP, a diagonal line shows a continuous and long similarity. However, in the case that we compare very long files, one pixel corresponds to a long interval so that it naturally has strong intensity because of many short similar substrings. The box-filtering used to remove such short similar substrings which are not a part of long similarity. The function is implemented independently as sachicax.
Let us consider a diagonal box of length l and width w. For example, the four corner points of a diagonal box are (100,100), (100,150), (600,600), and (600,650). In this case, the length is 500, and the width is 50. A pair of substrings corresponds to a point (x,y) when their starting positions are x and y, respectively. We here consider that only when t several pairs are in a diagonal box of a certain size, they form a part of diagonal line.
The program SACHICA enumerates all pairs of similar substrings, and outputs the pairs to the file. Sachicax inputs the file and removes all the pairs not included in any diagonal box with t-1 pairs. For the fast computation, for given l, w, and t, sachicax considers diagonal boxes of width 2w, and whose left-upper corner is (x,y), (x-y +M) mod w = c where M is a sufficiently large number and c is a common fixed constant. It means that for a fixed x, the left-upper corners of diagonal boxes can locate at skipping y-positions, such as (x,c), (x,c+w), (x,c+2w),... It enables us to execute box-filtering in almost linear time; first sort all the pairs in the order of x-y, and for each k, collect all pairs (x,y) satisfying (k-1)w+c <= x-y < (k+1)w+c. We then sort them in the x-coordinates and scan the sorted list to check the box-filtering condition.
The box filtering itself can be executed by sachicax, with
%sachicax input-file length threshold-num width output-file
where the input file is composed of lines each of them is composed of two positions of one pair. Sachicax outputs all the pairs satisfying the condition (slightly softer than box-filtering) to the output file. The options of sachicax is as follows.
-X: output representative of pairs in the same area
During the scan of sorted list, when sachicax finds a box including
[threshold-num] pairs, sachicax erases all the pairs in the box and
alternatively output a pair corresponds to the left-upper corner
point to the file.
-Z: ignore pairs in area of size width
by this option, sachicax considers an additional constraint for
box filtering. The constraint is that the horizontal distance between
the leftmost pair and the rightmost pair in a diagonal box has to be
no less than "width". This means that even if a diagonal box includes
many pairs, they will be erased if they are concentrated in a small
area (of length "width").
-r [num]: specify the length of the small box
this option specifies the length of the small box representing
concentrated pairs, for the box for the box filtering. When the
many pairs are in the box, but all they are contained in a small
box, the pairs will not be deleted.
The default value is the same as the width of the box.
Different to sheap, it is given by the absolute value, not ratio.
-n [num]: give the number of pair\n");
sachicax automatically scans the file in the initialization to
get the number of pairs in the file, thereby it scans the input
file twice. Instead of that, by giving the number by this option,
we can omit one scan to fast computation.
Sachicax erase the similar pairs that are not a part of a long similarity. SHEAP can detect longer similarity by detecting pseudo diagonal lines. A pseudo diagonal line is a sequence pixels (p1=(x1,y1),...,pn=(xn,yn)) such that each pi+1 is locating right-down direction from pi, and pi+1 is not distant from pi. Here they are not distant when (xi+1 - xi, yi+1 -yi) is one of the following. (1,0),(0,1),(2,0),(1,1),(0,2),(1,2),(2,1),(3,1),(2,2),(1,3). Sachicam solves a dynamic programing to find pixels. Sachicam is actually an independent implementation for drawing the dot plot picture. It inputs the file composed of the positions of similar pairs, and output an image files and additional file including additional information. The execution command is
%sachicam [options] input-file x-size y-size output-file
where x-size and y-size indicate the horizontal size and the vertical size of the output image. The following options can be given to the command.
-p: output bmp file
same as B command of SHEAP.
-N: negative image
-Z [num]: emphasize each dot to size [num]
-t [threshold]: monotone with [threshold]
same as SHEAP.
-k [real-num]: gray scale with density [real-num]
same as -d option of SHEAP.
-M: superimpose the diagonal flipped image
for pairs (x,y), count (y,x) as well. This is equivalent to
mirroring the image on the central diagonal line.
-H [position file]: read [position file]
read the length of the file (string length) from the position file.
In the case that the original file includes several strings, we can
specify the beginning and ending positions of each string. The file
format is that the ith line is composed of the starting position
and the ending position+1 of the string, separated by any
non-numeric character.
-h [position file]: draw grid lines according to [position file]
draw grid line at the cells corresponding to end-positions of the
strings.
-c [num]: comparison between 1st to [num]-1th, and the rest
show the comparison so that horizontal line corresponds to the
first to the [num]-1th strings, and vertical line corresponds to
the [num]th to the last strings. This option needs to read position
file by -h option.
-r [filename]: overlap x-coordinate reversed string in [filename]
superimpose the image drawn by the [filename] with flipping the
image in horizontal direction.
-R [filename]: overlap y-coordinate reversed string in [filename]
superimpose the image drawn by the [filename] with flipping the
image in vertical direction.
-L [column num] [num]: repeat detection mode with line gap [num]
repeat detection mode. The [column num]th column of each line is
considered to count. If [column num] is -1, consider both 1st and
2nd columns to count. The gap between two horizontal lines will be
set to [num]
-l [num]: repeat detection mode with inputting interval format
with gap [num]
repeat detection mode. It inputs interval representation, in which
each line represents the start and end positions of substring,
similar to other substring of the same length.
-A [num]: read the file as position-score file (has to be with -l)
read a file such that each line represents the position and its
score. The intensity of each cell is the maximum score among the
scores of the corresponding positions.
-C [len] [mis] [filename]: detect dense intervals
in the repeat detection mode, detect the dense intervals from the
picture of repeat detection, and output the start and end positions
of the intervals to the file [filename]. The dense interval is
similar to CIH, all its subintervals of length [len] has at most
[mis] dots of intensity less than the threshold.
-d [filename]: detect pseudo diagonal lines and write to [filename]
detect pseudo diagonal lines and erase points not in any pseudo
diagonal line. Approximate start and end positions of each pseudo
line are written to a line of the file.
-e [filename]: detect pseudo diagonal lines (reverse direction of x)
same as -d, but flip the image in x-direction.
-E [filename]: detect pseudo diagonal lines (reverse direction of y)
same as -d, but flip the image in y-direction.
-T [threshold-num]: threshold for length of x/y directions
threshold for pseudo diagonal line. When the start pixel (x,y) and
end pixel (x',y') of a pseudo line do not satisfy either
x'-x < [threshold-num] or y'-y < [threshold-num], the pseudo line
is not considered as a pseudo diagonal line, and the pixels
included only in such pseudo diagonal lines will be erased.
-S:long skip mode for pseudo diagonal line detection
allows more gap for each consecutive pixels, in pseudo diagonal
lines. Intuitively, more sparse diagonal lines will be allowed.
The increase of the gap is approximately twice.
-b,-B [ratio]: output pseudo diagonal lines whose one (or two if -B)
end is closed to boundary with [ratio]
this option has to be specified with -d, -e or -E options. In the
case of -b, a diagonal pseudo line is output only when one of its
end is close to the border of the image. The distance is evaluated
by the number of pixels to the nearest end, and the threshold is
given by "pixel size" * [ratio]. In the case of -B option, both
ends have to be closed to the (not necessary to the same) end of
the image.
SHEAP merges several images to superimpose the results. unionpgm is an implementation for this task; input two image files and superimpose one to the other, and output an image file. The usage of union pgm is
%unionpgm [options] input-filename1 input-filename2 output-filename
The options are as follows.
-r: reverse 2nd file in x direction
flip the second file in horizontal direction.
-R: reverse 2nd file in y direction
flip the second file in vertical direction.
-C: 1st file is colored (ppm file)
give a color image as the first image. The file format has to be
ppm, and the output image is also ppm format.
-c [num]: use [num]th color for second file
replace the white dots in the second file by colored dot by color of
color-ID c. The output file will be color image and the file format
is ppm.
-p: make bmp file
output the image file as bmp file format.
SHEAP accompanies several useful batch files which enables to compare many files at once. The parameters for the comparison can be set by modifying the batch file; the functions of each parameter are almost the same to SHEAP, sachicax, sachicam, and unionpgm, and also described in the batch files. They are designed for genome sequences, thus consider the reverse strings. For usual text files, we can use the batches by eliminating the lines concerned with the reverse files.
%bat_comp1 output-filename input-filename(s)
concatenate the input files and compare the file to itself.
%bat_comp2 output-filename input-filename1 input-filename(s)
compare the first input file and the concatenation of the rest files.
%bat_comp_1toall output-filename source-filename filename...
compare source and the other one-by-one, and superimpose the image
with coloring each in distinct color.
%bat_comp_pw output-filename 1st-directory-name 2nd-directory-name
compare each file in the first directly and each file in the second
file in pairwise. Actually, the first/second files will be
"1st-directory-name"* and "2nd-directory-name"*, thus we can specify
prefixes of first and second files.
We gratefully thank to Yasuhiro Ike of his comments for parallelization and coding. We also thank to Go Yamamoto of Fujitsu SSL for his help to this project. We would like to thank Dr. Koji Tsuda and Kana Shimizu of CBRC, Advanced Industrial Science and Technology for their helpful advices and bug reports.
- Takeaki Uno, Multi-sorting Algorithm for Finding Pairs of Similar Short Substrings from Large Scale String Data, Knowledge and Information Systems,
(DOI: 10.1007/s10115-009-0271-6), pp., 2009.
- Takeaki Uno, An Efficient Algorithm for Finding Similar Short
Substrings from Large Scale String Data, Lecture Notes in Computer Science 5012, pp.345-356, 2008.
- Takeaki Uno, An Efficient Algorithm for Computing Mismatch Tolerance
of Substrings [in Japanese], IEICE technical report. Theoretical foundations of Computing, 103(622) pp.55-60, 2004.