SHEAP: Similarity/Homology Efficient Analyze Procedure

Coded by Takeaki Uno, e-mail:uno@nii.jp,
homepage: http://research.nii.ac.jp/~uno/index.html


This program is available for only academic use, basically. Anyone can modify this program, but he/she has to write down the change of the modification on the top of the source code. Neither contact nor appointment to Takeaki Uno is needed. If one wants to re-distribute this code, do not forget to refer the newest code, and show the link to homepage of Takeaki Uno, to notify the news about codes for the users. For the commercial use, please make a contact to Takeaki Uno.


Function
Usage
Box-filtering
PseudoDiagonalLineExtraction
Superimposetheimages
BatchFiles
Acknowledgments
References


Function

The program SHEAP inputs a text file and compare it with itself, or input two text files and compare them, considering one character (one byte) as one letter, and show the result by an image. Actually, SHEAP accepts non-text file, such as binary file, thus can input even image files but of course SHEAP compares them as binary files (byte sequences). SHEAP takes takes substrings from all positions of the input text file(s), and finds pairs of short substrings which are similar (within a given Hamming distance).

The output image, called dot plot, shows the result of comparison such that the x, and y axis correspond to the first and second strings, so each pixel corresponds to positions of the first and second strings, i.e., if the lengths of first and the second files are 10,000 and 20,000, respectively, and the the horizontal and vertical pixel sizes are both 1,000, the pixel (x,y) corresponds to the positions x*10+1 to x*10+10 of the first file, and positions y*20 to y*20+20 of the second file.

The intensity of each pixel shows how many similar pairs are in the pixel, i.e., if the intensity of pixel (x,y) is 4, it means that there are four pairs of substrings such that their start position is from x*10+1 to x*10+10 of the first file or y*20+1 to y*20+20 of the second file. When SHEAP is given only one file, it compares the file to itself, i.e., the second file is equal to the first file.

One more function of SHEAP is to show how many similar substrings are in the input file, for substring taken from each position of the first file. That is, for the substring taken from each position of the first file, SHEAP counts the substrings in the first (and the second) file that are similar to it. The result is shown by a picture. We call this function repeat detection. In the picture, the first file is decomposed, and decomposed fragment is represented by a horizontal line. The picture is vertically ordered horizontal lines, thus is a stripe, and the the first fragment corresponds to the top horizontal line, and the second to the last fragments correspond to the following horizontal lines. The intensity of each pixel indicates how many strings are there in the first (and the second) file, that are similar to the substrings starting at the corresponding positions in the first file.

SHEAP uses SACHICA as a subroutine, and some parameters and functions are deeply depending on the system of SACHICA. Refer the readme.txt of sachica to see the technical details and the definition of similar pairs, continuous interval Hamming distance (CIH), etc.


Usage

==== How to Compile ====

Unzip the file into arbitrary directory, and execute "make". You then can see "sheap" (or sheap.exe), and other additional executable files which are components of SHEAP, in the same directory.

==== Command Line Options ====

To execute SHEAP, just type sheap and give some parameters as follows.

% sheap [command] [options] input-filename length #mismatch [output-filename]

"%" is not needed to type. It is a symbol to represent command line. To see a simple explanation, just execute "sheap" without parameters. An example is

%sheap GiB -2 test2.txt test.txt 30 2 out

"input-filename" is the filename of the input file. The 1st letter of input-filename must not be '-'. Otherwise it is regarded as an option. The input file format is written below. SHEAP compares the files by finding all pairs of substrings of the files with short Hamming distance. "length" and "#mismatch" specify the length of the substrings to be found, and the threshold for Hamming distance; SHEAP will find pairs of substrings of length "length" with Hamming distance at most "#mismatch".

The first parameter [command] is composed of some letters, each of them corresponds to a command. This is given to the program to indicate the task and some other execution modes.

   t:ASCII text comparison mode
     this specifies a simple mode; read the file simply with considering
     one byte as one letter.
   G:genome sequence comparison mode
     This is the mode of genome sequence mode (fasta file mode). In
     precise, any letter aAtTgGcC will be ATGC, nNxX will be N, which
     acts as a wildcard, and other characters will be ignored. Newline
     codes are ignored (same as i command). Further, a line starting
     with '>' is regarded as a comment line, and will be ignored.
   i:ignore newline code
     skip the newline codes in the file.
   B:output in bmp (bitmap) format
     output a bitmap file. Default format is pgm.
   N:negative image
     output negative image; in usual, black means non-similar, and white
     means similar. This command switch the color to the negative.
   R:repeat string detection (maximal similar string mode)
     It finds maximal substring pairs in the sense of CIH; any
     substrings of length "length" starting at the same position have
     Hamming distance at most "#mismatch". It then accumulates the number
     of similar pairs for each position, and the intensity of a pixel is
     given by the maximum number among the numbers of corresponding
     positions
   X:do not execute box-filtering
     It forces not to execute box-filtering. Box filtering is explained
     below.
   x:put a representative for each box, for box-filtering
     In box-filtering, each consecutive pairs more than the threshold in
     a diagonal box will be replaced by a representative pair.

Following to the first parameter (commands), we can give several options described below. Basically any option accompanies one or more values, and they are in the form "-? [num]" where ? is a letter. We can give no option.

   -2 [filename]:compare 2 files (input-file and [filename])
     SHEAP inputs one file. This option enables to read the second file,
     and compare two files.
   -Z [num]:emphasize each dot to size of [num]
     replace each dot in the picture by a rectangle of [num] by [num].
     If several dots will overlap, the intensity of the maximum will be
     selected.
   -t [num]:change the drawing mode to threshold drawing mode, in which
     each pixel will be white when it includes at least [num] pairs,
     i.e., put a dot when [num] pairs are in a pixel.
   -h [num]:set the number of horizontal pixels to [num].
   -p [num]:set the pixel size to [num], i.e., each pixel corresponds to
     [num] letters.
   -d [real_num]:brightness for dot plot
     the default brightness is 1.00, and when there are 255 pairs in a
     pixel, the pixel will be white.
   -l [num]:specify gap between two horizontal lines
     The option gives the gap between each two horizontal lines. The gap
     will be [num] pixels, thus a line follows with [num] pixels the
     previous horizontal line.
   -T [num]:specify threshold num for box-filtering
     give a threshold to box-filtering. Intuitively, only if that are
     [num] pairs in a diagonal box, the pairs remains, and pairs satisfy
     the condition for no diagonal box will be deleted.
   -X [num]:specify the factor of box for filtering
     The box-filtering uses a diagonal box. The default size is
     100*"length" by 10*"length". This option changes this factor to
     [num] and [num]/10. Intuitively, giving smaller number decreases
     the brightness of the picture.
   -r [num]: specify the length of the small box, for the box filtering
     this option specifies the length of the small box representing
     concentrated pairs, for the box for the box filtering. When the
     many pairs are in the box, but all they are contained in a small
     box, the pairs will not be deleted. The size will be given by the
     ratio, and is set to [num]*[length of the box]. The default value
     is the same as the width of the box.
   -H [num]:extract pseudo diagonal lines of length at least [num]
     Extract a pseudo diagonal lines from the picture of the comparison,
     and erase the dots not includes in pseudo diagonal lines.
   -W [dir]:specify working directory. All working files created during
     the execution are created in the directory, and will be removed
     after the execution of the program.
    


Box-filtering

The purpose of the use of SHEAP is actually to look at the partial similarities of a file, or two files. In the picture drawn by SHEAP, a diagonal line shows a continuous and long similarity. However, in the case that we compare very long files, one pixel corresponds to a long interval so that it naturally has strong intensity because of many short similar substrings. The box-filtering used to remove such short similar substrings which are not a part of long similarity. The function is implemented independently as sachicax.

Let us consider a diagonal box of length l and width w. For example, the four corner points of a diagonal box are (100,100), (100,150), (600,600), and (600,650). In this case, the length is 500, and the width is 50. A pair of substrings corresponds to a point (x,y) when their starting positions are x and y, respectively. We here consider that only when t several pairs are in a diagonal box of a certain size, they form a part of diagonal line.

The program SACHICA enumerates all pairs of similar substrings, and outputs the pairs to the file. Sachicax inputs the file and removes all the pairs not included in any diagonal box with t-1 pairs. For the fast computation, for given l, w, and t, sachicax considers diagonal boxes of width 2w, and whose left-upper corner is (x,y), (x-y +M) mod w = c where M is a sufficiently large number and c is a common fixed constant. It means that for a fixed x, the left-upper corners of diagonal boxes can locate at skipping y-positions, such as (x,c), (x,c+w), (x,c+2w),... It enables us to execute box-filtering in almost linear time; first sort all the pairs in the order of x-y, and for each k, collect all pairs (x,y) satisfying (k-1)w+c <= x-y < (k+1)w+c. We then sort them in the x-coordinates and scan the sorted list to check the box-filtering condition.

The box filtering itself can be executed by sachicax, with

     %sachicax input-file length threshold-num width output-file

where the input file is composed of lines each of them is composed of two positions of one pair. Sachicax outputs all the pairs satisfying the condition (slightly softer than box-filtering) to the output file. The options of sachicax is as follows.

   -X: output representative of pairs in the same area
     During the scan of sorted list, when sachicax finds a box including
     [threshold-num] pairs, sachicax erases all the pairs in the box and
     alternatively output a pair corresponds to the left-upper corner
     point to the file.
   -Z: ignore pairs in area of size width
     by this option, sachicax considers an additional constraint for
     box filtering. The constraint is that the horizontal distance between
     the leftmost pair and the rightmost pair in a diagonal box has to be
     no less than "width". This means that even if a diagonal box includes
     many pairs, they will be erased if they are concentrated in a small
     area (of length "width").
   -r [num]: specify the length of the small box
     this option specifies the length of the small box representing
     concentrated pairs, for the box for the box filtering. When the
     many pairs are in the box, but all they are contained in a small
     box, the pairs will not be deleted.
     The default value is the same as the width of the box.
     Different to sheap, it is given by the absolute value, not ratio.
   -n [num]: give the number of pair\n");
     sachicax automatically scans the file in the initialization to
     get the number of pairs in the file, thereby it scans the input
     file twice. Instead of that, by giving the number by this option,
     we can omit one scan to fast computation.


Pseudo Diagonal Line Extraction

Sachicax erase the similar pairs that are not a part of a long similarity. SHEAP can detect longer similarity by detecting pseudo diagonal lines. A pseudo diagonal line is a sequence pixels (p1=(x1,y1),...,pn=(xn,yn)) such that each pi+1 is locating right-down direction from pi, and pi+1 is not distant from pi. Here they are not distant when (xi+1 - xi, yi+1 -yi) is one of the following. (1,0),(0,1),(2,0),(1,1),(0,2),(1,2),(2,1),(3,1),(2,2),(1,3). Sachicam solves a dynamic programing to find pixels. Sachicam is actually an independent implementation for drawing the dot plot picture. It inputs the file composed of the positions of similar pairs, and output an image files and additional file including additional information. The execution command is

     %sachicam [options] input-file x-size y-size output-file

where x-size and y-size indicate the horizontal size and the vertical size of the output image. The following options can be given to the command.

   -p: output bmp file
     same as B command of SHEAP.
   -N: negative image
   -Z [num]: emphasize each dot to size [num]
   -t [threshold]: monotone with [threshold]
     same as SHEAP.
   -k [real-num]: gray scale with density [real-num]
     same as -d option of SHEAP.
   -M: superimpose the diagonal flipped image
     for pairs (x,y), count (y,x) as well. This is equivalent to
     mirroring the image on the central diagonal line.
   -H [position file]: read [position file]
     read the length of the file (string length) from the position file.
     In the case that the original file includes several strings, we can
     specify the beginning and ending positions of each string. The file
     format is that the ith line is composed of the starting position
     and the ending position+1 of the string, separated by any
     non-numeric character.
   -h [position file]: draw grid lines according to [position file]
     draw grid line at the cells corresponding to end-positions of the
     strings.
   -c [num]: comparison between 1st to [num]-1th, and the rest
     show the comparison so that horizontal line corresponds to the
     first to the [num]-1th strings, and vertical line corresponds to
     the [num]th to the last strings. This option needs to read position
     file by -h option.
   -r [filename]: overlap x-coordinate reversed string in [filename]
     superimpose the image drawn by the [filename] with flipping the
     image in horizontal direction.
   -R [filename]: overlap y-coordinate reversed string in [filename]
     superimpose the image drawn by the [filename] with flipping the
     image in vertical direction.
   -L [column num] [num]: repeat detection mode with line gap [num]
     repeat detection mode. The [column num]th column of each line is
     considered to count. If [column num] is -1, consider both 1st and
     2nd columns to count. The gap between two horizontal lines will be
     set to [num]
   -l [num]: repeat detection mode with inputting interval format
     with gap [num]
     repeat detection mode. It inputs interval representation, in which
     each line represents the start and end positions of substring,
     similar to other substring of the same length.
   -A [num]: read the file as position-score file (has to be with -l)
     read a file such that each line represents the position and its
     score. The intensity of each cell is the maximum score among the
     scores of the corresponding positions.
   -C [len] [mis] [filename]: detect dense intervals
     in the repeat detection mode, detect the dense intervals from the
     picture of repeat detection, and output the start and end positions
     of the intervals to the file [filename]. The dense interval is
     similar to CIH, all its subintervals of length [len] has at most
     [mis] dots of intensity less than the threshold.
   -d [filename]: detect pseudo diagonal lines and write to [filename]
     detect pseudo diagonal lines and erase points not in any pseudo
     diagonal line. Approximate start and end positions of each pseudo
     line are written to a line of the file.
   -e [filename]: detect pseudo diagonal lines (reverse direction of x)
     same as -d, but flip the image in x-direction.
   -E [filename]: detect pseudo diagonal lines (reverse direction of y)
     same as -d, but flip the image in y-direction.
   -T [threshold-num]: threshold for length of x/y directions
     threshold for pseudo diagonal line. When the start pixel (x,y) and
     end pixel (x',y') of a pseudo line do not satisfy either
     x'-x < [threshold-num] or y'-y < [threshold-num], the pseudo line
     is not considered as a pseudo diagonal line, and the pixels
     included only in such pseudo diagonal lines will be erased.
   -S:long skip mode for pseudo diagonal line detection
     allows more gap for each consecutive pixels, in pseudo diagonal
     lines. Intuitively, more sparse diagonal lines will be allowed.
     The increase of the gap is approximately twice.
   -b,-B [ratio]: output pseudo diagonal lines whose one (or two if -B)
     end is closed to boundary with [ratio]
     this option has to be specified with -d, -e or -E options. In the
     case of -b, a diagonal pseudo line is output only when one of its
     end is close to the border of the image. The distance is evaluated
     by the number of pixels to the nearest end, and the threshold is
     given by "pixel size" * [ratio]. In the case of -B option, both
     ends have to be closed to the (not necessary to the same) end of
     the image.


Superimpose the images

SHEAP merges several images to superimpose the results. unionpgm is an implementation for this task; input two image files and superimpose one to the other, and output an image file. The usage of union pgm is

     %unionpgm [options] input-filename1 input-filename2 output-filename

The options are as follows.

   -r: reverse 2nd file in x direction
     flip the second file in horizontal direction.
   -R: reverse 2nd file in y direction
     flip the second file in vertical direction.
   -C: 1st file is colored (ppm file)
     give a color image as the first image. The file format has to be
     ppm, and the output image is also ppm format.
   -c [num]: use [num]th color for second file
     replace the white dots in the second file by colored dot by color of
     color-ID c. The output file will be color image and the file format
     is ppm.
   -p: make bmp file
     output the image file as bmp file format.


Batch Files

SHEAP accompanies several useful batch files which enables to compare many files at once. The parameters for the comparison can be set by modifying the batch file; the functions of each parameter are almost the same to SHEAP, sachicax, sachicam, and unionpgm, and also described in the batch files. They are designed for genome sequences, thus consider the reverse strings. For usual text files, we can use the batches by eliminating the lines concerned with the reverse files.

%bat_comp1 output-filename input-filename(s)
     concatenate the input files and compare the file to itself.

%bat_comp2 output-filename input-filename1 input-filename(s)
     compare the first input file and the concatenation of the rest files.

%bat_comp_1toall output-filename source-filename filename...
     compare source and the other one-by-one, and superimpose the image
     with coloring each in distinct color.

%bat_comp_pw output-filename 1st-directory-name 2nd-directory-name
     compare each file in the first directly and each file in the second
     file in pairwise. Actually, the first/second files will be
     "1st-directory-name"* and "2nd-directory-name"*, thus we can specify
     prefixes of first and second files.


Acknowledgments

We gratefully thank to Yasuhiro Ike of his comments for parallelization and coding. We also thank to Go Yamamoto of Fujitsu SSL for his help to this project. We would like to thank Dr. Koji Tsuda and Kana Shimizu of CBRC, Advanced Industrial Science and Technology for their helpful advices and bug reports.


References

- Takeaki Uno, Multi-sorting Algorithm for Finding Pairs of Similar Short Substrings from Large Scale String Data, Knowledge and Information Systems,
(DOI: 10.1007/s10115-009-0271-6), pp., 2009.

- Takeaki Uno, An Efficient Algorithm for Finding Similar Short
Substrings from Large Scale String Data, Lecture Notes in Computer Science 5012, pp.345-356, 2008.

- Takeaki Uno, An Efficient Algorithm for Computing Mismatch Tolerance
of Substrings [in Japanese], IEICE technical report. Theoretical foundations of Computing, 103(622) pp.55-60, 2004.