|
A pairwise sequence alignment is a scheme of writing one sequence on top of another, where the residues in one position are deemed to have a common evolutionary origin. If the same letter occurs in both the sequences, then this position has been conserved in evolution. If the letters differ, it is assumed that the two derive from an ancestral letter, which could be one of the two or neither. Homologous sequences may have different lengths, though, which are generally explained through insertions or deletions in sequences. Thus, a letter or a stretch of letters may be paired with dashes in the other sequence to signify such an insertion or deletion. In such a simple evolutionary motivated scheme, an alignment mediates the definition of a distance for the two sequences. Generally, 0 is assigned for a match and some negative number for a mismatch. A distance function for two sequences can be defined by looking for the alignment that yields the minimum score. Using, dynamic programming, this minimization can be effected without explicitly enumerating all possible alignments of two sequences.
The idea of assigning a score to an alignment and then minimizing over all alignments is at the heart of all biological sequence alignment. However, many more considerations have influenced the definition of the scores and made sequence alignment applicable to a wide range of biological settings. First, on may either assign a distance or a similarity function to an alignment. The difference lies more in the interpretation of he values. A distance function will define negative values for mismatches or gaps and then aim at minimizing this distance. A similarity function will give high values to matches and low values to gaps and then maximize the resulting score. The basic structure of the algorithm is the same for both cases. Smith and Waterman showed that for global alignment, i.e. when a score is computed over the entire length of both sequences, the two concepts are in fact equivalent. Thus it is now customary to choose the setting that gives more freedom in appropriately modeling the biological setting than one is interested in. In the similarity framework, one can easily distinguish among the different possible mismatches and also among different kinds of matches. For example, a match between two tryptophanes is usually seen to be more important than a match between two alanines. For amino acids, scoring matrices have been defined to assign a score to each possible pair of amino acids.
Smith-Waterman pairwise Sequence Alignment
The approach to parallelizing the pairwise sequence alignment is to distribute the computation along the diagonals of the similarity matrix, because of the computation of element values along one diagonal is independent of that along other. But there is dependency between neighboring diagonals because the calculation of one element value relies on the values of its left, left upper and upper elements. So the parallel alignment is executed in a wavefront way, computing first the values along the first diagonal in parallel, then along the second diagonal in parallel, and so forth, through the last diagonal in parallel.
While the parallel calculation goes on, selection for the element with the highest score is also connected, so that only the elements in the last diagonals are required to be remembered. This greatly reduces the memory requirement.
The computing granularity can be changed to achieve the best performance according to the size of the similarity matrix and the number of computing nodes available. There will be a trade-off between the granularity and load balancing. Larger granularity will reduce the communication overhead; this will improve the performance but, at the same time, it is more likely to incur load imbalance, which will degrade performance.
|