|
DNA is the basis of heredity. It is a polymer made up of small molecules called nucleotides, which can be distinguished by the four bases: Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). A DNA sequence is therefore specified completely by a sequence consisting of the four alphabets A, C, G, and T. DNA usually occurs in double strands, and the bases in the two strands are complementary to each other, i.e., a pairing with T and G pairing with C with hydrogen bonds.
The DNA of an organism is determined by a process called sequencing. DNA sequencing involves the process of determining the exact order of the four nucleotides A, C, G, and T that make up the DNA sequence. A standard method for sequencing is based on separating DNA fragments by gel electrophoresis. However, the method is extremely labor-intensive and expensive, which prevents its use in large scale sequencing applications. Capillary electrophoresis is the latest method of choice in large sequencing centers. The sequencing process generates a set of four traces of signal intensities corresponding to each of the four nucleotide bases. The actual sequence of nucleotides is then determined from the traces by a process known as basecalling.
The next step after obtaining a new DNA sequence is to study the functional and structural information encoded in the sequence. One way to do is by comparing the new sequence with sequences which are already being studied and annotated. Sequences that are similar would probably have the same function, be it a functional role, regulatory role, or structural properties in the case of proteins. Additionally, if two sequences from different organisms are similar, there are may be a common ancestor sequence, and the sequence are then said to be homologous. Relationship between homologous sequences has important implications in speciation study and phylogenetic analysis.
One method for sequence comparison is sequence alignment. Sequence alignment is the procedure of comparing two or more sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences. For base-by-base comparison of two sequences, a rigorous alignment of the two sequences using string matching techniques is needed.
The standard pairwise alignment method is based on dynamic programming. The method compares every pairs of characters in the two sequences and generates an alignment and a score, which is dependent on the scoring scheme used (i.e., a scoring matrix for the different base-pair combinations, match and mismatch scores, and a scheme for insertion/deletion, gap, penalties). This alignment will include matched and mismatched characters and gaps in the two sequences that are positioned, so that the number of matches between identical characters is the maximum possible. Sequence alignment can be either global or local. Global alignment tries to align the entire sequence in such a way as to maximize the degree of similarity between the two sequences. However, for most DNA sequences comparisons, one is usually more interested in finding conserved patterns or segments in two sequences by local alignment. In local alignment, the alignment stops at the ends of regions of strong similarity, and a much higher priority is given to finding these local regions than to extending the alignment to include more neighboring pairs. The Smith-Waterman algorithm finds a pair of segments, one from each of two long sequences, such that there is no other pair of segments with greater similarity.
Analysis of multiple DNA sequences for phylogenetic study is an important area of sequence analysis. A phylogenetic analysis of a family of related DNA or protein sequences is a determination of how the family might have been derived during molecular evolution of how the family might have been derived during molecular evolution. Phylogenetic analysis leads to the construction of an evolution tree. The evolutionary relationships among the sequences are depicted by placing the sequences the leaves on the tree in such a way that the branching relationship in the tree reflects the degree to which different sequences are related. Phylogenetic study performed on a gene family could also aid in the prediction of genes with equivalent or similar functions. It could also be used to track changes in the genome of a rapidly changing species, such as virus.
|