|
DNA binding proteins commonly contact multiple bases within their target sequence, and will tolerate single or even multiple base changes, albeit with reduced affinity. Sometimes, reduced affinity can be compensated for by cooperative binding by another factor to an adjacent site. The recognition site for any one factor is therefore recognized not as a single sequence, but as a family of motifs to which the factor can bind. The goal of pattern recognition in this context is to identify subsequences in the query sequence that is likely to bind a particular factor. The query sequence in this context can be a gene sequence or a selected genomic sequence where we want to match our pattern or motif. Generally, motif detection scans the query sequence by a given window size, and evaluates the subsequence in the current window to the selected motif definition. The window size is set to the length of the motif, and slides across the sequence one base at a time until it reach the end of the query sequence. A match occurs if the subsequence in the current window matches the motif definition under the preset user condition. Motifs are derived from sampling a collection of binding site sequences for selected TFs usually determined experimentally and available in a number of public databases such as TransFac.
There are two ways of representing a motif, either by a consensus pattern or by a probability matrix. A consensus pattern is a string of letters that defines the query sequence, which can match to at a given position. The approach uses a window with the size of the motif to scan the query sequence. As it scans through each subsequence the window, it attempts to match the sub-string to the consensus pattern. The matrix approach uses probabilities to define the motif. A matrix representation assigns a probability of a particular base occurring at a certain position in the motif. This approach takes account of the fact that certain contact residues in a motif may be absolutely required for significant binding; others may tolerate two or three alternatives, while others yet may be non-contact spacers tolerant of any base. In this approach, the query sequence is again scanned from one end to the other. The subsequence yield from each window slide generates a probability score.
Regardless of whether one takes a simple consensus or a matrix as the basis for searching, mammalian genes have search windows that are just too large. Given the size and degeneracy of transcription factor binding sites, all sites occur randomly in genomic DNA with significant frequency, and when one is dealing with control regions extending over tens of kilobases, an attempt to identify the functional elements in any single gene is futile.
There have been two major approaches to increasing the statistical power of pattern recognition motif searching algorithms, the “Multi-genes, single species” and “Single gene, multi-species”. The “Multi-genes, single species” approach relies on the conservation of the regulatory mechanism between clusters of co-regulated genes. The “single gene, multi-species” approach is based upon the assumption that the core regulatory mechanism will be conserved across groups of evolutionally related species, and that the motifs involved will approximately be in the same place or have approximately the same relative abundance. Neither approach is entirely adequate. The logical extension is to combine the two into “Multi-genes, multi-species”.
This approach inherits the strengths and weakness of both the approaches. In general, the “single gene multi-species” approach is less computationally intensive, since one is typically dealing with only a small number of species, and is usually carried out first. For each member of a cluster of co-regulated genes, we assemble the conserved non-coding region and carry out an analysis to identify conserved motifs or motifs that fit the predetermined matrices. The predictions are then merged to determine which motifs are over-expressed in the cluster. The analysis can potentially be sufficient large those false positives become unimportant. In fact, one can decide to include all putative orthologs for any gene within the cluster, and examine them individually in retrospect to decide which of them contain motifs common to the co-regulated set.
|