Motifs reported by Chen et al.
Theory and Algorithms
Chen et al. Because sequences-sets are very large, some of the reported motifs became highly degenerated. For these, even allowing for IUPAC symbols during the greedy search results in highly conserved motifs. Hu el al. This grounds in the belief that motifs are tightly packed near the peak summit - the location inside each peak with the highest sequence coverage depth.
As a result, prior probabilities were set to be proportional to a discretized Student's t -distribution with 3 degrees of freedom and rescaled such that they form a step function with a fixed 25 bp step-size. The prior probabilities are symmetric and centered at the peak summits. Actually, the motifs reported by this prior were exactly the same as using the uniform prior recall that for the uniform prior any position in the DNA is likely to contain a motif.
We attributed this to the fact that part of the information contained in the binding peak-based prior is already encoded in the BIS score. Indeed, peak summits indicate an overrepresentation of a motif in a certain locus. Such overrepresentation is already weighted in the BIS score recall Equation 1 and 4 in page For longer sequences, the effective resolution of the peak summits seems to provide useful information [ 22 , 23 ]. Wasserman and Sandelin [ 41 ] noticed that the discovery of TFBS's from a nucleotide sequence alone suffers from impractical high false positive rates.
This was termed the futility theorem as nearly every predicted TFBS has no function in vivo. This problem has been studied and addressed by taking into consideration information in and beyond the TFBS's, such as orthologous conservation [ 16 , 17 ], nucleosome positioning [ 11 , 42 ], DNA duplex stability [ 14 ] and coverage profiles obtained from ChiP-seq assays [ 22 , 23 ]. Following this line of research we have verified in the present study that post-processing the output of RISOTTO with prior knowledge from different sources is beneficial for motif discovery.
RISOTTO is a consensus-based method that enumerated exhaustively all motifs by collecting their occurrences, up to a fixed Hamming distance, from input sequences. The Hamming distance between two string measures the minimum number of substitutions required to change one string into the other. As a result, a set of overrepresented motifs is reported and then ordered by their biological relevance according to some statistical significance test [ 24 , 26 , 27 ].
This ordered list is retrieved in a classical way from the nucleotide sequence alone and, as previously mentioned, it is of particular importance to introduce a bias from available priors. Certainly, we would not expect RISOTTO, or any other combinatorial algorithm, to report completely outlandish motifs, as motif discovery problem is indeed a combinatorial problem that accounts for overrepresentation of a string in a set of DNA sequences. However, prior information provides valuable guidance on how to describe a motif that goes beyond neighborhoods defined by the Hamming distance or any similar distance of the consensus sequence.
For the sake of simplicity, consider we are looking for motifs of a fixed size k. Combinatorial algorithms take into consideration overrepresentation of motifs to extract them.
This extraction is exhaustive, by iteratively extending candidate strings of size Usually, complex data structures, such as suffix-trees, are employed to extend the candidate string. Whenever an extension fails to be overrepresented in the input sequences that extension is disregarded and another one is attempted. Only extensions that reach the size k are reported. Conversely, prior information only asserts if a sub-sequence of a fixed size W in a certain position of the DNA sequences is likely to be a motif. It is not straightforward to use prior information in combinatorial algorithms because they would need to know if a sub-string of size However, in one hand, it is space-wise unfeasible to have priors for multiple values of W.
On the other hand, priors for small or large values of W have no information whatsoever, as either they are very common occur in all input sequences or very rare occur only once or never.
- One Nation, Indivisible?;
- International Workshop on Combinatorial Algorithms.
- Senior Residences: Designing Retirement Communities for the Future (Wiley Series in Healthcare and Senior Living Design).
- Combinatorial optimization.
- Decisions at Yalta: An Appraisal of Summit Diplomacy?
- Monday, June 3rd, 12222.
Besides this discussion, there are two obvious advantages of using prior information at a post-processing step. Another advantage is that while new priors are devised, we do not need to re-compute previous starting points, being sufficient to run the greedy-search procedure of the GRISOTTO algorithm. In closing, we stress that the BIS score was used throughout the experiments with sequence-sets known to be bound by a TF. Therefore, it was only used to discover the positions of each sequence-set where the motif occurs.
Another possible application of the BIS score would be to detect the fraction of sequences that are likely to have site predictions. There are two possible ways to adapt GRISOTTO to this new problem: i derive a threshold of the BIS score contribution of a sequence above which the sequence is likely to have site predictions; ii incorporate an input parameter in the GRISOTTO greedy procedure, usually called quorum , that amounts for the fraction of sequences that have binding site predictions.
None of these approaches seems straightforward and are out of the scope of this paper, hence they were left as a future research topic. In practice, this introduces some extra knowledge taken from the literature, or computed from the sequences, that will help in characterizing motifs.
The algorithm is flexible enough to combine several priors from different sources. Each prior is given a weight reflecting the confidence on the information contained in it and its relevance for motif discovery.
Combinatorial algorithms and algorithmic graph theory
In this way, priors can be introduced at will giving rise to a scoring criterion based on the convex closure of the information given by each prior. Prior information has previously been shown to be beneficial when used with EM and Gibbs sampler-based motif discoverers. We emphasize that the goal of this paper is not to introduce new priors, but to show that priors can also be advantageous to assist and improve the output of combinatorial algorithms such as RISOTTO.
Moreover, we have shown that combining priors is very promising in further extending the power of motif discovery algorithms. Prior information from different sources was used, including, orthologous conservation, nucleosome occupancy, and destabilization energy. In this assessment two priors were used, including, orthologous conservation and base coverage profiles obtained from the ChiP-seq assays. We concluded that, as for ChiP-chip data, the orthologous conservation-based prior was of great convenience, being able to unravel 13 motifs strongly similar to the ones reported by other tools and found in the TRANSFAC database.
In respect to the coverage-based prior, their direct use as a positional prior was not favorable, having been comparable to the uniform prior. We believe this is due to the fact that the BIS score already accounts for overrepresentation in the input sequences which we suspect mimics the information contained in this new prior, turning the prior redundant. AMC did the programming and designed and performed the experiments. AMC also wrote the final draft of the paper.
ALO did the proofreading of the final draft of the paper. Both authors have read and approved the final manuscript. This additional file presents in detail three topics needed to make the paper self-contained. Finally, it contains relevant information about the evaluation methodology, including, parameter settings and running times.
We also thank Timothy Bailey and his co-authors for making available the mouse ChiP-seq data and respective priors used in the experiments. For last, but not least, the authors are very thankful for the invaluable comments of the anonymous referees.
National Center for Biotechnology Information , U. Journal List Algorithms Mol Biol v. Algorithms Mol Biol. Published online Apr Author information Article notes Copyright and License information Disclaimer. Corresponding author. Alexandra M Carvalho: tp. Received Nov 10; Accepted Apr This article has been cited by other articles in PMC.
PDF K. PDF 49K.
- Journal of Combinatorial Optimization.
- Pakistan or Partition of India!
- Dream Big, Little Pig!?
Results We extend RISOTTO, a combinatorial algorithm for motif discovery, by post-processing its output with a greedy procedure that uses prior information. Conclusions The conclusions of this work are twofold. Background An important part of gene regulation is mediated by specific proteins, called transcription factors TF , which influence the transcription of a particular gene by binding to specific sites on DNA sequences, called transcription factor binding sites TFBS. Table 1 Definition of terms used in describing the algorithms presented in Methods. Open in a separate window.
Evolutionary conservation-based priors Diverse methods for motif discovery make use of orthologous conservation to assess wether a particular DNA site is conserved across related organisms, and thus more likely to be functional. Nucleosome occupancy-based priors Nucleosome occupancy has also been used in motif discovery. Combining priors Despite considerable effort to date in developing new potential priors to boost motif discoverers, PSP's from different sources have not yet been combined.
Figure 1. Binding peak-based priors Hu el al. Discussion Wasserman and Sandelin [ 41 ] noticed that the discovery of TFBS's from a nucleotide sequence alone suffers from impractical high false positive rates. Competing interests The authors declare that they have no competing interests. Authors' contributions AMC did the programming and designed and performed the experiments. Click here for file K, PDF.
Click here for file 49K, PDF. German Conference on Bioinformatics. Conformational and physicochemical DNA features specific for transcription factor binding sites. Macromolecular recognition. Current Opinion in Structural Biology. Non-additivity in protein-DNA binding.