Clustering short DNA sequences: hidden complexity

Rahul Siddharthan, IMSc Chennai

We describe a new approach to efficiently clustering large numbers of transcription factor binding sites, in particular sites identified by high-throughput ChIP-seq experiments as well as bioinformatically predicted sites. Sites that are more similar to one another than to other sites, in the sense of being more likely to be sampled from the same position weight matrix, are clustered together, and there are stringent statistical criteria. Compared to "synthetic" DNA generated from high-order Markov models, we find that binding sites in real DNA exhibit a lot of complexity, not only in their core sequence but in their neighbourhood "context", which in some cases appears biologically significant. (Work done with Leelavati Narlikar and Ankit Agrawal)