In a long-running collaboration with Kaustuv Sanyal's laboratory (JNCASR, Bangalore), we have studied the sequence composition and evolution of centromeres in various yeast species. Baker's yeast, Saccharomyces cerevisiae, has small ``point centromeres'' about 200-300bp long that are clearly defined by repetitive elements. Fission yeast, S. pombe, and many multicellular eukaryotes have large ``regional centromeres'' thousands of base pairs long, that are often characterised by inverted repeats and conserved sequence patterns. Candida albicans was previously observed to have no identifiable sequence pattern in its centromeres. We found that the same is true for its close relative C. dubliniensis and, moreover, that the centromeric regions, though syntenous with C. albicans, have no discernible sequence homology and appear to be evolving much faster than other intergenic sequence, suggesting a high error rate in replication (PNAS 2008). In C. tropicalis we found a conserved centromere structure reminiscent of higher eukaryotes, suggesting either a gain of such structure in this species or a loss in other Candida species (PLOS Genetics, 2016). We made a purely in-silico but convincing determination of centromeric loci in Malassezia sympodialis, as part of a larger project (NAR 2017). Sequence analysis techniques here can also be relevant to understanding the functioning of promoters, going beyond binding motifs to other features.
Recent work includes further study of centromere and karyotype evolution in Malassezia (eLife 2020), a study of replication origins and centromeric chromatin in Candida albicans (Genome Research 2021) and an evolutionary analysis of the multi-drug-resistant pathogen Candida auris (mBio 2021) and a clade identification technique (Microbiol Spectrum 2022).
In 2018, we published a new algorithm for analysis of ChIP-seq data, THiCweed (NAR 2018). Processed ChIP-seq data can consist of tens of thousands of peaks per dataset, posing a challenge to traditional motif-finders which don't scale to such sizes. Also, ChIP-seq pulls down protein-DNA complexes and sequences the bound DNA, but these DNA fragments may be directly bound to the protein of interest or indirectly bound via co-factors, and therefore any motif may appear in only a minority of these peaks. THiCweed reframes the problem as one of clustering by sequence similarity, and moreover looks at much larger lengthscales than typical motifs. In an extensive analysis of ENCODE ChIP-seq data, we recover the known literature motif in some fraction of the ChIP-seq peaks in most cases, but also find many variants in these motifs, including variablity within motifs, occurrence of secondary motifs, variability in spacing, and extended features in surrounding sequence. Some motifs are widespread across multiple ChIP-seq datasets: previous authors (Worsley-Hunt and Wasserman, 2014) observed a few examples of this, but we find several more. The variations in motifs and context sequence appear to be biologically relevant. Importantly THiCweed is also much faster than competitors, despite currently running on a single processor. The importance of co-factors is a well-known problem, and several papers have remarked on the variability of TF motifs and the importance of contextual sequence; THiCweed is a promising and practical tool to explore these issues. Extensions of this algorithm are ongoing. This work builds on a long-standing interest in the complexities of transcriptional regulation. Early work was PhyloGibbs, a Gibbs-sampling motif-finder incorporating phylogeny (PLOS Comp Biol 2005, 2008). In 2010 I proposed a method for including dinucleotide correlations in models for TF binding, and demonstrated superior predictive performance in yeast (PLOS One, 2010).
Chromatin and TF interactions:
With IMSc student Rakesh Netha Vadnala (now at NCBS Bengaluru), and in collaboration with Leelavati Narlikar (Pune) and Sridhar Hannenhalli (Bethesda, MD, USA), we studied the cooperative interaction of transcription factor binding in the three-dimensional context of the nucleus. Traditionally sequence proximity has been the main metric for studying TF-TF interactions, but the availability of Hi-C chromatin interaction data now enables us to examine interactions between sequentially distant but spatially proximal regions. This interaction varies among cell types because of differences in chromatin conformation, and an understanding of this will be important in in silico TF binding prediction and gene expression studies. (bioRxiv, 2022)
ChIP-seq: inferring cooperativity among transcription factors, and a simulation pipeline
Vishaka Datta (then at NCBS, Bangalore), with Sandeep Krishna (NCBS Bangalore) and me, developed an algorithm on how to infer cooperativity among transcription factors purely based on ChIP-seq peak information (Datta, Siddharthan, Krishna; PLOS One, 2018). Vishaka also developed a comprehensive simulation pipeline for ChIP-seq experiments (Datta, Hannenhalli, Siddharthan; PLOS Computational Biology, 2019).