Centromere specification:
In a long-running collaboration with
Kaustuv Sanyal's laboratory (JNCASR, Bangalore), we have studied the
sequence composition and evolution of centromeres in various yeast
species. Baker's yeast, Saccharomyces cerevisiae, has small
``point centromeres'' about 200-300bp long that are clearly defined by
repetitive elements. Fission yeast, S. pombe, and many
multicellular eukaryotes have large ``regional centromeres'' thousands
of base pairs long, that are often characterised by inverted repeats
and conserved sequence patterns. Candida albicans was
previously observed to have no identifiable sequence pattern in its
centromeres. We found that the same is true for its close relative
C. dubliniensis and, moreover, that the centromeric regions,
though syntenous with C. albicans, have no discernible sequence
homology and appear to be evolving much faster than other intergenic
sequence, suggesting a high error rate in replication (PNAS 2008). In
C. tropicalis we found a conserved centromere structure
reminiscent of higher eukaryotes, suggesting either a gain of such
structure in this species or a loss in other Candida species
(PLOS Genetics, 2016). We made a purely in-silico but convincing
determination of centromeric loci in Malassezia sympodialis, as
part of a larger project (NAR 2017). Sequence analysis techniques here
can also be relevant to understanding the functioning of promoters,
going beyond binding motifs to other features.
Recent work includes further study of centromere and karyotype evolution in Malassezia (eLife 2020), a study of replication origins and centromeric chromatin in Candida albicans (Genome Research 2021) and an evolutionary analysis of the multi-drug-resistant pathogen Candida auris (mBio 2021) and a clade identification technique (Microbiol Spectrum 2022).
Regulatory Genomics:
In 2018, we published a new algorithm for
analysis of ChIP-seq data, THiCweed (NAR 2018). Processed ChIP-seq data can
consist of tens of thousands of peaks per dataset, posing a challenge
to traditional motif-finders which don't scale to such sizes. Also,
ChIP-seq pulls down protein-DNA complexes and sequences the bound DNA,
but these DNA fragments may be directly bound to the protein of
interest or indirectly bound via co-factors, and therefore any motif
may appear in only a minority of these peaks. THiCweed reframes the
problem as one of clustering by sequence similarity, and moreover
looks at much larger lengthscales than typical motifs. In an extensive
analysis of ENCODE ChIP-seq data, we recover the known literature
motif in some fraction of the ChIP-seq peaks in most cases, but also
find many variants in these motifs, including variablity within
motifs, occurrence of secondary motifs, variability in spacing, and
extended features in surrounding sequence. Some motifs are widespread
across multiple ChIP-seq datasets: previous authors (Worsley-Hunt and
Wasserman, 2014) observed a few examples of this, but we find several
more. The variations in motifs and context sequence appear to be
biologically relevant. Importantly THiCweed is also much faster than
competitors, despite currently running on a single processor. The
importance of co-factors is a well-known problem, and several papers
have remarked on the variability of TF motifs and the importance of
contextual sequence; THiCweed is a promising and practical tool to
explore these issues. Extensions of this algorithm are ongoing.
This work builds on a long-standing interest in the complexities of
transcriptional regulation. Early work was PhyloGibbs, a
Gibbs-sampling motif-finder incorporating phylogeny
(PLOS Comp Biol 2005, 2008). In
2010 I proposed a method for including dinucleotide correlations in
models for TF binding, and demonstrated superior predictive
performance in yeast (PLOS One, 2010).
Chromatin and TF interactions:
With IMSc student
Rakesh Netha Vadnala (now at NCBS Bengaluru), and in
collaboration with Leelavati Narlikar (Pune) and Sridhar
Hannenhalli (Bethesda, MD, USA), we studied the cooperative
interaction of transcription factor binding in the
three-dimensional context of the nucleus. Traditionally
sequence proximity has been the main metric for studying TF-TF
interactions, but the availability of Hi-C chromatin interaction
data now enables us to examine interactions between sequentially
distant but spatially proximal regions. This interaction varies
among cell types because of differences in chromatin
conformation, and an understanding of this will be important in
in silico TF binding prediction and gene expression
studies. (bioRxiv, 2022)
Past projects
ChIP-seq: inferring cooperativity among transcription factors,
and a simulation pipeline
Vishaka Datta (then at NCBS, Bangalore), with Sandeep Krishna (NCBS
Bangalore) and me, developed an algorithm on how to infer cooperativity among
transcription factors purely based on ChIP-seq peak information (Datta,
Siddharthan, Krishna; PLOS One, 2018). Vishaka also developed a
comprehensive simulation pipeline for ChIP-seq experiments (Datta,
Hannenhalli, Siddharthan; PLOS Computational Biology, 2019).