THiCweed: fast, sensitive detection of sequence features by clustering big data sets

THiCweed is a new approach to motif finding that models the problem as one of clustering bound regions based on sequence similarity. We take an iterative ``top-down'' approach of repeatedly subdividing an initial single large cluster of all input sequences into smaller and smaller clusters, while also exploring shift and reverse-strand matches of sequences to clusters. Our implementation is significantly faster than any other ChIP-seq-oriented motif-finding program we tested, able to process 5,000 sequences of 100bp length in a few minutes, or 30,000 sequences in 1-2 hours, on a desktop computer using a single CPU core.

"THiCweed: fast, sensitive detection of sequence features by clustering big data sets", Ankit Agrawal, Snehal Sambare, Leelavati Narlikar, Rahul Siddharthan
bioRxiv doi:10.1101/104109
now published in NAR, gkx1251, 2017

(ENCODE data used in paper)

NEW 19/09/2018: updated to work with Julia version 1.0

NEW 14/05/2017: experimental webserver for submitting jobs. If it doesn't quite work yet, be patient and contact me with bugs...

NEW 14/05/2017: Collection of predictions on ENCODE chip-seq data


THiCweed is available under the 2-clause BSD licence: see here for terms.

THiCweed is written in the Julia language, and requires version 1.0 or later. It requires the ArgParse and SpecialFunctions modules. If you don't already have it, install Julia 1.0 for your operating system (following instructions here) for example).

To install the ArgParse and SpecialFunctions modules: start Julia up on the command line (type "julia"), and at the prompt type

using Pkg

Then download and save the following files all in the same directory (for example in $HOME/bin or /usr/local/bin, need not be in current working directory).
thicweed.jl (main program) (UPDATED 2018-09-19 for Julia 1.0)
recluster.jl (optional: reclustering output of thicweed UPDATED 2018-09-19)
thicweed_lib.jl (common routines used by both of the above UPDATED 2018-09-19)
matchseq.jl (optional helper program to use thicweed output to scan for matches in a FASTA file; documentation forthcoming, use -h for quick help UPDADED 2018-09-19)
Then type
julia /path/to/thicweed.jl -h
for help. More detailed help is below.
To make life easier: Add the location of thicweed.jl to your $PATH environment variable, and make it executable (type "chmod +x /path/to/thicweed.jl"). Then you can invoke it simply as "thicweed.jl [arguments]"

THiCweed manual

THiCweed clusters input sequences by sequence similarity and effectively finds multiple motifs in large data sets such as ChIP-seq peaks. This is an expanded version of the brief help you get when typing "thicweed.jl -h".

THiCweed takes as input a single fasta-formatted file containing multiple sequences to be clustered on similarity. The sequences may be of different lengths. The clustering considers similarities within sliding "windows" of a fixed size, by default two-thirds the median length of the sequences. It considers all possible "shifts" of the windows within the sequences as well as partially out of the sequences (up to half the window may slide outside the known sequence; unknown nucleotides are treated as "N") and reverse compleents. Optionally, the input sequences may be pre-aligned by a given motif and shift moves and reverse complements disabled; this would then search known motifs for additional "structure" and is much faster than the normal operation. This is similar to what NPLB does.

It is recommended for efficient running that input sequences be about 100bp in size (for example, the ChIP-seq peak position plus and minus 50bp). THiCweed can cluster 5,000 such sequences in about 5-10 minutes, or 30,000 such sequences in about 1-2 hours on a modern (eg Intel i7 based) desktop machine. Significantly longer input sequences or larger numbers of sequences will increase the running time, which also depends on the nature of the input (i.e. the complexity of the motif structure in the sequences and so on).


thicweed.jl [options] filename
Positional arguments:
filenamesequences to be clustered, in fasta format
Optional arguments:
-l, --length LENGTHlength of window size to be aligned within longer sequence (default: 2/3 median input sequence length) (type: Int64)
-S, --shiftdo NOT consider shifted versions of sequences (default: consider shifts; if disabled also disables revcomp)
-d, --debugprint verbose debugging message
-R, --revcompdo NOT consider reverse complements of input sequences (default: consider them if using shifts, not otherwise)
-r, --randlim RANDLIM rand-index threshold (reject splits that are less consistent than this) (type: Float64, default: 0.66)
-m, --minclustsize MINCLUSTSIZE minimum cluster size (type: Int64, default: 20)
-i, --infthres INFTHRES consider only positions with at least this information content when splitting (0.0 = consider all positions; default / negative values: automatic choice per cluster) (type: Float64, default: -1.0 = automatic). Automatic means 0.005 times the maximum information content at any position in the window. While this was introduced as a speed optimisation, it also appears to improve accuracy.
-N, --maxnclust MAXNCLUST maximum number of output clusters (0 = no limit) (type: Int64, default: 25). On many ChIP-seq datasets, the number of output clusters can be very large, and this seems both statistical significant and biologically relevant: see the manuscript. Recommendation: leave this at default value (25)or even more, and then use the separate program "recluster.jl" to reduce the number of clusters if desired.
-M, --maxsplitclusters MAXSPLITCLUSTERS maximum number of clusters to initially split into (0 = no limit) (type: Int64, default: 500). On some pathological datasets the number of clusters produced proliferates, and in the interest of running time an upper limit is desirable. They are re-merged to MAXNCLUST (above) before output.
-o, --out_prefix OUT_PREFIX prefix of output files (if prefix is "output", files and output_archs.txt will be generated) (default: same as input filename). The file contains weight matrix representations of all clusters, in Transfac format. The output_archs.txt file contains the cluster number, fasta sequence ID, sequence within window, shift and reverse complement information, and score within cluster. The format is based on the output format of NPLB.
-h, --helpshow this help message and exit

Helper scripts

Convert output files to sequence logos in SVG (scalable vector graphics) format. The resulting files can be viewed in Firefox (Google Chrome sometimes shows glitches but firefox seems to work perfectly), or edited/converted in vector graphics programs like Inkscape. Run them with -h for more help. Both are written in python 2.

Back to my homepage
Back to institute homepage