"THiCweed: fast, sensitive motif finding by clustering big data sets", Ankit Agrawal, Leelavati Narlikar, Rahul Siddharthan
(ENCODE data used in paper)
NEW 14/05/2017: experimental webserver for submitting jobs. If it doesn't quite work yet, be patient and contact me with bugs...
NEW 14/05/2017: Collection of predictions on ENCODE chip-seq data
THiCweed is written in the Julia language, and requires version 0.5.0 or later. It requires the ArgParse module. If you don't already have it, install Julia 0.5.0+ for your operating system (following instructions here) for example).
To install the ArgParse module: start Julia up on the command line (type "julia"), and at the prompt type
julia /path/to/thicweed.jl -hfor help. More detailed help is below.
THiCweed takes as input a single fasta-formatted file containing multiple sequences to be clustered on similarity. The sequences may be of different lengths. The clustering considers similarities within sliding "windows" of a fixed size, by default two-thirds the median length of the sequences. It considers all possible "shifts" of the windows within the sequences as well as partially out of the sequences (up to half the window may slide outside the known sequence; unknown nucleotides are treated as "N") and reverse compleents. Optionally, the input sequences may be pre-aligned by a given motif and shift moves and reverse complements disabled; this would then search known motifs for additional "structure" and is much faster than the normal operation. This is similar to what NPLB does.
It is recommended for efficient running that input sequences be about 100bp in size (for example, the ChIP-seq peak position plus and minus 50bp). THiCweed can cluster 5,000 such sequences in about 5-10 minutes, or 30,000 such sequences in about 1-2 hours on a modern (eg Intel i7 based) desktop machine. Significantly longer input sequences or larger numbers of sequences will increase the running time, which also depends on the nature of the input (i.e. the complexity of the motif structure in the sequences and so on).
thicweed.jl [options] filename
|filename||sequences to be clustered, in fasta format|
|-l, --length LENGTH||length of window size to be aligned within longer sequence (default: 2/3 median input sequence length) (type: Int64)|
|-S, --shift||do NOT consider shifted versions of sequences (default: consider shifts; if disabled also disables revcomp)|
|-d, --debug||print verbose debugging message|
|-R, --revcomp||do NOT consider reverse complements of input sequences (default: consider them if using shifts, not otherwise)|
|-r, --randlim RANDLIM||rand-index threshold (reject splits that are less consistent than this) (type: Float64, default: 0.66)|
|-m, --minclustsize MINCLUSTSIZE||minimum cluster size (type: Int64, default: 20)|
|-i, --infthres INFTHRES||consider only positions with at least this information content when splitting (0.0 = consider all positions; default / negative values: automatic choice per cluster) (type: Float64, default: -1.0 = automatic). Automatic means 0.005 times the maximum information content at any position in the window. While this was introduced as a speed optimisation, it also appears to improve accuracy.|
|-N, --maxnclust MAXNCLUST||maximum number of output clusters (0 = no limit) (type: Int64, default: 25). On many ChIP-seq datasets, the number of output clusters can be very large, and this seems both statistical significant and biologically relevant: see the manuscript. Recommendation: leave this at default value (25)or even more, and then use the separate program "recluster.jl" to reduce the number of clusters if desired.|
|-M, --maxsplitclusters MAXSPLITCLUSTERS||maximum number of clusters to initially split into (0 = no limit) (type: Int64, default: 500). On some pathological datasets the number of clusters produced proliferates, and in the interest of running time an upper limit is desirable. They are re-merged to MAXNCLUST (above) before output.|
|-o, --out_prefix OUT_PREFIX||prefix of output files (if prefix is "output", files output_wms.tr and output_archs.txt will be generated) (default: same as input filename). The output_wms.tr file contains weight matrix representations of all clusters, in Transfac format. The output_archs.txt file contains the cluster number, fasta sequence ID, sequence within window, shift and reverse complement information, and score within cluster. The format is based on the output format of NPLB.|
|-h, --help||show this help message and exit|