PHYLOGIBBS

NAME
SYNOPSIS
DESCRIPTION
ESSENTIAL OPTIONS
ADVANCED OPTIONS
BUGS
SEE ALSO
AUTHORS

NAME

PhyloGibbs v1.0 − Gibbs motif sampler incorporating phylogeny and tracking statistics.

SYNOPSIS

phylogibbs [options]

DESCRIPTION

phylogibbs searches for motifs in multiple DNA sequences, each of which may be prealigned with homologous sequence from related species. It uses Gibbs sampling with an anneal+track strategy that finds multiple motifs simultaneously and reports their significance. The user needs to specify the phylogenetic tree of the "proximities" between the species that occur in the input multiple alignmnents (see examples below).

This manual page describes the program’s command-line options. For the internal philosophy and description of the algorithm, see the phylogibbs_algorithm(7) manpage. Below is a quick summary of the most important options, followed by more detailed descriptions of all options in alphabetical order.

Input sequences are read from a fasta file specified by the −f input_file option. Sequences that are not aligned with any other sequences may include the bases A, C, G, T, degenerate IUPAC symbols, and the symbol X (blocking sites from occurring at that position). The case of the symbols in unaligned sequences is ignored. If sequences are phylogenetically related, they must be in multi-fasta format, as output by Dialign or other multiple alignment programs; they may contain uppercase or lowercase nucleotide symbols A, C, G, T, dashes (-) to indicate gaps, and ’X’ or ’x’ to indicate a blocked site. Vertically aligned uppercase letters are treated as aligned and scored taking their phylogenetic relationship into account. Lowercase letters are considered unaligned with other sequences.

To run on multiple sets of multiple alignments, e.g. multiple alignments of orthologous regulatory regions from the same set of species, the different alignments are to be separated by using a double tick ">>" instead of a single tick ">" on the header of the first species in each set. For example:

>>SpeciesA region 1
ACTGACataAGCAGAC...
>SpeciesB region 1
AGTCAC--aATCAGAT...
>>SpeciesA region 2
CATCACATA--agaat...
>SpeciesB region 2
CATCGCAAAaagcata...

Here, the two region 1 sequences from species A and species B are aligned with each other, but are not aligned with the two region 2 sequences from species A and B, because of the separator ">>".

The interpretation of the input sequence is controlled by the −D option; 0 ignores phylogeny and treats all sequences as unaligned, 1 and 2 both assume phylogeny but with different degrees of strictness regarding which alignment blocks can be considered candidate locations for binding sites. The relation of the species to their common ancestor must be specified using the -L or the −H option as described below (see also the advanced −G option).

Per default the best configuration of binding sites (obtained by simulated annealing) is written to the file "output" and the posterior probabilities of binding sites (obtained by tracking) are written to "tracked_output". These filenames can be changed via the −o output_file and −t tracking_file options. The width of the sites is the same for all motifs and is set via −m motifwidth. The total number of sites and the number of different motifs in the initial state is provided by −I -- for example, −I 4,4,4 specifies three motifs ("colours"), with a total of 12 sites initially distributed equally over the motifs, i.e. 4 in each color. By default the total number of motifs and sites remains fixed throughout the run. For most purposes these options will suffice to provide meaningful results. Fine-tuning is provided by a host of other options described in alphabetic order below; for many of these, proper usage will require an understanding of the internal strategy, as described in phylogibbs_algorithm(7).

ESSENTIAL OPTIONS

These options must be understood in order to use Phylogibbs effectively. Advanced options are described in the next section. All options have short and long versions; use either version.

−D, −−dialign num

num specifies the "alignment level" of the input sequence and may be 0, 1, 2. If 0, assume input sequences are not aligned: this is the default. If 1, assume they are aligned (Dialign-style aligned fasta, or multi-fasta), and treat inconsistent windows "flexibly". That is, when gaps occurs within a multi-species aligned window, separate it into smaller windows (that is, windows containing fewer species) that each contain mutually gapless subsets of species. If 2, assume aligned sequences and be "strict" about windows containing gaps: reject them (never place sites at their positions).

−F, −−bgfile filename

File (fasta format) to use for estimating the background distribution. In typical cases, this file would consist of large amounts of intergenic sequence from the species of interest. Default: use the input file to calculate the background distribution.

−f, −−inputfile filename

Name of the input sequence file, in fasta or aligned-fasta format. Default: none (must be supplied)

−I, −−initialocc nums

nums is a comma-separated list (no spaces) of integers specifying the initial number of windows per motif ("colour"): eg, 15,7,8 will put 15 windows in colour 1, 7 in colour 2, 8 in colour 3 initially. Window-shift moves preserve the total number of windows (30 in this case) and the number of colours (3) but may redistribute windows between colours. Colour-change moves preserve neither the total number of windows nor the number of motifs, but these are switched off by default (see −c in the advanced options). When colour-change moves are switched on the values specified through -I will be interpreted as the a priori expected number of motifs and sites.

−L,−−labeltree string

This is required to specify general phylogenies ("star topology" phylogenies can be specified with -H, see below). The supplied string specifies the phylogenetic tree and must be complete and correct. The tree format that we use is the standard Newick format. However, the values that are normally branch lengths are instead "proximities" in our format. A proximity gives the probability that no mutation has taken place along the branch. The names labelling the leafs of the tree should correspond to a substring that occurs in all fasta-headers of those input sequences that derive from this species (and that does not occur in the fasta-headers of sequences from other species).

For example, the following phylogeny

                       0.85
                     -------- human
              0.6   |
          ----------+
         |          |  0.9
         |           -------- chimp
Ancestor +
         |             0.8
         |           -------- mouse
         |    0.7   |
          ----------+
                    |  0.9
                     -------- rat

would be written as
−L "((human:0.85,chimp:0.9):0.6,(mouse:0.8,rat:0.9):0.7)"

(the string is quoted to protect the parentheses from the shell). Note that the proximities are between successive nodes or leaves: eg, 0.9 for chimp indicates that, for a given neutrally evolving base in chimp, the probability of no substitutions in this base since the common ancestor of chimp and human is 0.9. Proximities are multiplicative: for example, if the set of aligned sequences in the input does not actually contain the sequence for human, the human-chimp node will be eliminated and chimp’s proximity to the common ancestor of all species in the tree will be 0.6x0.9 = 0.54. Thus, a single tree may be used that includes species not occurring in the input data.

−m, −−motifwidth num

Width of motifs to search for (integer). Default: 10.

−N, −−ncorrel num

Order of the Markov model used for the background probabilities. -1: use background probabilities of 0.25 for each base. 0: use single base counts. 1 or more: condition the probability of a background base on the num bases immediately preceeding it. Default: 1. See also the −F and −P options.

One can also supply a comma-separated list of four floats instead of a single integer num as an argument. In that case, these four values (which will be automatically normalised) will be used as single-base background probabilities. If this list is ill-formed, a flat 0.25 will be assumed, with a warning on stderr.

−o, −−outputfile filename

Write output of anneal phasse to filename. This file contains the maximal-a-posterior configuration of binding sites that will be used as a reference state during the tracking phase. If filename is "stdout", output is written to standard output instead. The default filename is "output".

−S, −−nsteps num

Total number of steps in the tracking phase. This is the main option controlling the total running time of the algorithm. There are generally four phases to a run of the algorithm: A transient (warmup) phase, an annealing phase, a deep quench phase, and a tracking phase. By default the anneal phase is the same length as the tracking phase, the transient phase is 10% of this length and the deep quench phase is 3% of this length. The length of all these phases can be controlled independently with advanced options described below. Default: 100

−t, −−trackedoutput filename

File to which posterior probabilities for individual sites and the estimated weight matrices are written at the end of the tracking phase (unless this phase is disabled with −X). The default filename is "tracked_output".

ADVANCED OPTIONS

Use these options to fine-tune the working of PhyloGibbs in various ways.

−A , −−trackfile filename

By default PhyloGibbs uses the maximum-a-posterior binding site configuration found through simulated annealing as a reference state for the tracking phase. With option -A the algorithm instead uses the binding site configuration in the file filename as the reference state. Each line in filename specifies a window, i.e. the location of a binding site, by a set of three white-space separated integers: the sequence number in the input file (starting from zero), the start position of the window on that sequence (starting from zero), and the cluster label number (starting from one). In case of aligned sequences (−D), it is sufficient to give the seq/start pair for one of the aligned sequences. For example, a file containing

0  13  1
1  57  1
2  40  1
0  33  2
1  45  2
2  11  2

specifies two motifs to be tracked: one with sites starting at positions 13, 57 and 40 on sequences 0, 1, 2, respectively, and the other with windows starting at 33, 45, 11 on the same sequences. Specifying −A skips the simulated anneal and deep quench phases.

−a, −−nanneal num

Do num steps of the simulated anneal phase (excluding transient and final deep quench). Default: same as number of tracking steps (specified via −S). See also the −u, −g, −S options.

−B, −−blockedfile filename

Block all windows overlapping sequence stretches specified in filename: each line in the file should contain, as white-space-separated integers, a sequence number (starting from zero), the starting position on that sequence (starting from zero), and the number of bases starting at that position to be blocked. All windows overlapping those bases will be ignored, i.e. not sampled by the program. For example, a file containing

0 34 5
2 21 6

specifies that bases 34 through 38 on sequence 0, and bases 21 through 26 on sequence 2, are to be blocked. Additionally, any occurrence of a base "x" or "X" in the input sequence is also regarded as a blocked site.

−b , −−beta value

Sets the initial inverse temperature to value (a positive floating point number). 1 by default. It is unlikely that one would need to change this parameter. See also the −x option.

−C, −−rcsymmetric

Assumes the motif weight matrices are reverse-complement-symmetric. Effectively, the second half of the weight matrix is assumed to be the reverse-complement of the first half. If the specified motif width is of odd length, the middle column is ignored. Default: turned off. This option is not extensively tested.

−c, −−ncolmoves num

Do num "colour change moves" per step (ie, adding, removing or recolouring a window). As a special case, the value -1 means an automatic value will be picked. Default: zero. See also the −w, −s, −k options. Turning on colour-changing moves allows the total number of sites and motifs to change and a prior distribution over the expected number of sites and motifs is then needed. By default the values given in the -I option specify the expected number of sites and motifs. These can be overruled by the options -y and -z (see below).

−E,−−trackingcutoff float

A cut-off (between 0 and 1) for printing sites in the tracked-output file: sites with posterior probability less than this value will not be printed. Default: 0.05.

−G, −−phylohistory value

When running on aligned sequences, and assuming a star topology of the phylogenetic tree, this value is used as a default proximity for all the species. Thus, when neither -L nor -I are used, the program assumes the phylogenetic tree is a star topology and that all species have proximity value to the common ancestor. When -H is used (see below) then proximity value is used whenever there are more sequences in a sequence group than specified in the -H option. Note that this option is ignored when -L is used.

−g, −−ndeepquench num

Do num steps of the deep quench phase. Default: 3% of the number of simulated anneal steps, but at least two. See also the −u, −a, −S options.

−h, −−help

Print a brief summary of options and exit.

−H, −−phylohistlist values

Specifies a phylogenetic tree with star topology for the species from which the input sequences derive. This option only applies when: all groups of aligned sequences have sequences from the same set of species in the same order. values is a comma-separated list (no spaces) of the proximities (probability of no mutation along branch from ancestor to leaf) of each species. That is, with -H 0.3,0.2,0.5,0.7, 0.3 is the proximity of the species from which the first sequence in the input file derives; 0.2, the proximity of the species of the second sequence; and so on. All sequences must be covered, otherwise an error occurs.

The same values will work if the input has multiple groups of aligned orthologous sequences and each group has sequences from the same set of species in the same order. If the input has multiple groups of aligned homologous sequences with different species in different groups, or if the order of the species differs between groups, then the more general −L option needs to be used.

−i, −−initfile filename

Name of file containing a configuration of binding sites to be used as an initial state. This initial configuration will be printed to a file called initial_output. File format is similar to the −A option, except that a fourth column, specifying the "direction" (strand) of the window, is required and must be 0 or 1. For example,

0  13  1  0
1  57  1  1

specifies that windows starting on sequence 0, position 13 and sequence 1, position 57 should initially be given colour 1, with directions 0 and 1 respectively (ie, on opposite strands). See also −I. Per default a random initial configuration is used.

−k, −−nmaskmoves num

Per default all positions in binding sites are assumed to derive from a common weight matrix. When maskmoves are turned on the algorithm will also sample over "masking" positions in the sites. That is, the bases at masked positions of sites are scored according to the background model. num gives the number of maskbit-flip moves per cycle. Default: 0 (no masking). As a special case, the value -1 means the number of mask moves will be automatically chosen. See also the −c, −w, −s options.

−M −−motiffile filename

Using this option PhyloGibbs can search for sites for motifs for which some sites are already known, i.e. an initial weight matrix is available. filename contains a set of weight matrices in the same format as in the output files that PhyloGibbs produces, which is also the format used by TRANSFAC and MEME. The width of the WMs should be at least as long as the motif width given in option −m (for longer WMs the last columns are ignored) and the initial state −I should have at least as many colors as there are WMs in filename. Usage of color-changing moves (see option -c) in combination with -M is currently not well-tested.

−P, −−bgpscount value

Weigh background probabilities with Markov order greater than zero with a "pseudocount" of single-site probabilities, with weight value (useful when the input sequence is not long enough to calculate truly reliable correlated counts but one wants to use them anyway). Default: 0.0. See also the −N and −F options.

−p, −−chempot value

Use the specified value as a "chemical potential" when sampling: for each window this value is subtracted from the logarithm of posterior probability of the configuration and should generally be a non-negative float. It is used to control the number of windows when colour-changing moves are used. Default: 0. Note that this option has been largely superseded by the more practical -x and -y options (see below).

−q, −−quiet

Run quietly (no output to stdout; errors and warnings are still printed on stderr.)

−R, −−reverseprint

Print starting positions of sites not from the start of the input sequences, but as a negative number from the end of input sequences (ie, the last base on the input sequence is numbered -1). For example, this may be the distance from the start of the gene.

−r, −−norevcomp

Do not search for reverse-complement matches (ie, do not search on the "other strand"). Default: Search on both strands.

−s, −−nshiftmoves num

Number of global-shift moves per step. In a global-shift move all windows in a colour are shifted to the left or right by the same amount. Default: twice the number of colours in the initial configuration. See also −c, −w, −k.

−T, −−pseudocount value

Value of pseudocount (of the Dirichlet prior) to be used in the weight-matrix integral (generally between zero and one). Default: 1.0, which corresponds to a uniform prior over the space of WMs. Use smaller values when WMs with significant information scores in each column are expected.

−u, −−ntransient num

Do num step of the transient equilibriation phase, before the simulated anneal and/or before tracking. Default: 10% of the number of steps in the simulated anneal. See also the −a, −g, −S options.

−v, −−verbose

Print verbose information while running (unlike −q which suppresses all output to stdout). Of −v and −q, the last to be specified applies.

−W, −−write-each-cycle

Write the output and tracked-output (see −o and −t) after every step, instead of just at the end of the anneal and tracking phases respectively.

−w, −−nwinmoves num

Number of window-shift moves per step. Default: Equal to the total number of windows in the initial configuration. See also −c, −s, −k.

−X, −−noautotrack

Disable tracking: quit after the simulated-anneal and deep quench phase have finished. The default is to due a tracking phase as well.

−x, −−betaincr value

Amount by which to increase fictitious inverse temperature (beta) at each step in the simulated-annealing phase. Must be positive. Default: chosen automatically.

−y, −−nexpwin num

Specifies the total number of sites expected a priori in the data. Used only when colour-changing moves (see -c option) are turned on. The value num is used to set a maximal entropy prior over the space of binding site configurations. Default: inferred from the values specified with the -I or -i options. NOTE: The value num refers to the total number of windows when running on aligned sequences. That is, a site occurring at the same location in the alignment in multiple species counts as a single site.

−z, −−nexpcol num

Specifies the total number of different motifs (colours) expected a priori in the data. Used only when colour-changing moves (see -c option) are turned on. The value num is used to set a maximal entropy prior over the space of binding site configurations. Default: inferred from the values specified with the -I or -i options.

BUGS

Improperly formatted input fasta files may crash the program or fail silently with mysterious results. In particular, CR-delineated files (Mac-style) will not work; convert them to Unix-style LF-delineated files. DOS/Windows-style CR-LF files will probably work but are not regularly tested.

The program is sprinkled with asserts that (if compiled with -DDEBUG) should catch inconsistencies early and terminate execution rather than fail silently. If you come across such an assert error, or other bugs, or have a feature wishlist, send email to the authors, preferably including the complete commandline, the GSL random number seed (which are both printed in the output file), and the offending input file.

SEE ALSO

The phylogibbs_algorithm(7) manpage for internal details of the algorithm and further references.

AUTHORS

The PhyloGibbs algorithm was developed during 2002-2005 by Rahul Siddharthan <rsidd@remove-this.imsc.res.in>
Erik van Nimwegen <erik.vannimwegen@remove-this.unibas.ch>
Eric D. Siggia <siggia@remove-this.eds1.rockefeller.edu>
The code was written by Rahul Siddharthan and Erik van Nimwegen.