NEW (March 22, 2010): Sigma 2.0-BETA available, for the brave. Go here.
NEW (March 16, 2009): Sigma 1.1.3 released
This version contains
an improved local-alignment algorithm (linear in space, though quadratic in time),
due to Gayathri Jayaraman. The "pre-fragmentation" option ("-l") in versions
1.1 through 1.1.2 has been removed. There is also a bugfix (an off-by-one error).
See changelog in the source file for details.
Sigma 1.0 was written in ocaml (why ocaml?).
Sigma 1.1 (April 3, 2007) has been rewritten in C (why not ocaml?), is more
correct (some bugs and shortcomings have been fixed) and can handle
much larger datasets. It is also significantly faster: generally many
times faster than version 1.0, or other programs such as Dialign or
ClustalW. For details, see the README file in the source distribution. Sigma 1.1.1 and 1.1.2 were
minor updates, while 1.1.3 includes significant improvements and a bugfix.
Sigma
Most tools for multiple-sequence alignment are focussed on aligning
protein sequence or protein-coding DNA sequence. Sigma ("Simple greedy
multiple alignment") is an alignment program with a new algorithm and
scoring scheme designed specifically for non-coding DNA sequence.
This problem is now growing in importance with the increasing number
of fully-sequenced species. In particular, studies of gene regulation
seek to take advantage of comparative genomics, and recent algorithms
(such as PhyloGibbs) for
finding regulatory sites in phylogenetically-related intergenic
sequence require alignment as a preprocessing step.
Tests on synthetic data generated to mimic real data show excellent
performance, with Sigma showing much greater ``sensitivity'' (more
bases aligned) and fewer ``incorrect'' alignments. Results on real
data are harder to quantify, but PhyloGibbs performs well on
Sigma-generated alignments.
References:
Rahul Siddharthan, "Sigma: multiple alignment of
weakly-conserved non-coding DNA sequences", BMC
Bioinformatics 7:143 (2006)
The code (Version 1.1.3: March 16, 2009): (Upgrade from
previous versions recommended!)
- Source code, written in C;
compilation requires the "glib" library (including headers), available on most
Linux systems and many other Unix systems. It should also compile on
Mac OS X, or under cygwin on Windows. See "README"
and "COMPILING" in the source distribution.
- Linux (x86) static binary (should work
on most Linux x86 32-bit systems and on BSD systems with Linux emulation)
- Linux (AMD64) static binary (should
work on most Linux x86 64-bit systems: AMD Opteron, Athlon etc, as
well as Intel systems with the EM64T extensions)
You may need to make the binary executable: type "chmod +x
sigma". Binaries for other platforms may be made available later, or
you are welcome to contribute them.
You may redistribute these binaries and source under the terms of the
GNU
General Public License, version 2. (Short inexact summary: you may
use them privately as you like, modify them, distribute them; if you
distribute modified binaries, you must distribute the corresponding source
on request, also under the GNU GPL).
If you're using Sigma for actual
research, please let me know so that I can alert you of bugfixes or new
releases.
Help with the program
The program does not have too many command-line options, and running
it with the "-help" option (or with no option or an invalid option) produces a
help summary. There is also a manpage. Output is to standard output, which may be redirected
to a file or piped through some other command. Here are some key
points:
-
The "significance" parameter (-x) determines whether local alignments
are accepted or rejected. The default at present is 0.002.
Experiments on synthetic
data (described in the paper) suggest that 0.002 is about the
threshold where sigma fails to align phylogenetically-unrelated data
that has moderate (yeast-like) dinucleotide correlation.
-
Using a "background model" appropriate to the sequences being aligned
greatly reduces spurious alignments on synthetic data (and, one hopes,
on real data too). The simplest way to ensure this is to supply, via
the -b parameter, a FASTA-format file containing large quantities of
similar sequence data (eg, if one is aligning yeast sequences, supply
a file containing all intergenic yeast sequence).
- Instead of this, if the single-site and dinucleotide frequencies are
known already, they may be supplied in a file via the -B option. The
file format should be: one entry per line, with the mononucleotide or
dinucleotide (case-insensitive) followed by the frequency. (eg, "A
0.3", "AT 0.16", etc on successive lines.) A sample is in the
"Background" subdirectory of the source distribution. A file like
"yeast.nc.3.freq" in the "tests" subdirectory of the
MEME
source distribution works fine (trinucleotide counts are ignored).
Limitations in version 1.0 (addressed in 1.1, see README in source
distribution):
- When aligning many weakly-conserved sequences, some regions may
not get aligned (thanks to Karin Kiontke for first alerting me about
this). This is because of the crude way that mismatches are handled
in version 1.0 (mentioned in the paper). In version 1.1, mismatches
are handled better, with much improved results in many cases. In some
pathological cases, where many sequences are being aligned and there
are many mismatches that are not correlated in position across
sequences, Sigma 1.1 may still go astray. Further improvement may be
in future versions.
- Aligning very long sequences in Sigma 1.0 requires a lot
of memory, due to the crude implementation of a Smith-Waterman-type
local-alignment algorithm. Sigma 1.1.3 does local alignments in linear
memory.
- There was a bug in updating "sequence limits" which meant some
allowed alignments weren't considered (however, I believe that no
illegitimate alignments were made.)
If you are interested in version 1.0 anyway, the archived source code
is here.
For any further information, contact me.
Rahul Siddharthan
The Institute of Mathematical
Sciences, Chennai