Aim: To propose a mechanism for the origin of long-range correlations in a chain consisting of 2 species of monomers, purines (P) and pyrimidines (Y), during its growth, starting from an initially randomly assembled monomer chain of small length.
Motivation: I was motivated by the question of how did the genome grow in size as organisms evolved - from bacteria to human, the genome has increased by 3 - 4 orders of magnitude - such that power-law correlations developed in the DNA. I would like to focus on the so-called ``junk'' part of DNA - whose growth was presumably not affected by natural selection. In contrast, the ``coding'' portion of the DNA should be extremely responsive/susceptible/sensitive to natural selection, as any deviation would affect the fitness and hence, the survivability of the species, in the competition for resources. Note that the non-coding proportion of the DNA seems to have grown roughly with genome size and evolutionary complexity - the bacteriophage has 100% coding portion, whereas in humans 97% of the DNA is non-coding. So we are not wrong, if in considering the growth in size of DNA, we consider only the non-coding portion.
The importance of power-law correlations (or ``1/f noise'') in DNA: ``. . . increasing (long-range fractal correlations) increases the recovery probability from a DNA error but decreases the information content per unit length . . . A white noise represents the maximum rate of information transfer, but a 1/f noise, with its scale-independent correlations, seems to offer the best compromise between efficient information transfer and immunity to errors on all scales'' (R. F. Voss, Phys. Rev. Lett. 68 (1992) 3805.)
Biological evidence for growth of genome size: Measurement of the nucleic acid content of the genome of viruses, bacteria and lower & higher eukaryotes has led to the following generalization :
Genome size increases roughly with evolutionary
complexity.
Virus |
Genome size |
Bacteria |
Genome size |
Eukaryotes |
Genome size |
SV40 0X174 M13 Lambda Herpes Simplex T2, T4, T6 Smallpox |
5 |
Mycoplasma hominis
E. coli |
760 4, 700 |
Yeast Nematode Fruit Fly Man Salamander |
14, 000 |
(Source: D. L. Hartl, Basic Genetics, 2nd edition, London: Jones & Bartlett, 1991, p. 122)
Q. What are the different categories of ``junk'' DNA ?
At least nine classes of junk DNA are now recognized: Introns, satellites,
minisatellites, microsatellites, 3' untranslated regions, heterogeneous
nuclear RNA, short interspersed elements, long interspersed elements, and
pseudogenes.(Source: Tim
Standish who claims that ``non-coding'' does not necessarily mean ``junk''
DNA - rather these segments have a structural role)
Q. How much of specific genomes is introns?
T4 has three introns in about 200 kbp. Yeast has 221 introns in a genome
of size 12067 kbp (with 6294 open reading frames, ORFs). In contrast, yeast
mitochondria has 14 introns, 33 ORFs, and is 79 kbp. A good exercise is
to compare the genome size and the hypothesized number of genes - since
a good way to look at the intron content is to compare how many introns
per gene. In compact genomes there is not mcuh "junk" DNA which
would include spacer sequences repetitive elements and introns. So these
genomes have relatively low numbers of introns per gene. As the genome
size increases, a lot more junk DNA including introns is present. In comparing
C. elegans to humans, C.elegans has 130 genes/Mb, and humans have 27 genes/Mb.
This may be because there is a high proportion of junk DNA in humans -
refelcted in a proportionaly increased number of introns per gene in the
genome. In summary, as genome size increases with more junk DNA, the more
introns will be seen in genes. (source)
Q. What is the dynamics of the model ?
Start from a randomly assembled string, apply the mutation operators, and
look at the distribution of lengths of sequences of P (or Y) as the string
evolves under the application of those operators. See whether a steady-state
is distribution is reached. Note under what conditions (i.e. what set of
values of mutation operators) is a power-law distribution observed.
Categories of mutation:
Point Mutations | Large Mutations |
Base Substitution X Transition (replacement of P by P/Y by Y) Transversion (replacement of P by Y/Y by P) X Missense X Nonsense X Frameshift |
Deletion Insertion Duplication Inversion |
(Source: Encyclopaedia of Molecular Biology 3 (Wiley, 1999) 1553.)
Q. Of these mutation operators, which ones shall we include in our
model ?
A. Not transition, missense, nonsense or frameshift. These mutations
affect the functional role of DNA (mainly, they change a STOP-codon to
a non-STOP codon, and vice versa or involve skipping some CODON triplets)
and not the long-range correlation property - which is our object of study.
Since we are considering only P and Y, our model cannot recognize Transition
mutation.
Q. Do we consider a lattice model or a single chain model ?
RANDOM THOUGHTS (18/2/2000):
To find how the above mutation operators acting on a randomly assembled
string of Ps and Ys make it power-law correlated - from an initial exponential
(Poisson) distribution (which should be observed in the case of small-size
genome - say, in bacterias - where all segments are coding) to finally
a power-law distribution, P(n), of lengths (n) of
sub-sequences with only Ps or only Ys. This is opposite to the way a RG
transformation works - we are approaching away from critical point (which
has an exponential distribution of fluctuations , exp(- \lambda n)
- with \lambda a function of temp. in the RG case and time in our case,
going to zero as the critical point is reached) towards the critical point
(where the distribution P(n) follows a power law with a characteristic
exponent). Further, just as blocking transformations decrease the length
of the spin chain as we go further and further away from the critical point
- similarly here the mutation operators make the genomic chain longer and
longer as the critical point is reached. As the system dynamics here drives
the system towards the critical state (in some sense) this can be looked
upon as an example of Self-Organized Criticality (SOC). But note that we
must have some way of distinguishing P and Y in the way the operators act
- if everything is equivalent, we will not be able to create a non-random
distribution (i.e., non-power-law correlated).
Q. How do we find out exactly what operations are necessary to generate
a scaling probability distribution ?
We look at how fractional Brownian motion is generated from Brownian motion
- because in a sense, what we are aiming at is how to generate a random
fractal from a random string/walk. We can look upon the random string as
a random walk by identifying P as +1 (step up) and Y as -1 (step down).
If it was an exactly self-similar fractal, we could have used IFS (maybe,
we can use some ``randomized'' version of IFS ?). In any case, this problem
of generating fBM from BM should give us an idea about which mathematical
operations are necessary for generating the random fractal - and guide
us in selecting our mutation operators that we essentially need. Look at
``Fractional Brownian motion as a nonstationary process: An alternative
paradigm for DNA sequences'' by Paolo Allegrini, Marco Buiatti, Paolo Grigolini
and Bruce J. West.
Q. What is the present model ?
It only has the operators: (1) chipping (from any site with probability
q) and (2)diffusion+aggregation (between any 2 nucleotides of the receiving
sequence with probability 1).
Related Links
RNA Folding by Energy Minimization [Broken Link - Sorry!]