Evolution of long-range correlations in DNA

My Collaborator: Apratim Chatterji (Department of Physics, Indian Institute of Science, Bangalore; currently in University of Mainz)

Aim: To propose a mechanism for the origin of long-range correlations in a chain consisting of 2 species of monomers, purines (P) and pyrimidines (Y), during its growth, starting from an initially randomly assembled monomer chain of small length.

Motivation: I was motivated by the question of how did the genome grow in size as organisms evolved - from bacteria to human, the genome has increased by 3 - 4 orders of magnitude - such that power-law correlations developed in the DNA. I would like to focus on the so-called ``junk'' part of DNA - whose growth was presumably not affected by natural selection. In contrast, the ``coding'' portion of the DNA should be extremely responsive/susceptible/sensitive to natural selection, as any deviation would affect the fitness and hence, the survivability of the species, in the competition for resources. Note that the non-coding proportion of the DNA seems to have grown roughly with genome size and evolutionary complexity - the bacteriophage has 100% coding portion, whereas in humans 97% of the DNA is non-coding. So we are not wrong, if in considering the growth in size of DNA, we consider only the non-coding portion.

The importance of power-law correlations (or ``1/f noise'') in DNA: ``. . . increasing (long-range fractal correlations) increases the recovery probability from a DNA error but decreases the information content per unit length . . . A white noise represents the maximum rate of information transfer, but a 1/f noise, with its scale-independent correlations, seems to offer the best compromise between efficient information transfer and immunity to errors on all scales'' (R. F. Voss, Phys. Rev. Lett. 68 (1992) 3805.)

Biological evidence for growth of genome size: Measurement of the nucleic acid content of the genome of viruses, bacteria and lower & higher eukaryotes has led to the following generalization :

Genome size increases roughly with evolutionary complexity.
 

Virus

Genome size
( in kbp )

Bacteria

Genome size
( in kbp )

Eukaryotes

Genome size
( in kbp )

SV40 
0X174
M13
Lambda
Herpes Simplex 
T2, T4, T6
Smallpox




50 
152 
165 
267 

Mycoplasma hominis

E. coli

760 

4, 700 

Yeast
Nematode 
Fruit Fly
Man
Salamander 

14, 000 
100, 000 
165, 000 
3, 000, 000 
76, 500, 000 

(Source: D. L. Hartl, Basic Genetics, 2nd edition, London: Jones & Bartlett, 1991, p. 122)

Q. What are the different categories of ``junk'' DNA ?
At least nine classes of junk DNA are now recognized: Introns, satellites, minisatellites, microsatellites, 3' untranslated regions, heterogeneous nuclear RNA, short interspersed elements, long interspersed elements, and pseudogenes.(Source: Tim Standish who claims that ``non-coding'' does not necessarily mean ``junk'' DNA - rather these segments have a structural role)

Q. How much of specific genomes is introns?
T4 has three introns in about 200 kbp. Yeast has 221 introns in a genome of size 12067 kbp (with 6294 open reading frames, ORFs). In contrast, yeast mitochondria has 14 introns, 33 ORFs, and is 79 kbp. A good exercise is to compare the genome size and the hypothesized number of genes - since a good way to look at the intron content is to compare how many introns per gene. In compact genomes there is not mcuh "junk" DNA which would include spacer sequences repetitive elements and introns. So these genomes have relatively low numbers of introns per gene. As the genome size increases, a lot more junk DNA including introns is present. In comparing C. elegans to humans, C.elegans has 130 genes/Mb, and humans have 27 genes/Mb. This may be because there is a high proportion of junk DNA in humans - refelcted in a proportionaly increased number of introns per gene in the genome. In summary, as genome size increases with more junk DNA, the more introns will be seen in genes. (source)

Q. What is the dynamics of the model ?
Start from a randomly assembled string, apply the mutation operators, and look at the distribution of lengths of sequences of P (or Y) as the string evolves under the application of those operators. See whether a steady-state is distribution is reached. Note under what conditions (i.e. what set of values of mutation operators) is a power-law distribution observed.

Categories of mutation:
 

Point Mutations Large Mutations
Base Substitution
X Transition (replacement of P by P/Y by Y)
Transversion (replacement of P by Y/Y by P)
X Missense
X Nonsense
X Frameshift
Deletion
Insertion
Duplication
Inversion

(Source: Encyclopaedia of Molecular Biology 3 (Wiley, 1999) 1553.)

Q. Of these mutation operators, which ones shall we include in our model ?
A. Not transition, missense, nonsense or frameshift. These mutations affect the functional role of DNA (mainly, they change a STOP-codon to a non-STOP codon, and vice versa or involve skipping some CODON triplets) and not the long-range correlation property - which is our object of study. Since we are considering only P and Y, our model cannot recognize Transition mutation.

Q. Do we consider a lattice model or a single chain model ?

RANDOM THOUGHTS (18/2/2000):
To find how the above mutation operators acting on a randomly assembled string of Ps and Ys make it power-law correlated - from an initial exponential (Poisson) distribution (which should be observed in the case of small-size genome - say, in bacterias - where all segments are coding) to finally a power-law distribution, P(n), of lengths (n) of sub-sequences with only Ps or only Ys. This is opposite to the way a RG transformation works - we are approaching away from critical point (which has an exponential distribution of fluctuations , exp(- \lambda n) - with \lambda a function of temp. in the RG case and time in our case, going to zero as the critical point is reached) towards the critical point (where the distribution P(n) follows a power law with a characteristic exponent). Further, just as blocking transformations decrease the length of the spin chain as we go further and further away from the critical point - similarly here the mutation operators make the genomic chain longer and longer as the critical point is reached. As the system dynamics here drives the system towards the critical state (in some sense) this can be looked upon as an example of Self-Organized Criticality (SOC). But note that we must have some way of distinguishing P and Y in the way the operators act - if everything is equivalent, we will not be able to create a non-random distribution (i.e., non-power-law correlated).

Q. How do we find out exactly what operations are necessary to generate a scaling probability distribution ?
We look at how fractional Brownian motion is generated from Brownian motion - because in a sense, what we are aiming at is how to generate a random fractal from a random string/walk. We can look upon the random string as a random walk by identifying P as +1 (step up) and Y as -1 (step down). If it was an exactly self-similar fractal, we could have used IFS (maybe, we can use some ``randomized'' version of IFS ?). In any case, this problem of generating fBM from BM should give us an idea about which mathematical operations are necessary for generating the random fractal - and guide us in selecting our mutation operators that we essentially need. Look at ``Fractional Brownian motion as a nonstationary process: An alternative paradigm for DNA sequences'' by Paolo Allegrini, Marco Buiatti, Paolo Grigolini and Bruce J. West.

Q. What is the present model ?
It only has the operators: (1) chipping (from any site with probability q) and (2)diffusion+aggregation (between any 2 nucleotides of the receiving sequence with probability 1).

References

Related Links

DNA Folding server, Zuker Lab

RNA Folding by Energy Minimization [Broken Link - Sorry!]

Bibliography on correlation structure of DNA (W. Li)