# Abstract

The article reports the frequency distributions of the mutants of Alu elements in the human genome. They are remarkably complex and, yet, almost identical for each chromosome, suggesting a universal mechanism for base substitutions in Alu-elements and, possibly, other retro-transposons as well. Conceivably, these mutations of the Alu-elements effectively reduced or even crippled their proliferation which, otherwise, might have fragmented and destroyed the host genome. The article proposes a simple mathematical model that simulates the observed distributions and offers

(a) a quantitative reconstruction of the evolutionary past of the number of Alu-mutants,

(b) a determination of the times in their past when new copies of the Alu-elements inserted with full capacity to proliferate (‘seedings’), arguably giving rise to new sub-families,

(c) a new and simple determination of the evolutionary age of Alu-mutants and, thus, of a minimal age of the domains of the host chromosome in which they were found.

# Introduction

Alu elements and other retro-transposons should pose a lethal threat for every genome they invade. Their method of amplification via transcripts that reinsert into the host genome through reverse transcription [1, 2, 3] could conceivably lead to an exponential ‘explosion’ of copy numbers that would completely fragment and thus destroy the host genome. In the case of the Alu retro-transposon this catastrophe did not happen to our ancestral genomes it invaded, although its copy numbers in the e.g. human genome exceed 1 million [4, 5]. Lucky for us, our ancestral genomes appear to have found an effective defense strategy that limited the proliferation of the Alu-elements to harmless levels and, in the process, may even have created a selective advantage [6].

One of the defense strategies may have been the mutation of the Alu-elements, possibly aimed at crippling their ability to proliferate. Since the entire spectrum of conceivable point mutations is consistent with the interpretation that all point mutations were caused by auto-mutagenic mechanisms [7], it would make sense, if the genomes had unleashed this arsenal for their defense. As will be shown in this article, the Alu mutants in the human genome of 50 or even more base substitutions outnumber the ‘original’ Alu-copies by a wide margin. Considering that the Alu-elements are only approximately 280 bases long, such large numbers of base substitution must have had a substantial impact on their functionality.

One might expect that random Alu-proliferation in the host genome followed by random base substitutions of each Alu sequence results in poorly reproducible, rather chaotic distributions of Alu-mutants. Surprisingly, however, the process created precisely defined frequency distributions of Alu-mutants that were the same for all human chromosomes (and chimpanzee chr.1) and depended only on the specific family to which the Alu-element belonged. In order to explain this finding, the present article offers a simple mathematical model of the dynamics of the proliferation of Alu-elements while their capacity to proliferate is increasingly inhibited by point mutations. If correct, this model will permit to reconstruct the evolutionary past of the Alu mutants and also to predict their evolutionary future. It may even serve to justify the interpretion of Alu mutants as time stamps on the host genome.

The present article adopted several simplifying methods and strategies in order to depict Alu-elements in an easily recognizable way, and to minimize computation times.

1. Characterization of Alu-mutant by the number of their base substitutions regardless of their position.

The most significant simplification used here was the focus on the number of mutations in an Alu-sequence regardless of their position. In view of the high level of sophistication of today’s sequence analysis of Alu-elements [4, 8] this approach may seem rather crude. However, similar to aerial photography, the omission of details may sometimes offer a depiction of large-scale features that might otherwise go undetected. I hope to convince the reader that this loss of sequence details, nevertheless, helps describing the overall dynamics of Alu-evolution in its past and future.

2. The restriction to the main Alu families

As a further simplification the article focuses only on the 3 major Alu-families, AluY, AluS and AluJ, while ignoring their division into a total of 217 sub-families [8, 9]. Nevertheless, my search program for Alu-mutants treated the members of all sub-families as mutants and, therefore, did not omit them.

# Methods

The sequences of the human genome were obtained from the UCSC site. The analysis program, "GA_dnaorg.exe", and the simulation program, “Alu- dnaorg.exe" were written by G.A.-B. using Visual C++ (Microsoft , Redmond, WA).

The search primers for the Alu families J, S and Y were the following sequences:

AluJ primer :

CCCAGGAGTTCGAGACCAGCCTGGGCAACATAGTGAGACCCCATCTCTACAAAAATTTAAAAAATTAGCCAGGCATGGTG

GCGCATGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGTGGGAGGATCGCTTGAGCCCAGGAGGTCGAGGCTGCAGTG

AGCTATGATCATGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGTCTC.

AluS primer:

TCAGGAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCGTGGTG

GCACGCGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATCGCTTGAACCCGGGAGGCGGAGGTTGCAGTG

AGCCGAGATCGCGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGTCTC.

AluY primer:

AGGAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCGTGGTGG

CGGGCGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTG

AGCCGAGATCGCGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGTCT.

# Results

1. The distribution of Alu-mutants in the human genome

A. The genome pixel image (GPxI) of Alu-elements and their mutations

Instead of describing the mutated Alu sequences as strings of letters, we will present them as optical images by the GPxI method described earlier [11]. Briefly, the method assigns to the bases of a DNA sequence the following gray-tone values: A: black, G: white, C: dark gray and T: light gray. This assignment is, of course, arbitrary, but must remain the same throughout. It transforms the consecutive bases of the sequence into a continuous line of pixels with varying gray values. In addition, the method requires the choice of an arbitrary, but also fixed image width W. Whenever the line of pixels reaches W, it wraps around like any other text would, and continues at the beginning of the next line immediately underneath.

It is, of course, also possible to choose the image width W equal to the size of the depicted sequences. In this way, the sequences (e.g. the Alu-sequences) can be written in register, as was done in all GPxIs shown in this article.

The resulting images permit intuitively clear decisions whether a sequence is or is not a mutated Alu-element, while avoiding rather abstract mathematical homology computations. For example, the GPxIs of the sequences of the 217 Alu subfamilies [8] if placed in register appear as shown in Illustr. 1a. Obviously, no detailed sequence analysis is required to recognize in the GPxI that all these sequences all are variations of essentially the same motif.

In order to find the Alu-mutants in the human genome, the article uses the search program ‘GA-dnaorg.exe’ written by the author, and specific search primers of 209 [b] size whose GPxIs are shown in Illustr. 1b. Their sequences are listed in Materials and Methods.

B. The choice of the search parameters.

The search algorithm used in the search program was a simple base-by-base comparison between a search primer and a genome sequence while the primer was moved along the genome. The success of the search was defined as a match where the number of base substitutions remained below a certain threshold N. Whenever the program found a suitable match it recorded its position, sequence and exact number of base substitutions in a data file.

The threshold N must not be too small, lest the search would miss too many Alu-mutants. It must also not be too large, lest the search would accept sequences that could no longer be considered Alu mutants. As shown by their GPxIs (Illustr. 2) the patterns of the sequences identified by the search program were, indeed, easily recognizable Alu-mutants even for values as high as N = 100 base substitutions, provided the size search primer was 200 bases or larger.

The same criterion of yielding recognizable Alu-patterns was applied to the selection of a suitable size p of the search primers. While a search primer with p = 200 bases searched with a threshold of N = 100 yielded clear Alu-patterns (Illustr. 3b), a search primer with p = 50 did not yield recognizable Alu-patterns, even if the threshold N was a low as 25 bases (Illustr. 3c).

These and similar criteria led to the choices of a search primer size p = 209 or 213 bases and a threshold of an acceptable number of N = 100 bases substitutions throughout the following.

C. The universal frequency distribution of Alu-mutants.

Applying the described search method individually to the human chromosomes 1 – 22 and X yielded 389,956 AluY mutants. If normalized for the same chromosome size, their frequency distributions were remarkably identical for all chromosomes (Illustr. 4a) as evidenced by the very small standard deviations between the values of different chromosomes (bars in Illustr. 4).

Similarly, the search program found 171,066 AluS mutants and 172,240 AluJ mutants in human chromosomes 1 – 7. Although their average distribution curves were characteristically different for the different members of the Alu family, different chromosomes yielded again surprisingly identical distribution curves (Illustr. 4b, c). The distribution curve of the AluJ mutants consisted almost exclusively of heavily mutated elements, confirming that the AluJ–elements are the oldest of the family [8, 9].

Testing the homologies of all recorded AluY mutants against the AluY- and AluJ search primers by the Needleman-Wunsch algorithm [12] yielded characteristically different values for mutants with less than 30 base substitutions (Illustr. 4d). However, in case of 40 or more base substitutions both search primers yielded the same cloud of values between 50% and 95% homology. This result is not surprising as one should expect that the distinction between the more heavily mutated elements of the 3 Alu-families AluY, AluS and AluJ may become blurred, as 40-50 or more of base substitutions are likely to wipe out many of the relatively small sequence differences between the 2 search primers (see Illustration 1).

A surprising feature was the appearance of multiple peaks in the distributions, suggesting that there had been several waves of increased replication in the evolutionary past of the Alu-elements.

In all cases, the frequency distributions showed a pronounced dominance of Alu- mutations with 50 and more base substitutions over Alu-elements that contained fewer than 50 mutations. Equating large numbers of base substitutions with large evolutionary age, it suggests that most Alu-elements in the human genome are quite ‘old’.

Testing chimpanzee chromosome 1 yielded the same distributions as the human chromosomes.

D. The decision, which of 2 Alu-elements is more similar to the ‘original’ based on their mutant distribution.

The search for mutated Alu-elements poses a fundamental question. How can we know whether the sequence AluYthatwe used as a search primer is the ‘original’ sequence? Why should not another mutant Alu-sequence AluYm with (say) m base substitutionsbe the ‘original’ while AluYwas one of its m-fold mutants?

To be sure, there is clear evidence that Alu sequences are part of the 7SL RNA gene of numerous species [10]. However, among them only certain primates have processed it into a retro-transposon, whereas Xenopus and drosophila have highly analogous 7SL RNA genes but no Alu-elements. Therefore, there was early in the evolution of these primates a mutation of the primate 7SL RNA gene or an invasion from the 7SL RNA gene of another species that laid the foundation of the Alu-elements as we know them today.

Obviously, we can never decide whether a particular Alu-sequence is the ‘true original’ which may no longer exist today. Therefore, in the literature many authors placed the terms ‘original’ or ‘source’ sequences in inverted commas as was done in the present article.

Nevertheless, based on the set {M} of all Alu-mutants known today, it is quite possible to determine which of 2 Alu-mutants is more similar to the ‘original’ than the other. Traditionally, the students of Alu-elements have solved the problem by detailed studies of homologies between domains of different Alu-sequences, which can determine which sequence pre-dates the other and, thus identify the earliest among them as the most ‘original’ [4, 8].

The mutant distributions presented here offer another rather simple way to tell which of two Alu-mutants is more similar to the ‘original’ sequence. Consider the set {M} of mutants that all arose from a common original sequence Alu0 in the human or any other genome. Using Alu0 as a search primer will yield a specific mutant distribution A0[n] similar to the ones in Illustr. 4.

Now select one of the mutant Alu0-sequences X which, unbeknownst to you, differs from Alu0 by m base substitutions. Using X as a search primer will yield its mutant distribution AX[n] from the same set {M} of Alu-mutants.

The comparison between the distributions A0[n] and AX[n] will show quite easily that the search primer Alu0 is more similar to the ‘original’ Alu-sequence than X by the following criteria: Compared to A0[n] the distribution AX[n] will be shifted towards large numbers of base substitutions while lacking mutants of X with 1, 2, 3,… and other small numbers of base substitutions.

For example, Illustr. 5 shows the mutant distributions obtained from certain AluY-mutants X(i) with i = 0, 15, 30, and 63 base substitutions which were used as search primers on human chr.1 Clearly, the more base substitutions the search primer X(i) contained, the fewer mutants could be found that contained less than (say) 30 base substitutions.

The finding can be explained by the fact that {M} contains many mutants of Alu0 with 1, 2, 3,… and other small numbers of base substitutions, as the presence of such mutants is the definition of {M}. Hence, A0[n] will contain substantial numbers of mutants with fewer than 30 base substitutions.

On the other hand, it is extremely unlikely to find in {M} a single sequence that could qualify as a mutant of (say) X(30) that contains additional 1, 2, 3,… or other small numbers of base substitutions. Such a sequences would have to be mutants of Alu0 with the exact same 30 base substitutions as X(30) in exactly the same positions, but contain 1, 2, 3,… additional ones in other (or the same) positions. The probability to find such mutants is very small since the number of all possible mutants with 30 base substitutions is on the order of 1060, while {M} contains only a miniscule fraction of them, namely approximately 106. Hence, AX(30)[n] will contain almost no mutants with only 1, 2, 3, … base substitutions.

This reasoning was used earlier, in order to conclude that AluJ was much older than AluY because it contained almost no mutants with fewer than 30 base substitutions (Illustr. 4c). Likewise, one can see immediately that the age of AluS is in between AluY and AluJ but more similar to AluY, as its mutant distribution has fewer such mutants than AluY but many more than AluJ.

2. Mathematical model of the dynamics of Alu mutations

The described invariance of the frequency distributions and the predominance of heavily mutated Alu-elements are quite counter-intuitive results and, therefore, need explanation. The following mathematical model of Alu-mutation will try to provide one.

The model introduces only the most obvious variables and parameters that are needed to discuss quantitatively the evolutionary fate of the Alu-mutants. Furthermore, the equations of the model are kept simple and comply with common sense. For all these reasons I included them in the main body of the text, instead of banning them into an Appendix.

A. The basic variables and equations.

Definitions:

L: size of the ‘original’ Alu-element

R: number of recursions of computation. It plays the role if ‘time’ in the model and will be calibrated in terms of evolutionary time T (see equ. 6)

DR: number recursions between successive computations (usually,

DR=1)

T: calibrated evolutionary time

DT: calibrated time intervals of computation corresponding to 1 recursion of computation (approx. 250,000 yrs; see equ.6)

P: number of recursions needed to develop the model into the present

n: number of base substitutions in an Alu-mutant

N: maximal number of base substitutions allowed (n ≤ N; N = 100) by the search program

A[n,R]: number of Alu-mutants that contain exactly ‘n’ base substitutions at ‘time’ R

Using these definitions the model considers only the effects of base substitutions on a particular ‘original’ Alu-element. It calculates at successive recursions R the number of mutants A[n,R] that contain exactly n base substitutions at ‘time’ R. The article will describe R interchangeably as ‘evolutionary time’ or as ‘recursions’. The values of R will start with R = 0 (i.e. the first appearance of the Alu-element in the genome) and proceed to R = P (i.e. the present time). The consecutive time points are assumed to be spaced by equal distances of ‘DR ‘. As mentioned earlier the numbers of mutation n must remain below the threshold ‘N’, i.e. the maximal number of mutations allowed which leave the mutated Alu-element still recognizable as a mutant of the original sequence.

During the time interval DR the number of mutants A[n,R] will change for 3 reasons.

(a) Some of the mutants replicated (called ‘replication[n,R]’),

(b) Some of the A[n-1,R] mutants received a base substitution and added to the number A[n,R] (called ‘gain[n,R]’), and

(c) Some of the A[n,R] mutants received a base substitution and moved up to the next category A[n+1,R] (called ‘loss[n,R]’).

Hence, we can write as change of A[n,R]

{1} DA[n,R] = { replication[n,R] + gain[n1,R] - loss[n,R] } ·DR; (0≤n≤N ; 0≤R≤P)

The simplest assumption about the replication[n,R] is that it is proportional to the number of mutants A[n,R] and to a probability z[n] that expresses the capacity of an Alu-mutant with n base substitutions to replicate:

{2} replication[n,R ]=a·z[n]·A[n,R] ;

Similarly, we assume that gain[n,R] and loss[n,R] are proportional to the numbers of mutants A[n-1,R] and A[n,R] respectively from which they originated. Furthermore, they should be also proportional to certain probabilities v[n] and w[n] that describe how likely an Alu-mutant with n-1 or n base substitutions will receive an additional one. Hence,

{3a}gain[n,R] = b·v[n]·A[n-1,R];

{3b}loss[n,R] = b·w[n]·A[n,R];

with the constant b describing the probability of a base substitution

As to the replication probability z[n], its detailed properties do not matter as long as it vanishes very rapidly with the number n of mutations. Otherwise, every solution of the above equations would be equivalent to an explosive growth of the copy numbers of the original Alu-element and all its mutants. Hence, I chose the simplest inactivation function

{4a}z[n] = exp(-n/g); with g the inactivation constant

As to the gain and loss probabilities, it stands to reason, that the probability of a mutant to receive one more base substitution is proportional to the fraction of the not yet mutated bases. Hence,

{4b}v[n] = (L-n-1)/L;

{4c}w[n] = (L-n)/L;

Beginning with R = 0 and n = 0 one needs to compute the values of DA [n,0] for each value of n and subsequently obtain the values of A [n,R] by the recursion

{5} A [n,R+DR] = A [n,R] +DA[n,R] ; (0≤n≤N ; 0≤R≤P)

Equ.3 ignores the possibility of more than one simultaneous base substitution. Otherwise, if 2 or 3 of them would occur during the same time interval DR, then one would have to include the contributions from A[n-2,R] and A[n-3,R] to the gain of A[n,R].

However, this scenario is extremely unlikely. It is known that Alu-elements appeared around 60 million years ago [8, 9]. During this time the majority of them accumulated on average some 60 base substitutions (see Illustr. 4), which suggests an approximate frequency of 10-6 base substitution per Alu per year. Therefore, the occurrence of 2 or 3 simultaneous substitutions would have probabilities of 10-12 and 10-18 per Alu and year, and may be neglected.

Technical note: The mathematically trained reader will have noticed that the above equations 1 – 4 describe a set of linear difference equations for which there are well established, elegant methods of solution. Unfortunately, it is difficult to convert their parameters such as the Eigenvalues of the coefficient matrix into quantities that have readily understandable biological meaning such as mutation rates, replication rates, etc. Furthermore, as pointed out in the next section, it will be necessary to render this matrix explicitly time dependent. As a result, there will be no explicit solutions of these equations.

Therefore, the present article preferred a more pedestrian way of solving the equations with a computer program called ‘Alu_dnaorg.exe’, which was written by the author and carries out the required recursions of equ.5 one step at a time.

B. The fitting of the observed distributions of Alu mutants.

The mentioned computer program yielded essentially only 2 kinds of non-trivial, realistic solutions. One, which will be called the ‘single seeding solution’ describes the time course of the mutants A[n,R] following a single ‘infection episode with a number of ‘original’ or ‘source’ sequences (called a ‘seeding’). The other describes the time course of the mutants A[n,R] following multiple episodes of new seedings at different times, and will be called ‘multiple seeding solution’.

a. Single Seeding solution.

Assuming a single seeding of ‘original’ Alu elements I used the simulation program to determine the simplest, most fundamental solution of the equ. 1 – 5. I selected as parameters a size of the Alu-elements of L = 200, an initial fraction of ‘original’ Alu-elements of A[0,0] = 1.8 [%], an inactivation constant of g = 5 [mutations], and values for the (dimensionless ) mutation and replication terms of b·DR = 0.51, and a·DR = 0.435. The solutions are shown in Illustr. 6 as a function of the number of recursions. Their actual calibration in units of evolutionary time will be deferred to Section 3.

It appeared that the basic single seeding solution is represented by a single narrow peak, reminiscent of a Gaussian. In contrast to a Gaussian, however, it is not symmetrical about its maximum. As time increases, the amplitude of the peak grows while it migrates to the right, i.e. towards larger numbers of base substitutions. This behavior of the single seeding solution may not be immediately obvious. However, it may be made plausible as follows.

The steady growth in numbers of Alu mutants follows because replication can only increase the numbers of Alu-mutants. Once these numbers are large enough, even a very low probability of replication can still increase them, albeit by relatively small increments. Nevertheless, these increments continue to add up to larger and larger numbers.

As to the ‘migration to the right’, it follows because the number of base substitutions can only increase if one disregards the unlikely possibility that a base substitution accidentally reverts a mutation back to its original base. Therefore, the number of Alu-sequences that still contain only a small number of base substitutions decreases steadily, while the ones with the largest number of such mutations acquire even more.

b. Multiple seeding solution.

In contrast to the single seeding solution, the actual frequency distributions of the Alu-mutants in the human genome showed several peaks (Illustr. 4), suggesting that several seeding episodes of ‘fresh’ (i.e. fully replicative) Alu-elements had occurred in the past [13, 14, 15, 16]. The simulation program provided for the occurrence of several episodes of new Alu-elements appearing and replicating.

As shown in Illustration 7 the mutant distributions of the AluY, AluS, and AluJ mutations could, indeed, be simulated with reasonable accuracy by introducing 5 and 4 individual single seeding episodes at specific recursion number Ri. In section 3 they will be converted into actual times Ti. All parameters used for the simulation are shown in Illustration 10. It appeared that each new wave of Alu-elements required its own set of parameters in order to fit the experimental distributions.

The reader will have noticed that the experimental distributions differed considerably from the simulated ones at levels of 90 or more base substitutions. The reason for this discrepancy is not a failure of the model but an inevitable consequence of the search method.

Above certain high levels of mutations the search program will increasingly misidentify unrelated sequences as heavily mutated Alu-sequences. This becomes obvious in the extreme case, where the search program would accept all of the 280 bases in the Alu-element as mutations. In this case, every contiguous sequence of 280 bases contained in the chromosome would qualify as an Alu-mutant and would be counted. In other words, every experimental distribution inevitably rises to the very high level of (chromosome size)/280 counts as it approaches abscissa values of 280 mutations.

c. The need for initial mutations of newly seeded Alu mutants.

There is no reason to assume that the ‘fresh’ Alu-elements of the various seedings were necessarily identical to the original sequence. It is conceivable that they contained already one or several base substitution. Indeed, the success of the fitting procedure required several of the most recent seedings to have one or several initial mutations. They are listed in Illustration 10.

3. The calibration of the evolutionary age of Alu-elements

A. Conversion between recursions and evolutionary time.

As mentioned above, the number of recursions R was used in the mathematical model of Alu-mutation in place of a time variable. In order to calibrate it in terms of the actual evolutionary time T the article made the same assumption as the field in general [8,17], namely that the increasing number of base substitutions occurred at a steady rate. (That is NOT to say that the Alu-mutants occurred at a steady rate. On the contrary, their numbers jumped up several times at the times of the different ‘seedings’.).

Therefore, the calibration needs only 2 time points and their corresponding numbers of recursion to determine the conversion factor t between evolutionary time and number of recursions. The obvious choice for the first point was the first appearance of Alu-elements (T0 = -60 million years [6, 8]), corresponding to R0 = 0, and as the second point the present time (Tpres = 0). Its corresponding number of recursions Rpres was derived from the fitting of the AluJ distribution. Consistent with other determinations, it was obviously the oldest Alu-element, because it contained practically no members with less than 30 mutations (Illustr. 4c). To fit it with the mathematical model required 240 recursions (Illustration 10), yielding Rpres = 240. Hence, the conversion factor became t = 60/240 = 1/4 [million years/recursion]. It was the same value for all other Alu-elements, although their value of Rpres was different (see below). In other words the time increment DR = 1 of the mathematical model in real time spans 250,000 years.

{6} Let T = evolutionary time, R = number of recursions, and Rpres = final number of recursions which creates the present day distribution of mutants, then

T [106 yrs] =t (R-Rpres), with t =0.25 [106 yrs/recursion] ;

As an immediate application of equ.6, Illustration 8 shows the progression of the simulation of the AluY mutations calibrated in evolutionary time from its beginning into the far future. In this case we used Rpres = 200 (see Illustration 10) for the calibration, because the distribution of the AluY mutants required 200 recursions to reach the present time.

B. The transition from simulated numbers of mutants to actual ones.

Each simulated distribution curve represented a total number of S simulated mutants. Therefore, the actual numbers of Alu-mutants can be calculated by multiplying each value of A[n,T] with the same scale factor l.

{7} Scale factor l = M / S, where

M = the total number of experimentally observed Alu-mutants, and

S = SUM(A[n,P]), the total number of mutants simulated for the present time P.

This approach is legitimate because the solutions of equations 1 are obviously scale invariant i.e. if A[n,T] (0≤n≤N) is a solution then l·A[n,T] is also one for any arbitrary constant number l.

In the case of the human genome there were 389956 AluY-mutants (see Results 1B) and the fitting program yielded a value of S(AluY) = 5296. Hence, l(AluY) = 389956/5296 = 73.6. The corresponding values for the 2 other Alu-families were l(AluS) = 72.7 and l(AluJ) = 72.3.

The scale factor l must also be applied to the initial numbers of every seeding. For example, the normalized initial number of A[0,0] = 1.8 for AluY becomes a value of A[0,0] · l = 1.8·73.6 = 133 initial copies of the ‘original’ AluY. Applying the values for the accuracy of the parameters (Illustration 10, last column) the mathematical model claims that some 60 million years ago our ancestral genome was ‘invaded’ by only 133 ± 27 copies of the original AluY element.

C. Seedings of Alu mutants and the appearance of Alu-subfamilies.

Applying equ.6 to the times when new seedings appeared in the mathematical simulation (Illustration 11, column A) yielded the values shown in Illustration 11 column B. The appearance of new, major Alu sub-families [8, 9] is listed in column C of the Illustration 11. The remarkably close correspondence between the values in columns B and C suggests that the seedings may be interpreted as the appearance of new copies of Alu elements that have the full (original) proliferation capacity and, thus, gave rise to new sub-families. In other words, the mathematical model expresses the appearance of new Alu sub-families as new seedings. Based on this interpretation Illustration 9 shows for the 3 main Alu families AluY, AluS, and AluJ the times and relative numbers for the new seedings and, thus, the appearance of new sub-families.

# Discussion

**1.Implications of the mathematical model**

**A.The basic assumptions of the model.**

The mathematical model proposes that the number of Alu-mutants A[n,T] that contain n base substitutions at any given time T

(a)increases through replication, although the probability of replication diminishes rapidly with n,

(b)increases because some of the (n-1)-fold mutants acquire one new base substitution proportional to the fraction of their un-mutated bases,

(c)decreases because some of the (n)-fold mutants acquire one new base substitution proportional to the fraction of their un-mutated bases, and

(d)increases through occasional ‘seedings’ of Alu-elements with full replicative capacity.

These rather minimal assumptions which comply with simple common sense were sufficient to reproduce the details of the actual mutant distributions of AluY, AluS and AluJ to a high degree of accuracy. None of theses assumptions expresses any chromosome specificity. Therefore, the mathematical model explains the main finding of this article, namely the remarkable similarity between the Alu-mutant distributions of different chromosomes.

**B. The quantitative reconstruction of the evolutionary past of Alu-mutations.**

A major attraction of the mathematical model is the possibility to reconstruct the past mutation distributions and their future (Illustration 8 and 9). Of course, the model cannot predict whether and when future seedings may occur.

The list of the past seedings may not be complete. It seems possible to obtain a more accurate simulation of the mutant distributions than illustrated in Illustr. 7 by adding several further ‘minor’ seedings that introduced only very small numbers of ‘fresh’ Alu-elements. These additional seedings may explain the large number of 217 Alu-sub families [8). However, they were omitted here in order to offer a more transparent presentation of the mathematical model.

**C. The seemingly large number of required parameters.**

The simulation of each observed mutant distribution used as many as 28 parameters. The number may appear large and, thus, may seem to render the model less meaningful.

However, one should keep in mind that the length L = 200 of Alu-search primers and the time T1=0 for the first seeding are not fitting parameters. Furthermore, a number of six parameters is clearly a necessary minimum for each single seeding solution, because the processes of inactivation (which requires 2 parameters), replication and mutation of the Alu-elements, their initial numbers and time of the seeding are all independent of each other and require separate parameters. Eventually, the number of 28 parameters arises as the minimal number of parameters to describe the 5 and 4 separate seedings.

**D. The dynamic parameters of Alu-mutation.**

As to the replication of all newly seeded Alu-elements, the common value of the inactivation constant g = 5 suggests that on average a fraction of 1/e = 37% of all newly seeded Alu-mutants were rendered incapable of replication after only 5 base substitutions. Not even all of the remaining elements were replicating, but only 42 % of them (equ.2) because the different Alu-families had a rather similar replication factor a·DT =0.42. However, the reader should be reminded that even very small probabilities of replication may support an impressive increase in total numbers of Alu-elements, provided there are large enough copy numbers already present.

As to the mutation of all newly seeded Alu-elements of the different Alu-families, their parameters were similar enough to equate them to a common value b·DT =0.5 for the purpose of the discussion. It suggests that half of the un-mutated fraction w = (L - n)/L of each n-fold Alu-mutant received a base substitution every DT = 250,000 years (see equ.6).

The mathematical model also introduced the possibility that each newly seeded Alu-element may have started with a certain number of initial base substitutions (called initial # mutations in Illustration 10) compared to the ‘original’ Alu-element. These numbers which are not subject to a scale factor are surprisingly small. However, they have a substantial impact on the fitted distribution.

**2. Possible use as time stamp.**

Once calibrated in terms of evolutionary time, one can use the mathematical model to determine the age of a particular Alu-mutant AluX. What is more, in this way one would also be able to determine a minimal age for the host chromosome or even a part of the host chromosome in which this particular mutant was found. After all, it stands to reason, that most of the host genome existed before the Alu-element invaded it.

To this end one would use the sequence of AluX as a search primer to establish its mutant distribution throughout the entire chromosome. Subsequently, the age of AluX can be determined by assessing how many mutants with small numbers of base substitutions are contained in its mutant distribution, as it was described in section 1D of the Results.

**3. Applicability to other retro-transposons.**

The present article focused on the Alu retro transposons in the human genome, because they are the most numerous and best studied and, thus, could be used for a more detailed discussion of the mathematical model. However, no part of the mathematical model used any specific property of Alu-elements or the human genome that would not also apply to other retro transposons in other species. Therefore, it should be able to model the specific mutant distributions of other retro transposons in other species as well.

# References

1. Mathias SL, Scott AF, Kazazian Jr. HH, Boeke JD, Gabriel A. Reverse transcriptase encoded by a human transposable element. Science (1991) 254: 1808–1810.

2. Rogers, J. Retroposons defined. Nature (1983) 301: 460

3. Dewannieux M, Esnault C, Heidmann, T. LINE-mediated retrotransposition of marked Alu sequences. Nat. Genet. (2003) 35: 41–48.

4. Han K, Xin J, Wang H,Hedges DJ, Garber RK, Cordaux R, Batzer MA. Under the genomic radar: The stealth model of Alu amplification. Genome Res. (2005)15: 655-664

5. Batzer, M.A. and Deininger, P.L... Alu repeats and human genomic diversity. Nat. Rev. Genet. (2002) 3: 370-379.

6. Britten RJ. Evolutionary selection against change in many Alu repeat sequences interspersed through primate genomes. Proc. Natl. Acad.Sci. USA (1994) 91:5992-5996.

7. Albrecht-Buehler, G. The spectra of point mutations in vertebrate genomes. BioEssays (2009) 31:98-106 (http://www3.interscience.wiley.com/cgi-bin/fulltext/121641840/PDFSTART)

8. Price AL, Exkin E, Pevzner PA. Whole-genome analysis of Alu repeat elements reveals complex evolutionary history. Genome Res. (2004) 14: 2245-2252 (doi: 10.1101/gr.2693004

9. Britten RJ. Evidence that most human Alu sequences were inserted in a process that ceased about 30 million years ago. Proc. Natl. Acad.Sci. USA (1994) 91:6148-6150.

10. Ullu E, Tschudi Chr. Alu sequences are processed 7SL RNA genes. Nature (1984) 312, 171 - 172 ; doi:10.1038/312171a0

11. Albrecht-Buehler, G. Outline of a Genome Navigation System Based on the Properties of GA-Sequences and Their Flanks. PLoS ONE (2009) 4(3): doi:10.1371/journal.pone.0004701

12. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol (1970) 48 (3): 443-53.

13. Shaikh TH, Deininger PL. The role and amplification of the HS Alu subfamily founder gene. J. Mol. Evol. (1996) 42: 15–21

14. Matera, A.G., Hellmann, U., and Schmid, C.W. A transpositionally and transcriptionally competent Alu subfamily. Mol. Cell Biol. ( 1990) 10: 5424-5432.

15. Leeflang, E.P., Liu, W.M., Hashimoto, C., Choudary, P.V. and Schmid, C.W. Phylogenetic evidence for multiple Alu source genes. J. Mol. Evol. (1992) 35: 7-16.

16. Shen, M.R., Batzer, M.A., and Deininger, P.L. Evolution of the master Alu gene(s). J. Mol. Evol. (1991) 33: 311-320.

17. Britten RJ. Rates of DNA sequence evolution differ between taxonomic groups. Science(1986) 231:1393-1398.

# Source(s) of Funding

The work was supported by the Robert Laughlin Rea endowed chair held by the author.

# Competing Interests

N/A

# Disclaimer

This article has been downloaded from WebmedCentral. With our unique author driven post publication peer
review, contents posted on this web portal do not undergo any prepublication peer or editorial review. It is
completely the responsibility of the authors to ensure not only scientific and ethical standards of the manuscript
but also its grammatical accuracy. Authors must ensure that they obtain all the necessary permissions before
submitting any information that requires obtaining a consent or approval from a third party. Authors should also
ensure not to submit any information which they do not have the copyright of or of which they have transferred
the copyrights to a third party.

Contents on WebmedCentral are purely for biomedical researchers and scientists. They are not meant to cater to
the needs of an individual patient. The web portal or any content(s) therein is neither designed to support, nor
replace, the relationship that exists between a patient/site visitor and his/her physician. Your use of the
WebmedCentral site and its contents is entirely at your own risk. We do not take any responsibility for any harm
that you may suffer or inflict on a third person by following the contents of this website.