Asymptotic results in over and under representation of words in DNA

Thông tin tài liệu

... score defined above in equation (1.1), and analyze the over- and underrepresentation of a finite set of DNA words as the sequence length goes to infinity, by investigating the behavior of the extrema... the tail probability of a standard normal distribution when x √ and n tend to infinity with x ≤ c ln n To obtain the asymptotic results in over and under representation of DNA word counts, we... under M0 38 4.2 DNA Sequences under M1 42 iv Summary Identifying over- and under- represented words is often useful in extracting information of DNA

ASYMPTOTIC RESULTS IN OVER- AND UNDER-REPRESENTATION OF WORDS IN DNA WANG RANRAN (B.Sc., Peking University) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF MATHEMATICS NATIONAL UNIVERSITY OF SINGAPORE 2005 Acknowledgements Firstly, I would like to express my sincere thanks to my advisor, Professor Chen, Louis H. Y. for his help and guidance in these two years. Professor Chen suggested the research topic on over- and under-representation of words in DNA sequences to me, which is interesting and inspiring. From the research work, I not only gained knowledge of basics of Computational Biology, but also learned many important modern techniques in probability theories, like the Chen-Stein method. Meanwhile, Professor Chen gave me precious advice on my research work and taught me how to think rigorously. He always encouraged me to open my mind and discover new methods independently. The academic training I acquired during these two years will greatly benefit my future research. I would also like to thank Professor Shao Qiman for his inspiring suggestion which leads to Remark 2.4 and finally to the proof of Theorem 3.9; and Associate Professor Choi Kwok Pui, who helped me revise the first draft of my thesis and gave me many valuable suggestions. My thank also goes to Mr. Chew Soon Huat David, who gave me guidance in conducting computer simulations; Mr. Lin Honghuang for helping me generate DNA sequences and compute word counts when conducting the simulations; and ii Acknowledgements Mr. Dong Bin, for giving me advise in revising this thesis. Finally, I would like to thank Mr. Chew Soon Huat David again for providing this wonderful Latex template of thesis. iii Contents Acknowledgements ii Summary v List of Figures vi 1 Introduction 1 2 Extrema of Normal Random Variables 5 2.1 Distribution Functions of Extrema . . . . . . . . . . . . . . . . . . 5 2.2 Poisson Approximation Approach . . . . . . . . . . . . . . . . . . . 13 3 Asymptotic Results of Words in DNA 17 3.1 Tail Probabilities of Extrema of Sums of m-dependent Variables . . 17 3.2 Asymptotic Normality of Markov Chains . . . . . . . . . . . . . . . 27 4 Simulation Results 38 4.1 DNA Sequences under M0 . . . . . . . . . . . . . . . . . . . . . . . 38 4.2 DNA Sequences under M1 . . . . . . . . . . . . . . . . . . . . . . . 42 iv Summary Identifying over- and under-represented words is often useful in extracting information of DNA sequences. Since the criteria of defining these over and under represented words is somewhat ambiguous, we shall focus on the words of maximal and minimal occurrences, which will be definitely regarded as over- and underrepresented words respectively. In this thesis, we study the tail probabilities of the extrema over a finite set of standard normal random variables by using techniques like Bonferroni’s inequalities and Poisson Approximation associated with the Chen-Stein method. We apply similar techniques and the moderate deviations of m-dependent random variables together, and then derive the asymptotic tail probabilities of extrema over a set of word occurrences under M0 model. The statistical distribution of word counts is also studied. We show the asymptotic normality of word counts under both the M0 and M1 models. Finally we use computer simulations to study the tail probabilities of the most frequently and most rarely occurred DNA words under both the M0 and M1 models. The asymptotic results under the M1 model are shown to be similar to those for the M0 model. v List of Figures 4.1 Normal Q-Q plot of the sums of ξ scores of all 64 3-tuple words in 20,000 simulated DNA sequences under M0 model. . . . . . . . . . 4.2 39 Normal Q-Q plot of the maxima of ξ scores of all 64 3-tuple words in 20,000 simulated DNA sequences under M0 model. . . . . . . . . 4.3 40 Point-to-point plots of values of (a) Fmax(x) versus G(x), where Fmax(x) stands for the estimated probabilities P (M ≥ x) and G(x) = 64 1 − Φ(x) ; (b) Fmin(x) versus G(x), where Fmin(x) stands for the estimated probability P (m ≤ x) with G(x) = 64Φ(x). 4.4 41 Normal Q-Q plot of the sums of ξ scores of all 64 3-tuple words in 20,000 simulated DNA sequences under M1 model. . . . . . . . . . 4.5 43 Point-to-point plots of values of (a) Fmax(x) versus G(x); (b) Fmax(x) versus G(x); (c) Fmin(x) versus G(x). . . . . . . . . . . . . . . . . 45 vi Chapter 1 Introduction The analysis of rarity and abundance of words in DNA sequences has always been of interest in biological sequence analysis. A direct way to observe whether a DNA word occurs rarely (or frequently) in a genome is to analyze the number of its occurrences in a DNA sequence. For a DNA sequence A1 A2 · · · An with Ai ∈ A = {A, C, G, T }, we define the word count (e.g. Waterman (1995)) Nu for a k-tuple word u as follows, n−k+1 Nu = n−k+1 Iu (i) = i=1 I(Ai = u1 , Ai+1 = u2 , · · · , Ai+k−1 = uk ), i=1 where Iu (i) is the indicator of the word u occurring at the starting position Ai . To determine whether a DNA word u is rare or abundant in a DNA sequence, one needs to introduce a probability model first. Typical models, such as the stationary m-order Markov chains, have been widely considered in the literature (Reinert et al.(2000)). In this thesis, two models for DNA sequences will be considered. One is called M0 model, for which all letters are independently and identically distributed; and the other one is called M1 model, for which {A1 , A2 , · · · } forms a stationary Markov chain of order 1. In order to analyze word count Nu , naturally, we shall first study the possible statistical distribution of it for a given model of the underlying DNA sequence. We 1 2 adopt a commonly used standardized score: zu = Nu − ENu Var(Nu ) (1.1) where ENu and Var(Nu ) are the mean and variance of Nu respectively (Leung et al.(1996)). Statistical distribution of the word count Nu has already been well studied in the literature. Waterman (1995) (Chapter 12) showed that the joint distribution of a finite set of z scores can be well approximated by a multivariate normal distribution under M1 model. Several research works aim at identifying over- and under-represented words in DNA or palindromes. A word is called over(or under-)represented if it is observed more (or less) frequently than expected under some specified probability model (Phillips et al. (1987)). Leung et al.(1996) identified over- and under-represented short DNA words by ranking their z L scores (maximum likelihood plug-in z scores) in a specific genome. Chew et al. (2003) studied the over- and under-representation of the accumulative counts of all palindromes of certain length by identifying their upper and lower 5% z scores of a standard normal distribution. In these studies, the criteria one used to identify the over- (or under-)representation were different. Indeed, for different purposes in biological studies, the criteria would be different in general. There is no single universal way to determine whether a given word is over- (or under-)represented. However, if we consider the extreme case, i.e. if we only take the words of maximal and minimal occurrences, these two words are surely the over-represented and the under-represented ones respectively (which is exactly what we will do in this thesis). In this thesis, we shall apply the ξ scores which is essentially the same as the z score defined above in equation (1.1), and analyze the over- and underrepresentation of a finite set of DNA words as the sequence length goes to infinity, by investigating the behavior of the extrema over their ξ scores. We shall study the asymptotic results of ξ scores. Generally, the DNA sequence 3 are long and asymptotic results may be of relevance to the statistical analysis of the word counts. For this, we introduce the following notations. an = O(bn ) : an = o(bn ) : an ∼ b n : an bn : |an | ≤ c|bn | (constant), as n → ∞. an −→ 0, as n → ∞. bn an −→ 1, as n → ∞. bn c1 bn ≤ an ≤ c2 bn (c1 , c2 constants), as n → ∞. Assuming that the DNA sequence is modelled by M0, we shall show that (see Theorem 3.9) if there exists a finite set of ξ scores {ξ1 , ξ2 , · · · , ξd }, we have P (max ξi ≥ x) ∼ d(1 − Φ(x)), i and P (min ξi ≤ −x) ∼ dΦ(−x), i √ as n → ∞ and x → ∞ with 1 ≤ x ≤ c ln n, provided that the covariance matrix of word counts is non-singular. Here, Φ and ϕ respectively denote the distribution function and the density function of a standard normal random variable. When assuming the DNA sequence is M1, we will prove the asymptotic normality of the joint distribution of ξ scores by applying a central limit theorem for random variables under mixing condition (Billingsley (1995), Section 27). Unfortunately, under the M1 model, the convergence of the ratios P (maxi ξi >x) d 1−Φ(x) and P (mini ξi ≤−x) dΦ(−x) to 1 for ξ scores remain unsolved. This thesis is organized as follows. Chapter 2 shows how the distribution functions of extrema of a finite set of correlated standard normal random variables behave when these extrema tend to extremely large or small values. In Chapter 3, the asymptotic convergence of the tail probabilities of extrema is established for word counts under M0 model. It is also devoted to study the asymptotic normality of word counts under M0 and M1 models. Results of simulations are presented in 4 Chapter 4, which support the asymptotic results given by Theorem 3.8 and show the possibility that similar results will be obtained under M1 model. Chapter 2 Extrema of Normal Random Variables In this chapter, we would like to investigate the distributions of both the maximum and minimum of a set of standard normal random variables. More precisely, we will try to find out the probabilities of the maximum being greater than c, and the probabilities of the minimum being less than c0 , for c, c0 ∈ R. Our main theorem in this chapter shows that, when c is large enough and c0 is small enough, the asymptotic tail distributions of the both extrema follow certain expressions in terms of c and c0 respectively. We will present two methods in proving this theorem, one using Bonferroni’s inequalities and the other one using Poisson approximation associated with the Chen-Stein method. 2.1 Distribution Functions of Extrema To facilitate the proof of Theorem 2.7, we need a few lemmas first. The first lemma was given by Barbour et al.(1992). To make this thesis self-contained, we shall provide its proof, which is essentially the same as that of Barbour et al.(1992). Throughout this section, we assume the correlation of two random variables X and Y , r, be strictly bounded between -1 and 1, i.e. −1 < r = corr(X, Y ) < 1. 5 2.1 Distribution Functions of Extrema 6 Lemma 2.1. Let  (X, Y ) be  jointly normally distributed with mean vector 0 and 1 r . covariance matrix  r 1 (i) If 0 ≤ r < 1, then for any positive a and b, b − ra 1 − Φ(a) 1−Φ( √ ) ≤ P (X > a, Y > b) 1 − r2 b − ra ϕ(b) a − rb ≤ 1 − Φ(a) 1 − Φ( √ ) +r 1 − Φ( √ ) . 2 ϕ(a) 1−r 1 − r2 If −1 < r ≤ 0, the inequalities are reversed. (ii) If 0 ≤ r < 1, then for any nonpositive a and b, b − ra Φ(a)Φ( √ ) ≤ P (X ≤ a, Y ≤ b) 1 − r2 ϕ(b) a − rb b − ra )+r Φ( √ ) . ≤ Φ(a) Φ( √ ϕ(a) 1 − r2 1 − r2 If −1 < r ≤ 0, the inequalities are reversed. Proof. For part (i), ∞ ∞ 1 1 − (x2 +y 2 −2rxy) e 2(1−r2 ) dxdy 2π 1 − r2 b ∞ x2 1 − (y−rx)2 dy 1 √ e− 2 dx √ e 2(1−r2 ) √ 1 − r2 2π 2π b √ P (X > a, Y > b) = a ∞ = a ∞ = ∞ ϕ(x)dx a a ϕ(y)dy 1−r 2 ∞ = √b−rx b − rx ϕ(x) 1 − Φ( √ ) dx. 1 − r2 Integrating by parts, we get P (X > a, Y > b) ∞ b − rx 1 − Φ( √ ) d 1 − Φ(x) 1 − r2 a ∞ (2.1) a r b − rx b − rx 1 − Φ(x) √ + ) ϕ( √ )dx = 1 − Φ(x) 1 − Φ( √ 1 − r2 ∞ 1 − r2 1 − r2 a ∞ b − ra 1 b − rx = 1 − Φ(a) 1 − Φ( √ 1 − Φ(x) √ ) +r ϕ( √ )dx. 1 − r2 1 − r2 1 − r2 a =− 2.1 Distribution Functions of Extrema 7 If 0 ≤ r < 1, we get the lower bound immediately. Next, we want to prove that the function 1−Φ(x) ϕ(x) is decreasing. Let f (x) = 1−Φ(x) . ϕ(x) Then f (x) = −1 + x 1−Φ(x) ϕ(x) . When x > 0, we have ∞ 1 − Φ(x) = ϕ(y)dy ≤ x y2 1 1 = √ e− 2 x 2π x ∞ 1 x ∞ x y2 1 y √ e− 2 dy 2π 1 = ϕ(x). x The above inequality gives f (x) > 0. It follows that ∞ b − rx dx 1 − Φ(x) ϕ( √ )√ 2 1−r 1 − r2 a 1 − Φ(a) ∞ b − rx dx ) √ ≤ ϕ(x)ϕ( √ 2 ϕ(a) 1−r 1 − r2 a x − br dx 1 − Φ(a) ∞ ϕ(b)ϕ( √ )√ = 2 ϕ(a) 1−r 1 − r2 a ϕ(b) ∞ x − br x − rb = 1 − Φ(a) ϕ( √ )d( √ ) 2 ϕ(a) a 1−r 1 − r2 a − rb ϕ(b) 1 − Φ√ ) , = 1 − Φ(a) ϕ(a) 1 − r2 which gives the upper bound. Due to equation (2.1), the lower and upper bounds are reversed when r < 0. Hence, the same argument can be used to derive the reversed inequalities for −1 < r ≤ 0. For part (ii), since P (X > a, Y > b) = P (−X < −a, −Y < −b) = P (X < −a, Y < −b) = P (X ≤ −a, Y ≤ −b), the same argument works when we take a and b to be nonpositive. And the inequalities become P (X ≤ a, Y ≤ b) = P (X > −a, Y > −b) −b + ra −a + rb ϕ(b) 1 − Φ( √ 1 − Φ( √ ) +r ) 2 ϕ(a) 1−r 1 − r2 ϕ(b) a − rb b − ra )+r Φ( √ ). = Φ(a)Φ( √ 2 ϕ(a) 1−r 1 − r2 ≤ 1 − Φ(−a) 2.1 Distribution Functions of Extrema 8 As a result, the inequalities are established when 0 r < 1 for nonpositive a and b. The same argument works when −1 < r ≤ 0, with the inequalities reversed. Lemma 2.2. Let  (X, Y ) be  jointly normally distributed with mean vector 0 and 1 r . covariance matrix  r 1 (i) If 0 r < 1, then for any positive a, 1 − Φ(a) 1 − Φ(a 1−r ) ≤ P (X > a, Y > a) ≤ (1 + r) 1 − Φ(a) 1 − Φ(a 1+r 1−r ) . 1+r If −1 < r ≤ 0, the inequalities are reversed. (ii) If 0 r < 1, then for any nonpositive a and b, Φ(a)Φ(a 1−r ) ≤ P (X ≤ a, Y ≤ a) ≤ (1 + r)Φ(a)Φ(a 1+r 1−r ). 1+r If −1 < r ≤ 0, the inequalities are reversed. Proof. This lemma is the direct result of Lemma 2.1 by substituting b into a. The above two lemmas give the exact expressions of the lower and upper bounds of probability P (X > a, Y > a). Next, we would like to find out the asymptotic behavior of P (X > a, Y > a) as a tends to infinity. The rate of convergence of P (X > a, Y > a) is also expected in following lemma. Lemma 2.3. Let  (X, Y ) be  jointly normally distributed with mean vector 0 and 1 r . We have covariance matrix  r 1 P (X > a, Y > a) = o 1 − Φ(a) as a → ∞, (2.2) and P (X ≤ −a, Y ≤ −a) = o Φ(−a) as a → ∞. (2.3) 2.1 Distribution Functions of Extrema Proof. When a → ∞, 1 − Φ( 1−r a) 1+r 9 → 0. Immediately, by applying the squeeze theorem, Lemma 2.2 yields equations (2.2) and (2.3). Remark 2.4. The upper and lower bounds obtained by Lemma 2.1 are vary tight that will refine the error bounds in normal approximation problems. However, it is not necessary to use such tight bounds to prove Lemma 2.3 as can be seen as follows. Since X and Y are jointly normal, then X + Y is normal with E(X + Y ) = 0 and Var(X + Y ) = 2(1 + r). Hence √(X+Y ) is standard normal. Therefore, 2(1+r) P (X > a, Y > a) ≤ P (X + Y > 2a) =P X +Y 2(1 + r) =1−Φ a Let λ = 2 . 1+r 2a > 2(1 + r) (2.4) 2 . 1+r Obviously, we have λ > 1. It suffices to prove that, when λ > 1, 1 − Φ(λa) = o 1 − Φ(a) as a → ∞. When a > 0, ∞ ∞ 1 1 2 √ e−x /2 dx ≤ 1 − Φ(a) = a 2π a 1 ϕ(a) 2 = √ e−a /2 = . a a 2π a x 2 √ e−x /2 dx 2π Also, ∞ ∞ 1 −x2 /2 1 2 √ e √ de−x /2 1 − Φ(a) = dx = − 2π x 2π a a ∞ 1 1 2 2 √ e−x /2 d(1/x) = √ e−a /2 + a 2π 2π a ∞ 1 1 2 √ e−x /2 dx ≥ ϕ(a)/a − 2 a a 2π = ϕ(a)/a − 1 − Φ(a) /a2 . Therefore, when a > 0, we obtain a ϕ(a) . ϕ(a) ≤ 1 − Φ(a) ≤ 1 + a2 a 2.1 Distribution Functions of Extrema 10 When λ > 1, 1 − Φ(λa) ϕ(λa)/(λa) 1 + a2 − 1 (λ2 −1)a2 −→ 0, ≤ = e 2 1 − Φ(a) aϕ(a)/(1 + a2 ) λa2 as a → ∞. (2.5) Equation (2.3) is a direct result of equation (2.2), since P (X ≤ −a, Y ≤ −a) = P (−X > a, −Y > a) = P (X > a, Y > a). Since we only need to observe the asymptotic convergence of the ratio P (X > a, Y > a)/ 1 − Φ(a) , tight bounds for the term P (X > a, Y > a) will be unnecessary. Furthermore, Lemma 2.1 is applied to standard normal random variables, while the method shown in the proof of above lemma can also be applied to random variables which converge weakly to standard normal random variables. For example, if we have random variables Xn ⇒ N (0, 1) and Yn ⇒ N (0, 1), we may get asymptotic results similar to Lemma 2.3. We will discuss this in the later chapters. We notice that to derive the asymptotic convergence of the ratio P (X > a, Y > a)/ 1 − Φ(a) , the correlation of X and Y should be strictly bounded between −1 and 1. If we have a sequence of correlated random variables {Z1 , Z2 , · · · , Zd }, it is not realistic to check the correlations of every two nonidentical random variables one by one. In Proposition 2.6 below, we will show that a non-singular covariance matrix of {Z1 , Z2 , · · · , Zd } implies that the correlations of every two nonidentical random variables are not equal to either 1 or -1. To prove this argument, we recall a well known fact below. Theorem 2.5. Let X and Y be two random variables and r be their correlation, then |r| = 1 if and only if there exists constants a, b such that Y = aX + b with probability 1. Then, we give our proposition as follows. Proposition 2.6. Let Z1 , Z2 , · · · , Zd be random variables with mean 0 and variance 1. Let Σ and R be the covariance matrix and correlation matrix respectively. 2.1 Distribution Functions of Extrema 11 If Σ is non-singular, all non-diagonal entries of R are strictly bounded between −1 and 1, i.e. −1 < rij < 1 for i = j, where rij is the (i, j)-entry of R. Proof. If there exist Zi and Zj such that rij = corr(Zi , Zj ) = 1, Theorem 2.5 implies that there exist constants a and b, such that Zj = aZi + b with probability 1. Together with the conditions EZi = EZj = 0 and Var(Zi ) = Var(Zj )=1, we have Zi = Zj . It is known that  cov(Z1 , Z1 )    ···    cov(Zi , Z1 )   Σ =  ···    cov(Zj , Z1 )    ···  cov(Zd , Z1 )  cov(Z1 , Z2 ) · · · cov(Z1 , Zd ) ··· ··· cov(Zi , Z2 ) · · · ··· ··· cov(Zj , Z2 ) · · · ··· ··· cov(Zd , Z2 ) · · ·      cov(Zi , Zd )    . ···   cov(Zj , Zd )     ···  cov(Zd , Zd ) ··· Consequently, the ith and jth rows of Σ are identical, and |Σ| = 0 follows. If corr(Zi , Zj ) = −1, the sum of the ith and jth rows of Σ is a zero vector and it also yields |Σ| = 0. This contradicts with our assumption that Σ is a non-singular matrix. With the above Lemma 2.3 and Proposition 2.6, we shall introduce the main result of this chapter. This theorem presents the asymptotic tail distributions for both maximum and minimum over a sequence of normal random variables. Theorem 2.7. Let (Z1 , · · · , Zd ) be a random vector with multivariate normal distribution with a mean vector 0 and a non-singular covariance matrix Σ. Assume further that variance of Zi , 1 ≤ i ≤ d is 1. Then P ( max Zi > c) ∼ d 1 − Φ(c) 1≤i≤d as c → +∞, and P ( min Zi ≤ c0 ) ∼ dΦ(c0 ) as c0 → −∞. 1≤i≤d 2.1 Distribution Functions of Extrema 12 Proof. We shall give two proofs to this theorem. (Proof I) Recall the Bonferroni’s inequalities. For any sets A1 , A2 , · · · , Ad , d d P (Ai ) − i=1 d P (Ai ∩ Aj ) ≤ P ( i=1 1≤i c} = Ai ) ≤ d i=1 {Zi P (Ai ). i=1 > c}. So, d P ( max Zi > c) = P 1≤i≤d (Zi > c) . i=1 It follows that d d P (Zi > c) − i=1 P (Zi > c, Zj > c) ≤ P (max Zi > c) ≤ i 1≤i c), i=1 which is equivalent to P (Zi > c, Zj > c) ≤ P (max Zi > c) ≤ d 1 − Φ(c) . d 1 − Φ(c) − i 1≤i c) P (maxi Zi > c) ≤ ≤ d. 1 − Φ(c) 1 − Φ(c) (2.7) Recall that Zi and Zj , 1 ≤ i, j ≤ d, are both standard normal random variables with zero mean. Lemma 2.3 gives that P (Zi > c, Zj > c) = o 1 − Φ(c) as c → ∞. Due to the finiteness of the index set, 1≤i c, Zj > c) −→ 0 as c → ∞. 1 − Φ(c) Applying squeeze theorem to equation (2.7), we have P ( max Zi > c) ∼ d 1 − Φ(c) 1≤i≤d as c → ∞. 2.2 Poisson Approximation Approach 13 Since (Z1 , Z2 , · · · , Zd ) has zero mean, (−Z1 , −Z2 , · · · , −Zd ) has the same distribution as (Z1 , Z2 , · · · , Zd ). Hence, P (min Zi ≤ c0 ) = P min(−Zi ) ≤ c0 i i = P (max Zi ≥ −c0 ) i ∼ d 1 − Φ(−c0 ) = dΦ(c0 ). Therefore, Theorem 2.7 is obtained. 2.2 Poisson Approximation Approach For a set of independent events, if the probabilities for these events to occur are very small, we call them rare events. Suppose there are n independent events each with probability pi of occurring, 1 ≤ i ≤ n, and pi tends to zero. Then, for k = 0, 1, · · · , the probability for exactly k of these events to occur is approximately equal to e−λ λk /k!, where λ = i pi . This is known as the Poisson limit theorem. It leads to an important fact that the probability for at least one event to occur is equal to 1 − e−λ . Therefore, it is quite natural to think of using Poisson Approximation here. In this section, we would like to provide another method to prove Theorem 2.7, which employs the technique of Poisson Approximation associated with the ChenStein method. In 1975, Chen first applied Stein’s method (Stein (1972)) to Poisson approximation problems, and obtained error bounds when approximating sums of dependent Bernoulli random variables with Poisson distribution. The Chen-Stein method has been successfully developed in the past 30 years and resulted in lots of interesting applications (See e.g. Barbour and Chen (2005a, b)). In Poisson approximation problems, we use the total variation distance to show how one random variable approximates the other. The total variation distance 2.2 Poisson Approximation Approach 14 between two distributions is defined as ||L(X) − L(Y )|| = sup |P (X ∈ A) − P (Y ∈ A)| A where X and Y are random variables. Suppose {Xα : α ∈ J} are dependent Bernoulli random variables with index set J. Denote the probabilities of occurring as pα = P (Xα = 1) = 1 − P (Xα = 0). Let W = λ = EW = α∈J Xα be the number of occurrences of dependent random events, and α∈J pα , for every α ∈ J, α ∈ Aα ⊂ J. To prove Theorem 2.7 using the Chen-Stein method, we shall apply one main result of Poisson Approximation (Arratia et.al.(1990)) which is given below. Theorem 2.8. The total variation distance between the distribution of W , L(W ), and the Poisson distribution with mean λ, P o(λ), is 1 − e−λ (b1 + b2 ) + min(1, 1.4λ−1/2 )b3 λ ≤ 2(b1 + b2 + b3 ), ||L(W ), P o(λ)|| ≤ and |P (W = 0) − e−λ | ≤ (b1 + b2 + b3 )(1 − e−λ )/λ 1 < (1 ∧ )(b1 + b2 + b3 ), λ where b1 = pα pβ , α∈J β∈Aα b2 = EXα Xβ , α∈J α=β∈Aα E E(Xα − pα |Xβ : β ∈ Acα ) . b3 = α∈J (2.8) 2.2 Poisson Approximation Approach 15 With Theorem 2.8, we now introduce the second proof of Theorem 2.7. Proof. (Proof II) Let the finite state space be J = (1, · · · , d), and the indicator of the event {Zi > c} be Xi = I(Zi > c). Suppose pi = P (Zi > c) = P (Xi = 1) W = d i=1 Xi , d i=1 λ= pi , then the event { max Zi > c} = {W ≥ 1}. 1≤i≤d Next, we shall apply Theorem 2.8. Take Bi , the neighborhood of Xi , to be the whole index set in state space. Then b3 , given by equation (2.8), becomes 0, and it follows that P (W ≥ 1) − (1 − e−λ ) = P (W = 0) − e−λ 1 ≤ (1 ∧ ) λ d d d d E(Xi Xj ) . pi pj + i=1 j=1 i=1 j=1,j=i Obviously, when c tends to infinity, pi will be a very small number. So does λ. Therefore, λ → 0, as c → ∞. Consequently, λ/(1 − eλ ) → 1, and 1 ∧ 1 λ = 1. We rewrite the inequality above as 1 P (W ≥ 1) −1 ≤ −λ 1−e 1 − e−λ d d d d pi pj + i=1 j=1 E(Xi Xj ) . (2.9) i=1 j=1,j=i We look at the second term of the upper bound given in (2.9). From Lemma 2.3 and Proposition 2.6, we obtain that E(Xi Xj ) = P (Xi > c, Xj > c) = o(1 − Φ(c)). Since d is finite, it follows that d d E(Xi Xj ) = o 1 − Φ(c) , i=1 j=1,j=i as c → ∞. 2.2 Poisson Approximation Approach 16 Therefore, equation (2.9) becomes P (W ≥ 1) 1 −1 ≤ λ2 + o(λ) −→ 0, −λ 1−e 1 − e−λ with λ → 0. Finally, we get the result that P ( max Zi > c) ∼ 1 − e−λ ∼ λ = d 1 − Φ(c) , as c → +∞. 1≤i≤d (2.10) Similar to the arguments referring to max Zi , we shall define the indicators Yi = I(Zi ≤ c0 ), qi = P (Yi = 1) and U = becomes {U ≥ 1}. Since i qi d i=1 Yi . Then the event {mini Zi ≤ c0 } tends to zero, it yields that P (min Zi ≤ c0 ) = P (U ≥ 1) ∼ i qi = dΦ(c0 ), i as c0 → −∞. Remark 2.9. As one can see the proof using Bonferroni’s inequalities is more approachable and easier to understand. However, we still keep the second proof in this section, because it is an interesting application to the Poisson approximation associated with the Chen-Stein method. Chapter 3 Asymptotic Results of Words in DNA 3.1 Tail Probabilities of Extrema of Sums of mdependent Variables Let X1 , X2 , · · · be a sequence of m-dependent random variables with EXk = 0 and EXk2 < ∞, k = 1, 2, · · · We put Bn2 = ESn2 , Sn = X1 + · · · + Xn , and Mn,p (h) = max E |Xk |p ep|h||Xk | , 1≤k≤n Mn,p = Mn,p (0), Ln,p = nMn,p Bn−p . Denote the distribution function of Sn /Bn as Fn , that is, Fn (x) = P (Sn < xBn ). Here, we present an important result of Heinrich (1985), on the behavior of the ratios 1 − Fn (x) / 1 − Φ(x) and Fn (−x)/Φ(−x) for x ∈ [1, c ln Bn2 ], c > 0 as n → ∞ (so-called moderate deviations). Theorem 3.1. Let X1 , X2 , · · · be a sequence of m-dependent random variables with EXk = 0 and E|Xk |p < ∞, p = 2 + c20 , for some c0 > 0, and let q = min(p, 3). Take Bn2 = ESn2 , Sn = X1 + · · · + Xn . Then in the interval 1 ≤ x ≤ c ln Bn2 , 17 3.1 Tail Probabilities of Extrema of Sums of m-dependent Variables 18 0 < c ≤ c0 , we have 1 − Fn (x) Fn (−x) mp+1 = =1+O 2 1 − Φ(x) Φ(−x) xp−1 Bnp−c n E|Xk |p I Xk > k=1 Bn 2(2m + 1)x × (1 + mq−1 x2p−q Ln,q ) + mq−1 xq Ln,q + mp−1 xp Ln,p , (3.1) if mq−1 xq Ln,q + mp−1 xp Ln,p → 0 as n → ∞. The proof for above theorem can be found in Heinrich (1985), which is technical and shall be omitted here. Interested readers may consult the original paper for more details. With additional assumptions, we may get the ratio 1 − Fn (x) / 1 − Φ(x) asymptotically equals to 1, as shown in the following theorem. Theorem 3.2. Let X1 , X2 , · · · be a sequence of m-dependent random variables with EXk = 0 and E|Xk |2p ≤ Cp < ∞, p = 2 + c20 for some c0 > 0. Then, for Bn2 n, we have 1 − Fn (x) Fn (−x) = = 1 + o(1), 1 − Φ(x) Φ(−x) √ for 1 ≤ x ≤ c ln n and 0 < c ≤ c0 as n → ∞. Proof. From E|Xk |2p ≤ Cp < ∞, we know that Mn,2p = max E|Xk |2p ≤ Cp uniformly. Recall Theorem 3.1, what we are interested in is actually the expression inside O(·) at the right hand side of equation (3.1). The expression can be written as R1 + R2 + R3 where R1 = n mp+1 xp−1 Bnp−c E|Xk |p I Xk > 2 k=1 Bn 2(2m + 1)x (3.2) R2 = mq−1 x2p−q Ln,q R1 R3 = mq−1 xq Ln,q + mp−1 xp Ln,p We shall first prove that R3 → 0 as n → ∞. Note that xp Ln,p = xp nMn,p n(ln Bn2 )p/2 ≤ C p Bnp Bn (ln n)p/2 , np/2−1 3.1 Tail Probabilities of Extrema of Sums of m-dependent Variables and (ln n)p/2 ln n = p/2−1 n n1−2/p p/2 . Since 1 − 2/p > 0, the ratio (ln n)/(n1−2/p ) converges to 0 as n → ∞. Similarly, when q equals to either p or 3, the statement xq Ln,q → 0 holds. Therefore, mq−1 xq Ln,q + mp−1 xp Ln,p → 0 as n → ∞. Next, we want to show that R1 → 0 as n → ∞: n 2 mp+1 Bnc xp−1 Bnp 2 m Bnc ≤ p−1 p x Bn p+1 k=1 n E|Xk |2p 1/2 Bn 2(2m + 1)x P Xk > k=1 Bn 2(2m + 1)x 1/2 p+1 2(2m + 1)x · n(Mn,2p )1/2 · (E|Xk |2 )1/2 Bn √ nx n ln n ≤C 3+c2 −c2 2 2 n1+(1+c0 −c )/2 Bn 0 ≤ m E|Xk |p I Xk > 2 Bnp−c Note that when c20 − c2 > 0, ln n 2 n1+c0 −c 2 −→ 0 as n → ∞. Thus, R1 → 0 as n → ∞. Finally, we consider R2 . From above discussions, we only need to explore the convergence of mq−1 x2p−q Ln,q . When q = p, x2p−q Lnq = xp Lnp → 0 as has been proved earlier. When q = 3, using a similar argument, we have that x2p−q Lnq = x2p−3 nMn3 Bn3 (ln n)p−3/2 →0 n1/2 as n → ∞, since p > 3. Therefore, we have shown that under the additional assumption Bn2 n, 1 − Fn (x) = 1 + o(1) 1 − Φ(x) √ as n → ∞ with 1 ≤ x ≤ c ln n. 19 3.1 Tail Probabilities of Extrema of Sums of m-dependent Variables In above theorems, the moderate deviations for m-dependent random variables are studied. We know that, when Bn n is satisfied, the tail probability of Sn /Bn is approximated by the tail probability of a standard normal distribution when x √ and n tend to infinity with x ≤ c ln n. To obtain the asymptotic results in over and under representation of DNA word counts, we shall study the distribution functions of word counts in DNA sequences and normality of these word counts. Theorem 3.2 shall be applied. However, to begin with, it is crucial to investigate the properties of the means and covariance matrix for a set of word counts. Then we will easily obtain a set of centered and standardized scores of the word counts. Now, consider a DNA sequence A = A1 A2 · · · An with identically independentdistributed letters from state space A = (A, C, G, T ). The probabilities for picking up letters from A are (pA , pC , pG , pT ). Let u = u1 u2 · · · uk and v = v1 v2 · · · vk be two k-tuple words with letters ui , vi ∈ Λ. The indicator that the word u occurs at the starting position i in the sequence A1 · · · An is denoted by Iu (i) = I(Ai Ai+1 · · · Ai+k−1 = u). Then, n−k+1 Nu = Nu (n) = Iu (i) i=1 denotes the number of occurrences for word u in the sequence A. Since A is i.i.d, it is easy to derive that E(Iu (i)) = pu1 pu2 · · · puk . Let πu = pu1 pu2 · · · puk . It is obvious that πu is independent to the exact positions of the letters in the word u. In the following two theorems, we shall not only prove the existence of the limiting means and covariances of word counts, but also give their exact expressions. With these expressions, our computation for the score functions will be facilitated. 20 3.1 Tail Probabilities of Extrema of Sums of m-dependent Variables Theorem 3.3. For an i.i.d DNA sequence A1 A2 · · · An and a word u, we let Nu (n) be its word count. Then 1 E Nu (n) = πu . n→∞ n lim To calculate the covariance of Nu and Nv , we must notice the dependence between u and v when they overlap. Define an overlap bit βu,v (j) = I(uj+1 = v1 , · · · , uk = vk−j ). When 0 ≤ j − i < k, we have E Iu (i)Iv (j) = pu1 · · · puj−i puj−i+1 · · · puk βu,v (j − i)pvk−j+i+1 · · · pvk = πu βu,v (j − i)pvk−j+i+1 · · · pvk . When j − i ≥ k, u and v do not overlap. Then E Iu (i)Iv (j) = πu πv . Theorem 3.4. For an i.i.d sequence A1 A2 · · · An and words u and v of length k, we have 1 lim cov Nu (n), Nv (n) =πu n→∞ n k−1 βu,v (j)Pv (j) − kπu πv j=0 k−1 + πv βv,u (j)Pu (j) − kπv πu j=0 − πu βu,v (0) − πv , with and   1 Pu (j) =  p j = 0, uk−j+1   1 Pv (j) =  p · · · puk j = 1, · · · , k − 1, j = 0, vk−j+1 · · · pvk j = 1, · · · , k − 1. (3.3) 21 3.1 Tail Probabilities of Extrema of Sums of m-dependent Variables Proof. From the definition of covariance, we have n−k+1 n−k+1 cov Nu (n), Nv (n) = cov Iu (i), Iv (j) i=1 j=1 k−1 n−k+1 = E Iu (i)Iv (i + j) − πu πv j=0 i=1 n−k n−k+1 + E Iu (i)Iv (i + j) − πu πv i=1 j=k k−1 n−k+1 + E Iv (i)Iu (i + j) − πv πu j=0 i=1 n−k n−k+1 + E Iv (i)Iu (i + j) − πv πu i=1 j=k n−k+1 − cov Iu (i), Iv (i) . i=1 Here, no matter where u and v appear in the whole sequence, the first four terms of the R.H.S of the second identity above cover all possible overlap and nonoverlap cases. However, when u and v overlap completely, their covariance will be calculated twice in the first and third terms respectively. So we need to subtract these redundant terms from the first four terms as given by the fifth term shown above. Since we have already obtained the expressions for E Iu (i)Iv (j) under both overlap and non-overlap circumstances, we rewrite the covariance of Nu and Nv as follows. k−1 n−k+1 cov Nu (n), Nv (n) = πu βu,v (j)Pv (j) − πu πv + 0 i=1 j=1 k−1 n−k+1 + πv βv,u (j)Pu (j) − πv πu + 0 j=0 i=1 − (n − k + 1)πu βu,v (0) − πv . 22 3.1 Tail Probabilities of Extrema of Sums of m-dependent Variables From the equation above, we take limit of cov(Nu , Nv )/n term by term. Finally, we have 1 lim cov(Nu (n), Nv n) =πu n→∞ n k−1 βu,v (j)Pv (j) − kπu πv j=0 k−1 + πv βv,u (j)Pu (j) − kπv πu j=0 − πu βu,v (0) − πv . This theorem also leads to the following corollary, which presents the limiting variances for all the word counts. Then we can standardize the word counts by using these variances as one will see in a moment. Corollary 3.5. For an i.i.d DNA sequence A1 A2 · · · An and a word u, we let Nu (n) be its word count. Then k−1 lim n−1 Var Nu (n) = 2 n→∞ βu,u (j)Pu (j) + 1 πu − (2k − 1)πu2 . (3.4) j=1 Proof. This is immediate from Theorem 3.4, by noticing that Var Nu (n) = cov Nu (n), Nu (n) . Example 3.6. Take the 3-tuple words u = AAA, v = ACA, w = ACT . The DNA sequence A1 A2 · · · has i.i.d letters. Then 1 Var Nu (n) ≈ (2pa + 2p2a + 1)p3a − 5p6a , n 1 Var Nv (n) ≈ (2pc pa + 1)pa pc pa − 5(pa pc pa )2 , n 1 Var Nw (n) ≈ pa pc pt − 5(pa pc pt )2 , n where n is the sequence length, which is usually very large. 23 3.1 Tail Probabilities of Extrema of Sums of m-dependent Variables Theorems 3.3 and 3.4 show that the existence of the limiting mean and covariance matrix for a set of word counts. Then we shall derive the asymptotic normality of word counts under M0 model. However, since M0 model is a special case of M1 model, if we can derive the asymptotic normality of word counts under M1 model, as what we will do in Theorem 3.19, we will henceforth obtain the normality of those under M0 model. Indeed, Theorem 3.7. Let A1 A2 · · · be a sequence of identically independently distributed random variables. Let {u1 , u2 , · · · , ud } be a set of words and N = N1 (n), N2 (n), · · · , Nd (n) be the count vector. Then n−1 N is asymptotically normal with mean π and non-singular covariance matrix n−1 Σ, where π = (π1 , · · · , πd ), and Σ = (σij )d×d is the limiting covariance matrix with σij = lim n−1 cov(Ni , Nj ). n→∞ Now, we would like to consider the set of all DNA words of length k. Take the index set to be ∆ = {1, 2, · · · , d}. In an i.i.d sequence A1 A2 · · · An , we denote the indicator of word i occurring at starting position j as Ii (j), for any i ∈ ∆. It follows that πi is the equilibrium probability of word i, and Ni (n) = n−k+1 Ii (j) j=1 is the total number of occurrences of word i, ∀i ∈ ∆. When |m − l| ≥ k, Ii (l) is independent to {Ii (m) : |m − l| ≥ k}. Thus, for any fixed i, {Ii (1), Ii (2), · · · } is a sequence of m-dependent random variables. As we are interested in asymptotic behavior that is, n → ∞, we will take n − k + 1 as n for the rest of this section. Take n Si (n) = Ii (j) − πi = Ni (n) − ENi (n), i ∈ ∆. j=1 Clearly, Bi2 (n) = ESi (n)2 = Var Ni (n) . From Corollary 3.5, the limiting variance of Ni exists. Thus, Bi (n)2 n, for all i ∈ ∆. 24 3.1 Tail Probabilities of Extrema of Sums of m-dependent Variables Take the standardized score ξi = ξi (n) = Ni (n) − ENi (n) VarNi (n) = Si (n) , Bi (n) i ∈ ∆. (3.5) The asymptotic behavior of the ratio P (ξi (n) ≥ x)/ 1 − Φ(x) is described in following theorem. Theorem 3.8. Suppose ∆ = {1, 2, · · · , d} is a set of all k-tuple DNA words. For any i ∈ ∆, P ξi (n) ≥ x −→ 1 1 − Φ(x) (3.6) √ as n → ∞ with 1 ≤ x ≤ c ln n. Proof. Let Xj = Ii (j)−πi in Theorem 3.2. Then E|Xj |2p ≤ Cp < ∞ for 1 ≤ p < ∞. Finally, since Bi (n) = Var Ni (n) n, Theorem 3.2 implies this theorem. Theorem 3.8 tells us that the probability P (ξi ≥ x) can be approximated by the tail probability of a standard normal distribution provided that x lies in a certain range. Using Theorem 3.8, we shall study the asymptotic tail probabilities of the extrema of {ξi : 1 ≤ i ≤ d} in the following theorem. Theorem 3.9. Let A1 A2 · · · be identically independent random variables over A. Let ∆ be a set of words, N1 (n), · · · , Nd (n) be their word counts and ξ1 (n), · · · , ξd (n) be the standardized scores. Assume the correlation matrix for all the DNA word counts be non-singular. Then P max ξi (n) ≥ x ∼ d(1 − Φ(x)) (3.7) P min ξi (n) ≤ −x ∼ d(Φ(−x)) (3.8) 1≤i≤d and 1≤i 1. Analogous to equation (2.5), we have ϕ(λn x)/(λn x) 1 + x2 − 1 (λ2n −1)x2 1 − Φ(λn x) ≤ < e 2 −→ 0, 1 − Φ(x) xϕ(x)/(1 + x2 ) λn x 2 as x → ∞ and n → ∞. Then, P (ξi ≥ x, ξj ≥ x) ≤ P (ξi + ξj ≥ 2x) =P ξi + ξj ≥ 2(1 + rij ) =1−Φ x 2x 2(1 + rij ) 2 1 + rij = o 1 − Φ(x) as x → ∞. We write equation (3.9) as follows, d i=1 P (ξi ≥ x, ξj ≥ x) P (maxi ξi ≥ x) P (ξi ≥ x) − ≤ ≤ 1 − Φ(x) 1 − Φ(x) 1 − Φ(x) 1≤i 0, then Sn /σ n ⇒ N . To derive the central limit theorem for word counts under M1 model, we shall first prove the existence of the limiting variances and covariance matrix for all the word counts. We now introduce some necessary notations. Consider a DNA sequence A1 A2 · · · , which is generated by a stationary, irreducible and aperiodic Markov chain with state space A = {A, C, G, T }. The probability of having letter b successive to a is pa,b = P(Am = b|A1 = a1 , · · · , Am−2 = am−2 , Am−1 = a) = P(Am = b|Am−1 = a), 3.2 Asymptotic Normality of Markov Chains 29 and (k) pa,b = P (Am+k = b|Am = a) denotes the probability of moving from a to b in k steps. Lemma 3.13. Assume the Markov chain {Ai } is irreducible and aperiodic. There exists a unique stationary distribution π that is independent of the initial distribution, and constants K ≥ 0 and 0 ≤ ρ < 1, such that (n) |pa,b − πb | ≤ Kρn . (3.12) Let u and v be two words of length k and l. Write u = u1 · · · uk and v = v1 · · · vl . We recall the notations in section 3.1 that Iu (i) = I(Ai Ai+1 · · · Ai+k−1 = u) is the indicator that the word u occurs at the starting position i in the sequence A1 · · · An · · · . We have n−k+1 Nu = Nu (n) = Iu (i) i=1 to denote the number of occurrence of word u in the sequence A1 · · · An . To derive the Central Limit Theorem of the word counts, we need to calculate their means. The first moments of Iu are easy to derive: E Iu (i) = πu1 Pu (k − 1), (3.13) where Pu (k − 1) = pu1 ,u2 pu2 ,u3 · · · puk−1 ,uk . More generally, we define Pu (l) as the probability of the occurrence of uk−l+1 · · · uk given u1 · · · uk−l occurs: Pu (l) = puk−l ,uk−l+1 puk−l+1 ,uk−l+2 · · · puk−1 ,uk . Theorem 3.14. For a sequence A1 A2 · · · under M1 model, we have 1 E Nu (n) = πu , n→∞ n lim where πu = πu1 Pu (k − 1). 3.2 Asymptotic Normality of Markov Chains 30 Proof. Since equation (3.13) gives that E Iu (i) = πu1 Pu (k − 1) = πu , we have n−k+1 1 E Nu (n) = E Iu (i) −→ πu n n as n → ∞. Hence, we have verified that n−1 E Nu (n) approaches to πu in the limit. The next step for a Central Limit Theorem is to calculate the covariance of Nu (n), Nv (n) . First, we recall the overlap bit βu,v (j) = I(uj+1 = v1 , · · · , uk = vk−j ). When i ≤ j and j − i < k, E Iu (i)Iv (j) = E Iu (i) βu,v (j − i)Pv (j − i + l − k). (3.14) When j − i ≥ k, the words u and v do not overlap and u is followed by j − i − k letters before v occurs. So E Iu (i)Iv (j) = E Iu (i) p(j−i+l−k) Pv (l − 1). uk ,v1 (3.15) The following proposition gives an explicit expression for the covariance of (Nu , Nv ). Proposition 3.15. For the sequence A1 A2 · · · under M1 model and words u and v with length k and l respectively. Assuming |u| = k ≤ |v| = l, then we have k−1 n−l−j+1 cov Nu (n), Nv (n) = πu βu,v (j)Pv (j + l − k) − πv j=0 i=1 l−1 n−l−j+1 + πv βv,u (j)Pu (j + k − l) − πu j=0 i=1 n−k−l n−l−j−k+1 πu p(j+1) uk ,v1 Pv (l − 1) − πv + j=0 i=1 n−k−l n−l−j−k+1 πv p(j+1) vl ,u1 Pu (k − 1) − πu + j=0 i=1 n−l+1 − n−l+l n−k+1 cov Iu (i), Iv (i) + i=1 cov Iu (i), Iv (j) . j=0 i=n−l+2 3.2 Asymptotic Normality of Markov Chains 31 Proof. When 0 ≤ j − i < k, equation (3.14) implies cov Iu (i), Iv (j) = E Iu (i)Iv (j) − E Iu (i) E Iv (j) = E Iu (i) βu,v (j − i)Pv (j − i + l − k) − E Iv (j) . Similarly, for the case 0 ≤ i − j < l, we have cov Iu (i), Iv (j) = E Iv (j) βv,u (i − j)Pu (i − j + k − l) − E Iu (i) . By (3.15) and for j − i ≥ k, we have cov Iu (i), Iv (j) = E Iu (i) p(j−i−k+1) Pv (l − 1) − E Iv (j) . uk ,v1 Similarly, for the case i − j ≥ l, we have cov Iu (j), Iv (j) = E Iv (j) p(i−j−l+1) Pu (k − 1) − E Iu (i) . vl ,u1 Combining these four identities, we have k−1 n−l−j+1 cov Nu (n), Nv (n) = E Iu (i) βu,v (j)Pv (j + l − k) − E Iv (i + j) j=0 i=1 l−1 n−l−j+1 + E Iv (i) βv,u (j)Pu (j + k − l) − E Iu (i + j) j=0 i=1 n−k−l n−l−j−k+1 E Iu (i) p(j+1) uk ,v1 Pv (l − 1) − E Iv (i + j + k) + j=0 i=1 n−k−l n−l−j−k+1 E Iv (i) p(j+1) vl ,u1 Pu (k − 1) − E Iu (i + j + l) + j=0 i=1 n−l+1 − n−l+l n−k+1 cov Iu (i)Iv (i) + i=1 cov Iu (i)Iv (j) . j=0 i=n−l+2 In the first term, the word u actually starts at position i, and v at i + j. Thus, it satisfies 0 ≤ j − i = j < k for j = 0, · · · , k − 1. In the second term, u starts at position i + j, and v at i. Thus, we have 0 ≤ i − j = j < l for j = 0, · · · , l − 1. In 3.2 Asymptotic Normality of Markov Chains 32 the third term, u starts at i, and v at i + j + k. Thus, we have j − i = j + k > k. Similarly, one can verify that i − j = j > l in the forth term. It is also easy to observe that when j = 0, totally overlapped terms, i.e. when two words start at the same position. Thus, these redundant terms should be subtracted away, as shown in the fifth term. Due to our assumption that |u| = k ≤ |v| = l, the terms, where staring point i is among n − l + 1 and n − k + 1, should be included in our sum as well. Then, the last term is added to make our decomposition process complete. Finally, from the facts E Iu (i) = πu and E Iv (i) = πv for i = 1, 2, · · · , we obtain this proposition. By using the formulas decomposed in the above proposition, we can get the asymptotic covariance of each two DNA words, as shown in the following theorem. Theorem 3.16. For the sequence A1 A2 · · · under M1 model and words u and v with length k and l respectively. We assume |u| = k ≤ |v| = l, then 1 lim cov Nu (n), Nv (n) =πu n→∞ n k−1 βu,v (i)Pv (i + l − k) − πv i=0 l−1 + πv βu,v (i)Pv (i + k − l) − πu i=0 ∞ p(j+1) uk ,v1 − πv1 + πu Pv (l − 1) j=0 ∞ p(j+1) vl ,u1 − πu1 + πv Pu (k − 1) j=0 − πu βu,v (0)Pv (l − k) − πv . Proof. We derive the desired limit term by term from Proposition 3.15. For the 3.2 Asymptotic Normality of Markov Chains 33 first term, we have n−l−j+1 k−1 lim n→∞ n −1 πu βu,v (j)Pv (j + l − k) − πv j=0 i=1 k−1 n−l−j+1 πu βu,v (j)Pv (j + l − k) − πv n→∞ n j=0 = lim k−1 βu,v (j)Pv (j + l − k) − πv . =πu j=0 The second term has a similar limit with the roles of u and v interchanged. Using the dominated convergence theorem,we then obtain that n−k−l n−l−j−k+1 lim n −1 πu p(j+1) uk ,v1 Pv (l − 1) − πv n→∞ j=0 i=1 ∞ = n−l−j−k+1 πu p(j+1) uk ,v1 Pv (l − 1) − πv n→∞ n j=0 lim ∞ πu p(j+1) uk ,v1 Pv (l − 1) − πv1 Pv (l − 1) = j=0 ∞ p(j+1) uk ,v1 − πv1 . = πu Pv (l − 1) j=0 The forth term has a similar form. By taking j = 0 in the first term, we obtain the fifth term, and thus we have n−l−j+1 lim n −1 n→∞ cov Iu (i), Iv (i) = πu βu,v (0)Pv (l − k) − πv . i=1 Finally, n−l+l n−k+1 n−1 n−l+l n−k+1 cov Iu (i), Iv (j) j=0 i=n−l+2 However, the inequality ≤ n−1 cov Iu (i), Iv (j) . j=0 i=n−l+2 ∞ j=1 |cov(Iu (i), Iv (j))| ≤ k + l implies that n−l+l n−k+1 lim n n→∞ −1 cov Iu (i), Iv (j) j=0 i=n−l+2 K = 0. n→∞ n ≤ lim Together with all discussions above, the theorem is proved. 3.2 Asymptotic Normality of Markov Chains 34 The following corollary is a direct consequence of Theorem 3.16, which gives the exact expression of the limiting variance for each word count in DNA sequences under M1 model. Corollary 3.17. For the sequence A1 A2 · · · under M1 model and a DNA word u of length k, we have k−1 1 lim Var Nu (n) = 2 βu,u (j)Pu (j) + 1 πu − 5πu2 n→∞ n j=1 (3.16) ∞ pu3 ,u1 (j) − πu1 . + 2πu Pu (k − 1) j=1 We let the DNA sequence A1 A2 · · · to be generated by a first-order stationary Markov chain, then the transition from word u, Ai Ai+1 · · · Ai+k−1 = u1 u2 · · · uk , to word v, Ai+1 · · · Ai+k = v1 v2 · · · vk is a Markov chain. Therefore, when i ∈ Λ, the indicators, {Ii (j) : j = 1, 2, · · · } form a Markov chain. However, Theorem 3.12 gives the central limit theorem for random variables in α-mixing sequences. Next, we shall study the relation between the α-mixing sequences and the Markov chains in order to derive the asymptotic normality of random variables in Markov chains. Suppose {Yn : n = 1, 2 · · · } is a stationary, irreducible and aperiodic Markov chain with finite state space and positive transition probabilities pij . Lemma 3.13 (n) implies that if there exists stationary initial probabilities πi , then |pij −πj | ≤ Kρn , where ρ < 1. Moreover, it is obvious that P (Y1 = i1 , · · · , Yk = ik , Yk+n = j0 , · · · , Yk+n+l = jl ) (n) = πi1 pi1 i2 · · · pik−1 ik pik j0 pj0 j1 · · · pjl−1 jl , and P (Y1 = i1 , · · · , Yk = ik ) × P (Yk+n = j0 , · · · , Yk+n+l = jl ) = πi1 pi1 i2 · · · pik−1 ik πj0 pj0 j1 · · · pjl−1 jl . 3.2 Asymptotic Normality of Markov Chains 35 Taking difference between these two formulas above, we obtain that P (Y1 , · · · , Yk , Yk+n , · · · , Yk+n+l ) − P (Y1 , · · · , Yk ) × P (Yk+n , · · · , Yk+n+l ) ≤ πi1 pi1 i2 · · · pik−1 ik Kρn pj0 j1 · · · pjl−1 jl . Since the number of states is finite, for sets A = {(Y1 , · · · , Yk ) ∈ H} and B = {(Yk+n , · · · , Yk+n+l ) ∈ H }, the α-mixing condition |P (AB) − P (A)P (B)| ≤ αn = sρn (3.17) holds with constants s > 0 and 0 ≤ ρ < 1. The fields formed by sets A and B generate σ-fields containing σ(Y1 , · · · , Yk ) and σ(Yk+n , Yk+n+1 , · · · ). From the arguments above, we see that a Markov chain implies an α-mixing sequence. Therefore, if we can prove the central limit theorem for α-mixing random variables, we will automatically obtain the asymptotic normality for Markov chains. For this, Theorem 3.12 will be applied. Recall a theorem of Billingsley (1995) (Theorem 29.4) as shown below, which allows us to get a multivariate central limit theorem by deriving the normality of the sums of random variables in one-dimensional case. Theorem 3.18. For k-dimensional random vectors Xn = (Xn1 , Xn2 , · · · , Xnk ) and Y = (Y1 , Y2 , · · · , Yk ) in Rk , a necessary and sufficient condition for Xn ⇒ Y is that Xn tT ⇒ Y tT for every t = (t1 , t2 , · · · , tk ) in Rk . Here, let Λ = {u1 , u2 , · · · , ud } be a finite set of DNA words and Ni be the word count for the i-th word. Due to Theorems 3.14 and 3.16, the limiting mean and covariance matrix exist for the count vector N = N1 (n), N2 (n), · · · , Nd (n) . Define the limiting mean π = (π1 , · · · , πd ) = lim n−1 (EN1 , · · · , ENd ), n→∞ and covariance matrix Σ = (σij ), where σij = lim n−1 cov(Ni , Nj ). n→∞ 3.2 Asymptotic Normality of Markov Chains 36 Moreover, let D = diag(σ1 , · · · , σd ), where σi = limn→∞ n−1 Var(Ni ). We obtain the following theorem which investigates the multivariate normality of the set of word frequencies. Theorem 3.19. Let A = A1 A2 · · · be a stationary, irreducible, aperiodic firstorder Markov chain over a finite alphabet. Let Λ be a set of d words and N = N1 (n), N2 (n), · · · , Nd (n) be the count vector. Then n−1 N is asymptotically normal with mean π and covariance matrix n−1 Σ. Furthermore,   N1 /n−π1 √ σ1 /n     ..   =⇒ N (0, D− 12 ΣD− 12 ). .    N /n−π  d d √ (3.18) σd /n Proof. We apply Theorem 3.18 here. Let t be a vector in Rd , we have d NtT = d tj Nj = j=1 where Zj = d i=1 ti Ii (j), n ti i=1 n d Ii (j) = j=1 n ti Ii (j) = j=1 i=1 Zj j=1 a linear combination of the indicators Ii (j) for 1 ≤ i ≤ d. Observe that sequence Z1 , Z2 , · · · is a stationary Markov chain and α-mixing sequence. From equation (3.17), it is obvious that αn = sρn = o(n−5 ) and EZj = d i=1 ti EIi (j) = d i=1 ti πi = πtT . Since Zj has all finite moments, 1 ≤ j ≤ n, EZj12 < ∞. The conditions in Theorem 3.12 are satisfied. Let Sn = Z1 + · · · + Zn , then Sn = NtT . It is easy to see that lim n−1 ESn = πtT , n→∞ and tVar(N)tT = tΣtT . n→∞ n lim n−1 Var(Sn ) = lim n−1 Var(NtT ) = lim n→∞ n→∞ Therefore, Theorem 3.12 implies that Sn − nπtT √ =⇒ N (0, 1). ntΣtT 3.2 Asymptotic Normality of Markov Chains As a direct result of above equation, we obtain that NtT =⇒ N (πtT , tΣtT /n) n for any t ∈ Rk . Applying Theorem 3.18, we have that the vector n−1 N has a multivariate normal distribution with mean vector π and covariance matrix Σ/n. By standardizing  N1 /n−π1 √ σ1 /n   ..  .   N /n−π d d √ every word count Ni for 1 ≤ i ≤ d, we have    √ −1/2  = nD (N/n − π) =⇒ N (0, D−1/2 ΣD−1/2 ),   σd /n and equation (3.18) follows. 37 Chapter 4 Simulation Results In this chapter, we consider the set of all 64 DNA words of length 3 and count their word occurrences in DNA sequences under M0 and M1 model respectively. We will investigate the normality of the set of word counts and to study the distributions of the extrema over all the 64 word counts by simulation. Take a herpesvirus HCMV (Human Cytomegalovirus) as our reference set. The length of HCMV, 230,000, is apparently long enough when compared to the word length 3. Therefore the central limit theorem could be applied to check the normality of all 64 word counts. 4.1 DNA Sequences under M0 In this section, we generate DNA sequences under M0 model. Each position of the sequence A1 A2 · · · will follow the base composition of HCMV. This means that for every single position in the sequence, we pick up characters from the state space A = {A, C, G, T } with the distribution: (pA , pC , pG , pT ) = (0.2156092, 0.2831467, 0.2887962, 0.2124479). 38 4.1 DNA Sequences under M0 Under this rule, we generate 20,000 sequences each of length 230,000 under M0 model and compute all the 64 word counts for every sequence. Figure 4.1: Normal Q-Q plot of the sums of ξ scores of all 64 3-tuple words in 20,000 simulated DNA sequences under M0 model. If we want to calculate the standardized frequency scores ξ as shown in equation (3.5) for every word count, it is crucial to compute the means and standard deviations for the word counts. The means and variances will be calculated based on the expressions in Theorem 3.3 and Corollary 3.5. For short words of length L = 3 and i.i.d sequences of length n = 230, 000, the distributions of ξ scores of all word counts under M0 model are expected to be approximately standard normal. This is verified by the Q-Q plot (Venables and Ripley, 2001) in Figure 4.1, which indicates the normality of the sum of all 64 words of length 3 in the 20,000 random-generated sequences under M0 model. By Theorem 3.18, we have verified that the joint distribution of the set of ξ scores is approximately multivariate normally distributed. 39 4.1 DNA Sequences under M0 Figure 4.2: Normal Q-Q plot of the maxima of ξ scores of all 64 3-tuple words in 20,000 simulated DNA sequences under M0 model. The asymptotic normality of all word count scores motivates us to further investigate the over and under representation of their corresponding words. Therefore, we now study the normality of the extrema of these frequency scores next, where the result is given in Figure 4.2. However, from Figure 4.2, it appears that the maxima of all ξ scores are not normally distributed. This result leads us to consider the tail distributions of the extrema of the 64 word counts and to see if Theorem 3.9 can be verified. We would like to explore properties of the most over-represented and most under-represented words as follows. For every sequence, let M = max1≤i≤64 ξi and m = min1≤i≤64 ξi . Then 20,000 maximums and 20,000 minimums will be obtained, and we denote them by a set of maxima {M (k) : 1 ≤ k ≤ 20, 000} and a set of minima {m(k) : 1 ≤ k ≤ 20, 000}. When calculating the distributions of the event 40 4.1 DNA Sequences under M0 {M ≥ x} with x in certain range, we estimate the probability of its occurrence as P (M ≥ x) = 1 I(M (k) ≥ x). 20, 000 1≤k≤20000 √ Theorem 3.9 states that P (maxi ξi ≥ x) ∼ d(1−Φ(x)),when 1 ≤ x ≤ c ln n. From √ the point-to-point plots shown in Figure 4.3, when x approaches ln 230, 000, the probabilities P (M ≥ x) approximates the probability 64(1 − Φ(x)), which is due to the near-straight-line appearance of the plot. Similarly, we estimate the probability P (m < x0 ) by P (m ≤ x0 ) = 1 I(m(k) ≤ x0 ). 20, 000 1≤k≤20000 Figure 4.3(b) shows that the estimated probability P (m ≤ x0 ) approximates the √ value 64Φ(x0 ) when x0 approaches − ln 230, 000. Figure 4.3: Point-to-point plots of values of (a) Fmax(x) versus G(x), where Fmax(x) stands for the estimated probabilities P (M ≥ x) and G(x) = 64 1 − Φ(x) ; (b) Fmin(x) versus G(x), where Fmin(x) stands for the estimated probability P (m ≤ x) with G(x) = 64Φ(x). 41 4.2 DNA Sequences under M1 4.2 42 DNA Sequences under M1 In genome studies, Markov chains of order 1 have been used extensively to model DNA sequences (see e.g. Blaisdell (1985), Avery (1987), Burge et al.(1992), Prum et al.(1995)). We would like to generate DNA sequences under M1 model in this section and investigate the behavior of the ξ scores for certain DNA words. Take the herpesvirus HCMV as our reference set again. From HCMV sequence, we estimated the stationary distribution as (0.2156092, 0.2831467, 0.2887962, 0.2124479), and the matrix of transition  0.2439530    0.2278353    0.2137734  0.1730439 probabilities as  0.3021771 0.2604475 0.1934224   0.2439997 0.3433479 0.1848171  .  0.3051500 0.2491542 0.2319224   0.2860968 0.2987491 0.2421102 We randomly generated 20,000 first-ordered Markov chains of length 230,000 by using this stationary distribution as its initial distribution, and the transition probability matrix. Then we compute all the 64 word counts for every sequence. The frequency scores ξ under M1 model are defined similarly to M0 model. We recall the definition of ξ score as shown in equation (3.5) and justify it to be as follows: ξi = Ni − ENi Var(Ni ) = Ni /n − πi σi /n (4.1) where Ni is the count of word i, and πi and σi denote its expected value and standard deviation respectively, for 1 ≤ i ≤ 64. Applying Theorem 3.14, the expected values of word frequencies known as (πi ) can be calculated from the parameter settings given above for all the 64 DNA words. However, due to Corollary 3.17, if we want to compute the variance for every word frequency, the probabilities of two 4.2 DNA Sequences under M1 43 letters being separated by j letters, known as pa,b (j), have to be calculated with 1 ≤ j ≤ n. Note from Lemma 3.13 that |pa,b (n) − πb | → 0. It is sufficient to choose n equals to 50. The reason is that we may reduce the complexity of computing without affecting any accuracy. Finally, we can calculate the standard deviations and obtain the standardized ξ scores as shown in equation (4.1), for all the 3-tuple words. Figure 4.4: Normal Q-Q plot of the sums of ξ scores of all 64 3-tuple words in 20,000 simulated DNA sequences under M1 model. Analogous to what we have done for M0 model, the straight line appearance of the Q-Q plot suggests that the normal approximation of the sum of the 64 ξ scores is good. Further, the multivariate normality of the random vector {ξi } follows from Billingsley (1995). We have shown not only in Theorem 3.9 but also demonstrated in simulation results, that, for ξ scores under M0 model, we have P (max ξi ≥ x) ∼ d 1 − Φ(x) i (4.2) 4.2 DNA Sequences under M1 44 and P (min ξi ≤ −x) ∼ dΦ(−x) i (4.3) with x starting at some point and tending to infinity. A natural question in the tail probabilities of the extrema of the ξ scores under M1 model arises. For ξ scores defined in equation (4.1), do we have similar relations as shown above? Although we have not proved any asymptotic convergence of the tail probabilities of the extrema of ξ scores under M1 model, we would like to point out the possibilities that similar asymptotic relations, as shown in Theorem 3.8, are true by simulation. We continue using the estimated probabilities P (M ≥ x) and P (m ≤ −x) to denote the tail probabilities. Figure 4.5(a) is the point-to-point plot of the probability P (M ≥ x) versus 64(1 − Φ(x)) when x varies in the interval [2.5, 1.5 × √ ln 230, 000], while Figure 4.5(b) is the point-to-point plot when x varies in the √ interval [2.8, 1.5 × ln 230, 000]. From both of these two figures, we can see that the ratio of P (maxi ξi ≥ x)/64(1 − Φ(x)) approximates to 1, as x approaching √ c ln 230, 000. However, we notice that in Figure 4.5(a), there is a tendency of the growth of ratio P (maxi ξi ≥ x)/64(1−Φ(x)) when the value of P (maxi ξi ≥ x) goes below 0.2, and the slope of the graph grows and becomes greater than 1. From the Bonferroni’s inequalities, the ratio P (maxi ξi ≥ x)/64(1−Φ(x)) should be always less than 1 and converges to 1 asymptotically. However, when doing the simulations, the estimated tail probability P (maxi ξi ≥ x) is not quite sensitive to the growth of x, especially when x is large. At the same time, the function 64(1 − Φ(x)) goes to zero very fast as x being large enough. This different convergence rates between our estimated function P and distribution function 64(1 − Φ(x)) may introduce errors when we calculate their ratio. So this inaccuracy explains the convex appearance of the graph shown in Figure 4.5(a). Despite of the inaccuracy in computation, the near-straight line shown in Figure 4.2 DNA Sequences under M1 Figure 4.5: Point-to-point plots of values of (a) Fmax(x) versus G(x); (b) Fmax(x) versus G(x); (c) Fmin(x) versus G(x). 4.5(b) may still indicate that for ξ score under M1 model, P (maxi ξi ≥ x) ∼ √ 64(1 − Φ(x)) with x approaching c ln 230, 000. Also, Figure 4.5(c) may tell that √ P (mini ξi ≤ −x) ∼ 64Φ(−x) with x approaching c ln 230, 000. As a result, this may lead to further research on the asymptotic behavior of the ratios P (maxi ξi ≥ √ x)/ i (1 − Φ(x)) and P (mini ξi ≤ −x)/ i Φ(−x) as x grows like ln n, for ξ scores taken under M1 model. In biological applications, the expected values and variances of word frequencies are unknown and have to be estimated. As a result, the plug-in ξ score shall be 45 4.2 DNA Sequences under M1 46 applied. It is defined as follows ξu = Nu − E(Nu ) . (4.4) Var(Nu ) Although some other methods can be applied to estimate the means and variances of word counts, such as the techniques using conditional expectation and martingale approach, the above version of estimated score is preferred in biological studies (See Prum et al.(1995), Schbath et al.(1995), Leung et.al.(1996)). The drawback of using this plug-in score is that the formula Nu − E(Nu ) will not be properly standardized by Var(Nu ), so the ξu score does not have an asymptotic standard normal distribution (See Prum et al.(1995), Leung et al.(1996)). However, equations similar to (4.2) and (4.3) are expected as follows, P (max ξi ≥ x) ∼ i (1 − Φi (x)), i P (min ξi < −x) ∼ i Φi (−x), i with x starting at some point and tending to infinity. Here, Φi (x) denotes the distribution function for normal random variables with standard deviation τi . Computer simulation to test if these two equation above hold for plug-in scores under M1 model is also possible. However, since the computation in doing so is quite time-consuming, we shall not do it in this thesis. Bibliography [1] Arratia, R., Goldstein, L. and Gordon, L. (1990). Approximation and the Chen-Stein method. Statist. Sci. 5 403–434. [2] Avery, P. J. (1987). The analysis of intron data and their use in the detection of short signals. J. Mol. Evol. 26 335–340. [3] Barbour, A. D., Holst, L. and Janson, S. (1992) Poisson Approximation. Studies in Probability 2. Clarendon Press, Oxford. [4] Barbour, A. D. and Chen, L. H. Y. (editors) (2005a). An Introduction to Stein’s Method. Lecture notes series 4. Singapore University Press and World Scientific, Singapore. [5] Barbour, A. D. and Chen, L. H. Y. (editors) (2005b). Stein’s Method and Applications. Lecture notes series 5. Singapore University Press and World Scientific, Singapore. [6] Billingsley, P. (1995). Probability and Measure. Wiley, New York. 47 Bibliography [7] Blaisdell, B. E. (1985). Markov chain analysis finds a significant influence of neighboring bases on the occurrence of a base in eucaryotic nuclear DNA sequences both protein-coding and noncoding. J. Mol. Evol. 21 278–288. [8] Burge, C., Campbell, A. and Karlin, S. (1992). Over- and underrepresentation of short oligonucleotides in DNA sequences. Proc. Natl. Acad. Sci. USA. 89 1358–1362. [9] Chen, L. H. Y. (1975). Poisson approximation for dependent trials. Ann. Probab. 3 534-545. [10] Chew, D. S. H., Choi, K. P., Heidner, H. and Leung, M. Y. (2004). Palindromes in SARS and other coronaviruses. INFORMS J. Comput. 16(4) 331–340. [11] Heinrich, L. (1985). Non-uniform estimates, moderate and large deviations in the central limit theorem for m-dependent random variables. Math. Nachr. 121 107–121. [12] Leung, M. Y., Marsh, G. M. and Speed, T. P. (1996). Over- and underrepresentation of short DNA words in herpesvirus genomes. J. Comp. Biol. 3 345–360. [13] Phillips, G. J., Arnold, J. and Ivarie, R. (1987). The effect of codon usage on the oligonucleotide composition of the E.coli genome and identification of over- and underrepresented sequences by Markov chain analysis. Nucl. Acids Res. 15 2627–2638. [14] Prum, B., Rodolphe, F. and Turckheim, E. D. (1995). Finding words with unexpected frequencies in DNA sequences. J. R. Statist. Soc. B. 57(1) 205–220. 48 Bibliography [15] Reinert, G., Schbath, S. and Waterman, M. S. (2000). Probabilistic and statistical properties of words: an overview. J. Comp. Biol. 7(1/2) 1–46. [16] Schbath, S., Prum, B. and Turckheim, E. D. (1995). Exceptional motifs in different Markov chain models for a statistical analysis of DNA sequences. J. Comp. Biol. 2(3) 417–437. [17] Stein, C. (1972). A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. Proc. Sixth Berkeley Symp. Math. Statist. Probab. 2, 583–602. Univ. California Press, Berkeley. [18] Venebles, W. N. and Ripley, B. D. (2001). Modern Applied Statistics with S-plus. Springer-Verlag, New York. [19] Waterman, M. S. (1995). Introduction to Computational Biology. Chapman and Hall, New York. 49 [...]... the indicator of a finite word u occurring at position i in the DNA sequence The goal of this section is to derive the asymptotic normality of the joint distribution of a finite set of word counts under M1-modelled sequences We will begin with deducing a central limit theorem for the sum of all the word counts 27 3.2 Asymptotic Normality of Markov Chains 28 The asymptotic normality of the sums of both... Probabilities of Extrema of Sums of m-dependent Variables In above theorems, the moderate deviations for m-dependent random variables are studied We know that, when Bn n is satisfied, the tail probability of Sn /Bn is approximated by the tail probability of a standard normal distribution when x √ and n tend to infinity with x ≤ c ln n To obtain the asymptotic results in over and under representation of DNA word... equations (3.7) and (3.8) will be obtained when we simulate DNA sequences of finite length 3.2 Asymptotic Normality of Markov Chains In studies of DNA sequences, we usually assume that the base pairs in DNA sequences are generated by Markov chains, i.e the letters in every DNA sequence form a Markov chain To reduce the complexity of computation, we shall throughout assume that the Markov chains are of first-order...Chapter 2 Extrema of Normal Random Variables In this chapter, we would like to investigate the distributions of both the maximum and minimum of a set of standard normal random variables More precisely, we will try to find out the probabilities of the maximum being greater than c, and the probabilities of the minimum being less than c0 , for c, c0 ∈ R Our main theorem in this chapter shows that,... large enough and c0 is small enough, the asymptotic tail distributions of the both extrema follow certain expressions in terms of c and c0 respectively We will present two methods in proving this theorem, one using Bonferroni’s inequalities and the other one using Poisson approximation associated with the Chen-Stein method 2.1 Distribution Functions of Extrema To facilitate the proof of Theorem 2.7,... distribution functions of word counts in DNA sequences and normality of these word counts Theorem 3.2 shall be applied However, to begin with, it is crucial to investigate the properties of the means and covariance matrix for a set of word counts Then we will easily obtain a set of centered and standardized scores of the word counts Now, consider a DNA sequence A = A1 A2 · · · An with identically independentdistributed... the technique of Poisson Approximation associated with the ChenStein method In 1975, Chen first applied Stein’s method (Stein (1972)) to Poisson approximation problems, and obtained error bounds when approximating sums of dependent Bernoulli random variables with Poisson distribution The Chen-Stein method has been successfully developed in the past 30 years and resulted in lots of interesting applications... proof in this section, because it is an interesting application to the Poisson approximation associated with the Chen-Stein method Chapter 3 Asymptotic Results of Words in DNA 3.1 Tail Probabilities of Extrema of Sums of mdependent Variables Let X1 , X2 , · · · be a sequence of m-dependent random variables with EXk = 0 and EXk2 < ∞, k = 1, 2, · · · We put Bn2 = ESn2 , Sn = X1 + · · · + Xn , and Mn,p... asymptotic normality of the sums of both independent and dependent random variables has been extensively studied in the literature For instance, Billingsley (1995) deduced the central limit theorem for dependent random variables under the assumption that the sequence of random variables is stationary and α-mixing The definition of an α-mixing sequence is given below Definition 3.11 Denote Ma,b = σ(Xl : a... probability of moving from a to b in k steps Lemma 3.13 Assume the Markov chain {Ai } is irreducible and aperiodic There exists a unique stationary distribution π that is independent of the initial distribution, and constants K ≥ 0 and 0 ≤ ρ < 1, such that (n) |pa,b − πb | ≤ Kρn (3.12) Let u and v be two words of length k and l Write u = u1 · · · uk and v = v1 · · · vl We recall the notations in section

Ngày đăng: 30/09/2015, 14:23

Xem thêm: Asymptotic results in over and under representation of words in DNA, Asymptotic results in over and under representation of words in DNA

Asymptotic results in over and under representation of words in DNA

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan