15 Estimation Theory and Algorithms: From Gauss to Wiener to Kalman

Mendel, J.M “Estimation Theory and Algorithms: From Gauss to Wiener to Kalman” Digital Signal Processing Handbook Ed Vijay K Madisetti and Douglas B Williams Boca Raton: CRC Press LLC, 1999 1999 by CRC Press LLC c 15 Estimation Theory and Algorithms: From Gauss to Wiener to Kalman 15.1 15.2 15.3 15.4 15.5 15.6 15.7 15.8 15.9 Jerry M Mendel University of Southern California 15.1 Introduction Least-Squares Estimation Properties of Estimators Best Linear Unbiased Estimation Maximum-Likelihood Estimation Mean-Squared Estimation of Random Parameters Maximum A Posteriori Estimation of Random Parameters The Basic State-Variable Model State Estimation for the Basic State-Variable Model Prediction • Filtering (the Kalman Filter) • Smoothing 15.10 Digital Wiener Filtering 15.11 Linear Prediction in DSP, and Kalman Filtering 15.12 Iterated Least Squares 15.13 Extended Kalman Filter Acknowledgment References Further Information Introduction Estimation is one of four modeling problems The other three are representation (how something should be modeled), measurement (which physical quantities should be measured and how they should be measured), and validation (demonstrating confidence in the model) Estimation, which fits in between the problems of measurement and validation, deals with the determination of those physical quantities that cannot be measured from those that can be measured We shall cover a wide range of estimation techniques including weighted least squares, best linear unbiased, maximumlikelihood, mean-squared, and maximum-a posteriori These techniques are for parameter or state estimation or a combination of the two, as applied to either linear or nonlinear models The discrete-time viewpoint is emphasized in this chapter because: (1) much real data is collected in a digitized manner, so it is in a form ready to be processed by discrete-time estimation algorithms; and (2) the mathematics associated with discrete-time estimation theory is simpler than with continuoustime estimation theory We view (discrete-time) estimation theory as the extension of classical signal processing to the design of discrete-time (digital) filters that process uncertain data in a optimal manner Estimation theory can, therefore, be viewed as a natural adjunct to digital signal processing theory Mendel [12] is the primary reference for all the material in this chapter 1999 by CRC Press LLC c Estimation algorithms process data and, as such, must be implemented on a digital computer Our computation philosophy is, whenever possible, leave it to the experts Many of our chapter’s algorithms can be used with MATLABTM and appropriate toolboxes (MATLAB is a registered trademark of The MathWorks, Inc.) See [12] for specific connections between MATLABTM and toolbox M-files and the algorithms of this chapter The main model that we shall direct our attention to is linear in the unknown parameters, namely Z(k) = H(k)θ + V(k) (15.1) In this model, which we refer to as a “generic linear model,” Z(k) = col (z(k), z(k − 1), , z(k − N + 1)), which is N × 1, is called the measurement vector Its elements are z(j ) = h0 (j )θ + v(j ); θ which is n × 1, is called the parameter vector, and contains the unknown deterministic or random parameters that will be estimated using one or more of this chapter’s techniques; H(k), which is N ×n, is called the observation matrix; and, V(k), which is N × 1, is called the measurement noise vector By convention, the argument “k” of Z(k), H(k), and V(k) denotes the fact that the last measurement used to construct (15.1) is the kth Examples of problems that can be cast into the form of the generic linear model are: identifying the impulse response coefficients in the convolutional summation model for a linear time-invariant system from noisy output measurements; identifying the coefficients of a linear time-invariant finitedifference equation model for a dynamical system from noisy output measurements; function approximation; state estimation; estimating parameters of a nonlinear model using a linearized version of that model; deconvolution; and identifying the coefficients in a discretized Volterra series representation of a nonlinear system The following estimation notation is used throughout this chapter: θˆ (k) denotes an estimate of θ ˜ denotes the error in estimation, i.e., θ˜ (k) = θ − θˆ (k) The generic linear model is the starting and θ(k) point for the derivation of many classical parameter estimation techniques, and the estimation model ˆ ˆ for Z(k) is Z(k) = H(k)θ(k) In the rest of this chapter we develop specific structures for θˆ (k) These structures are referred to as estimators Estimates are obtained whenever data are processed by an estimator 15.2 Least-Squares Estimation The method of least squares dates back to Karl Gauss around 1795 and is the cornerstone for most estimation theory The weighted least-squares estimator (WLSE), θˆWLS (k), is obtained by minimizing ˜ ˜ ˆ ˆ where [using (15.1)] Z(k) = Z(k) − Z(k) = the objective function J [θ(k)] = Z˜ (k)W(k)Z(k), ˜ H(k)θ(k)+V(k), and weighting matrix W(k) must be symmetric and positive definite This weighting matrix can be used to weight recent measurements more (or less) heavily than past measurements If W(k) = cI, so that all measurements are weighted the same, then weighted least-squares reduces to least squares, in which case, we obtain θˆLS (k) Setting dJ [θˆ (k)]/d θˆ (k) = 0, we find that: and, consequently, −1 H (k)W(k)Z(k) θˆWLS (k) = H0 (k)W(k)H(k) (15.2) −1 θˆLS (k) = H0 (k)H(k) H (k)Z(k) (15.3) Note, also, that J [θˆWLS (k)] = Z0 (k)W(k)Z(k) − θˆWLS (k)H0 (k)W(k)H(k)θˆWLS (k) Matrix H (k)W(k)H(k) must be nonsingular for its inverse in (15.2) to exist This is true if W(k) is positive definite, as assumed, and H(k) is of maximum rank We know that θˆWLS (k) minimizes ˆ θˆ (k) = 2H0 (k)W(k)H(k) > 0, since H0 (k)W(k)H(k) is invertJ [θˆWLS (k)] because d J [θ(k)]/d ˆ ible Estimator θWLS (k) processes the measurements Z(k) linearly; hence, it is referred to as a linear 1999 by CRC Press LLC c estimator In practice, we not compute θˆWLS (k) using (15.2), because computing the inverse of H0 (k)W(k)H(k) is fraught with numerical difficulties Instead, the so-called normal equations [H0 (k)W(k)H(k)]θˆWLS (k) = H0 (k)W(k)Z(k) are solved using stable algorithms from numerical linear algebra (e.g., [3] indicating that one approach to solving the normal equations is to convert the original least squares problem into an equivalent, easy-to-solve problem using orthogonal transformations such as Householder or Givens transformations) Note, also, that (15.2) and (15.3) apply to the estimation of either deterministic or random parameters, because nowhere in the derivation of θˆWLS (k) did we have to assume that θ was or was not random Finally, note that WLSEs may not be invariant under changes of scale One way to circumvent this difficulty is to use normalized data Least-squares estimates can also be computed using the singular-value decomposition (SVD) of matrix H(k) This computation is valid for both the overdetermined (N < n) and underdetermined (N > n) situations and for the situation when H(k) may or may not be of full rank The SV D of K × M matrix A is: 0 (15.4) U AV = 0 P where U and V are unitary matrices, = diag (σ1 , σ2 , , σr ), and σ1 ≥ σ2 ≥ ≥ σr > The σi ’s are the singular values of A, and r is the rank of A Let the SVD of H(k) be given by (15.4) Even if H(k) is not of maximum rank, then −1 ˆθLS (k) = V (15.5) U0 Z(k) 0 P where −1 = diag (σ1−1 σ2−1 , , σr−1 ) and r is the rank of H(k) Additionally, in the overdetermined case, r X vi (k) v (k)H0 (k)Z(k) (15.6) θˆLS (k) = (k) i σ i=1 i Similar formulas exist for computing θˆWLS (k) Equations (15.2) and (15.3) are batch equations, because they process all of the measurements at one time These formulas can be made recursive in time by using simple vector and matrix partitioning techniques The information form of the recursive WLSE is: θˆWLS (k + 1) Kw (k + 1) P−1 (k + 1) = = = θˆWLS (k) + Kw (k + 1)[z(k + 1) − h0 (k + 1)θˆWLS (k)] P(k + 1)h(k + 1)w(k + 1) P−1 (k) + h(k + 1)w(k + 1)h0 (k + 1) (15.7) (15.8) (15.9) Equations (15.8) and (15.9) require the inversion of n × n matrix P If n is large, then this will be a costly computation Applying a matrix inversion lemma to (15.9), one obtains the following alternative covariance form of the recursive WLSE: Equation (15.7), and Kw (k + 1) = P(k)h(k + 1) h (k + 1)P(k)h(k + 1) + w(k + 1) P(k + 1) = I − Kw (k + 1)h0 (k + 1) P(k) −1 (15.10) (15.11) Equations (15.7)–(15.9) or (15.7), (15.10), and (15.11), are initialized by θˆWLS (n) and P−1 (n), where P(n) = [H0 (n)W(n)H(n)]−1 , and are used for k = n, n + 1, , N − Equation (15.7) can be expressed as (15.12) θˆWLS (k + 1) = I − Kw (k + 1)h0 (k + 1) θˆWLS (k) + Kw (k + 1)z(k + 1) 1999 by CRC Press LLC c which demonstrates that the recursive WLSE is a time-varying digital filter that is excited by random inputs (i.e., the measurements), one whose plant matrix [I − Kw (k + 1)h0 (k + 1)] may itself be random because Kw (k + 1) and h(k + 1) may be random, depending upon the specific application The random natures of these matrices make the analysis of this filter exceedingly difficult Two recursions are present in the recursive WLSEs The first is the vector recursion for θˆWLS given by (15.7) Clearly, θˆWLS (k + 1) cannot be computed from this expression until measurement z(k + 1) is available The second is the matrix recursion for either P−1 given by (15.9) or P given by (15.11) Observe that values for these matrices can be precomputed before measurements are made A digital computer implementation of (15.7)–(15.9) is P−1 (k + 1) → P(k + 1) → Kw (k + 1) → θˆWLS (k + 1), whereas for (15.7), (15.10), and (15.11), it is P(k) → Kw (k + 1) → θˆWLS (k + 1) → P(k + 1) Finally, the recursive WLSEs can even be used for k = 0, 1, , N − Often z(0) = 0, or there is no measurement made at k = 0, so that we can set z(0) = In this case we can set w(0) = 0, and the recursive WLSEs can be initialized by setting θˆWLS (0) = and P(0) to a diagonal matrix of very large numbers This is very commonly done in practice Fast fixed-order recursive least-squares algorithms that are based on the Givens rotation [3] and can be implemented using systolic arrays are described in [5] and the references therein 15.3 Properties of Estimators How we know whether or not the results obtained from the WLSE, or for that matter any estimator, are good? To answer this question, we must make use of the fact that all estimators represent transformations of random data; hence, θˆ (k) is itself random, so that its properties must be studied from a statistical viewpoint This fact, and its consequences, which seem so obvious to us today, are due to the eminent statistician R.A Fischer It is common to distinguish between small-sample and large-sample properties of estimators The term “sample” refers to the number of measurements used to obtain θˆ , i.e., the dimension of Z The phrase “small sample” means any number of measurements (e.g., 1, 2, 100, 104 , or even an infinite number), whereas the phrase “large sample” means “an infinite number of measurements.” Large-sample properties are also referred to as asymptotic properties If an estimator possesses as small-sample property, it also possesses the associated large-sample property; but the converse is not always true Although large sample means an infinite number of measurements, estimators begin to enjoy large-sample properties for much fewer than an infinite number of measurements How few usually depends on the dimension of θ, n, the memory of the estimators, and in general on the underlying, albeit unknown, probability density function A thorough study into θˆ would mean determining its probability density function p(θˆ ) Usually, ˆ for most estimators (unless θˆ is multivariate Gaussian); thus, it is it is too difficult to obtain p(θ) customary to emphasize the first-and second-order statistics of θˆ (or its associated error θ˜ = θ − θˆ ), the mean and the covariance Small-sample properties of an estimator are unbiasedness and efficiency An estimator is unbiased if its mean value is tracking the unknown parameter at every value of time, i.e., the mean value of the estimation error is zero at every value of time Dispersion about the mean is measured by error variance Efficiency is related to how small the error variance will be Associated with efficiency is the very famous Cramer-Rao inequality (Fisher information matrix, in the case of a vector of parameters) which places a lower bound on the error variance, a bound that does not depend on a particular estimator Large-sample properties of an estimator are asymptotic unbiasedness, consistency, asymptotic normality, and asymptotic efficiency Asymptotic unbiasedness and efficiency are limiting forms of their small sample counterparts, unbiasedness and efficiency The importance of an estimator being asymptotically normal (Gaussian) is that its entire probabilistic description is then known, and it 1999 by CRC Press LLC c can be entirely characterized just by its asymptotic first- and second-order statistics Consistency is ˆ a form of convergence of θ(k) to θ; it is synonymous with convergence in probability One of the reasons for the importance of consistency in estimation theory is that any continuous function of a consistent estimator is itself a consistent estimator, i.e., “consistency carries over.” It is also possible to examine other types of stochastic convergence for estimators, such as mean-squared convergence and convergence with probability A general carry-over property does not exist for these two types of convergence; it must be established case-by case (e.g., [11]) Generally speaking, it is very difficult to establish small sample or large sample properties for leastsquares estimators, except in the very special case when H(k) and V(k) are statistically independent While this condition is satisfied in the application of identifying an impulse response, it is violated in the important application of identifying the coefficients in a finite difference equation, as well as in many other important engineering applications Many large sample properties of LSEs are determined by establishing that the LSE is equivalent to another estimator for which it is known that the large sample property holds true We pursue this below Least-squares estimators require no assumptions about the statistical nature of the generic model Consequently, the formula for the WLSE is easy to derive The price paid for not making assumptions about the statistical nature of the generic linear model is great difficulty in establishing small or large sample properties of the resulting estimator 15.4 Best Linear Unbiased Estimation Our second estimator is both unbiased and efficient by design, and is a linear function of measurements Z(k) It is called a best linear unbiased estimator (BLUE), θˆBLU (k) As in the derivation of the WLSE, we begin with our generic linear model; but, now we make two assumptions about this model, namely: (1) H(k) must be deterministic, and (2) V(k) must be zero mean with positive definite known covariance matrix R(k) The derivation of the BLUE is more complicated than the derivation of the WLSE because of the design constraints; however, its performance analysis is much easier because we build good performance into its design We begin by assuming the following linear structure for θˆBLU (k), θˆBLU (k) = F(k)Z(k) Matrix F(k) is designed such that: (1) θˆBLU (k) is an unbiased estimator of θ , and (2) the error variance for each of the n parameters is minimized In this way, θˆBLU (k) will be unbiased and efficient (within the class of linear estimators) by design The resulting BLUE estimator is: θˆBLU (k) = [H0 (k)R −1 (k)H(k)]H0 (k)R −1 (k)Z(k) (15.13) A very remarkable connection exists between the BLUE and WLSE, namely, the BLUE of θ is the special case of the WLSE of θ when W(k) = R −1 (k) Consequently, all results obtained in our section above for θˆWLS (k) can be applied to θˆBLU (k) by setting W(k) = R −1 (k) Matrix R −1 (k) weights the contributions of precise measurements heavily and deemphasizes the contributions of imprecise measurements The best linear unbiased estimation design technique has led to a weighting matrix that is quite sensible If H(k) is deterministic and R(k) = σν2 I, then θˆBLU (k) = θˆLS (k) This result, known as the Gauss-Markov theorem, is important because we have connected two seemingly different estimators, one of which, θˆBLU (k), has the properties of unbiasedness and minimum variance by design; hence, in this case θˆLS (k) inherits these properties In a recursive WLSE, matrix P(k) has no special meaning In a recursive BLUE [which is obtained by substituting W(k) = R −1 (k) into (15.7)–(15.9), or (15.7), (15.10) and (15.11)], matrix P(k) is the covariance matrix for the error between θ and θˆBLU (k), i.e., P(k) = [H0 (k)R −1 (k)H(k)]−1 = cov [θ˜BLU (k)] Hence, every time P(k) is calculated in the recursive BLUE, we obtain a quantitative measure of how well we are estimating θ 1999 by CRC Press LLC c Recall that we stated that WLSEs may change in numerical value under changes in scale BLUEs are invariant under changes in scale This is accomplished automatically by setting W(k) = R −1 (k) in the WLSE The fact that H(k) must be deterministic severely limits the applicability of BLUEs in engineering applications 15.5 Maximum-Likelihood Estimation Probability is associated with a forward experiment in which the probability model, p(Z(k)|θ ), is specified, including values for the parameters, θ , in that model (e.g., mean and variance in a Gaussian density function), and data (i.e., realizations) are generated using this model Likelihood, l(θ |Z(k)), is proportional to probability In likelihood, the data is given as well as the nature of the probability model;but the parameters of the probability model are not specified They must be determined from the given data Likelihood is, therefore, associated with an inverse experiment The maximum-likelihood method is based on the relatively simple idea that different (statistical) populations generate different samples and that any given sample (i.e., set of data) is more likely to have come from some populations than from others In order to determine the maximum-likelihood estimate (MLE) of deterministic θ, θˆML , we need to determine a formula for the likelihood function and then maximize that function Because likelihood is proportional to probability, we need to know the entire joint probability density function of the measurements in order to determine a formula for the likelihood function This, of course, is much more information about Z(k) than was required in the derivation of the BLUE In fact, it is the most information that we can ever expect to know about the measurements The price we pay for knowing so much information about Z(k) is complexity in maximizing the likelihood function Generally, mathematical programming must be used in order to determine θˆML Maximum-likelihood estimates are very popular and widely used because they enjoy very good large sample properties They are consistent, asymptotically Gaussian with mean θ and covariance matrix N1 J−1 , in which J is the Fisher information matrix, and are asymptotically efficient Functions of maximum-likelihood estimates are themselves maximum-likelihood estimates, i.e., if g(θ ) is a vector function mapping θ into an interval in r-dimensional Euclidean space, then g(θˆML ) is a MLE of g(θ) This “invariance” property is usually not enjoyed by WLSEs or BLUEs In one special case it is very easy to compute θˆML , i.e., for our generic linear model in which H(k) is deterministic and V(k) is Gaussian In this case θˆML = θˆBLU These estimators are: unbiased, because θˆBLU is unbiased; efficient (within the class of linear estimators), because θˆBLU is efficient; consistent, because θˆML is consistent; and, Gaussian, because they depend linearly on Z(k), which is Gaussian If, in addition, R(k) = σν2 I, then θˆML (k) = θˆBLU (k) = θˆLS (k), and these estimators are unbiased, efficient (within the class of linear estimators), consistent, and Gaussian The method of maximum-likelihood is limited to deterministic parameters In the case of random parameters, we can still use the WLSE or the BLUE, or, if additional information is available, we can use either a mean-squared or maximum-a posteriori estimator, as described below The former does not use statistical information about the random parameters, whereas the latter does 15.6 Mean-Squared Estimation of Random Parameters Given measurements z(1), z(2), , z(k), the mean-squared estimator (MSE) of random θ, θˆMS (k) = (k)θ˜ φ[z(i), i = 1, 2, , k], minimizes the mean-squared error J [θ˜MS (k)] = E{θ˜MS MS (k)} [where ˜θMS (k) = θ − θˆMS (k)] The function φ[z(i), i = 1, 2, , k] may be nonlinear or linear Its exact structure is determined by minimizing J [θ˜MS (k)] 1999 by CRC Press LLC c The solution to this mean-squared estimation problem, which is known as the fundamental theorem of estimation theory is: (15.14) θˆMS (k) = E {θ |Z(k)} As it stands, (15.14) is not terribly useful for computing θˆMS (k) In general, we must first compute p[θ |Z(k)] and then perform the requisite number of integrations of θp[θ |Z(k)] to obtain θˆMS (k) It is useful to separate this computation into two major cases; (1) θ and Z(k) are jointly Gaussian — the Gaussian case, and (2) θ and Z(k) are not jointly Gaussian — the non-Gaussian case When θ and Z(k) are jointly Gaussian, the estimator that minimizes the mean-squared error is (15.15) θˆMS (k) = mθ + Pθ z (k)Pz−1 (k) Z(k) − mz (k) where mθ is the mean of θ, mz (k) is the mean of Z(k), Pz (k) is the covariance matrix of Z(k), and Pθz (k) is the cross-covariance between θ and Z(k) Of course, to compute θˆMS (k) using (15.15), we must somehow know all of these statistics, and we must be sure that θ and Z(k) are jointly Gaussian For the generic linear model, Z(k) = H(k)θ + V(k), in which H(k) is deterministic, V(k) is Gaussian noise with known invertible covariance matrix R(k), θ is Gaussian with mean mθ and covariance matrix Pθ , and, θ and V(k) are statistically independent, then θ and Z(k) are jointly Gaussian, and, (15.15) becomes −1 (15.16) θˆMS (k) = mθ + Pθ H0 (k) H(k)Pθ H0 (k) + R(k) [Z(k) − H(k)mθ ] where error-covariance matrix PMS (k), which is associated with θˆMS (k), is −1 H(k)Pθ PMS (k) = Pθ − Pθ H0 (k) H(k)Pθ H0 (k) + R(k) h i−1 = Pθ−1 + H0 (k)R −1 (k)H(k) (15.17) Using (15.17) in (15.16), θˆMS (k) can be reexpressed as θˆMS (k) = mθ + PMS (k)H0 (k)R −1 (k) [Z(k) − H(k)mθ ] (15.18) Suppose θ and Z(k) are not jointly Gaussian and that we know mθ , mz (k), Pz (k), and Pθ z (k) In this case, the estimator that is constrained to be an affine transformation of Z(k) and that minimizes the mean-squared error is also given by (15.15) We now know the answer to the following important question: When is the linear (affine) meansquared estimator the same as the mean-squared estimator? The answer is when θ and Z(k) are jointly Gaussian If θ and Z(k) are not jointly Gaussian, then θˆMS (k) = E{θ|Z(k)}, which, in general, is a nonlinear function of measurements Z(k), i.e., it is a nonlinear estimator Associated with mean-squared estimation theory is the orthogonality principle: Suppose f [Z(k)] is any function of the data Z(k); then the error in the mean-squared estimator is orthogonal to f [Z(k)] in the sense that E{[θ − θˆMS (k)]f [Z(k)]} = A frequently encountered special case of this occurs (k)} = when f [Z(k)] = θˆMS (k), in which case E{θ˜MS (k)θ˜MS When θ and Z(k) are jointly Gaussian, θˆMS (k) in (15.15) has the following properties: (1) it is unbiased; (2) each of its components has the smallest error variance; (3) it is a “linear” (affine) estimator; (4) it is unique; and, (5) both θˆMS (k) and θ˜MS (k) are multivariate Gaussian, which means that these quantities are completely characterized by their first- and second-order statistics Tremendous simplifications occur when θ and Z(k) are jointly Gaussian! Many of the results presented in this section are applicable to objective functions other than the mean-squared objective function See the supplementary material at the end of Lesson 13 in [12] for discussions on a wide number of objective functions that lead to E{θ |Z(k)} as the optimal estimator of θ, as well as discussions on a full-blown nonlinear estimator of θ 1999 by CRC Press LLC c There is a connection between the BLUE and the MSE The connection requires a slightly different BLUE, one that incorporates the a priori statistical information about random θ To this, we treat mθ as an additional measurement that is augmented to Z(k) The additional measurement equation is obtained by adding and subtracting θ in the identity mθ = mθ , i.e., mθ = θ + (mθ − θ ) Quantity (mθ − θ) is now treated as zero-mean measurement noise with covariance Pθ The augmented linear model is V(k) H(k) Z(k) (15.19) θ+ = mθ I mθ − θ a (k) Then it is always true that Let the BLUE estimator for this augmented model be denoted θˆBLU a (k) Note that the weighted least-squares objective function that is associated with θˆMS (k) = θˆBLU a ˜ θˆ (k) is Ja [θˆ a (k)] = [mθ − θˆ a (k)]0 P−1 [mθ − θˆ a (k)] + Z˜ (k)R −1 (k)Z(k) θ BLU 15.7 Maximum A Posteriori Estimation of Random Parameters Maximum a posteriori (MAP) estimation is also known as Bayesian estimation Recall Bayes’s rule: p(θ |Z(k)) = p(Z(k)|θ)p(θ)/p(Z(k)) in which density function p(θ |Z(k)) is known as the a posteriori (or posterior) conditional density function, and p(θ ) is the prior density function for θ Observe that p(θ|Z(k)) is related to likelihood function l{θ |Z(k)}, because l{θ |Z(k)} ∝ p(Z(k)|θ ) Additionally, because p(Z(k)) does not depend on θ, p(θ|Z(k)) ∝ p(Z(k)|θ )p(θ ) In MAP estimation, values of θ are found that maximize p(Z(k)|θ )p(θ ) Obtaining a MAP estimate involves specifying both p(Z(k)|θ) and p(θ) and finding the value of θ that maximizes p(θ |Z(k)) It is the knowledge of the a priori probability model for θ , p(θ ), that distinguishes the problem formulation for MAP estimation from MS estimation If θ1 , θ2 , , θn are uniformly distributed, then p(θ |Z(k)) ∝ p(Z(k)|θ ), and the MAP estimator of θ equals the ML estimator of θ Generally, MAP estimates are quite different from ML estimates For example, the invariance property of MLEs usually does not carry over to MAP estimates One reason for this can be seen from the formula p(θ|Z(k)) ∝ p(Z(k)|θ )p(θ ) Suppose, for example, that φ = g(θ) and we want to determine φˆ MAP by first computing θˆMAP Because p(θ ) depends on the Jacobian matrix of g −1 (φ), φˆ MAP = g(θˆMAP ) Usually θˆMAP and θˆML (k) are asymptotically identical to one another since in the large sample case the knowledge of the observations tends to swamp the knowledge of the prior distribution [10] Generally speaking, optimization must be used to compute θˆMAP (k) In the special but important case, when Z(k) and θ are jointly Gaussian, then θˆMAP (k) = θˆMS (k) This result is true regardless of the nature of the model relating θ to Z(k) Of course, in order to use it, we must first establish that Z(k) and θ are jointly Gaussian Except for the generic linear model, this is very difficult to When H(k) is deterministic, V(k) is white Gaussian noise with known covariance matrix R(k), a (k); hence, and θ is multivariate Gaussian with known mean mθ and covariance Pθ , θˆMAP (k) = θˆBLU for the generic linear Gaussian model, MS, MAP, and BLUE estimates of θ are all the same, i.e., a (k) = θˆ θˆMS (k) = θˆBLU MAP (k) 15.8 The Basic State-Variable Model In the rest of this chapter we shall describe a variety of mean-squared state estimators for a linear, (possibly) time-varying, discrete-time, dynamical system, which we refer to as the basic state-variable model This system is characterized by n × state vector x(k) and m × measurement vector z(k), and is: x(k + 1) 1999 by CRC Press LLC c = 8(k + 1, k)x(k) + 0(k + 1, k)w(k) + 9(k + 1, k)u(k) (15.20) z(k + 1) = H(k + 1)x(k + 1) + v(k + 1) (15.21) where k = 0, 1, In this model w(k) and v(k) are p ×1 and m×1 mutually uncorrelated (possibly nonstationary) jointly Gaussian white noise sequences; i.e., E{w(i)w (j )} = Q(i)δij , E{v(i)v (j )} = R(i)δij , and E{w(i)v (j )} = S = 0, for all i and j Covariance matrix Q(i) is positive semidefinite and R(i) is positive definite [so that R −1 (i) exists] Additionally, u(k) is an l × vector of known system inputs, and initial state vector x(0) is multivariate Gaussian, with mean mx (0) and covariance Px (0), and x(0) is not correlated with w(k) and v(k) The dimensions of matrices 8, 0, 9, H, Q, and R are n × n, n × p, n × l, m × n, p × p, and m × m, respectively The double arguments in matrices 8, 0, and may not always be necessary, in which case we replace (k + 1, k) by k Disturbance w(k) is often used to model disturbance forces acting on the system, errors in modeling the system, or errors due to actuators in the translation of the known input, u(k), into physical signals Vector v(k) is often used to model errors in measurements made by sensing instruments, or unavoidable disturbances that act directly on the sensors Not all systems are described by this basic model In general, w(k) and v(k) may be correlated, some measurements may be made so accurate that, for all practical purposes, they are “perfect” (i.e., no measurement noise is associated with them), and either w(k) or v(k), or both, may be nonzero mean or colored noise processes How to handle these situations is described in Lesson 22 of [12] When x(0) and {w(k), k = 0, 1, } are jointly Gaussian, then {x(k), k = 0, 1, } is a GaussMarkov sequence Note that if x(0) and w(k) are individually Gaussian and statistically independent, they will be jointly Gaussian Consequently, the mean and covariance of the state vector completely characterize it Let mx (k) denote the mean of x(k) For our basic state-variable model, mx (k) can be computed from the vector recursive equation mx (k + 1) = 8(k + 1, k)mx (k) + 9(k + 1, k)u(k) (15.22) where k = 0, 1, , and mx (0) initializes (15.22) Let Px (k) denote the covariance matrix of x(k) For our basic state-variable model, Px (k) can be computed from the matrix recursive equation Px (k + 1) = 8(k + 1, k)Px (k)80 (k + 1, k) + 0(k + 1, k)Q(k)0 (k + 1, k) (15.23) where k = 0, 1, , and Px (0) initializes (15.23) Equations (15.22) and (15.23) are easily programmed for a digital computer For our basic state-variable model, when x(0), w(k), and v(k) are jointly Gaussian, then {z(k), k = 1, 2, } is Gaussian, and (15.24) mz (k + 1) = H(k + 1)mx (k + 1) and Pz (k + 1) = H(k + 1)Px (k + 1)H0 (k + 1) + R(k + 1) (15.25) where mx (k + 1) and Px (k + 1) are computed from (15.22) and (15.23), respectively For our basic state-variable model to be stationary, it must be time-invariant, and the probability density functions of w(k) and v(k) must be the same for all values of time Because w(k) and v(k) are zero-mean and Gaussian, this means that Q(k) must equal the constant matrix Q and R(k) must equal the constant matrix R Additionally, either x(0) = or 8(k, 0)x(0) ≈ when k > k0 ; in both cases x(k) will be in its steady-state regime, so stationarity is possible If the basic state-variable model is time-invariant and stationary and if is associated with an asymptotically stable system (i.e., one whose poles all lie within the unit circle), then [1] matrix Px (k) reaches a limiting (steady-state) solution P¯ x and P¯ x is the solution of the following steady-state version of (15.23): P¯ x = 8P¯ x 80 + 0Q0 This equation is called a discrete-time Lyapunov equation 1999 by CRC Press LLC c 15.9 State Estimation for the Basic State-Variable Model Prediction, filtering, and smoothing are three types of mean-squared state estimation that have been developed since 1959 A predicted estimate of a state vector x(k) uses measurements which occur earlier than tk and a model to make the transition from the last time point, say tj , at which a measurement is available, to tk The success of prediction depends on the quality of the model In state estimation we use the state equation model Without a model, prediction is dubious at best A recursive mean-squared state filter is called a Kalman filter, because it was developed by Kalman around 1959 [9] Although it was originally developed within a community of control theorists, and is regarded as the most widely used result of so-called “modern control theory,” it is no longer viewed as a control theory result It is a result within estimation theory; consequently, we now prefer to view it as a signal processing result A filtered estimate of state vector x(k) uses all of the measurements up to and including the one made at time tk A smoothed estimate of state vector x(k) not only uses measurements which occur earlier than tk plus the one at tk , but also uses measurements to the right of tk Consequently, smoothing can never be carried out in real time, because we have to collect “future” measurements before we can compute a smoothed estimate If we don’t look too far into the future, then smoothing can be performed subject to a delay of LT seconds, where T is our data sampling time and L is a fixed positive integer that describes how many sample points to the right of tk are to be used in smoothing Depending upon how many future measurements are used and how they are used, it is possible to create three types of smoother: (1) the fixed-interval smoother, x(k|N ˆ ), k = 0, 1, , N − 1, where N is a fixed positive integer; (2) the fixed-point smoother, x(k|j ˆ ), j = k + 1, k + 2, , where k is a fixed positive integer; and (3) the fixed-lag smoother, x(k|k ˆ + L), k = 0, 1, , where L is a fixed positive integer 15.9.1 Prediction A single-stage predicted estimate of x(k) is denoted x(k|k ˆ − 1) It is the mean-squared estimate of x(k) that uses all the measurements up to and including the one made at time tk−1 ; hence, a single-stage predicted estimate looks exactly one time point into the future This estimate is needed by the Kalman filter From the fundamental theorem of estimation theory, we know that x(k|k ˆ − 1) = E{x(k)|Z(k − 1)} where Z(k − 1) = col (z(1), z(2), , z(k − 1)), from which it follows that (15.26) x(k|k ˆ − 1) = 8(k, k − 1)x(k ˆ − 1|k − 1) + 9(k, k − 1)u(k − 1) where k = 1, 2, Observe that x(k|k ˆ − 1) depends on the filtered estimate x(k ˆ − 1|k − 1) of the preceding state vector x(k − 1) Therefore, Equation (15.26) cannot be used until we provide the Kalman filter Let P(k|k − 1) denote the error-covariance matrix that is associated with x(k|k ˆ − 1), i.e., n 0 o ˜ − 1) − mx˜ (k|k − 1) , P(k|k − 1) = E x(k|k ˜ − 1) − mx˜ (k|k − 1) x(k|k where x(k|k ˜ − 1) = x(k) − x(k|k ˆ − 1) Additionally, let P(k − 1|k − 1) denote the error-covariance matrix that is associated with x(k ˆ − 1|k − 1), i.e., n 0 o ˜ − 1|k − 1) − mx˜ (k − 1|k − 1) , P(k − 1|k − 1) = E x(k ˜ − 1|k − 1) − mx˜ (k − 1|k − 1) x(k where x(k ˜ − 1|k − 1) = x(k − 1) − x(k ˆ − 1|k − 1) Then P(k|k − 1) = 8(k, k − 1)P(k − 1|k − 1)80 (k, k − 1) + 0(k, k − 1)Q(k − 1)0 (k, k − 1) (15.27) 1999 by CRC Press LLC c where k = 1, 2, Observe, from (15.26) and (15.27), that x(0|0) ˆ and P(0|0) initialize the single-stage predictor and its error covariance, where x(0|0) ˆ = mx (0) and P(0|0) = P(0) A more general state predictor is possible, one that looks further than just one step See ([12] Lesson 16) for its details The single-stage predicted estimate of z(k + 1), zˆ (k + 1|k), is given by zˆ (k + 1|k) = H(k + 1)x(k ˆ + 1|k) The error between z(k + 1) and zˆ (k + 1|k), is z˜ (k + 1|k); z˜ (k + 1|k) is called the innovations process (or, prediction error process, or, measurement residual process), and this process plays a very important role in mean-squared filtering and smoothing The following representations of the innovations process z˜ (k + 1|k) are equivalent: z˜ (k + 1|k) = z(k + 1) − zˆ (k + 1|k) = z(k + 1) − H(k + 1)x(k ˆ + 1|k) = H(k + 1)x(k ˜ + 1|k) + v(k + 1) The innovations is a zero-mean Gaussian white noise sequence, with E z˜ (k + 1|k)˜z0 (k + 1|k) = H(k + 1)P(k + 1|k)H0 (k + 1) + R(k + 1) (15.28) (15.29) The paper by Kailath [7] gives an excellent historical perspective of estimation theory and includes a very good historical account of the innovations process 15.9.2 Filtering (the Kalman Filter) The Kalman filter (KF) and its later extensions to nonlinear problems represent the most widely applied by-product of modern control theory We begin by presenting the KF, which is the meansquared filtered estimator of x(k + 1), x(k ˆ + 1|k + 1), in predictor-corrector format: x(k ˆ + 1|k + 1) = x(k ˆ + 1|k) + K(k + 1)˜z(k + 1|k) (15.30) for k = 0, 1, , where x(0|0) ˆ = mx (0) and z˜ (k + 1|k) is the innovations sequence in (15.28) (use the second equality to implement the KF) Kalman gain matrix K(k + 1) is n × m, and is specified by the set of relations: −1 K(k + 1) = P(k + 1|k)H0 (k + 1) H(k + 1)P(k + 1|k)H0 (k + 1) + R(k + 1) (15.31) P(k + 1|k) = 8(k + 1, k)P(k|k)80 (k + 1, k) + 0(k + 1, k)Q(k)0 (k + 1, k) (15.32) and P(k + 1|k + 1) = [I − K(k + 1)H(k + 1)] P(k + 1|k) (15.33) for k = 0, 1, , where I is the n × n identity matrix, and P(0|0) = Px (0) The KF involves feedback and contains within its structure a model of the plant The feedback nature of the KF manifests itself in two different ways: in the calculation of x(k ˆ + 1|k + 1) and also in the calculation of the matrix of gains, K(k + 1) Observe, also, from (15.26) and (15.32), that the predictor equations, which compute x(k ˆ + 1|k) and P(k + 1|k), use information only from the state equation, whereas the corrector equations, which compute K(k + 1), x(k ˆ + 1|k + 1), and P(k + 1|k + 1), use information only from the measurement equation Once the gain is computed, then (15.30) represents a time-varying recursive digital filter This is seen more clearly when (15.26) and (15.28) are substituted into (15.30) The resulting equation can be rewritten as x(k ˆ + 1|k + 1) = 1999 by CRC Press LLC c ˆ + K(k + 1)z(k + 1) [I − K(k + 1)H(k + 1)] 8(k + 1, k)x(k|k) (15.34) + [I − K(k + 1)H(k + 1)] 9(k + 1, k)u(k) for k = 0, 1, This is a state equation for state vector x, ˆ whose time-varying plant matrix is [I − K(k + 1)H(k + 1)]8(k + 1, k) Equation (15.34) is time-varying even if our basic state-variable model is time-invariant and stationary, because gain matrix K(k + 1) is still time-varying in that case It is possible, however, for K(k + 1) to reach a limiting value (i.e., steady-state value, K), in which case (15.34) reduces to a recursive constant coefficient filter Equation (15.34) is in recursive filter form, in that it relates the filtered estimate of x(k + 1), x(k ˆ + 1|k + 1), to the filtered estimate of x(k), x(k|k) ˆ Using substitutions similar to those in the derivation of (15.34), we can also obtain the following recursive predictor form of the KF: x(k ˆ + 1|k) = 8(k + 1, k) [I − K(k)H(k)] x(k|k ˆ − 1) + 8(k + 1, k)K(k)z(k) + 9(k + 1, k)u(k) (15.35) Observe that in (15.35) the predicted estimate of x(k + 1), x(k ˆ + 1|k), is related to the predicted estimate of x(k), x(k|k ˆ − 1), and that the time-varying plant matrix in (15.35) is different from the time-varying plant matrix in (15.34) Embedded within the recursive KF is another set of recursive equations, (15.31) to (15.33) Because P(0|0) initializes these calculations, these equations must be ordered as follows: P(k|k) → P(k + 1|k) → K(k + 1) → P(k + 1|k + 1) →, etc By combining these equations, it is possible to get a matrix equation for P(k + 1|k) as a function of P(k|k − 1) or a similar equation for P(k + 1|k + 1) as a function of P(k|k) These equations are nonlinear and are known as matrix Riccati equations A measure of recursive predictor performance is provided by matrix P(k + 1|k), and a measure of recursive filter performance is provided by matrix P(k +1|k +1) These covariances can be calculated prior to any processing of real data, using (15.31) to (15.33) These calculations are often referred to as a performance analysis, and P(k + 1|k + 1) 6= P(k + 1|k) It is indeed interesting that the KF utilizes a measure of its mean-squared error during its real-time operation Because of the equivalence between mean-squared, BLUE, and WLS filtered estimates of our state vector x(k) in the Gaussian case, we must realize that the KF equations are just a recursive solution to a system of normal equations Other implementations of the KF that solve the normal equations using stable algorithms from numerical linear algebra (see, e.g., [2]) and involve orthogonal transformations have better numerical properties than (15.30) to (15.33) (see, e.g., [4]) A recursive BLUE of a random parameter vector θ can be obtained from the KF equations by setting x(k) = θ, 8(k + 1, k) = I, 0(k + 1, k) = 0, 9(k + 1, k) = and Q(k) = Under these conditions we see that w(k) = for all k, and x(k + 1) = x(k), which means, of course, that x(k) is a vector of constants, θ The KF equations reduce to: θˆ (k + 1|k + 1) = θˆ (k|k) + K(k + 1)[z(k + 1) − H(k + 1)θˆ (k|k)], P(k+1|k) = P(k|k), K(k+1) = P(k|k)H0 (k+1)[H(k+1)P(k|k)H0 (k+1)+R(k+1)]−1 , and P(k + 1|k + 1) = [I − K(k + 1)H(k + 1)]P(k|k) Note that it is no longer necessary to distinguish between filtered and predicted quantities, because θˆ (k + 1|k) = θˆ (k|k) and P(k + 1|k) = P(k|k); ˆ hence, the notation θ(k|k) can be simplified to θˆ (k), for example, which is consistent with our earlier notation for the estimate of a vector of constant parameters A divergence phenomenon may occur when either the process noise or measurement noise or both are too small In these cases the Kalman filter may lock onto wrong values for the state, but believes them to the true values; i.e., it “learns” the wrong state too well A number of different remedies have been proposed for controlling divergence effects, including: (1) adding fictitious process noise, (2) finite-memory filtering, and (3) fading memory filtering Fading memory filtering seems to be the most successful and popular way to control divergence effects See [6] or [12] for discussions about these remedies For time-invariant and stationary systems, if limk→∞ P(k+1|k) = Pp exists, then limk→∞ K(k) = ¯ and the Kalman filter becomes a constant coefficient filter Because P(k + 1|k) and P(k|k) are K intimately related, then if Pp exists, limk→∞ P(k|k) = Pf also exists If the basic state-variable model is time-invariant, stationary, and asymptotically stable, then: (a) for any nonnegative symmetric 1999 by CRC Press LLC c initial condition P(0| − 1), we have limk→∞ P(k + 1|k) = Pp with Pp independent of P(0| − 1) and satisfying the following steady-state algebraic matrix Riccati equation, h i −1 HPp 80 + 0Q0 (15.36) Pp = 8Pp I − H0 HPp H0 + R ¯ (b) The eigenvalues of the steady-state KF, λ[8 − KH8], all lie within the unit circle, so that the filter ¯ is asymptotically stable, i.e., |λ[8 − KH8]| < If the basic state-variable model is time-invariant and stationary, but is not necessarily asymptotically stable (e.g., it may have a pole on the unit circle), the points (a) and (b) still hold as long as the basic state-variable model is completely stabilizable and detectable (e.g., [8]) To design a steady-state KF: (1) Given (8, 0, 9, H, Q, R), compute Pp , the ¯ in ¯ as K ¯ = Pp H0 (HPp H0 + R)−1 ; and (3) use K positive definite solution of (15.36); (2) compute K, x(k ˆ + 1|k + 1) = = ¯ z(k + 1|k) 8x(k|k) ˆ + 9u(k) + K˜ ¯ ¯ ¯ I − KH 8x(k|k) ˆ + Kz(k + 1) + I − KH 9u(k) (15.37) Equation (15.37) is a steady-state filter state equation The main advantage of the steady-state filter is a drastic reduction in on-line computations 15.9.3 Smoothing Although there are three types of smoothers, the most useful one for digital signal processing is the fixed-interval smoother, hence, we only discuss it here The fixed-interval smoother is x(k|N ˆ ), k = 0, 1, , N − 1, where N is a fixed positive integer The situation here is as follows: with an experiment completed, we have measurements available over the fixed interval ≤ k ≤ N For each time point within this interval we wish to obtain the optimal estimate of the state vector x(k), which is based on all the available measurement data {z(j ), j = 1, 2, , N} Fixed-interval smoothing is very useful in signal processing situations, where the processing is done after all the data are collected It cannot be carried out on-line during an experiment like filtering can Because all the available data are used, we cannot hope to better (by other forms of smoothing) than by fixed-interval smoothing A mean-squared fixed-interval smoothed estimate of x(k), x(k|N ˆ ), is x(k|N) ˆ = x(k|k ˆ − 1) + P(k|k − 1)r(k|N ) (15.38) where k = N − 1, N − 2, , 1, and n × vector r satisfies the backward-recursive equation −1 z˜ (j |j − 1) (15.39) r(j |N) = 80p (j + 1, j )r(j + 1|N) + H0 (j ) H(j )P(j |j − 1)H0 (j ) + R(j ) where 8p (k + 1, k) = 8(k + 1, k)[I − K(k)H(k)] and j = N, N − 1, , 1, and, r(N + 1|N ) = The smoothing error-covariance matrix P(k|N ), is P(k|N) = P(k|k − 1) − P(k|k − 1)S(k|N )P(k|k − 1) (15.40) where k = N − 1, N − 2, , 1, and n × n matrix S(j |N ), which is the covariance matrix of r(j |N ), satisfies the backward-recursive equation S(j |N) = 80p (j + 1, j )S(j + 1|N )8p (j + 1, j ) −1 H(j ) + H0 (j ) H(j )P(j |j − 1)H0 (j ) + R(j ) (15.41) where j = N, N − 1, , 1, and S(N + 1|N ) = Observe that fixed-interval smoothing involves a forward pass over the data, using a KF, and then a backward pass over the innovations, using (15.39) 1999 by CRC Press LLC c The smoothing error-covariance matrix, P(k|N ), can be precomputed; but, it is not used during the computation of x(k|N) ˆ This is quite different than the active use of the filtering error-covariance matrix in the KF An important application for fixed-interval smoothing is deconvolution Consider the single-input single-output system z(k) = k X µ(i)h(k − i) + ν(k) k = 1, 2, , N (15.42) i=1 where µ(j ) is the system’s input, which is assumed to be white, and not necessarily Gaussian, and h(j ) is the system’s impulse response Deconvolution is the signal-processing procedure for removing the effects of h(j ) and ν(j ) from the measurements so that we are left with an estimate of µ(j ) In order to obtain a fixed-interval smoothed estimate of µ(j ), we must first convert (15.42) into an equivalent state-variable model The single-channel state-variable model x(k +1) = 8x(k)+γ µ(k) and z(k) = h0 x(k) + ν(k) is equivalent to (15.42) when x(0) = 0, µ(0) = 0, h(0) = 0, and h(l) = ˆ ) = q(k)γ r(k +1|N ) h0 8l−i γ (l = 1, 2, ) A two-pass fixed-interval smoother for µ(k) is µ(k|N 2 where k = N −1, N −2, , The smoothing error variance, σµ (k|N ), is σµ (k|N ) = q(k)−q(k)γ S(k + 1|N)γ q(k) In these formulas r(k|N ) and S(k|N ) are computed using (15.39) and (15.41), respectively, and E{µ2 (k)} = q(k) 15.10 Digital Wiener Filtering The steady-state KF is a recursive digital filter with filter coefficients equal to hf (j ), j = 0, 1, Quite often hf (j ) ≈ for j ≥ J , so that the transfer function of this filter, Hf (z), can be truncated, i.e., Hf (z) ≈ hf (0) + hf (1)z−1 + + hf (J )z−J The truncated steady-state, KF can then be implemented as a finite-impulse response (FIR) digital filter There is, however, a more direct way for designing a FIR minimum mean-squared error filter, i.e., a digital Wiener filter (WF) Consider the scalar measurement case, in which measurement z(k) is to be processed by a digital filter F (z), whose coefficients, f (0), f (1), , f (η), are obtained by minimizingPthe mean-squared n error I (f ) = E{[d(k) − y(k)]2 } = E{e2 (k)}, where y(k) = f (k) ∗ z(k) = i=0 f (i)z(k − i) and d(k) is a desired filter output signal Using calculus, it is straightforward to show that the filter coefficients that minimize I (f) satisfy the following discrete-time Wiener-Hopf equations: η X f (i)φzz (i − j ) = φzd (j ) j = 0, 1, , η (15.43) i=0 where φzd (i) = E{d(k)z(k − i)} and φzz (i − m) = E{z(k − i)z(k − m)} Observe that (15.43) are a system of normal equations and can be solved in many different ways, including the Levinson algorithm The minimum mean-squared error, I ∗ (f), in general, approaches a nonzero limiting value which is often reached for modest values of filter length η To relate this FIR WF to the truncated steady-state KF, we must first assume a signal-plus-noise model for z(k), because a KF uses a system model, i.e., z(k) = s(k) + ν(k) = h(k) ∗ w(k) + ν(k), where h(k) is the IR of a linear time-invariant system and, as in our basic state-variable model, w(k) and ν(k) are mutually uncorrelated (stationary) white noise sequences with variances q and r, respectively We must also specify an explicit form for “desired signal” d(k) We shall require that d(k) = s(k) = h(k) ∗ w(k), which means that we want the output of the FIR digital WF to be as close as possible to signal s(k) The resulting Wiener-Hopf equations are η X f (i) i=0 1999 by CRC Press LLC c hq i q φhh (j − i) + δ(j − i) = φhh (j ), r r j = 0, 1, , η (15.44) P where φhh (i) = ∞ l=0 h(l)h(l + i) The truncated steady-state KF is a FIR digital WF For a detailed comparison of Kalman and Wiener filters, see ([12] Lesson 19) To obtain a digital Wiener deconvolution filter, we assume that filter F (z) is an infinite impulse response (IIR) filter, with coefficients {f (j ), j = 0, ±1, ±2, }; d(k) = µ(k) where µ(k) is a white noise sequence and µ(k) and ν(k) are stationary and uncorrelated In this case, (15.43) becomes ∞ X f (i)φzz (i − j ) = φzµ (j ) = qh(−j ) j = 0, ±1, ±2 (15.45) i=−∞ This system of equations cannot be solved as a linear system of equations, because there are a doubly infinite number of them Instead, we take the discrete-time Fourier transform of (15.45), i.e., F (ω)8zz (ω) = qH ∗ (ω), but, from (15.42), 8zz (ω) = q|H (ω)|2 + r; hence, F (ω) = qH ∗ (ω) q|H (ω)|2 + r (15.46) The inverse Fourier transform of (15.46), or spectral factorization, gives {f (j ), j = 0, ±1, ±2, } 15.11 Linear Prediction in DSP, and Kalman Filtering A well-studied problem in digital signal processing (e.g., [5]), is the linear prediction problem, in which the structure of the predictor is fixed ahead of time to be a linear transformation of the data The “forward” linear prediction problem is to predict a future value of stationary discrete-time random sequence {y(k), k = 1, 2, } using a set of past samples of the sequence Let y(k) ˆ denote the predicted value of y(k) that uses M past measurements; i.e., y(k) ˆ = M X aM,i y(k − i) (15.47) i=1 The forward prediction error filter (PEF) coefficients, aM,1 , , aM,M , are chosen so that either the mean-squared or least-squared forward prediction error (FPE), fM (k), is minimized, where ˆ Note that in this filter design problem the length of the filter, M, is treated fM (k) = y(k) − y(k) as a design variable, which is why the PEF coefficients are argumented by M Note, also, that the PEF coefficients not depend on tk ; i.e., the PEF is a constant coefficient predictor, whereas our mean-squared state-predictor and filter are time-varying digital filters Predictor y(k) ˆ uses a finite window of past measurements: y(k − 1), y(k − 2), , y(k − M) This window of measurements is different for different values of tk This use of measurements is quite different than our use of the measurements in state prediction, filtering, and smoothing The latter are based on an expanding memory, whereas the former is based on a fixed memory Digital signal-processing specialists have invented a related type of linear prediction named backward linear prediction in which the objective is to predict a past value of a stationary discrete-time random sequence using a set of future values of the sequence Of course, backward linear prediction is not prediction at all; it is smoothing But the term backward linear prediction is firmly entrenched in the DSP literature Both forward and backward PEFs have a filter architecture associated with them that is known as a tapped delay line Remarkably, when the two filter design problems are considered simultaneously, their solutions can be shown to be coupled, and the resulting architecture is called a lattice The lattice filter is doubly recursive in both time, k, and filter order, M The tapped delay line is only recursive in time Changing its filter length leads to a completely new set of filter coefficients Adding another stage to the lattice filter does not affect the earlier filter coefficients 1999 by CRC Press LLC c Consequently, the lattice filter is a very powerful architecture No such lattice architecture is known for mean-squared state estimators In a second approach to the design of the FPE coefficients, the constraint that the FPE coefficients are constant is transformed into the state equations: aM,1 (k + 1) = aM,1 (k), aM,2 (k + 1) = aM,2 (k), , aM,M (k + 1) = aM,M (k) Equation (15.47) then plays the role of the observation equation in our basic state-variable model, and is one in which the observation matrix is time-varying The resulting mean-squared error design is then referred to as the Kalman filter solution for the PEF coefficients Of course, we saw above that this solution is a very special case of the KF, the BLUE In yet a third approach, the PEF coefficients are modeled as: aM,1 (k + 1) = aM,1 (k) + w1 (k), aM,2 (k + 1) = aM,2 (k) + w2 (k), , aM,M (k + 1) = aM,M (k) + wM (k) where wi (k) are white noises with variances qi Equation (15.47) again plays the role of the measurement equation in our basic state-variable model and is one in which the observation matrix is time-varying The resulting mean-squared error design is now a full-blown KF 15.12 Iterated Least Squares Iterated least squares (ILS) is a procedure for estimating parameters in a nonlinear model Because it can be viewed as the basis for the extended KF, which is described in the next section, we describe ILS briefly here To keep things simple, we describe ILS for the scalar parameter model z(k) = f (θ, k) + ν(k) where k = 1, 2, , N ILS is basically a four-step procedure: (1) Linearize f (θ, k) about a nominal value of θ, θ ∗ Doing this, we obtain the perturbation measurement equation δz(k) = Fθ (k; θ ∗ )δθ + ν(k) k = 1, 2, , N (15.48) where δz(k) = z(k) − z∗ (k) = z(k) − f (θ ∗ , k), δθ = θ − θ ∗ , and Fθ (k; θ ∗ ) = ∂f (θ, k)/∂θ|θ =θ ∗ ; ˆ WLS (N ) using (15.2); (3) Solve the (2) Concatenate (15.48) for the N values of k and compute δθ ˆ WLS (N ); (4) Replace θ ∗ ˆ WLS (N) = θˆWLS (N) − θ ∗ for θˆWLS (N ), i.e., θˆWLS (N ) = θ ∗ + δθ equation δθ i with θˆWLS (N) and return to step Iterate through these steps until convergence occurs Let θˆWLS (N ) i+1 ˆ and θWLS (N) denote estimates of θ obtained at iterations i and i +1, respectively Convergence of the i+1 i (N) − θˆWLS (N )| < ε where ε is a prespecified small positive number ILS method occurs when |θˆWLS Observe from this four-step procedure that ILS uses the estimate obtained from the linearized model to generate the nominal value of θ about which the nonlinear model is relinearized Additionally, in each complete cycle of this procedure, we use both the nonlinear and linearized models The nonlinear model is used to compute z∗ (k) and subsequently δz(k) The notions of relinearizing about a filter output and using both the nonlinear and linearized models are also at the very heart of the extended KF 15.13 Extended Kalman Filter Many real-world systems are continuous-time in nature and are also nonlinear The extended Kalman filter (EKF) is the heuristic, but very widely used, application of the KF to estimation of the state vector for the following nonlinear dynamical system: x(t) ˙ = f [x(t), u(t), t] + G(t)w(t) z(t) = h [x(t), u(t), t] + v(t) t = ti , 1999 by CRC Press LLC c (15.49) i = 1, 2, (15.50) In this model measurement equation (15.50) is treated as a discrete-time equation, whereas state equation (15.49) is treated as a continuous-time equation; x(t) ˙ is short for dx(t)/dt; both f and h are continuous and continuously differentiable with respect to all elements of x and u; w(t) is a zero-mean continuous-time white noise process, with E{w(t)w (τ )} = Q(t)δ(t − τ ); v(ti ) is a discrete-time zero-mean white noise sequence, with E{v(ti )v (tj )} = R(ti )δij ; and, w(t) and v(ti ) are mutually uncorrelated at all t = ti , i.e., E{w(t)v (ti )} = for t = ti , i = 1, 2, In order to apply the KF to (15.49) and (15.50) we must linearize and discretize these equations Linearization is done about a nominal input u∗ (t) and nominal trajectory x ∗ (t), whose choices we discuss below If we are given a nominal input u∗ (t), then x ∗ (t) satisfies the nonlinear differential equation (15.51) x˙ ∗ (t) = f x ∗ (t), u∗ (t), t and associated with x ∗ (t) and u∗ (t) is the following nominal measurement, z∗ (t), where t = ti , i = 1, 2, z∗ (t) = h x ∗ (t), u∗ (t), t (15.52) Equations (15.51) and (15.52) are referred to as the nominal system model Letting δx(t) = x(t) − x ∗ (t), δu(t) = u(t) − u∗ (t), and δz(t) = z(t) − z∗ (t), we have the following linear perturbation state-variable model: (15.53) δ x(t) ˙ = Fx x ∗ (t), u∗ (t), t δx(t) + Fu x ∗ (t), u∗ (t), t δu(t) + G(t)w(t) ∗ δz(t) = Hx x (t), u∗ (t), t δx(t) + Hu x ∗ (t), u∗ (t), t δu(t) + v(t), i = 1, 2, (15.54) t = ti , Where Fx [x ∗ (t), u∗ (t), t], for example, is the following time-varying Jacobian matrix,   ∂f1 /∂x1∗ · · · ∂f1 /∂xn∗   Fx x ∗ (t), u∗ (t), t =   ∗ ∗ ∂fn /∂x1 · · · ∂fn /∂xn (15.55) in which ∂fi /∂xj∗ = ∂fi [x(t), u(t), t]/∂xj (t)|x(t)=x ∗ (t),u(t)=u∗ (t) Starting with (15.53) and (15.54), we obtain the following discretized perturbation state variable model: (15.56) δx(k + 1) = k + 1, k;∗ δx(k) + k + 1, k;∗ δu(k) + wd (k) ∗ ∗ δz(k + 1) = Hx k + 1; δx(k + 1) + Hu k + 1; δu(k + 1) + v(k + 1) (15.57) where the notation 8(k + 1, k;∗ ), for example, denotes the fact that this matrix depends on x ∗ (t) and u∗ (t) In (15.56), 8(k + 1, k;∗ ) = 8(tk+1 , tk ;∗ ), where ˙ t, τ ;∗ = Fx x ∗ (t), u∗ (t), t t, τ ;∗ , t, t;∗ = I (15.58) Additionally, k + 1, k;∗ = Z tk+1 tk tk+1 , τ ;∗ Fu x ∗ (τ ), u∗ (τ ), τ dτ (15.59) Rt and wd (k) is a zero-mean noise sequence that is statistically equivalent to tkk+1 8(tk+1 , τ )G(τ )w(τ )dτ ; hence, its covariance matrix, Qd (k + 1, k), is Z tk+1 (tk+1 , τ ) G(τ )Q(τ )G0 (τ )80 (tk+1 , τ ) dτ (15.60) E wd (k)wd (k) = Qd (k + 1, k) = tk 1999 by CRC Press LLC c Great simplifications of the calculations in (15.58), (15.59), and (15.60) occur if F(t), B(t), G(t), and Q(t) are approximately constant during the time interval t ∈ [tk , tk+1 ], i.e., if F(t) ≈ Fk , B(t) ≈ Bk , G(t) ≈ Gk , and Q(t) ≈ Qk for t ∈ [tk , tk+1 ] In this case: 8(k + 1, k) = eFk T , 9(k + 1, k) ≈ Bk T = 9(k), and Qd (k + 1, k) ≈ Gk Qk Gk0 T = Qd (k) where T = tk+1 − tk Suppose x ∗ (t) is given a priori; then we can compute predicted, filtered, or smoothed estimates of δx(k) by applying all of our previously derived state estimators to the discretized perturbation statevariable model in (15.56) and (15.57) We can precompute x ∗ (t) by solving the nominal differential equation (15.51) The KF associated with using a precomputed x ∗ (t) is known as a relinearized KF A relinearized KF usually gives poor results, because it relies on an openloop strategy for choosing x ∗ (t) When x ∗ (t) is precomputed, there is no way of forcing x ∗ (t) to remain close to x(t), and this must be done or else the perturbation state-variable model is invalid The relinearized KF is based only on the discretized perturbation state-variable model It does not use the nonlinear nature of the original system in an active manner The EKF relinearizes the nonlinear system about each new estimate as it becomes available, i.e., at k = 0, the system is linearized about x(0|0) ˆ Once z(1) is processed by the EKF so that x(1|1) ˆ is obtained, the system is linearized about x(1|1) ˆ By “linearize about x(1|1),” ˆ we mean x(1|1) ˆ is used to calculate all the quantities needed to make the transition from x(1|1) ˆ to x(2|1) ˆ and subsequently x(2|2) ˆ The purpose of relinearizing about the filter’s output is to use a better reference trajectory for x ∗ (t) Doing this, δx = x − xˆ will be held as small as possible, so that our linearization assumptions are less likely to be violated than in the case of the relinearized KF The EKF is available only in predictor-corrector format [6] Its prediction equation is obtained by integrating the nominal differential equation for x ∗ (t) from tk to tk+1 Its correction equation is obtained by applying the KF to the discretized perturbation state-variable model The equations for the EKF are: Z tk+1 f xˆ (t|tk ) , u∗ (t), t dt , (15.61) x(k ˆ + 1|k) = x(k|k) ˆ + tk which must be evaluated by numerical integration formulas that are initialized by f [x(t ˆ k |tk ), u∗ (tk ), tk ], x(k ˆ + 1|k + 1) = x(k ˆ + 1|k) + K k + 1;∗ z(k + 1) − h x(k ˆ + 1|k), u∗ (k + 1), k + (15.62) − Hu k + 1;∗ δu(k + 1) K k + 1;∗ = P k + 1|k;∗ Hx k + 1;∗ −1 (15.63) Hx k + 1;∗ P k + 1|k;∗ Hx k + 1;∗ + R(k + 1) ∗ ∗ ∗ ∗ ∗ P k + 1|k; = k + 1, k; P k|k; k + 1, k; + Qd k + 1, k; (15.64) P k + 1|k + 1;∗ (15.65) = I − K k + 1;∗ Hx k + 1;∗ P k + 1|k;∗ In these equations, K(k + 1;∗ ), P(k + 1|k;∗ ), and P(k + 1|k + 1;∗ ) depend on the nominal x ∗ (t) that results from prediction, x(k ˆ + 1|k) For a complete flowchart of the EKF, see Figure 24-2 in [12] The EKF is very widely used; however, it does not provide an optimal estimate of x(k) The optimal mean-squared estimate of x(k) is still E{x(k)|Z(k)}, regardless of the linear or nonlinear nature of the system’s model The EKF is a first-order approximation of E{x(k)|Z(k)} that sometimes works quite well, but cannot be guaranteed to always work well No convergence results are known for the EKF; hence, the EKF must be viewed as an ad hoc filter Alternatives to the EKF, which are based on nonlinear filtering, are quite complicated and are rarely used The EKF is designed to work well as long as δx(k) is “small.” The iterated EKF [6] is designed to keep δx(k) as small as possible The iterated EKF differs from the EKF in that it iterates the correction equation L times until kxˆ L (k + 1|k + 1) − xˆ L−1 (k + 1|k + 1)k ≤ ε Corrector 1999 by CRC Press LLC c computes K(k + 1;∗ ), P(k + 1|k;∗ ), and P(k + 1|k + 1;∗ ) using x ∗ = x(k ˆ + 1|k); corrector computes these quantities using x ∗ = xˆ (k + 1|k + 1); corrector computes these quantities using x ∗ = xˆ (k + 1|k + 1); etc Often, just adding one additional corrector (i.e., L = 2) leads to substantially better results for x(k ˆ + 1|k + 1) than are obtained using the EKF Acknowledgment The author gratefully acknowledges Prentice-Hall for extending permission to include summaries of material that appeared originally in Lessons in Estimation Theory for Signal Processing, Communications, and Control [12] References [1] Anderson, B.D.O and Moore, J.B., Optimal Filtering, Prentice-Hall, Englewood Cliffs, NJ, 1979 [2] Bierman, G.J., Factorization Methods for Discrete Sequential Estimation, Academic Press, New York, 1977 [3] Golub, G.H and Van Loan, C.F., Matrix Computations, 2nd ed., Johns Hopkins Univ Press, Baltimore, MD, 1989 [4] Grewal, M.S and Andrews, A.P., Kalman Filtering: Theory and Practice, Prentice-Hall, Englewood Cliffs, NJ, 1993 [5] Haykin, S., Adaptive Filter Theory, 2nd ed., Prentice-Hall, Englewood Cliffs, NJ, 1991 [6] Jazwinski, A.H., Stochastic Processes and Filtering Theory, Academic Press, New York, 1970 [7] Kailath, T.K., A view of three decades of filtering theory, IEEE Trans Inf Theory, IT-20: 146– 181, 1974 [8] Kailath, T.K., Linear Systems, Prentice-Hall, Englewood Cliffs, NJ, 1980 [9] Kalman, R.E., A new approach to linear filtering and prediction problems, Trans ASME J Basic Eng Series D, 82: 35–46, 1960 [10] Kashyap, R.L and Rao, A.R., Dynamic stochastic Models from Empirical Data, Academic Press, New York, 1976 [11] Ljung, L., System Identification: Theory for the User, Prentice-Hall, Englewood Cliffs, NJ, 1987 [12] Mendel, J.M., Lessons in Estimation Theory for Signal Processing, Communications, and Control, Prentice-Hall PTR, Englewood Cliffs, NJ, 1995 Further Information Recent articles about estimation theory appear in many journals, including the following engineering journals: AIAA J., Automatica, IEEE Trans on Aerospace and Electronic Systems, IEEE Trans on Automatic Control, IEEE Trans on Information Theory, IEEE Trans on Signal Processing, Int J Adaptive Control and Signal Processing, Int J Control, and Signal Processing Nonengineering journals that also publish articles about estimation theory include: Annals Inst Statistical Math., Ann Math Statistics, Ann Statistics, Bull Inst Internat Stat., and Sankhya Some engineering conferences that continue to have sessions devoted to aspects of estimation theory, include: American Automatic Control Conference, IEEE Conference on Decision and Control, IEEE International Conference on Acoustics, Speech and Signal Processing, IFAC International Congress, and, some IFAC Workshops 1999 by CRC Press LLC c .. .15 Estimation Theory and Algorithms: From Gauss to Wiener to Kalman 15. 1 15. 2 15. 3 15. 4 15. 5 15. 6 15. 7 15. 8 15. 9 Jerry M Mendel University of Southern California 15. 1 Introduction... Equation (15. 7), and Kw (k + 1) = P(k)h(k + 1) h (k + 1)P(k)h(k + 1) + w(k + 1) P(k + 1) = I − Kw (k + 1)h0 (k + 1) P(k) −1 (15. 10) (15. 11) Equations (15. 7)– (15. 9) or (15. 7), (15. 10), and (15. 11),... the Gaussian case, and (2) θ and Z(k) are not jointly Gaussian — the non-Gaussian case When θ and Z(k) are jointly Gaussian, the estimator that minimizes the mean-squared error is (15. 15)

15 Estimation Theory and Algorithms: From Gauss to Wiener to Kalman

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan