Estimation Theory Basics What is Estimation something Prior Information
Transcription
Estimation Theory Basics What is Estimation something Prior Information
EE 5510 Random Processes E-1 Estimation Theory Basics What is Estimation Make a best determination of something given measurements (information) related (in a known, or partially known way) to that something, and possibly also given some prior information Prior Information may be • a Random Variable, this is the Bayesian Approach, • non-random, or sets Example: Estimation of the mean of random variable: 1X m ˆx = Xi n i=1 n is an estimator for mx = E(X), the mean of the random variable X. We know that m ˆ x is a “good” estimator, since P (|m ˆ x − mx | < ²) → 1 by the law of large numbers. EE 5510 Random Processes E-2 Parameter Estimation We are given a number of observations xi , i.e., x = [x1 , x2 , · · · , xn ]. This sample vector of “data” is called a statistic. From this statistic we are to estimate the desired parameter, i.e., θˆ = g(x) Unbiased Estimator An estimator is unbiased if ˆ =θ E(θ) Example: The estimator m ˆx = 1 n Pn i=1 xi is an unbiased estimator since 1X E(m ˆ x) = E(xi ) = mx n i=1 n Example: Estimation of unknown variance (and unknown mean): 1 X σ ˆ = (xi − m ˆ x )2 n − 1 i=1 n 2 This estimator is unbiased, since ! à ! à n n X n X X ¡ ¢ 1 1 E (xi − m xi ˆ x )2 = E(x2i ) − 2E xj + E xi xj n j=1 n2 j=1 i=1 1 n−1 2 = E(x2i ) − 2 E(x2i ) − 2 mx n n n 1 X n(n − 1) 2 + 2 E(x2j ) + mx n j=1 n2 µ ¶ 1 n−1 2 = E(x2i ) 1 − − mx n n n−1 = σ2 n EE 5510 Random Processes E-3 Biased Coin Toss Assume that a coin is biased and shows head (=1) with probability θ 6= 0.5, and tail with probability 1 − θ. We observe the n tosses of this coin and are to estimate the probability θ of a head toss. Let us form the statistic x1 + x2 + · · · + xn z= n The mean of this estimator is 1X E(Z) = E(Xi ) = θ n i=1 n and hence this estimator is unbiased. The variance of this estimator is ¢ ¡ σz2 = E (Z − θ)2 à ! à n ! n n X X X 1 1 = E − 2E X X Xi θ + θ2 i j 2 n i=1 j=1 n i=1 ! à n n 1 XX = E Xi Xj − θ2 2 n i=1 j=1 = n−1 θ(1 − θ) (n(n − 1)θ2 − nθ) − θ2 = n n The question arrises whether this is the “best” estimator? Minimum Variance Estimators (MVE) Let us find the estimator which minimizes ³ ˆ2 E (θ − θ) ´ However the MVE cannot, in general, be determined, since ³ ´ ³ ´ 2 2 ˆ ˆ ˆ ˆ E (θ − θ) = E (θ − E(θ) − (θ − E(θ)) ˆ ˆ 2 + var(θ) = (θ − E(θ)) ˆ 2 , called the bias, depends on the unknown since the term (θ − E(θ)) quantity θ. EE 5510 Random Processes E-4 Cram´ er-Rao Lower Bound Assumptions We assume that we have knowledge of the parametric distribution f (x, θ), and that this distribution is differentiable with respect to θ. We also assume that the domain of the statistic x does not depend on the parameter θ. Derivation We now concentrate on unbiased estimators: Z E(θˆ − θ) = (θˆ − θ)f (x, θ)dx = 0 ˆ Now due to the unbiased nature of θ. Z Z d d ˆ (θˆ − θ)f (x, θ)dx = (θ − θ)f (x, θ)dx dθ dθ Z Z d = − f (x, θ)dx + (θˆ − θ) f (x, θ)dx = 0 dθ Using the identity d ln(f (x, θ)) d f (x, θ) = f (x, θ) dθ dθ we develop: Z d ln(f (x, θ)) f (x, θ)dx dθ ¶2 Z µ Z d ln(f (x, θ)) f (x, θ)dx ≤ (θˆ − θ)2 f (x, θ)dx dθ 1 = (θˆ − θ) from which we find the Cram´ er-Rao lower bound (CRLB): ˆ ≥ var(θ) ³ E 1 d ln(f (x,θ)) dθ ´2 EE 5510 Random Processes E-5 Equality in the CRLB is achieved if equality is achieved in the Schwarz inequality (see Page ??) used to prove it, i.e., if d ln(f (x, θ)) = I(θ)(θˆ − θ) dθ ¶2 d ln(f (x, θ)) I(θ) = f (x, θ)dx dθ Z 2 d ln(f (x, θ)) = − f (x, θ)dx dθ2 Z µ is the Fisher Information of θ contained in the observation x. Consider n observations of a constant A in Gaussian noise, i.e., xi = A + ni ; 1≤i≤n The PDF of the observation vector x = (x1 , · · · , xn ) is given by ! à n 1 X 1 exp − 2 (xi − A)2 f (x, A) = 2 (n/2) 2σ i=1 (2πσ ) We calculate d ln(f (x, A)) 1 X n = 2 (xi − A) = 2 dA σ i=1 σ n à ! 1X xi − A n i=1 n from which we can identify the optimal unbiased estimator as 1X ˆ A= xi n i=1 n Furthermore d2 ln(f (x, A)) n = − = −I(A) dA2 σ2 and the variance of the estimator is given by ˆ = var(A) 1 σ2 = I(A) n Example: EE 5510 Random Processes E-6 Signals in Gaussian Noise As an example, consider the application of a radar range estimator: H(f ) x(t) We transmit s(t) and receive r(t) = s(t − τ ) + nw (t), and wish to estimate the delay τ . We first treat the more general problem of a parameterized signal in Gaussian noise: xi = s[n; θ] + ni ; 1≤i≤n We calculate d ln(f (x, θ)) 1 X ds[n; θ] = 2 (xi − s[n; θ]) dθ σ i=1 dθ n and the second derivative à µ ¶2 ! n 2 X d ln(f (x, θ)) ds[n; θ] 1 d s[n; θ] =− 2 − (xi − s[n; θ]) 2 2 dθ σ i=1 dθ dθ 2 EE 5510 Random Processes E-7 The Fisher information is now calculated as µ 2 ¶2 ¶ n µ d ln(f (x, θ)) 1 X ds[n; θ] I(θ) = E =− 2 dθ2 σ i=1 dθ and the CRLB is given by σ2 ˆ ≥ var(θ) Pn ³ ds[n;θ] ´2 i=1 dθ Application to the Radar Problem In the radar application the receive filter is a brickwall lowpass filter with cutoff frequency W . The output of this filter is sampled at the Nyquist rate, and we have a discrete time system x(t) → x(i/(2W )) = xi = s(i/(2W ) − τ ) + ni where ni is sampled Gaussian noise with variance σ 2 = N0 W (page ??). By the CRLB, the variance of any estimator is bounded by var(ˆ τ) ≥ n µ X i=1 N0 W ds(i/2W − τ ) dτ ¶2 = 2W n µ X i=1 N0 W ds(i/2W − τ ) dτ ¶2 1 2W N0 W N0 /2 Z = ¶ µ ∞ 2 T ds(t − τ ) (2πf )2 S 2 (f )df 2W −∞ dτ 0 Z ∞ (2πf )2 S 2 (f )df N0 /2 1 = ; F 2 = −∞Z ∞ 2 Es F S 2 (f )df ≈ Z −∞ The parameter F 2 is the mean-square bandwidth of the signal. EE 5510 Random Processes E-8 Bayesian Estimation classical θ x g(x) θˆ Bayesian PDF θ While in classical estimation nothing is known about the parameter θ, other than its range, in the Bayesian approach, θ is considered to be a random variable whose PDF is known or can be estimated. We hence work with the joint and conditional PDF’s f (x, θ) = f (θ|x)f (x) Minimum Mean Square Error (MMSE) Estimator What was impossible in classical estimation theory is now doable: Z ∞Z ∞ ³ ´ 2 E (g(x) − θ) (g(x) − θ)2 f (x)f (θ|x)dxdθ = −∞ Z−∞ Z ∞ ∞ = f (x) (g(x) − θ)2 f (θ|x)dxdθ −∞ −∞ Note that we can minimize the second integral with respect to g(·) for each x, and thus minimize the entire expression. Z ∞ Z ∞ d (g(x) − θ)2 f (θ|x)dxdθ = 2dg (g(x) − θ) f (θ|x)dxdθ = 0 dg −∞ −∞ Z ∞ This leads to the solution g(x) = θf (θ|x)dxdθ −∞ The MMSE estimator is the conditional expection given x: gMMSE (x) = E (θ|X = x) EE 5510 Random Processes E-9 Example: Consider again n observations of an amplitude A in Gaussian noise, i.e., xi = A + ni ; 1≤i≤n The PDF of the observation vector x = (x1 , · · · , xn ) is now interpreted as a conditional probability density ! à n 1 1 X fX |A (x|a) = exp − 2 (xi − a)2 2 (n/2) 2σ i=1 (2πσ ) The critical assumption is now the PDF of A. We assume ¶ µ 1 1 fA (a) = exp − 2 (a − µA )2 2 (1/2) 2σA (2πσA ) We now need to calculate Aˆ = E(A|x) We use the following result: If x and y are jointly Gaussian distributed, then −1 E(Y |x) = E(Y ) + Cyx Cxx (x − E(X)) −1 Cy|x = Cyy − Cyx Cxx Cxy where Cxx = E(XX T ) − µx µTx Cyy = E(Y Y T ) − µy µTy Cyx = E(Y X T ) − µy µTx ; T Cxy = Cyx Applying these formulas to our case with y = A we obtain Cxx = σA2 11T + σ 2 I; Cyy = σA2 ; Cyx = σA2 1T EE 5510 Random Processes E - 10 continued: Our estimator is now given by ¡ ¢−1 Aˆ = E(A|x) = µA + σA2 1T σ 2 I + σA2 11T (x − µA 1) Using Woodbury’s Identity: ¡ ¢ T −1 I + c11 c11T =I− 1 + nc allows us to rewrite the conditional expectation as à ! 2 σA T 2 11 σ 2 Aˆ = E(A|x) = µA + A2 1T I − σ (x − µA 1) σ2 σ 1 + n A2 σ and after some manipulations we obtain Aˆ = αµx + (1 − α)µA ; |{z} | {z } Data Part α= Prior Knowledge σA2 σA2 + σ 2 /n Mean Square Estimator Error Let us calculate the MMSE of our Bayesian estimator: à !2 n ³ ´ X α E (Aˆ − A)2 = E (A + ni ) + (1 − α)µA n i=1 à ! n 2 X ¡ ¢ α = E (α − 1)2 (A − µA )2 + E n2i 2 n i=1 = (α − 1) (A − µA ) + α 2 2 2σ 2 n and the Bayesian estimator can give a substantially lower error 2 than the unbiased classical estimator, whose MMSE equals σn . EE 5510 Random Processes E - 11 Linear Estimators A linear estimator is one of the form: θˆ = n X ai xi + b i=1 where the coefficients ai and b are constants. To find the optimal values for ai we minimize the MSE, i.e., ! à n ³³ ´´ X ∂ aj x j − b = 0 E θ − θˆ = −2E θ − ∂b j=1 n X =⇒ b = E(θ) − aj µj ; µj = E(xj ) j=1 This is the equivalence to unbiased condition in the classical case. Furthermore: à !2 ! à n n X X ∂ θ − E θ− aj xj − b = −2E a x − b x j j i |{z} = 0 ∂ai j=1 j=1 {z } data | error This leads to the famous orthogonality principle: The data is orthogonal to the estimation error !! à à n X aj xj − b =0 E xi θ − E (xi θ) = n X j=1 j=1 aj (E(xi xj ) − µi µj ) + µi E(θ) EE 5510 Random Processes E - 12 Letting E(xi θ) = Rθi , and E(xi xj ) = Rij and E(θ) = µθ , we obtain Rθi = n X aj (Rij − µi µj ) + µi µθ ; ∀i j=1 or, in matrix notation ¡ ¢ R − µx µTx a = rθ − µθ µx where R = {Rij } = {E(xi xj )}; is the n × n correlation matrix of the data, µx µTx = {E(xi )E(xj )}; is the n × n matrix of product of mean of the data, a = [a1 , · · · , an ]T ; is the n × 1 vector of linear estimation coefficients, and rθ = [Rθ1 , · · · , Rθn ]T . The optimal, linear solution is given by the filter coefficients ¡ ¢−1 a = R − µx µTx (rθ − µθ µx ) EE 5510 Random Processes E - 13 Linear Prediction Consider a stationary discrete random process x = (x1 , x2 , x3 · · · ). We want to predict the value of xn+j , given the values up to xn , and let us assume that E(xi ) = 0 for simplicity. We will use a linear estimator, i.e., we let θ = xn+j and calculate xn+j = N −1 X ai xn−i i=0 The parameter N is the predictor order, i.e., the number of previous samples which are used to predict the new sample value. The solution is given by: a = R−1 r where, since the process is stationary and Rij = Rj−i r = [Rj+N −1 , · · · , Rj ]T and R= R1 · · · RN −1 R0 R1 R0 · · · RN −2 .. .. . . RN −1 RN −2 · · · R0 These equations are known as the Yule-Walker prediction equations, or also as the Wiener-Hopf equations. EE 5510 Random Processes E - 14 Linear Prediction: Examples The following two examples illustrate a linear predictor for a random process with sinusoidal and exponentially decaying data correlation. The examples are for a 3-step and a 1-step prediction. Random Process with Power: 1.2805 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 10 20 30 40 50 60 70 80 90 100 110 120 90 100 110 120 Random Process with Power: 0.99726 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 10 20 30 40 50 60 70 80 EE 5510 Random Processes E - 15 Mean Square Error ³ ´ E (xn+1 − xˆn+1 )2 Orthogonality: E((xn+1 − x ˆn+1 )ˆ xn+1 ) = 0 E(xn+1 x ˆn+1 ) = E(ˆ x2n+1 ) ¡ ¡ ¢ ¢ = E x2n+1 − 2E (xn+1 xˆn+1 ) + E xˆ2n+1 = σx2 − E (xn+1 xˆn+1 ) = σx2 − = σx2 − N −1 X i=0 N −1 X ai E (xn+1 xn−i ) ai Ri+1 i=0 = σx2 − aT r = σx2 − r T R−1 r These principles can be used for filtering, prediction or smoothing Filtering: Data Prediction: Data Smoothing: Data EE 5510 Random Processes E - 16 Linear Estimation of Many Parameters Let us assume that we want to estimate L parameters θl , where the l-th parameter is estimated via the linear equation: θˆl = n X ali xi + bl = aTl x + bl i=1 • First we find the constant term bl via the MMSE principle: ¡ ¢2 ´ ¢ ∂ ³¡ T E Θl − al X − bl = −2E Θl − aTl X − bl = 0 ∂bl and hence bl = E (Θl ) − aTl E (X) • Second, we proceed to evaluate the values of the coefficients al ³¡ ¢2 ´ ∂ T E Θl − al (X − E(X)) − E(Θl ) ∂ali ¡¡ ¢ ¢ = −2E Θl − aTl (X − E(X)) − E(Θl ) (Xi − E(Xi )) ¡ ¢ = −2E ((Θl − E(Θl )) (Xi − E(Xi )) − 2E aTl (X − E(X)) (Xi − E(Xi )) = −2Cθl xi − 2aTl Cxxi , ∀l, i where Cxxi = [E ((X1 − E(X1 )(Xi − E(Xi )) , · · · , E((XL − E(XL )(Xi − E(Xi ))]T ; Collecting all i terms into a vector equation we obtain: T Cθl x1 ... = aTl [Cxx1 , . . . , Cxxn ] Cθl xn CθTl x = aTl Cxx EE 5510 Random Processes E - 17 From this we obtain the optimal estimator coefficients: −1 aTl = CθTl x Cxx and we derive the estimator for θl as −1 θˆl = aTl (x − E(X)) + E(Θl ) = Cθl x Cxx (X − E(X)) + E(Θl ); ∀l Summarizing the L equations above into a single matrix equation ˆ = A (x − E(X)) + b = Cθx C −1 (x − E(X)) + E(Θ) θ xx Note: This has exactly the same form as the conditional expectation for ˆ is the jointly Gaussian random vectors, see page E - 9, and hence θ optimal estimator if θ and x are Gaussian distributed. Note: This linear estimator uses only second order statistics, i.e., Cθx , Cxx , E(X), E(Θ) Mean Square Error (MSE) The mean square error of the linear estimator can be calculated as µ³ ´³ ´T ¶ ˆ ˆ MSE ˆ = E Θ−Θ Θ−Θ θ ³ ´ T = E (Θ − E(Θ) − A(X − E(X)) (Θ − E(Θ) − A(X − E(X)) = Cθθ + ACxx AT − ACxθ − Cθx A −1 MSE ˆ = Cθθ − Cθx Cxx Cxθ θ EE 5510 Random Processes E - 18 z-1 h0 h1 z-1 h2 . . . . . un z-1 . . . . . Channel Estimation Consider the unknown discrete channel shown below: hp-1 wn xn the output data is given by u0 u u0 1 x = .. .. .. .. . . . . uN −1 uN −2 · · · uN −p h0 h 1 .. . hp−1 + w = Uh + w The data covariance matrix is given by: ¢¢ ¡ ¡ Cxx = E (X − E(X)) X − E(X)T ³ ´ T = E (U (H − µh )) (U (H − µh )) ¡ ¡ ¢ ¢ = HE (H − µh )(H − µh )T H T + E wwT = U Chh U T + Cww The optimal Bayesian estimator for this problem is given by: ¢ ¡ ˆ = Chh U HChh U T + Cww −1 (x − U µh )) + µh h If the taps are uncorrelated with zero mean: Chh = I, µh = 0, then ¡ ¢ ˆ = U U U T + Cww −1 x h EE 5510 Random Processes E - 19 MMSE Channel Estimator The matrix U U T has rank at most n, and hence ¡ ¢ U U T + Cww is only invertible is Cww is not (close to) zero. That is, the noise drives the estimator. In this case we apply the matrix inversion lemma: ¢ ¡ (A + BCD)−1 = A−1 − A−1 B DA−1 B + C −1 DA−1 to the required inverse with the identifications: A B C D = = = = C ww U I UT This gives us, after some manipulations, and with C ww = σ 2 I: ¡ U U T + Cww ¢−1 = ¡ ¢¢ 1 ¡ T T 2 I − U U U + σ IU σ2 and, finally a simpler formula for the optimal channel estimates: ¡ ¢ ˆ = U U T + σ 2 I −1 U T x h Remarks: • The size of the inverse is now only p × p, a significant gain in simplicity • Signal Design: If the probing signals are designed such that U U T ≈ nI, the inverse becomes trivial, and ˆ= h 1 UTx 2 n+σ In this case the error is also best, given by MSE ˆ = Chh − h −1 Chx Cxx Cxh σ 2 /n = I 1 + σ 2 /n EE 5510 Random Processes E - 20 Wiener Filtering Consider a system where a signal is observed in noise, i.e., x=θ+w =s+w • parameters evolve continuously • as many parameters as measurements Using an optimal linear estimator we obtain the Wiener Filter for this process: ˆ = Css (Css + Cww )−1 x s Note that the filter makes use of prior information in the form of the signal correlation Css . Filter Interpretation ˆ = Css (Css + Cww )−1 x = F x s This implies the time-varying filter: sˆn = sˆn = n X (n) fn−k xk = f (n)T x k=0 r Tss (Css + Cww )−1 x where • r Tss is the last row of Css (n) (n) • f (n) = [fn , · · · , f0 ] is the last row of F (note the inverse ordering of the subscripts) EE 5510 Random Processes E - 21 The equations (Css + Cww ) f (n) = r ss are the well-known Wiener-Hopf equations (see E - 13): (n) 2 fn r1 ··· rn rn r0 + σ (n) r 2 r0 + σ · · · rn−1 fn−1 rn−1 1 .. = .. .. .. . . . . 2 (n) rn−1 · · · r0 + σ rn − 1 r0 f 0 where rif is the (i, j)-th entry of Css , and Cww is diagonal. Note: For large values of n the filter taps will become constant since rn → 0 as n → ∞: 2 0 r0 + σ 0 ··· rp−1 0 ··· 0 . 2 r1 .. ... r0 + σ · · · rp−1 0 0 0 ... 0 (n) = . .. fp−1 rp−1 . . .. .. 0 0 rp−1 · · · r0 + σ 2 0 ··· 0 rp−1 ··· r0 + σ 2 (n) f0 r0 z-1 f0 f1 z-1 f2 . . . . . x n z-1 . . . . . This leads to stationary filter taps, and we obtain the time-invariant discrete Wiener filter: fp-1 s^n
Similar documents
Panel Data: Fixed and Random Effects 2 Short Guides to Microeconometrics Kurt Schmidheiny
More information
Semiparametric Estimation of a Sample Selection Model in the Presence of Endogeneity Abstract
More information
Semiparametric Estimation of a Binary Choice Model with Sample Selection Abstract J¨
More information
A New Estimator for Panel Data Sample Selection Models María Engracia ROCHINA-BARRACHINA*
More information