Estimation Theory

\(\def\RR{\mathbb{R}}\) \(\def\CC{\mathbb{C}}\) \(\def\TR{\text{T}}\) \(\def\CO{\text{conv}}\) \(\def\bmat{\left[\begin{matrix}}\) \(\def\emat{\end{matrix}\right]}\) \(\def\bsmat{\left[\begin{smallmatrix}}\) \(\def\esmat{\end{smallmatrix}\right]}\)

  • An observation is defined as: \[y=h(x)+w\] where \(x\in\RR^n\) and \(y\in\RR^m\) denote the unknown vector and the measurement vector. \(h:\RR^n\rightarrow\RR^m\) is a function of \(x\) and \(w\) is the observation noise with the power density \(f\_w(w)\).
  • It is assumed that \(x\) is a random variable with an a priori power density \(f\_x(x)\) before the observation.
  • The goal is to compute the “best” estimation of \(x\) using the observation.

Optimal Estimation

  • The optimal estimation \(\hat x\) is defined based on a cost function \(J\): \[\hat x\_{opt}=\underset{\hat x}{\arg\min} E[J(x-\hat x)]\]
  • Some typical cost functions:
    • Minimum Mean Square Error (\(\hat x\_{MMSE}\)): \[J(x-\hat x)=(x-\hat x)^\TR W(x-\hat x), W>0\]
    • Absolute Value (\(\hat x\_{ABS}\)): \[J(x-\hat x)=|x-\hat x|\]
    • Maximum a Posteriori (\(\hat x\_{MAP}\)): \[J(x-\hat x)=\begin{cases} 0 & \text {if }|x-\hat x|\leq \epsilon \\\\\\ 1 & \text {if }|x-\hat x| \gt \epsilon\end{cases}, \epsilon\rightarrow 0\]
  • It can be shown that: \[\hat x\_{MMSE}(y)=E[x|y]\] \[\int\_{-\infty}^{\hat x\_{ABS}(y)}f\_{x|y}(x|y)dx=\int\_{\hat x\_{ABS}(y)}^{+\infty}f\_{x|y}(x|y)dx\] \[\hat x\_{MAP}=\arg\max\_{x} f\_{x|y}(x|y)\]
  • If the a posteriori density function \(f_{x|y}(x|y)\) has only one maximum and it is symmetric with respect to \(E[x|y]\) then all the above estimates are equal to \(E[x|y]\).
  • In fact, assuming these conditions for \(f_{x|y}(x|y)\), \(E(x|y)\) is the optimal estimation for any cost function \(J\) if \(J(0)=0\) and \(J(x-\hat x)\) is nondecreasing with distance (Sherman’s Theorem).
  • Maximum Likelihood Estimation: \(\hat x\_{ML}\) is the value of \(x\) that maximizes the probability of observing \(y\): \[\hat x_{ML}(y)=\underset{x}{\arg\max} f\_{y|x}(y|x)\]
  • It can be shown that \(\hat x\_{ML}=\hat x\_{MAP}\) if there is no a priori information about \(x\).

Linear Gaussian Observation

  • Consider the following observation: \(y=Ax+Bw\) where \(w\sim\mathcal{N}(0,I\_r)\) is a Gaussian random vector and matrices \(A\_{m\times n}\) and \(B\_{m\times r}\) are known.
  • In this observation, \(x\) is estimable if \(A\) has full column rank otherwise there will be infinite solutions for the problem.
  • If \(BB^\TR\) is invertible, then: \[f\_{y|x}(y|x)=\frac{1}{(2\pi)^{n / 2}|BB^\TR|^{1 / 2}}\exp\left(-\frac{1}{2}\left[(y-Ax)^\TR (BB^\TR)^{-1}(y-Ax)\right]\right)\]
  • The maximum likelihood estimation can be computed as: \[\begin{align}\hat x\_{ML}=&\arg\min_x (y-Ax)^\TR(BB^\TR)^{-1}(y-Ax)\\\\=&(A^\TR(BB^\TR)^{-1}A)^{-1}A^\TR(BB^\TR)^{-1}y\end{align}\]
  • It is very interesting that \(\hat x\_{ML}\) is the Weighted Least Square (WLS) solution to the following equation: \(y=Ax\) with the weight matrix \(W=BB^\TR\) i.e.  \[\hat x\_{WLS}=\arg\min\_x (y-Ax)^\TR W(y-Ax)\]
  • \(\hat x\_{ML}\) is an unbiased estimation: \[b(x)=E[x-\hat x\_{ML}|x]=E[(A^\TR(BB^\TR)^{-1}A)^{-1}A^\TR(BB^\TR)^{-1}y-x|x]=0\]
  • The covariance of the estimation error is: \[P\_e(X)=E[(x-\hat x\_{ML})(x-\hat x\_{ML})^\TR|x]=(A^\TR(BB^\TR)^{-1}A)^{-1}\]
  • \(\hat x\_{ML}\) is efficient in the sense of Cramér Rao bound.
  • Example: Consider the following linear Gaussian observation: \(y=ax+w\) where \(a\) is a nonzero real number and \(w\sim\mathcal{N}(0,r)\) is the observation noise.
  • Maximum a Posteriori Estimation: To compute \(\hat x\_{MAP}\), it is assumed that the a priori density of \(x\) is Gaussian with mean \(m\_x\) and variance \(p\_x\): \[x\sim\mathcal{N}(m\_x,p\_x)\]
  • The conditions of Sherman’s Theorem is satisfied and therefore: \[\begin{align} \hat x\_{MAP}=&E[x|y]\\\\\\ =&m\_x+\frac{p\_{xy}}{p\_y}(y-m\_y)\\\\\\ =&m\_x+\frac{ap\_{x}}{a^2p\_x+r}(y-am\_x)\\\\\\ =&\frac{ap\_xy+m\_xr}{a^2p\_x+r} \end{align}\]
  • Estimation bias: \[b\_{MAP}=E[x-\hat x\_{MAP}]=m\_x-\frac{ap\_xe[y]+rm\_x}{a^2p\_x+r}=m\_x-\frac{a^2p\_xm\_x+rm\_x}{a^2p\_x+r}=0\]
  • Estimation error covariance: \[p\_{MAP}=E[(x-\hat x\_{MAP})^2]=p\_x-\frac{a^2p\_x^2}{a^2p\_x+r}=\frac{p\_xr}{a^2p\_x+r}\]
  • Maximum Likelihood Estimation: For this example, we have: \[f\_{y|x}(y|x)=f\_w(y-ax)=\frac{1}{\sqrt{2\pi r}}\exp(-\frac{(y-ax)^2}{2r})\]
  • With this information: \[\hat x\_{ML}=\arg\max\_x f\_{y|x}(y|x)=\frac{y}{a}\]
  • Estimation bias: \[b\_{ML}=E[x-\hat x\_{ML}|x]=x-\frac{ax}{x}=0\]
  • Estimation error covariance: \[p\_{ML}=E[(x-\hat x\_{ML})^2|x]=E[(x-\frac{ax+w}{a})^2]=\frac{r}{a^2}\]
  • Comparing \(x\_{MAP}\) and \(x\_{ML}\), we have: \[\lim\_{p\_x\rightarrow +\infty}\hat x\_{MAP}=\hat x\_{ML}\] It means that if there is no a priori information about \(x\), the two estimations are equal.
  • For the error covariance, we have: \[\frac{1}{p\_{MAP}}=\frac{1}{p\_{ML}}+\frac{1}{p\_x}\]
  • In other words, information after observation is the sum of information of the observation and information before the observation.
  • Estimation error covariance: \[\lim\_{p\_x\rightarrow +\infty}\hat p\_{MAP}=\hat p\_{ML}\]
  • It is possible to include a priori information in maximum likelihood estimation.
  • A priori distribution of \(x\), \(\mathcal{N}(m\_x,p\_x)\), can be rewritten as the following observation: \(m\_x=x+v\) where \(v\sim\mathcal{N}(0,p\_x)\) is the observation noise.
  • Combined observation: \(z=Ax+u\) where: \[z=\bmat m\_x\\\\y\emat, A=\bmat 1 & 0\\\\0 & a\emat, u=\bmat v\\\\ w\emat\]
  • The assumption is that \(v\) and \(w\) are independent. Therefore: \[u\sim\mathcal{N}\left(0,\bmat p\_x& 0\\\\0 & r\emat\right)\]
  • Maximum liklihood estimation: \[\begin{align}\hat x\_{MLp}(z)=&\arg\max\_x f\_{z|x}(z|x)\\\\\\ =&\arg\min\_x \left(\frac{(m\_x-x)^2}{p\_x}+\frac{(y-ax)^2}{r}\right)\\\\\\ =&\frac{ap\_xy+m\_xr}{a^2p\_x+r}=\hat x\_{MAP}\end{align}\]
  • \(\hat x\_{MLp}\) is unbiased and has the same error covariance as \(\hat x\_{MAP}\).
  • Therefore \(\hat x\_{MLp}\) and \(\hat x\_{MAP}\) are equivalent.

Standard Kalman Filter

  • Consider the following linear system: \[\begin{cases}x(k+1)&=&A(k)x(k)+w(k) \\\\ y(k)&=&C(k)x(k)+v(k)\end{cases}\] where \(x(k)\in\RR^n\), \(y(k)\in\RR^m\) denote the state vector and measurement vector at time \(t\_k\).
  • \(w(k)\sim\mathcal{N}(0,Q(k))\) and \(v(k)\sim\mathcal{N}(0,R(k))\) are independent Gaussian white noise processes where \(R(k)\) is invertible.
  • It is assumed that there is an a priori estimation of x, denoted by \(\hat x^-(k)\), which is assumed to be unbiased with a Guassian estimation error, independent of \(w\) and \(v\): \[e^-(k)\sim\mathcal{N}(0,P^-(k))\] where \(P^-(k)\) is invertible.
  • Kalman filter is a recursive algorithm to compute the state estimation.
  • Output Measurement: Information in \(\hat x^-(k)\) and \(y(k)\) can be written as the following observation: \[\bmat\hat x^-(k)\\\\ y(k) \emat=\bmat I\\\\C(k)\emat x(k)+\bmat e^-(k)\\\\v(k)\emat\] Considering the independence of \(e^-(k)\) and \(v(k)\), we have: \[\bmat e^-(k)\\\\v(k) \emat\sim\mathcal{N}\left(0,\bmat P^-(k) & 0\\\\0& R(k) \emat\right)\]
  • Using the Weighted Least Square (WLS) and matrix inversion formula: \[(A+BD^{-1}C)^{-1}=A^{-1}-A^{-1}B(D+CA^{-1}B)^{-1}CA^{-1}\]
  • Assuming: \[K(k)=P^-(k)C^\TR(k)[C(k)P^-(k)C^\TR(k)+R(k)]^{-1}\]
  • We have: \[\hat x(k)=\hat x^-(k)+K(k)(y(k)-C(k)\hat x^-(k))\]
  • State estimation is the sum of a priori estimation and a multiplicand of output prediction error. Since: \[\hat y^-(k)=C(k)\hat x^-(k)\]
  • \(K(k)\) is the Kalman filter gain.
  • Estimation error covariance: \[P(k)=(I-K(k)C(k))P^-(k)\]
  • Information: \[\hat x(k)=x(k)+e(k)\] where \(e(k)\sim\mathcal{N}(0,P(k))\)
  • State Update: To complete a recursive algorithm, we need to compute \(\hat x^-(k+1)\) and \(P^-(k+1)\).
  • Information: \[\begin{align}\hat x(k)=&x(k)+e(k)\\\\\ 0=&\bmat -I& A(k)\emat\bmat x(k+1)\\\\x(k) \emat+w(k)\end{align}\]
  • By removing \(x(k)\) from the above observation, we have: \[A(k)\hat x(k)=x(k+1)+A(k)e(k)-w(k)\]
  • It is easy to see: \[\hat x^-(k+1)=A(k)\hat x(k)\]
  • Estimation error: \[e^-(k+1)=A(k)e(k)-w(k)\]
  • Estimation covariance: \[P^-(k+1)=A(k)P(k)A^\TR(k)+Q(k)\]


  • Initial Conditions: \(\hat x^-(k)\) and its error covariance \(P^-(k)\).
  • Gain Calculation: \[K(k)=P^-(k)C^\TR(k)[C(k)P^-(k)C^\TR(k)+R(k)]^{-1}\]
  • \(\hat x(k)\): \[\begin{align}\hat x(k)=&\hat x^-(k)+K(k)(y(k)-C(k)\hat x^-(k))\\\\\\ P(k)=&(I-K(k)C(k))P^-(k)\end{align}\]
  • \(\hat x^-(k+1)\): \[\begin{align}\hat x^-(k+1)=&A(k)\hat x(k)\\\\ P^-(k+1)=&A(k)P(k)A^\TR(k)+Q(k)\end{align}\]
  • Go to gain calculation and continue the loop for \(k+1\).


  • Estimation residue: \[\gamma(k)=y(k)-C(k)\hat x^-(k)\]
  • Residue covariance: \[P\_\gamma(k)=C(k)P^-(k)C^\TR(k)+R(k)\]
  • The residue signal is used for monitoring the performance of Kalman filter.
  • Modeling error, round off error, disturbance, correlation between input and measurement noise and other factors might cause a biased and colored residue.
  • The residue signal can be used in Fault Detection and Isolation (FDI).
  • The standard Kalman filter is not numerically robust because it contains matrix inversion. For example, the calculated error covariance matrix might not be positive definite because of computational errors.
  • There are different implementations of Kalman filter to improve the standard Kalman filter in the following aspects:
    • Computational efficiency
    • Dealing with disturbance or unknown inputs
    • Dealing singular systems (difference algebraic equations)
Behzad Samadi
Behzad Samadi

My research interests include connected autonomous vehicles, edge computing and convex optimization.