# Estimation Theory

\(\def\RR{\mathbb{R}}\) \(\def\CC{\mathbb{C}}\) \(\def\TR{\text{T}}\) \(\def\CO{\text{conv}}\) \(\def\bmat{\left[\begin{matrix}}\) \(\def\emat{\end{matrix}\right]}\) \(\def\bsmat{\left[\begin{smallmatrix}}\) \(\def\esmat{\end{smallmatrix}\right]}\)

- An
**observation**is defined as: \[y=h(x)+w\] where \(x\in\RR^n\) and \(y\in\RR^m\) denote the unknown vector and the measurement vector. \(h:\RR^n\rightarrow\RR^m\) is a function of \(x\) and \(w\) is the observation noise with the power density \(f\_w(w)\). - It is assumed that \(x\) is a random variable with an a priori power density \(f\_x(x)\) before the observation.
- The goal is to compute the “best” estimation of \(x\) using the observation.

## Optimal Estimation

- The optimal estimation \(\hat x\) is defined based on a cost function \(J\): \[\hat x\_{opt}=\underset{\hat x}{\arg\min} E[J(x-\hat x)]\]
- Some typical cost functions:
- Minimum Mean Square Error (\(\hat x\_{MMSE}\)): \[J(x-\hat x)=(x-\hat x)^\TR W(x-\hat x), W>0\]
- Absolute Value (\(\hat x\_{ABS}\)): \[J(x-\hat x)=|x-\hat x|\]
- Maximum a Posteriori (\(\hat x\_{MAP}\)): \[J(x-\hat x)=\begin{cases} 0 & \text {if }|x-\hat x|\leq \epsilon \\\\\\ 1 & \text {if }|x-\hat x| \gt \epsilon\end{cases}, \epsilon\rightarrow 0\]

- It can be shown that: \[\hat x\_{MMSE}(y)=E[x|y]\] \[\int\_{-\infty}^{\hat x\_{ABS}(y)}f\_{x|y}(x|y)dx=\int\_{\hat x\_{ABS}(y)}^{+\infty}f\_{x|y}(x|y)dx\] \[\hat x\_{MAP}=\arg\max\_{x} f\_{x|y}(x|y)\]
- If the a posteriori density function \(f_{x|y}(x|y)\) has only one maximum and it is symmetric with respect to \(E[x|y]\) then all the above estimates are equal to \(E[x|y]\).
- In fact, assuming these conditions for \(f_{x|y}(x|y)\), \(E(x|y)\) is the optimal estimation for any
cost function \(J\) if \(J(0)=0\) and \(J(x-\hat x)\) is nondecreasing with
distance (
**Sherman’s Theorem**). **Maximum Likelihood Estimation:**\(\hat x\_{ML}\) is the value of \(x\) that maximizes the probability of observing \(y\): \[\hat x_{ML}(y)=\underset{x}{\arg\max} f\_{y|x}(y|x)\]- It can be shown that \(\hat x\_{ML}=\hat x\_{MAP}\) if there is no a priori information about \(x\).

## Linear Gaussian Observation

- Consider the following observation: \(y=Ax+Bw\) where \(w\sim\mathcal{N}(0,I\_r)\) is a Gaussian random vector and matrices \(A\_{m\times n}\) and \(B\_{m\times r}\) are known.
- In this observation, \(x\) is estimable if \(A\) has full column rank otherwise there will be infinite solutions for the problem.
- If \(BB^\TR\) is invertible, then: \[f\_{y|x}(y|x)=\frac{1}{(2\pi)^{n / 2}|BB^\TR|^{1 / 2}}\exp\left(-\frac{1}{2}\left[(y-Ax)^\TR (BB^\TR)^{-1}(y-Ax)\right]\right)\]
- The maximum likelihood estimation can be computed as: \[\begin{align}\hat x\_{ML}=&\arg\min_x (y-Ax)^\TR(BB^\TR)^{-1}(y-Ax)\\\\=&(A^\TR(BB^\TR)^{-1}A)^{-1}A^\TR(BB^\TR)^{-1}y\end{align}\]
- It is very interesting that \(\hat x\_{ML}\) is the Weighted Least Square (WLS) solution to the following equation: \(y=Ax\) with the weight matrix \(W=BB^\TR\) i.e. \[\hat x\_{WLS}=\arg\min\_x (y-Ax)^\TR W(y-Ax)\]
- \(\hat x\_{ML}\) is an unbiased estimation: \[b(x)=E[x-\hat x\_{ML}|x]=E[(A^\TR(BB^\TR)^{-1}A)^{-1}A^\TR(BB^\TR)^{-1}y-x|x]=0\]
- The covariance of the estimation error is: \[P\_e(X)=E[(x-\hat x\_{ML})(x-\hat x\_{ML})^\TR|x]=(A^\TR(BB^\TR)^{-1}A)^{-1}\]
- \(\hat x\_{ML}\) is
*efficient*in the sense of Cramér Rao bound. **Example:**Consider the following linear Gaussian observation: \(y=ax+w\) where \(a\) is a nonzero real number and \(w\sim\mathcal{N}(0,r)\) is the observation noise.**Maximum a Posteriori Estimation:**To compute \(\hat x\_{MAP}\), it is assumed that the a priori density of \(x\) is Gaussian with mean \(m\_x\) and variance \(p\_x\): \[x\sim\mathcal{N}(m\_x,p\_x)\]- The conditions of Sherman’s Theorem is satisfied and therefore: \[\begin{align} \hat x\_{MAP}=&E[x|y]\\\\\\ =&m\_x+\frac{p\_{xy}}{p\_y}(y-m\_y)\\\\\\ =&m\_x+\frac{ap\_{x}}{a^2p\_x+r}(y-am\_x)\\\\\\ =&\frac{ap\_xy+m\_xr}{a^2p\_x+r} \end{align}\]
- Estimation bias: \[b\_{MAP}=E[x-\hat x\_{MAP}]=m\_x-\frac{ap\_xe[y]+rm\_x}{a^2p\_x+r}=m\_x-\frac{a^2p\_xm\_x+rm\_x}{a^2p\_x+r}=0\]
- Estimation error covariance: \[p\_{MAP}=E[(x-\hat x\_{MAP})^2]=p\_x-\frac{a^2p\_x^2}{a^2p\_x+r}=\frac{p\_xr}{a^2p\_x+r}\]
**Maximum Likelihood Estimation:**For this example, we have: \[f\_{y|x}(y|x)=f\_w(y-ax)=\frac{1}{\sqrt{2\pi r}}\exp(-\frac{(y-ax)^2}{2r})\]- With this information: \[\hat x\_{ML}=\arg\max\_x f\_{y|x}(y|x)=\frac{y}{a}\]
- Estimation bias: \[b\_{ML}=E[x-\hat x\_{ML}|x]=x-\frac{ax}{x}=0\]
- Estimation error covariance: \[p\_{ML}=E[(x-\hat x\_{ML})^2|x]=E[(x-\frac{ax+w}{a})^2]=\frac{r}{a^2}\]
- Comparing \(x\_{MAP}\) and \(x\_{ML}\), we have: \[\lim\_{p\_x\rightarrow +\infty}\hat x\_{MAP}=\hat x\_{ML}\] It means that if there is no a priori information about \(x\), the two estimations are equal.
- For the error covariance, we have: \[\frac{1}{p\_{MAP}}=\frac{1}{p\_{ML}}+\frac{1}{p\_x}\]
- In other words, information after observation is the sum of information of the observation and information before the observation.
- Estimation error covariance: \[\lim\_{p\_x\rightarrow +\infty}\hat p\_{MAP}=\hat p\_{ML}\]
- It is possible to include a priori information in maximum likelihood estimation.
- A priori distribution of \(x\), \(\mathcal{N}(m\_x,p\_x)\), can be rewritten as the following observation: \(m\_x=x+v\) where \(v\sim\mathcal{N}(0,p\_x)\) is the observation noise.
**Combined observation**: \(z=Ax+u\) where: \[z=\bmat m\_x\\\\y\emat, A=\bmat 1 & 0\\\\0 & a\emat, u=\bmat v\\\\ w\emat\]- The assumption is that \(v\) and
\(w\) are
*independent*. Therefore: \[u\sim\mathcal{N}\left(0,\bmat p\_x& 0\\\\0 & r\emat\right)\] - Maximum liklihood estimation: \[\begin{align}\hat x\_{MLp}(z)=&\arg\max\_x f\_{z|x}(z|x)\\\\\\ =&\arg\min\_x \left(\frac{(m\_x-x)^2}{p\_x}+\frac{(y-ax)^2}{r}\right)\\\\\\ =&\frac{ap\_xy+m\_xr}{a^2p\_x+r}=\hat x\_{MAP}\end{align}\]
- \(\hat x\_{MLp}\) is unbiased and has the same error covariance as \(\hat x\_{MAP}\).
- Therefore \(\hat x\_{MLp}\) and \(\hat x\_{MAP}\) are equivalent.

## Standard Kalman Filter

- Consider the following linear system: \[\begin{cases}x(k+1)&=&A(k)x(k)+w(k) \\\\ y(k)&=&C(k)x(k)+v(k)\end{cases}\] where \(x(k)\in\RR^n\), \(y(k)\in\RR^m\) denote the state vector and measurement vector at time \(t\_k\).
- \(w(k)\sim\mathcal{N}(0,Q(k))\) and \(v(k)\sim\mathcal{N}(0,R(k))\) are independent Gaussian white noise processes where \(R(k)\) is invertible.
- It is assumed that there is an a priori estimation of x, denoted by \(\hat x^-(k)\), which is assumed to be unbiased with a Guassian estimation error, independent of \(w\) and \(v\): \[e^-(k)\sim\mathcal{N}(0,P^-(k))\] where \(P^-(k)\) is invertible.
- Kalman filter is a recursive algorithm to compute the state estimation.
**Output Measurement:**Information in \(\hat x^-(k)\) and \(y(k)\) can be written as the following observation: \[\bmat\hat x^-(k)\\\\ y(k) \emat=\bmat I\\\\C(k)\emat x(k)+\bmat e^-(k)\\\\v(k)\emat\] Considering the independence of \(e^-(k)\) and \(v(k)\), we have: \[\bmat e^-(k)\\\\v(k) \emat\sim\mathcal{N}\left(0,\bmat P^-(k) & 0\\\\0& R(k) \emat\right)\]- Using the Weighted Least Square (WLS) and matrix inversion formula: \[(A+BD^{-1}C)^{-1}=A^{-1}-A^{-1}B(D+CA^{-1}B)^{-1}CA^{-1}\]
- Assuming: \[K(k)=P^-(k)C^\TR(k)[C(k)P^-(k)C^\TR(k)+R(k)]^{-1}\]
- We have: \[\hat x(k)=\hat x^-(k)+K(k)(y(k)-C(k)\hat x^-(k))\]
- State estimation is the sum of a priori estimation and a multiplicand of output prediction error. Since: \[\hat y^-(k)=C(k)\hat x^-(k)\]
- \(K(k)\) is the Kalman filter gain.
- Estimation error covariance: \[P(k)=(I-K(k)C(k))P^-(k)\]
- Information: \[\hat x(k)=x(k)+e(k)\] where \(e(k)\sim\mathcal{N}(0,P(k))\)
**State Update:**To complete a recursive algorithm, we need to compute \(\hat x^-(k+1)\) and \(P^-(k+1)\).- Information: \[\begin{align}\hat x(k)=&x(k)+e(k)\\\\\ 0=&\bmat -I& A(k)\emat\bmat x(k+1)\\\\x(k) \emat+w(k)\end{align}\]
- By
*removing*\(x(k)\) from the above observation, we have: \[A(k)\hat x(k)=x(k+1)+A(k)e(k)-w(k)\] - It is easy to see: \[\hat x^-(k+1)=A(k)\hat x(k)\]
- Estimation error: \[e^-(k+1)=A(k)e(k)-w(k)\]
- Estimation covariance: \[P^-(k+1)=A(k)P(k)A^\TR(k)+Q(k)\]

**Summary:**

- Initial Conditions: \(\hat x^-(k)\) and its error covariance \(P^-(k)\).
- Gain Calculation: \[K(k)=P^-(k)C^\TR(k)[C(k)P^-(k)C^\TR(k)+R(k)]^{-1}\]
- \(\hat x(k)\): \[\begin{align}\hat x(k)=&\hat x^-(k)+K(k)(y(k)-C(k)\hat x^-(k))\\\\\\ P(k)=&(I-K(k)C(k))P^-(k)\end{align}\]
- \(\hat x^-(k+1)\): \[\begin{align}\hat x^-(k+1)=&A(k)\hat x(k)\\\\ P^-(k+1)=&A(k)P(k)A^\TR(k)+Q(k)\end{align}\]
- Go to gain calculation and continue the loop for \(k+1\).

**Remarks:**

- Estimation residue: \[\gamma(k)=y(k)-C(k)\hat x^-(k)\]
- Residue covariance: \[P\_\gamma(k)=C(k)P^-(k)C^\TR(k)+R(k)\]
- The residue signal is used for monitoring the performance of Kalman filter.
- Modeling error, round off error, disturbance, correlation between input and measurement noise and other factors might cause a biased and colored residue.
- The residue signal can be used in Fault Detection and Isolation (FDI).
- The standard Kalman filter is not numerically robust because it contains matrix inversion. For example, the calculated error covariance matrix might not be positive definite because of computational errors.
- There are different implementations of Kalman filter to improve the
standard Kalman filter in the following aspects:
- Computational efficiency
- Dealing with disturbance or unknown inputs
- Dealing singular systems (difference algebraic equations)