# Estimation Theory

$$\def\RR{\mathbb{R}}$$ $$\def\CC{\mathbb{C}}$$ $$\def\TR{\text{T}}$$ $$\def\CO{\text{conv}}$$ $$\def\bmat{\left[\begin{matrix}}$$ $$\def\emat{\end{matrix}\right]}$$ $$\def\bsmat{\left[\begin{smallmatrix}}$$ $$\def\esmat{\end{smallmatrix}\right]}$$

• An observation is defined as: $y=h(x)+w$ where $$x\in\RR^n$$ and $$y\in\RR^m$$ denote the unknown vector and the measurement vector. $$h:\RR^n\rightarrow\RR^m$$ is a function of $$x$$ and $$w$$ is the observation noise with the power density $$f\_w(w)$$.
• It is assumed that $$x$$ is a random variable with an a priori power density $$f\_x(x)$$ before the observation.
• The goal is to compute the “best” estimation of $$x$$ using the observation.

## Optimal Estimation

• The optimal estimation $$\hat x$$ is defined based on a cost function $$J$$: $\hat x\_{opt}=\underset{\hat x}{\arg\min} E[J(x-\hat x)]$
• Some typical cost functions:
• Minimum Mean Square Error ($$\hat x\_{MMSE}$$): $J(x-\hat x)=(x-\hat x)^\TR W(x-\hat x), W>0$
• Absolute Value ($$\hat x\_{ABS}$$): $J(x-\hat x)=|x-\hat x|$
• Maximum a Posteriori ($$\hat x\_{MAP}$$): $J(x-\hat x)=\begin{cases} 0 & \text {if }|x-\hat x|\leq \epsilon \\\\\\ 1 & \text {if }|x-\hat x| \gt \epsilon\end{cases}, \epsilon\rightarrow 0$
• It can be shown that: $\hat x\_{MMSE}(y)=E[x|y]$ $\int\_{-\infty}^{\hat x\_{ABS}(y)}f\_{x|y}(x|y)dx=\int\_{\hat x\_{ABS}(y)}^{+\infty}f\_{x|y}(x|y)dx$ $\hat x\_{MAP}=\arg\max\_{x} f\_{x|y}(x|y)$
• If the a posteriori density function $$f_{x|y}(x|y)$$ has only one maximum and it is symmetric with respect to $$E[x|y]$$ then all the above estimates are equal to $$E[x|y]$$.
• In fact, assuming these conditions for $$f_{x|y}(x|y)$$, $$E(x|y)$$ is the optimal estimation for any cost function $$J$$ if $$J(0)=0$$ and $$J(x-\hat x)$$ is nondecreasing with distance (Sherman’s Theorem).
• Maximum Likelihood Estimation: $$\hat x\_{ML}$$ is the value of $$x$$ that maximizes the probability of observing $$y$$: $\hat x_{ML}(y)=\underset{x}{\arg\max} f\_{y|x}(y|x)$
• It can be shown that $$\hat x\_{ML}=\hat x\_{MAP}$$ if there is no a priori information about $$x$$.

## Linear Gaussian Observation

• Consider the following observation: $$y=Ax+Bw$$ where $$w\sim\mathcal{N}(0,I\_r)$$ is a Gaussian random vector and matrices $$A\_{m\times n}$$ and $$B\_{m\times r}$$ are known.
• In this observation, $$x$$ is estimable if $$A$$ has full column rank otherwise there will be infinite solutions for the problem.
• If $$BB^\TR$$ is invertible, then: $f\_{y|x}(y|x)=\frac{1}{(2\pi)^{n / 2}|BB^\TR|^{1 / 2}}\exp\left(-\frac{1}{2}\left[(y-Ax)^\TR (BB^\TR)^{-1}(y-Ax)\right]\right)$
• The maximum likelihood estimation can be computed as: \begin{align}\hat x\_{ML}=&\arg\min_x (y-Ax)^\TR(BB^\TR)^{-1}(y-Ax)\\\\=&(A^\TR(BB^\TR)^{-1}A)^{-1}A^\TR(BB^\TR)^{-1}y\end{align}
• It is very interesting that $$\hat x\_{ML}$$ is the Weighted Least Square (WLS) solution to the following equation: $$y=Ax$$ with the weight matrix $$W=BB^\TR$$ i.e.  $\hat x\_{WLS}=\arg\min\_x (y-Ax)^\TR W(y-Ax)$
• $$\hat x\_{ML}$$ is an unbiased estimation: $b(x)=E[x-\hat x\_{ML}|x]=E[(A^\TR(BB^\TR)^{-1}A)^{-1}A^\TR(BB^\TR)^{-1}y-x|x]=0$
• The covariance of the estimation error is: $P\_e(X)=E[(x-\hat x\_{ML})(x-\hat x\_{ML})^\TR|x]=(A^\TR(BB^\TR)^{-1}A)^{-1}$
• $$\hat x\_{ML}$$ is efficient in the sense of Cramér Rao bound.
• Example: Consider the following linear Gaussian observation: $$y=ax+w$$ where $$a$$ is a nonzero real number and $$w\sim\mathcal{N}(0,r)$$ is the observation noise.
• Maximum a Posteriori Estimation: To compute $$\hat x\_{MAP}$$, it is assumed that the a priori density of $$x$$ is Gaussian with mean $$m\_x$$ and variance $$p\_x$$: $x\sim\mathcal{N}(m\_x,p\_x)$
• The conditions of Sherman’s Theorem is satisfied and therefore: \begin{align} \hat x\_{MAP}=&E[x|y]\\\\\\ =&m\_x+\frac{p\_{xy}}{p\_y}(y-m\_y)\\\\\\ =&m\_x+\frac{ap\_{x}}{a^2p\_x+r}(y-am\_x)\\\\\\ =&\frac{ap\_xy+m\_xr}{a^2p\_x+r} \end{align}
• Estimation bias: $b\_{MAP}=E[x-\hat x\_{MAP}]=m\_x-\frac{ap\_xe[y]+rm\_x}{a^2p\_x+r}=m\_x-\frac{a^2p\_xm\_x+rm\_x}{a^2p\_x+r}=0$
• Estimation error covariance: $p\_{MAP}=E[(x-\hat x\_{MAP})^2]=p\_x-\frac{a^2p\_x^2}{a^2p\_x+r}=\frac{p\_xr}{a^2p\_x+r}$
• Maximum Likelihood Estimation: For this example, we have: $f\_{y|x}(y|x)=f\_w(y-ax)=\frac{1}{\sqrt{2\pi r}}\exp(-\frac{(y-ax)^2}{2r})$
• With this information: $\hat x\_{ML}=\arg\max\_x f\_{y|x}(y|x)=\frac{y}{a}$
• Estimation bias: $b\_{ML}=E[x-\hat x\_{ML}|x]=x-\frac{ax}{x}=0$
• Estimation error covariance: $p\_{ML}=E[(x-\hat x\_{ML})^2|x]=E[(x-\frac{ax+w}{a})^2]=\frac{r}{a^2}$
• Comparing $$x\_{MAP}$$ and $$x\_{ML}$$, we have: $\lim\_{p\_x\rightarrow +\infty}\hat x\_{MAP}=\hat x\_{ML}$ It means that if there is no a priori information about $$x$$, the two estimations are equal.
• For the error covariance, we have: $\frac{1}{p\_{MAP}}=\frac{1}{p\_{ML}}+\frac{1}{p\_x}$
• In other words, information after observation is the sum of information of the observation and information before the observation.
• Estimation error covariance: $\lim\_{p\_x\rightarrow +\infty}\hat p\_{MAP}=\hat p\_{ML}$
• It is possible to include a priori information in maximum likelihood estimation.
• A priori distribution of $$x$$, $$\mathcal{N}(m\_x,p\_x)$$, can be rewritten as the following observation: $$m\_x=x+v$$ where $$v\sim\mathcal{N}(0,p\_x)$$ is the observation noise.
• Combined observation: $$z=Ax+u$$ where: $z=\bmat m\_x\\\\y\emat, A=\bmat 1 & 0\\\\0 & a\emat, u=\bmat v\\\\ w\emat$
• The assumption is that $$v$$ and $$w$$ are independent. Therefore: $u\sim\mathcal{N}\left(0,\bmat p\_x& 0\\\\0 & r\emat\right)$
• Maximum liklihood estimation: \begin{align}\hat x\_{MLp}(z)=&\arg\max\_x f\_{z|x}(z|x)\\\\\\ =&\arg\min\_x \left(\frac{(m\_x-x)^2}{p\_x}+\frac{(y-ax)^2}{r}\right)\\\\\\ =&\frac{ap\_xy+m\_xr}{a^2p\_x+r}=\hat x\_{MAP}\end{align}
• $$\hat x\_{MLp}$$ is unbiased and has the same error covariance as $$\hat x\_{MAP}$$.
• Therefore $$\hat x\_{MLp}$$ and $$\hat x\_{MAP}$$ are equivalent.

## Standard Kalman Filter

• Consider the following linear system: $\begin{cases}x(k+1)&=&A(k)x(k)+w(k) \\\\ y(k)&=&C(k)x(k)+v(k)\end{cases}$ where $$x(k)\in\RR^n$$, $$y(k)\in\RR^m$$ denote the state vector and measurement vector at time $$t\_k$$.
• $$w(k)\sim\mathcal{N}(0,Q(k))$$ and $$v(k)\sim\mathcal{N}(0,R(k))$$ are independent Gaussian white noise processes where $$R(k)$$ is invertible.
• It is assumed that there is an a priori estimation of x, denoted by $$\hat x^-(k)$$, which is assumed to be unbiased with a Guassian estimation error, independent of $$w$$ and $$v$$: $e^-(k)\sim\mathcal{N}(0,P^-(k))$ where $$P^-(k)$$ is invertible.
• Kalman filter is a recursive algorithm to compute the state estimation.
• Output Measurement: Information in $$\hat x^-(k)$$ and $$y(k)$$ can be written as the following observation: $\bmat\hat x^-(k)\\\\ y(k) \emat=\bmat I\\\\C(k)\emat x(k)+\bmat e^-(k)\\\\v(k)\emat$ Considering the independence of $$e^-(k)$$ and $$v(k)$$, we have: $\bmat e^-(k)\\\\v(k) \emat\sim\mathcal{N}\left(0,\bmat P^-(k) & 0\\\\0& R(k) \emat\right)$
• Using the Weighted Least Square (WLS) and matrix inversion formula: $(A+BD^{-1}C)^{-1}=A^{-1}-A^{-1}B(D+CA^{-1}B)^{-1}CA^{-1}$
• Assuming: $K(k)=P^-(k)C^\TR(k)[C(k)P^-(k)C^\TR(k)+R(k)]^{-1}$
• We have: $\hat x(k)=\hat x^-(k)+K(k)(y(k)-C(k)\hat x^-(k))$
• State estimation is the sum of a priori estimation and a multiplicand of output prediction error. Since: $\hat y^-(k)=C(k)\hat x^-(k)$
• $$K(k)$$ is the Kalman filter gain.
• Estimation error covariance: $P(k)=(I-K(k)C(k))P^-(k)$
• Information: $\hat x(k)=x(k)+e(k)$ where $$e(k)\sim\mathcal{N}(0,P(k))$$
• State Update: To complete a recursive algorithm, we need to compute $$\hat x^-(k+1)$$ and $$P^-(k+1)$$.
• Information: \begin{align}\hat x(k)=&x(k)+e(k)\\\\\ 0=&\bmat -I& A(k)\emat\bmat x(k+1)\\\\x(k) \emat+w(k)\end{align}
• By removing $$x(k)$$ from the above observation, we have: $A(k)\hat x(k)=x(k+1)+A(k)e(k)-w(k)$
• It is easy to see: $\hat x^-(k+1)=A(k)\hat x(k)$
• Estimation error: $e^-(k+1)=A(k)e(k)-w(k)$
• Estimation covariance: $P^-(k+1)=A(k)P(k)A^\TR(k)+Q(k)$

Summary:

• Initial Conditions: $$\hat x^-(k)$$ and its error covariance $$P^-(k)$$.
• Gain Calculation: $K(k)=P^-(k)C^\TR(k)[C(k)P^-(k)C^\TR(k)+R(k)]^{-1}$
• $$\hat x(k)$$: \begin{align}\hat x(k)=&\hat x^-(k)+K(k)(y(k)-C(k)\hat x^-(k))\\\\\\ P(k)=&(I-K(k)C(k))P^-(k)\end{align}
• $$\hat x^-(k+1)$$: \begin{align}\hat x^-(k+1)=&A(k)\hat x(k)\\\\ P^-(k+1)=&A(k)P(k)A^\TR(k)+Q(k)\end{align}
• Go to gain calculation and continue the loop for $$k+1$$.

Remarks:

• Estimation residue: $\gamma(k)=y(k)-C(k)\hat x^-(k)$
• Residue covariance: $P\_\gamma(k)=C(k)P^-(k)C^\TR(k)+R(k)$
• The residue signal is used for monitoring the performance of Kalman filter.
• Modeling error, round off error, disturbance, correlation between input and measurement noise and other factors might cause a biased and colored residue.
• The residue signal can be used in Fault Detection and Isolation (FDI).
• The standard Kalman filter is not numerically robust because it contains matrix inversion. For example, the calculated error covariance matrix might not be positive definite because of computational errors.
• There are different implementations of Kalman filter to improve the standard Kalman filter in the following aspects:
• Computational efficiency
• Dealing with disturbance or unknown inputs
• Dealing singular systems (difference algebraic equations) 