Estimation Theory
\(\def\RR{\mathbb{R}}\) \(\def\CC{\mathbb{C}}\) \(\def\TR{\text{T}}\) \(\def\CO{\text{conv}}\) \(\def\bmat{\left[\begin{matrix}}\) \(\def\emat{\end{matrix}\right]}\) \(\def\bsmat{\left[\begin{smallmatrix}}\) \(\def\esmat{\end{smallmatrix}\right]}\)
- An observation is defined as: \[y=h(x)+w\] where \(x\in\RR^n\) and \(y\in\RR^m\) denote the unknown vector and the measurement vector. \(h:\RR^n\rightarrow\RR^m\) is a function of \(x\) and \(w\) is the observation noise with the power density \(f_w(w)\).
- It is assumed that \(x\) is a random variable with an a priori power density \(f_x(x)\) before the observation.
- The goal is to compute the “best” estimation of \(x\) using the observation.
Optimal Estimation
- The optimal estimation \(\hat x\) is defined based on a cost function \(J\): \[\hat x_{opt}=\underset{\hat x}{\arg\min} E[J(x-\hat x)]\]
- Some typical cost functions:
- Minimum Mean Square Error (\(\hat x_{MMSE}\)): \[J(x-\hat x)=(x-\hat x)^\TR W(x-\hat x), W>0\]
- Absolute Value (\(\hat x_{ABS}\)): \[J(x-\hat x)=|x-\hat x|\]
- Maximum a Posteriori (\(\hat x_{MAP}\)): \[J(x-\hat x)=\begin{cases} 0 & \text {if }|x-\hat x|\leq \epsilon \\\\\\ 1 & \text {if }|x-\hat x| \gt \epsilon\end{cases}, \epsilon\rightarrow 0\]
- It can be shown that: \[\hat x_{MMSE}(y)=E[x|y]\] \[\int_{-\infty}^{\hat x_{ABS}(y)}f_{x|y}(x|y)dx=\int_{\hat x_{ABS}(y)}^{+\infty}f_{x|y}(x|y)dx\] \[\hat x_{MAP}=\arg\max_{x} f_{x|y}(x|y)\]
- If the a posteriori density function \(f_{x|y}(x|y)\) has only one maximum and it is symmetric with respect to \(E[x|y]\) then all the above estimates are equal to \(E[x|y]\).
- In fact, assuming these conditions for \(f_{x|y}(x|y)\), \(E(x|y)\) is the optimal estimation for any cost function \(J\) if \(J(0)=0\) and \(J(x-\hat x)\) is nondecreasing with distance (Sherman’s Theorem).
- Maximum Likelihood Estimation: \(\hat x_{ML}\) is the value of \(x\) that maximizes the probability of observing \(y\): \[\hat x_{ML}(y)=\underset{x}{\arg\max} f_{y|x}(y|x)\]
- It can be shown that \(\hat x_{ML}=\hat x_{MAP}\) if there is no a priori information about \(x\).
Linear Gaussian Observation
- Consider the following observation: \(y=Ax+Bw\) where \(w\sim\mathcal{N}(0,I_r)\) is a Gaussian random vector and matrices \(A_{m\times n}\) and \(B_{m\times r}\) are known.
- In this observation, \(x\) is estimable if \(A\) has full column rank otherwise there will be infinite solutions for the problem.
- If \(BB^\TR\) is invertible, then: \[\begin{align*} f_{y|x}(y|x) &= \frac{1}{(2\pi)^{n / 2}|BB^\TR|^{1 / 2}} \\\\\\ &\quad \times \exp\left( -\frac{1}{2} \left[(y-Ax)^\TR (BB^\TR)^{-1} (y-Ax)\right] \right) \end{align*} \]
- The maximum likelihood estimation can be computed as: \[\begin{align}\hat x_{ML}=&\arg\min_x (y-Ax)^\TR(BB^\TR)^{-1}(y-Ax)\\\\=&(A^\TR(BB^\TR)^{-1}A)^{-1}A^\TR(BB^\TR)^{-1}y\end{align}\]
- It is very interesting that \(\hat x_{ML}\) is the Weighted Least Square (WLS) solution to the following equation: \(y=Ax\) with the weight matrix \(W=BB^\TR\) i.e. \[\hat x_{WLS}=\arg\min_x (y-Ax)^\TR W(y-Ax)\]
- \(\hat x_{ML}\) is an unbiased estimation: \[\begin{align*} b(x) &= E[x - \hat{x}_{ML} | x] \\ &= E\left[(A^\TR (BB^\TR)^{-1} A)^{-1} A^\TR (BB^\TR)^{-1} y - x \mid x\right] = 0 \end{align*} \]
- The covariance of the estimation error is: \[P_e(X)=E[(x-\hat x_{ML})(x-\hat x_{ML})^\TR|x]=(A^\TR(BB^\TR)^{-1}A)^{-1}\]
- \(\hat x_{ML}\) is efficient in the sense of Cramér Rao bound.
- Example: Consider the following linear Gaussian observation: \(y=ax+w\) where \(a\) is a nonzero real number and \(w\sim\mathcal{N}(0,r)\) is the observation noise.
- Maximum a Posteriori Estimation: To compute \(\hat x_{MAP}\), it is assumed that the a priori density of \(x\) is Gaussian with mean \(m_x\) and variance \(p_x\): \[x\sim\mathcal{N}(m_x,p_x)\]
- The conditions of Sherman’s Theorem is satisfied and therefore: \[\begin{align} \hat x_{MAP}=&E[x|y]\\\\\\ =&m_x+\frac{p_{xy}}{p_y}(y-m_y)\\\\\\ =&m_x+\frac{ap_{x}}{a^2p_x+r}(y-am_x)\\\\\\ =&\frac{ap_xy+m_xr}{a^2p_x+r} \end{align}\]
- Estimation bias: \[b_{MAP}=E[x-\hat x_{MAP}]=m_x-\frac{ap_xe[y]+rm_x}{a^2p_x+r}=m_x-\frac{a^2p_xm_x+rm_x}{a^2p_x+r}=0\]
- Estimation error covariance: \[p_{MAP}=E[(x-\hat x_{MAP})^2]=p_x-\frac{a^2p_x^2}{a^2p_x+r}=\frac{p_xr}{a^2p_x+r}\]
- Maximum Likelihood Estimation: For this example, we have: \[f_{y|x}(y|x)=f_w(y-ax)=\frac{1}{\sqrt{2\pi r}}\exp(-\frac{(y-ax)^2}{2r})\]
- With this information: \[\hat x_{ML}=\arg\max_x f_{y|x}(y|x)=\frac{y}{a}\]
- Estimation bias: \[b_{ML}=E[x-\hat x_{ML}|x]=x-\frac{ax}{x}=0\]
- Estimation error covariance: \[p_{ML}=E[(x-\hat x_{ML})^2|x]=E[(x-\frac{ax+w}{a})^2]=\frac{r}{a^2}\]
- Comparing \(x_{MAP}\) and \(x_{ML}\), we have: \[\lim_{p_x\rightarrow +\infty}\hat x_{MAP}=\hat x_{ML}\] It means that if there is no a priori information about \(x\), the two estimations are equal.
- For the error covariance, we have: \[\frac{1}{p_{MAP}}=\frac{1}{p_{ML}}+\frac{1}{p_x}\]
- In other words, information after observation is the sum of information of the observation and information before the observation.
- Estimation error covariance: \[\lim_{p_x\rightarrow +\infty}\hat p_{MAP}=\hat p_{ML}\]
- It is possible to include a priori information in maximum likelihood estimation.
- A priori distribution of \(x\), \(\mathcal{N}(m_x,p_x)\), can be rewritten as the following observation: \(m_x=x+v\) where \(v\sim\mathcal{N}(0,p_x)\) is the observation noise.
- Combined observation: \(z=Ax+u\) where: \[z=\bmat m_x\\\\y\emat, A=\bmat 1 & 0\\\\0 & a\emat, u=\bmat v\\\\ w\emat\]
- The assumption is that \(v\) and \(w\) are independent. Therefore: \[u\sim\mathcal{N}\left(0,\bmat p_x& 0\\\\0 & r\emat\right)\]
- Maximum liklihood estimation: \[\begin{align}\hat x_{MLp}(z)=&\arg\max_x f_{z|x}(z|x)\\\\\\ =&\arg\min_x \left(\frac{(m_x-x)^2}{p_x}+\frac{(y-ax)^2}{r}\right)\\\\\\ =&\frac{ap_xy+m_xr}{a^2p_x+r}=\hat x_{MAP}\end{align}\]
- \(\hat x_{MLp}\) is unbiased and has the same error covariance as \(\hat x_{MAP}\).
- Therefore \(\hat x_{MLp}\) and \(\hat x_{MAP}\) are equivalent.
Standard Kalman Filter
- Consider the following linear system: \[\begin{cases}x(k+1)&=&A(k)x(k)+w(k) \\\\ y(k)&=&C(k)x(k)+v(k)\end{cases}\] where \(x(k)\in\RR^n\), \(y(k)\in\RR^m\) denote the state vector and measurement vector at time \(t_k\).
- \(w(k)\sim\mathcal{N}(0,Q(k))\) and \(v(k)\sim\mathcal{N}(0,R(k))\) are independent Gaussian white noise processes where \(R(k)\) is invertible.
- It is assumed that there is an a priori estimation of x, denoted by \(\hat x^-(k)\), which is assumed to be unbiased with a Guassian estimation error, independent of \(w\) and \(v\): \[e^-(k)\sim\mathcal{N}(0,P^-(k))\] where \(P^-(k)\) is invertible.
- Kalman filter is a recursive algorithm to compute the state estimation.
- Output Measurement: Information in \(\hat x^-(k)\) and \(y(k)\) can be written as the following observation: \[\bmat\hat x^-(k)\\\\ y(k) \emat=\bmat I\\\\C(k)\emat x(k)+\bmat e^-(k)\\\\v(k)\emat\] Considering the independence of \(e^-(k)\) and \(v(k)\), we have: \[\bmat e^-(k)\\\\v(k) \emat\sim\mathcal{N}\left(0,\bmat P^-(k) & 0\\\\0& R(k) \emat\right)\]
- Using the Weighted Least Square (WLS) and matrix inversion formula: \[(A+BD^{-1}C)^{-1}=A^{-1}-A^{-1}B(D+CA^{-1}B)^{-1}CA^{-1}\]
- Assuming: \[K(k)=P^-(k)C^\TR(k)[C(k)P^-(k)C^\TR(k)+R(k)]^{-1}\]
- We have: \[\hat x(k)=\hat x^-(k)+K(k)(y(k)-C(k)\hat x^-(k))\]
- State estimation is the sum of a priori estimation and a multiplicand of output prediction error. Since: \[\hat y^-(k)=C(k)\hat x^-(k)\]
- \(K(k)\) is the Kalman filter gain.
- Estimation error covariance: \[P(k)=(I-K(k)C(k))P^-(k)\]
- Information: \[\hat x(k)=x(k)+e(k)\] where \(e(k)\sim\mathcal{N}(0,P(k))\)
- State Update: To complete a recursive algorithm, we need to compute \(\hat x^-(k+1)\) and \(P^-(k+1)\).
- Information: \[\begin{align}\hat x(k)=&x(k)+e(k)\\\\\ 0=&\bmat -I& A(k)\emat\bmat x(k+1)\\\\x(k) \emat+w(k)\end{align}\]
- By removing \(x(k)\) from the above observation, we have: \[A(k)\hat x(k)=x(k+1)+A(k)e(k)-w(k)\]
- It is easy to see: \[\hat x^-(k+1)=A(k)\hat x(k)\]
- Estimation error: \[e^-(k+1)=A(k)e(k)-w(k)\]
- Estimation covariance: \[P^-(k+1)=A(k)P(k)A^\TR(k)+Q(k)\]
Summary:
- Initial Conditions: \(\hat x^-(k)\) and its error covariance \(P^-(k)\).
- Gain Calculation: \[K(k)=P^-(k)C^\TR(k)[C(k)P^-(k)C^\TR(k)+R(k)]^{-1}\]
- \(\hat x(k)\): \[\begin{align}\hat x(k)=&\hat x^-(k)+K(k)(y(k)-C(k)\hat x^-(k))\\\\\\ P(k)=&(I-K(k)C(k))P^-(k)\end{align}\]
- \(\hat x^-(k+1)\): \[\begin{align}\hat x^-(k+1)=&A(k)\hat x(k)\\\\ P^-(k+1)=&A(k)P(k)A^\TR(k)+Q(k)\end{align}\]
- Go to gain calculation and continue the loop for \(k+1\).
Remarks:
- Estimation residue: \[\gamma(k)=y(k)-C(k)\hat x^-(k)\]
- Residue covariance: \[P_\gamma(k)=C(k)P^-(k)C^\TR(k)+R(k)\]
- The residue signal is used for monitoring the performance of Kalman filter.
- Modeling error, round off error, disturbance, correlation between input and measurement noise and other factors might cause a biased and colored residue.
- The residue signal can be used in Fault Detection and Isolation (FDI).
- The standard Kalman filter is not numerically robust because it contains matrix inversion. For example, the calculated error covariance matrix might not be positive definite because of computational errors.
- There are different implementations of Kalman filter to improve the
standard Kalman filter in the following aspects:
- Computational efficiency
- Dealing with disturbance or unknown inputs
- Dealing singular systems (difference algebraic equations)