The Kalman Filter is a beautiful algorithm, probably among the top math algorithms of the 20th century. It gives a very fast solution to a specific optimization problem (optimal estimation of hidden/indirectly observed states in a linear dynamic system). But for practitioners, the computations and derivations are far less relevant than recognising the circumstances where state space / Kalman model are relevant/appropriate, and how to set up such a model (defining the transition and observation matrices of the model).
In this sense, the Kalman Filter in time-dynamic prediction/forecasting problems is like gradient descent in ML prediction problems, or the solution to the Normal equations in linear regression or FFT for discrete Fourier transforms. The algorithms are mathematical pieces of art, but the useful practical skills are how to set up your problems so that they're amendable to the solutions.
In case of Kalman / state space models, the main things that are estimated are positions of things that move in time, but not (directly) observed. These "things" can be concrete (drones, rockets in space, robots on Mars) or abstract (an idealized compuer cursor/pointer, the water level in a river, inflation or unemployment in macro-economics). When concrete "things" are estimated by traditional engineers and the dynamics are governed by physics, the tool of choice is usually (employer-paid) Matlab. For fuzzier economics and forecasting problems, nowadays there's a very good state space / Kalman Filter implementation in Python statsmodels by Chad Fulton[1], which closely follows the academic works from Harvey, Durbin and Koopman. The challenge in these problems is usually setting up a plausible underlying dynamics model, more so than the estimation/prediction part. Just like in linear regression, finding appropriate features and functional forms are the main practical challenge, not calculating the parameters and actual predictions.
It's worth noting that unlike KF and particle filters, none of the deep learning stuff that's making headlines these days is actually capable of properly modeling / simulating / approximating a dynamic system with a continuous time variable.
Most striking in deep learning, is that when all is said and done (i.e. unwrapping the so-called "recurrent" part of a deep net) all you're left with is a dumb function from Rn to Rp.
In other words, while the term "recurrent" might fool you into believing it, there isn't any actual feedback loop in deep nets, and they therefore can't properly model something whose output evolves over time.
I think there's something to be explored in adding actual feedback into deep nets.
The price is of course that the output is not a vector anymore: you must compute feed forward passes until the net "settles" to a fixed point or rather until the desired behavior is observed at the outputs.
Backprop also becomes a whole new game in that setting.
[EDIT]: but the prize is: your deep net, instead of being an approximation of a function from Rn to Rp, becomes an approximation of something that can transform a function into another function.
You are confusing that recurrent networks are trained unrolled, i.e. with finite history, with them beeing evaluated with finite history, which they generally arent.
The key advantage a kalman filter has is in its long term properties being predictable, things like observability, controllability, and bibo stability.
The key downside of kalman filters is that their nice properties only apply to linear systems and linear observations.
The EKF works by linearizing the process, and suffers on even moderately nonlinear models as a result. It also suffers from the Gaussian assumption. So people switched to unscented (tries to model the underlying probability better) and particle filters (same idea), each giving better accuracy for most all problems. Now plenty of problems are switching to full Bayesian models as the ultimate in accuracy.
So if you like KF, EKF, UKF, be sure to look into this entire chain of accuracy vs computation tradeoff algorithms.
Regarding dynamical systems, Neural ODEs hold promise as they are more analogous to the numerical solvers. I think with any dynamical system, purely data driven approach can be problematic because, as you say, the standard NN architectures are not great for it. You can however add physical constraints, e.g. minimize the Lagrangian as the objective function, to bring stability to these NN emulators.
> but the prize is: your deep net, instead of being an approximation of a function from Rn to Rp, becomes an approximation of something that can transform a function into another function.
Won't it be more like an approximation of a process instead of an approximation of a function? (I mean better described as that)
Thats not true if your formulate you dl system in the same "two steps" ways of the kalman filter. Stuff like DynaNet or "backprop kalman" are really good deep learning extensions of the kalman filter for non linear / perceptual inputs.
How do you explain results like https://arxiv.org/abs/1708.01885, which uses LSTMs to do exactly what a Kalman filter would do except with less manual work?
In many circumstances, particles filters are superior. Two major advantages off the top of my head are the ability to track multi-modal distributions (Kalman filters only track one peak) and apply non-Guassian likelihoods (Kalman filters assume Gaussian noise, while you can weight particles in a particle filter using any kind of distribution).
Particle filters do come at a cost of increased memory and processing usage. You need to propagate N particles at each epoch, each of which could have M dimensions of state (position, velocity, orientation, etc). So Kalman filters are likely preferred where memory and compute are at a premium.
Kalman filter are an optimal solution to the linear problem (PF are not). If you do not have a linear problem, of course there may be better alternatives. PF and KF solves different problems.
When applied to a linear system with Gaussian noise, the particle filter will (approximately) recover the solution of the Kalman filter. So yes, they solve different problems, but the particle filter is more of a generalization of the Kalman filter.
What does 'linear' mean in this case exactly? Does it mean, that if the measurements are in space (i.e. location), then Kalman filter can only be applied when location is changing linearly, i.e. speed is constant?
Linear problems are ones in the form of sum(a_i * f_i(x)). So all Newton ballistics are linear because acceleration is a*x^2. Sometimes people who deal in probably refer to systems where the noise is Brownian as linear as well. So, both descriptions of linearity apply to a regular Kalman filter.
There are Extended Kalman Filter and Unscented Kalman Filters that are designated to relax both of these constraints, but they aren’t perfect.
Actually the biggest challenge of particle filters is to avoid sample degeneracy : when significant parts of the posterior density are no longer covered by particles.
The particle filter is insanely expensive, relative to the Kalman filter. I had a project that I had to give up on because the particle filter just made it too time-consuming to do.
As you stated particle filters are computationally expensive. Also I found them more fidgety to configure for my application cases (in activity recognition).
The Extended Kalman Filter (EKF) relies on local linear linearizations to handle nonlinearity. This works for mildly nonlinear systems, but doesn’t do so well when the system is highly nonlinear (divergence happens).
The Unscented Kalman Filter (UKF) is a kind of particle-like filter but instead of propagating a large number of points, it only propagates carefully chosen “sigma points”. It’s much more economical and works relatively well with nonlinear systems.
To me, the gold standard in estimation is the Moving Horizon Estimator (MHE) [1] which solves an explicit optimization problem to arrive at the state estimate, which allows it to accommodate constraints in the system. It is however computationally much heavier than the Kalman family of filters.
I’ve implemented UKFs in the past and think they’re a good trade off between computational speed and accuracy in the presence of nonlinearities. They cannot handle constraints however, which is a major weakness that MHEs don’t succumb to. (Imagine you are estimating a quantity x in the interval [0, 1] - UKFs can return estimates like 1.32 or -0.3 whereas MHEs accommodate bound constraints like 0 <= x <= 1 as well as more complex constraints to prevent weird outputting impossible or weird state estimates)
I mayve read the presentation wrong but it says 3s/iteration on the MHE. That’s insane. I’ve run close to real time UKF for nav and control applications so it’s an apples vs oranges comparison.
For real time mechanical and electrical applications, the MHE is generally not competitive (at present) However, there many types of dynamic systems that are much slower like chemical plants (minute to hour level control) that can accommodate MHE’s execution time scales.
If you need to simultaneously handle nonlinearities and constraints, it may be a superior choice.
Also there are potential ways to make MHE run faster. For certain forms of the optimization problem (like a linear or quadratic program), it’s possible to convert it to a lookup table (search for “explicit MPC”) which can then be implemented in an embedded system for fast lookups.
That said, I agree with you in practice. An nonlinear optimization problem tends to be a numerical black box and can be hard to diagnose when things go wrong. The Kalman family of algorithms is much more straightforward (only involves function evaluations), faster and have failure modes that are much more easily understood. They just don’t do so well with constraints.
Right now, yes. There was a time where doppler filtering was too expensive in programmable COTS HW, or digital beamforming, or SVD... What's missing here is the gain vs optimisation effort. If there is an order of magnitude of gain here, we'll reduce the latency one way or the other. I see impossible-20-years-ago heavy-duty solvers in RT sensors more and more.
Also depending on your refresh rate 3s might still be worth it IF the filter gain is so much better.
Have you heard of the "Unscented Kalman Filter"? It goes by UKF. It uses a different approximation scheme than the EKF, which creates a linear approximation of the dynamics, in order to propagate the mean and variance. The UKF computation using "sigma points" resembles a specific part of the particle filter.
And why the name? Well the EKF stinks in many applications (it diverges). The UKF is more robust, especially when there is more significant uncertainty.
We upgraded from an EKF to a UKF for our drone. I feel like our code is a good reference for using an Unscented Kalman Filter to estimate a drone’s rotational state, later this year I expect to add an optical flow and ToF sensor and update the filter for better velocity and displacement estimation.
Another extension is the ensemble Kalman filter (https://en.wikipedia.org/wiki/Ensemble_Kalman_filter), which can work well for certain kinds of nonlinear systems, and when it does it can be more robust than particle filters. (Particle filters can "collapse"; this limits their usefulness.)
There's a good description of how to build a trading strategy using a Kalman Filter along with some code examples in Ernest P. Chan's book on algorithmic trading. I think it might be available for free online, although that might have just been something I was able to access through my university.
I feel like I owe it a shoutout because that strategy was the foundation for an algorithm that got me and a couple friends a top finish at UChicago's algo trading competition a few years back.
"These simulations used 36-bit floating point arithmetic, which was adequate for trajectory simulations but marginal for implementing the Riccati equation solution for the measurement updates in the Kalman filter. Performance was not reliable in 36-bit floating point arithmetic, and the Apollo flight computer would have to implement the Kalman filter in 15-bit fixed-point arithmetic. Microprocessors were still a long way off. J.E. Potter was at that time a graduate student in mathematics at MIT, working part time at the Instrumentation Laboratory on the Apollo Project. He took the problem home with him one Friday afternoon and arrived back on Monday with a solution. The trick was to use a Cholesky factor of the covariance matrix as the dependent variable in an equivalent Riccati equation. The solution was brilliant, and the improvement was profound. The approach came to be called square-root filtering, and alternative implementations of square-root filters with better numerical stability were soon discovered."
Yes, the KF was developt for the Apollo project by Kalman. As input in the update steps they used the IMU and the astronauts had to manually measure angles to specific stars.
One thing I wonder (maybe not a sensible question, I haven't really thought it through): A Kalman filter update is superficially like a HMM Forward Algorithm step, right? You have some belief, you have a new observation, and you have some new belief after.
Are there similar equivalences with other HMM algorithms for different questions? A Viterbi-like algorithm for the most likely path, for example, distinct from the series of Kalman-update argmaxes?
i've always just thought of kalman filters as continuous time/space equivalents of hmms, replacing multinomials with gaussians.
so yeah, you can do things like e-m to learn the parameters of the observation and state update models from data (just like fitting an hmm in the fully observed case), or given state and observation models, and past observations you can predict future state given current state. (what it was made for, where state is the state of your rocket given the physics of motion and all the noisy sensors or whatever, which if you think about it is like a viterbi search for the best fitting path through the continuous state space)
These are all unified as different modes of message passing in probabilistic graphical models. Kalman filters are indeed structurally similar to an HMM without the backward message passing step if viewed as a PGM.
Agreed. And, Bar Shalom's forward/backward alg for maneuvering, nonlinear models from Maneuvering Target Tracking is essentially one full execution PGO.
The connection you are drawing is not incorrect in my opinion. But perhaps I can clarify.
The dynamics model in a KF is a linear system with an input and and output (often denoted u and y respectively).
A Hidden Markov Model typically does not have an input. There is an initial state, and then the system transitions "autonomously" (that's a term I'm borrowing from differential equations).
An "Markov model" with input is called a Markov Decision Process (MDP). And if the state of the system is not fully observable, then it's a POMDP.
So Kalman filters are most analogous to "belief updates" in POMDPs. The KF takes into account the known input to the system.
I made a tiny paper for a masters thesis on KF for recurrent neural network training in 2002. 15 or 20 Neurons (a recurrent layer of a handful of neurons) was massive in terms of the compute time needed to train it. To even get it off the ground we had to rewrite all the matlab into C++. Recurrent networks for modeling dynamic systems where all the rage they but I haven’t really kept track of where the field went. Did the deep learning revolution completely drop recurrent connections?
A simple way to ease into understanding Kalman filters through experimentation is to implement a simpler alpha-beta prediction filter. Here is the link to the Wikipedia article to get started: https://en.m.wikipedia.org/wiki/Alpha_beta_filter
The Kalman Filter has been one of the toughest concepts for me to grasp over the years. I have looked at many examples related to target tracking, which logically makes sense to me given physical behavior of flying targets.
But can anyone provide more applications that make learning the Kalman Filter easy?
I recommend the following two articles that helped me get a better working understanding of the Kalman Filter:
- you have observations that are normally distributed around a linear function of an unknown state
- you have some state update function where the next state is normally distributed around a linear function of the unknown current state
- your beliefs about your initial state are normally distributed
then a kalman filter will update your beliefs about the state of your system and do as good a job as could be hoped for. There are also simple generalizations that relax some of those assumptions.
This is one of those algorithms that anyone working with data should know. It answers the dynamical question of wtf is going on now, given what I’ve seen over time.
- because linearity, your observations can come from more than one distribution. (multiple sensors of different measurements with linear relationships between them)
I've always wanted to deeply, fundamentally understand a kalman filter, but I usually get lost in the math notation that comes up without the plain language explanation of what that math means.
For an intuitive understanding, try to build one in one dimension. Imagine you have a position that's drifting at a roughly constant, but noisy, rate, modeled by a normal distribution. You also have an occasional 1D GPS reading about that position. As building blocks, check out conjugate priors, specifically normal-normal [1] and the algebra of random variables, specifically the sum of normally distributed variables [2].
Briefly:
If you have a normal distribution as a prior and a normal distribution as new evidence, then applying Bayes Theorem on them results in a new normal distribution (times a constant we'll ignore). The calculation can actually be pretty simple - the mean is a weighted average, weighted by precision (precision is 1/variance), and the precision is the sum of the two precisions.
If you have two normal distributions modeling unknown variables a and b, and you know x=a+b, then you can model your knowledge of x as another normal distribution. The calculation for this one is also simple - its mean is the sum of the two means (intuitively has to be) and its variance is the sum of the two variances (I've seen several math books say it's surprising. It's very convenient though).
So you have all the pieces you need to do (new state guess = old state knowledge + noise from passage of time) and (new state knowledge = new state guess | new observation), where all the variables are normal distributions, and the new knowledge ends up being a normal distribution too, so you can use it as the old knowledge next time you get information.
This is a great suggestion. For any topics like these, starting in very low dimensional space and working through an example is a fantastic way to understand it.
I'd also suggest not only trying to imagine this in your head: implement it in code. That's how I truly learn these types of things.
For extra fun, you can make it into a game. After you've implemented the filter, create a way for a human to input a guess at the filtered output. Score it depending on how close it is. This way you get to practise your intuition for it too.
At the risk of sounding pompous and self-serving, try my book, I tried really hard to give the plain language explanation, and to build knowledge through intuition and thought experiments.
I think this is what I've always been looking for, the notation used to describe kalman filters is dense, this looks like my kind of read, thank you for sharing!
The KF maximizes the likelihood of the observations..
You can easily form the optimization problem, "what state x maximizes the likelihood of my observations z". I had a writeup to this effect in my class notes from Prof Roumeliotis' class. I'll see if I can post it, but you could probably derive it following the KF wiki page.
Ask me a question or two about particular math notations and I will try to answer it.
I'll just go ahead and explain indices and summation in the meantime already because these are so super-common that you basically cannot read math things without knowing them. I will denote things LaTeX-like, that is subscript with _ and superscript with ^, if multiple things go in there, they will be enclosed in {}.
---
First: indices, usually written as sub-script, are super-common. They might be separated by a comma if there are more than one (but they do not have to be. In that case, you need to check which variables make sense as indices.) Usually variables are just single letters and not superWordyDescriptionsOfThingsBecauseCodeCompletionIsDifficultWithPenAndPaper.
Examples:
a_i is like a[i]
m_{ij} is element at position i, j in matrix M, i.e. ~ M[i][j]
---
Summation is also very common, denoted via symbol Σ . That's the upper-case greek letter "sigma". This one will have a few things accompanying it: below it is a variable name and most likely an assignment of a value to it. That's the first value that the variable will take. Above, the last value the variable should have. The expression behind the symbol is evaluated for each of these values and all integers in between and summed up. (It might also appear with just the varaible name below it, in which case it means: sum over all values of the variable).
Example: Summing up the squares of all integers from 3 to 6
Multiplication: there is also a similiar one for multiplication that I will include because it is so easily explained when having discussed summation already. It's denoted by the greek upper-case letter Π (Pi)
Π works like Σ but the results of the expressions are multiplied instead.
Example: Π_{i=3}^6 i^2 = 3^2 * 4^2 * 5^2 * 6^2
Pseudocode:
for(i = 3; i <= 6; i++)
product *= i^2
This is only meant to help you translate a few formulas into forms you might easier work with. Following a university math course for a single semester or maybe two might get you a long way as well.
Why is this low effort post so high? There have been lots of interesting articles, visualizations and tutorials on Kalman Filters posted on here over the years. This guy just posts a Wikipedia link with no comments and you all think it is great?
I sense some bitterness in the comment. What makes or doesn't make the front page depends on a lot of things and it will not be the same every time. It will, for example, depend on other competing stories. There is a steady influx of new members. So stories that are old hat may not be so for a large section of HN visitors.
Yes, this was a Wikipedia link and Kalman filters has been discussed here many times, that said I found the discussion and commentary quite delightful.
Learning about Kalman Filters' deduced-reckoning-like combination of external observation with internal model gave me an idea about the origin of consciousness:
https://news.ycombinator.com/item?id=23475069
Interesting thoughts. When looking at your link which links to Shannon-Mind-Reading machine I immediately think of several things. One: branch prediction. The mind-reading machine seems a lot like a branch predictor. Two: I used this type of algorithm for a baseball game I wrote when writing the pitching /batting 'AI'. - In fact I stole it from the Intel hardware manual. Three: Rock Paper Scissors - people who are good at Rock Paper Scissors , I feel, use some type of prediction based on your tendencies and tells. Like the dog you mention... There's more in this deep well but I'll stop there.
Kaiman filtering is frequently used in high energy physics particle tracking. For inhomogeneous magnetic fields, the transport isn’t linear though, so we pair it with 4th order Runge-Kutta for state and covariant propagation. Works pretty well and is typically faster than the alternatives.
Kalman filters require that observed variables have Gaussian distributions.
Frequently this isn't the case. Lots of real world systems have more complex distributions, like for example "which lane of a highway am I in given this GPS position" (it's likely you are in one of the lanes, but possible although unlikely I am between lanes).
Particle filters use more computation than kalman filters, but they can deal with arbitrary (and unknown) distributions of variables. Considering very few applications of this kind of system are limited by compute power, a particle filter is almost always better suited. (especially when you have a GPU available, which is a perfect fit for particle filters).
For predictions, the Kalman Filter has it's drawbacks. They are clearly visible when you try to use it in conjunction with other algorithms based on state prediction such as Model Predictive Control. The main problem is that the KF does not have integral action.
One quick fix is to add a disturbance model to the KF, for example an integating disturbance. But then you have one more tuning parameter (the disturbance model). Which along the other KF tuning parameters increase complexity and becomes more difficult to implement and understand.
Personally, I have had more success with other types of filters, which have a similar structure as the KF, but include mechanisms for improved predictions, for example this one:
I’m pretty sure you can see this in action if you open the maps app on your phone and walk around outside. GPS measurements with uncertainty are periodically combined with accelerometer data to produce a better real time position estimate.
How a Kalman filter works, in pictures https://www.bzarg.com/p/how-a-kalman-filter-works-in-picture...
Derive yourself a Kalman filter https://ngr.yt/blog/kalman/
Kalman and Bayesian Filters in Python https://github.com/rlabbe/Kalman-and-Bayesian-Filters-in-Pyt...