> The continuous equivalent in a sort of unification is Rao-Blackwellisation (done automagically by deep-learning from its training experience) which allows to pick the right associations efficiently kind of the same way that a "most general unification algorithm" allows to pick the right variable to unify the terms.
I don't know how to reconcile this statement about deep learning with my understanding of Rao-Blackwell. Can you explain:
- what is the value being estimated?
- what is the sufficient statistic?
- what is the crude estimator? what is the improved estimator?
Roughly, I think sufficient statistics don't really do anything useful in deep learning. If they did, they would give a recipe for embarassingly parallel training that would be assured to reach exactly the same value a fully sequential training. And from an information geometry perspective, because sufficient statistics are geodesics, the exploratory (hand-waving) and slow nature of SGD could be skipped.
Once you view prolog goal reaching as a game. You can apply Reinforcement Learning methodologies. The goal is writing a valid proof, aka a sequence of picking valid rules and variables assignment.
Value being estimated : The expected discounted reward of reaching the goal. The shorted the proof the better.
The sufficient statistic : The embedding representation of the current solving state (the inner state of your LLM (or any other model) that you use to make your choices). You make sure it's sufficient by being able to regenerate the state from the representation (using an auto-encoder or vae does the trick). You build this statistic across various instances of problems. This tells you what is a judicious choice of variable based on experience. Similar problems yield similar choices.
The crude estimator : All choices of have the same value therefore a random choice, The improved estimator : The choice value is conditioned on the current embedding representation of the state using a Neural Network.
You can apply Rao-Blackwell once again, based by also conditioning one-step look-ahead. (Or at the limit applying it infinitely many times by solving the bellman equation.)
(You can alternatively view each update step of your model, as an application of Rao-Blackwell theorem on your previous estimator. You have to make sure though that there is no mode collapse.)
You don't have to do it explicitly, it happens under the hood by your choice of modelisation in how to pick the decision.
I don't know how to reconcile this statement about deep learning with my understanding of Rao-Blackwell. Can you explain:
- what is the value being estimated?
- what is the sufficient statistic?
- what is the crude estimator? what is the improved estimator?
Roughly, I think sufficient statistics don't really do anything useful in deep learning. If they did, they would give a recipe for embarassingly parallel training that would be assured to reach exactly the same value a fully sequential training. And from an information geometry perspective, because sufficient statistics are geodesics, the exploratory (hand-waving) and slow nature of SGD could be skipped.