Imitation Learning (2023)

polygamous_bat · 2024-03-17T13:05:17.000000Z

As someone who is doing robotics and behavior cloning for a while, it’s incredibly funny to see the self-taught genius rediscover what’s been known in the field for 30 years and can be found by walking in to the first two lectures of any decision making/control theory class.

In practice, deep learning based behavior cloning works when there is enough data for the model to interpolate to the next “frame”, the proof of which is GPT and other LLMs whose primary training objective is practically behavior cloning. However, adding a neural network means there are some hallucinations, which can range for silly (for LLMs) to murderous (for self driving). However, this presents a huge opportunity as well, in that anyone who can solve hallucinations can make headway into not one but two billion dollar problems at once.

AndrewKemendo · 2024-03-17T13:26:52.000000Z

I have a similar reaction when somebody rediscovers something that specialists all know by default

However, the older I get the more I appreciate these folks because they are excited about the problem usually way more than people who know the systems rote

In some cases you get a charlatan trying to recycle something, not open to feedback that they aren’t Kepler or whatever, but in my experience most of these folks are trying to figure the world out and are excited when something seems to fit

Now, I try (not always successful) to “yes and” them into trying to figure out how their method compares to existing methods, with the idea that - hey maybe they figured out an edge case by accident that we can all learn from

That said, this is geohot and he definitely isn’t new to BC/IL

randomsolutions · 2024-03-17T15:26:46.000000Z

Yea, I don't see the problem with someone rediscovering knowledge.

The question I would ask is, would geohot have been better to have gone to school to learn this and deferred all the work he has done?

Who is farther ahead, the people who already knew this from school, or the guy who is building and learning this as he goes?

eddd-ddde · 2024-03-17T15:35:59.000000Z

I think both is good, people thrive on diversity, if everyone was taught exactly the same they would think the same. Without some people with different experiences we would certainly miss on some creations.

AndrewKemendo · 2024-03-17T15:34:31.000000Z

100% the person building and learning as they go

A PhD in ML is worth less to me than day to day operational product engineering experience utilizing the fast changing ML tooling landscape

polygamous_bat · 2024-03-17T16:46:15.000000Z

> A PhD in ML is worth less to me than day to day operational product engineering experience utilizing the fast changing ML tooling landscape

This is how you get a four-week transition from "I can fix twitter search in an internship!" to "twitter search is unfixable." Ideally, a PhD teaches you to bust your behind doing a deep dive on a seemingly unsolvable problem for three-five years.

If only we could recalibrate ourselves to stop the search for easy money and remind ourselves that some problems are just genuinely hard.

AndrewKemendo · 2024-03-17T17:05:46.000000Z

I mean I’m talking generally not about him specifically

The best senior engineers on the planet have a BS and 30 years experience

colesantiago · 2024-03-17T22:16:04.000000Z

Unfortunately 'senior engineers' who have 30+ years of experience in the valley and other places are past their prime in SV by most top tech companies as engineers.

They are expected to be in management roles.

lll-o-lll · 2024-03-17T20:19:06.000000Z

Obviously school puts you further ahead. Human development is based on writing, teaching, and then expanding from there.

The entire point of having a structured approach to teaching (and this includes apprenticeships and the like), is to accelerate the process of learning.

polygamous_bat · 2024-03-17T16:42:17.000000Z

> would geohot have been better to have gone to school to learn this and deferred all the work he has done?

Could you remind me what he has done since his jailbreaking day, i.e. in the last ten years, that can't be described as "setting VC money on fire?" Because my opinion would be, that yes, the world would have been better off if he did go to school.

georgehotz · 2024-03-17T20:28:29.000000Z

We have a product for sale: https://comma.ai

We raised $18.1M and have made $28M in revenue to date.

Where are you getting your narrative? Are you confusing comma with someone else?

richrichie · 2024-03-17T13:51:49.000000Z

This is a problem in tech in general, where people don’t or haven’t studied math beyond discrete and counting stuff.

Some people think that SGD was invented 10 years ago by ML “scientists”.

j45 · 2024-03-17T16:09:41.000000Z

He does seem open to asking. And not acting above anyone.

Everyone is a beginner at something, even OP.

Exploring curiousity, whether it's new to someone, or everyone is a good quality.

I wonder how many of the students who have attended these classes have applied all that math that's existed for so long into a self driving solution

dheuxysgg · 2024-03-17T13:24:24.000000Z

[flagged]

polygamous_bat · 2024-03-17T13:30:53.000000Z

I am talking about the accumulated error part near the end of the article, so your (uncharitable) assumption is not correct.

j0rd1smit · 2024-03-17T12:50:54.000000Z

This is also a huge problem in offline Rl (learning a policy using only a dataset). If done naively, the learned policy will keep accumulating errors due to enter areas that are not well covered. So the trick is to avoid these areas. In offline rl they do this by measuring epistemic uncertainty and using this as a regularization term in the loss function such that the model learns to avoid these areas. This a good blog post that explains it way better https://jacobbuckman.com/2020-11-30-conceptual-fundamentals-...

polygamous_bat · 2024-03-17T14:30:46.000000Z

Bigger issue with offline RL in the real world (I.e. not Atari video games) has been the assumption of reward labeling. Who’s giving you reward labels at scale? In my opinion that’s why we haven’t seen any large scale real world success stories using offline RL.

AndrewKemendo · 2024-03-17T14:39:21.000000Z

I fully agree with you that instrumentation is one of the biggest barriers to state, action, trajectory and reward feedback

However, instrumentation assumes that there’s a control regime that could actually control whatever the system is mechanically, and that’s generally not true.

So it’s almost a chicken and an egg problem where you can do instrumentation for non-autonomous-control systems in order to get state-action-reward data, but because you don’t actually have an actuated control system that you can specify and build mechanically, your targets for state-action-reward tuple aren’t the same

That is to say unless you’re actively collecting data from an autonomous system that’s being used non-autonomously then you’re not gonna be able to transition from a non-autonomous control regime to an autonomous control regime

sashank_1509 · 2024-03-17T17:03:36.000000Z

This sounds like a problem of a lack of scale. Intuitively I do not think there can ever be any theoretical solution for this with no assumptions on the real world.

Essentially let us say you have a world with data from a distribution X. Now you have a dataset with data from the distribution Y. Your problem boils down to the fact that when you run a model trained on the distribution Y, it regular ends up in states in the distribution {X-Y } for which it does not know what to imitate to, which leads it further into out of distribution land and so on till your car crashes. Now unless you limit the scope of X, you cannot come up with any real world guarantees, but in the real world there is no scope on X. In the real world tomorrow a road could cave, which would likely be so OOD, no model (or human driver) in the world can save you.

So really your only solution intuitively is if Y is so large and expansive, it models X for all the cases we care about (reaching human driver performance), or if your model is so big and generalizable that it generalizes to a good fraction of X being trained on Y. With deep learning we have seen generalization due to both more data and larger model with more data showing much greater improvements and that might be the only solution to this dilemma. Also a little bird told me, in Tesla’s latest end to end approach, they train the model, run it in a car with a driver they hired, the driver then intervenes at the last moment, they take the driver data and add it back to the dataset and do this as infinitum with their hope of solving self driving this way.

The problem with robotics, is even if you assume with data scale these problems are solved, in robots you are strictly limited in the model sizes you can use. Too big a model will not run in real time and make it effectively useless. But having a smaller model goes against all the trends we’ve seen in deep learning. This really just makes it exponentially harder to solve robotics compared to language or vision with our current paradigm of deep learning, I won’t put it out of the realm of possible though. A few years back I strongly thought language would not be solved this way and I could not be more wrong.

danielhanchen · 2024-03-17T12:51:57.000000Z

I'm not 100% sure if I'm summarizing correctly, but essentially the current method Comma uses is it (1) Collects tonnes of videos of "good" drivers + their actions (steering etc). (2) Using an autoregressive model, predict action_t based on scene_t. (3) The issue is because we rely on teacher forcing, the error accumulates, causing the model to go out of whack after 10 seconds.

I'm just making stuff up, but I'm assuming this is because of teacher forcing itself? LLMs for eg model sentences by predicting the next word, but retains the past state via the KV cache. Unsure if these systems have a KV cache equivalency? I'm assuming LLMs are in a different regime due to the sheer amount of data as well.

Also I guess a simulation based system could work - ie by going around teacher forcing, and instead doing generation every say 30 steps, and making the dataset itself larger by training on the errors themselves. Sadly this means one has to make the simulation system. But maybe Comma uses some encoder CNN + some LSTM (or attention) based past state thingo? tbh I'm just making stuff up - ignore me - brain dump.

karolist · 2024-03-17T13:21:00.000000Z

What he described as essentially discovering lane keep assist was in production cars and working great since at least 2012 (i.e. Audi C7).

vjerancrnjak · 2024-03-17T13:47:17.000000Z

Imitation learning field has produced learning and inference algorithms that produce regret minimizing policies that don’t depend on the length of the sequence of decisions. Naively training will likely lead to “label bias”. NNs manage to delay it significantly due to very good learned representations of data. But without proof that the learning algorithm produces bounded regret or that the inference/learning is not done over joint loss (of sequence of decisions) the error will most likely accumulate .

polygamous_bat · 2024-03-17T13:50:21.000000Z

You may enjoy this paper: https://arxiv.org/abs/2307.14619

NalNezumi · 2024-03-17T13:01:04.000000Z

It feels like this issue have been (at least in academia) known for a long time. When I try to explain the problem I usually use the slides from ICML 2018 talk by Yisong Yue[1]. And there's already (hard to use in practice) methods such as DAgger to correct for that accumulating error.

Out of the more recent advancements, I think ACT[2]s approach seems like the more interesting direction. Similar to MPC that predict(calculate) not only the next action but predict sequences of actions, discard all the actions except for one, at each loop.

Mapping the input to only the current action seems to be prone to error, while mapping the multiple steps is more robust.

My hunch is that there's probably some pretty fundamental limitations to our current methods (Transformers etc) though, which makes applying it to things such as language when error accumulation is essentially neglible (relatively speaking) fine but, to cars and robots maybe not.

[1] https://youtu.be/WjFdD7PDGw0?si=auftqglrNNTxel-Y

[2] https://arxiv.org/abs/2304.13705

polygamous_bat · 2024-03-17T13:16:00.000000Z

The problem with ACT is very similar to the problem with MPC: if your forward model of the world is not great, it will come back to bite you. So much so that afaik ACT policies can have difficulty transferring between two instances of the same robot.

NalNezumi · 2024-03-17T14:12:21.000000Z

I don't think ACT (unlike MPC) have an explicit use case for different embodiment. Or what do you mean by two different instances? Two different tasks /arm/robots? Their recent work (Octo policy or something) wasn't even that convincing in that evaluation.

But as you're a fellow Robot Learning (BC/LfD I'm assuming) practitioner, in the field of BC (Not IRL) what have been the more robust method for your case?

polygamous_bat · 2024-03-17T16:51:25.000000Z

> I don't think ACT (unlike MPC) have an explicit use case for different embodiment. Or what do you mean by two different instances?

I mean two different Aloha setups (2 x 2x WidowX) in the same lab. This is the very minimal level of "generalization" you may expect, barely enough to be called generalization, but still, the learned policy seems to overfit to the individual robot's joint space quirks.

> in the field of BC (Not IRL) what have been the more robust method for your case?

Different things for different definitions of robust. What's your definition?

alsodumb · 2024-03-17T14:38:06.000000Z

You're absolutely right about ACT. What are your thoughts on diffusion policy line of work? In my personal experience, I found it way more robust than ACT. Have you had a chance to try it?

polygamous_bat · 2024-03-17T16:54:42.000000Z

> What are your thoughts on diffusion policy line of work?

It's been interesting, for sure! More robust than ACT as long as you are not using action chunking (has the same problem as ACT). Downside is, it can be too slow/require too much data to train. I have a sneaking suspicion that at that scale of data even visual nearest neighbor stuff would be similarly robust.

Of course, no one has figured out an iota of useful generalization, which is sad across the field.

HarHarVeryFunny · 2024-03-17T13:14:38.000000Z

Whether imitation works as expected depends on your mental model of what the person/thing you are imitating is doing, otherwise you won't actually be copying.

If you wrongly assume that the person is deriving an absolute steering angle from the image then you will fail.

If you correctly assume that the person is deriving a relative/correction steering angle from the image then you will succeed.

polygamous_bat · 2024-03-17T13:19:50.000000Z

That is not the only problem however, because if it were self driving would be solved by 2018.

The problem is central to using ML tricks for anything: when you use maximum likelihood estimates for training anything, you get no guarantees for unlikely situations. That’s why you get self driving cars dragging someone stuck beneath them or LLMs just straight up spewing fiction.

HarHarVeryFunny · 2024-03-17T13:36:28.000000Z

Well, yeah.

If you want to drive in a straight line, then some Kalman filtering would help.

If you don't want to kill too many people, then AGI would help.

lairv · 2024-03-17T16:55:10.000000Z

So how is it that tree-search algorithms like AlphaZero or MuZero don't drift ? I understand that the searching part allows to reduce the epsilon errors, but with tree search of bounded depth, drifting should still occur right ?

polygamous_bat · 2024-03-17T16:57:27.000000Z

One primary difference is between a discrete (think chess/go moves) and a continuous (think steering wheel input) control space. Even theoretically, the error in one is bounded linearly while the other can have an exponential error with respect to control horizon.

YeGoblynQueenne · 2024-03-17T21:35:40.000000Z

Is that the important difference? The article is talking about sensor error if I read that right, so there's not much opportunity for that in a game environment. It's not like one needs sensors to know the state of a board game, or even an atari game. The world is really its own _perfect_ model.

That's basically why all the RL approaches for game playing don't work nearly as well in the real world: because in the real world policy is a function of state, action, error.

lemurien · 2024-03-17T15:40:02.000000Z

Looking at the lane problem, wouldn‘t an adjusted, less ambiguous road marking solve some of the self driving car issues?

jerpint · 2024-03-17T12:13:56.000000Z

I’ve been having this hunch lately that using LVMs (GPT-4V, LLaVa, etc) could be a solution to error correcting, assuming the LVM has a sufficiently good enough world model

wodenokoto · 2024-03-17T15:07:51.000000Z

I thought he had left comma ai, but the tone of the article is not of reminiscence but of what is currently at hand.