The only selling point of FSD (Supervised) is that it (can) work "everywhere." This is because it only relies on navigation information and what the car can see.
Waymo and similar companies all use HD Mapping. Ignoring the specifics, it can be thought of as a centimeter-level perfect reconstruction of the environment, including additional metadata such as slopes, exact lane positions, road markings, barriers, traffic signs, and much more.
HD Mapping is great when it's accurate and available. But it requires a ton of data and constant updating, or the car will get "lost," and realistically will never be implemented in general, at best in certain cities.
Reliance on HD Mapping gets you to "robotaxis" quicker and easier, but it doesn't and likely cannot scale.
It remains to be seen if Tesla can generalize FSD enough to reach the same level as HD Mapping everywhere. Still, they have shown that the current limiting factor is not what the car sees or knows but what it does with that information. It is unclear how or why HD mapping would help them at that point.
> HD Mapping is great when it's accurate and available. But it requires a ton of data and constant updating, or the car will get "lost," and realistically will never be implemented in general, at best in certain cities.
Waymo have said time and again they don’t rely on maps being 100% accurate to be able to drive. It's one of the key assumptions of the system. They use it as prior knowledge to aid in decision making. If they got "lost" whenever there was a road change, they wouldn't be successfully navigating construction zones in San Francisco as we've seen in many videos.
> They can also do constant updates because the cars themselves are able to detect road changes, self update maps and rollout changes to the entire fleet.
Which leads to mapping failures being unchecked, as the system that generated the data is the one checking the data by driving it. See bullet point 1 in their recent recall for an example.
> Prior to the Waymo ADS receiving the remedy described in this report, a collision could occur if the Waymo ADS encountered a pole or pole-like permanent object and all of the following were true:
> 1) the object was within the the boundaries of the road and the map did not include a hard road edge between the object and the driveable surface;
> 2) the Waymo ADS’s perception system assigned a low damage score to the object;
> 3) the object was located within the Waymo ADS’s intended path (e.g. when executing a pullover near the object); and
> 4) there were no other objects near the pole that the ADS would react to and avoid.
> the Waymo ADS’s perception system assigned a low damage score to the object;
and Tesla would do better how in this case? It also routinely crashes into stationary objects, presumably because the system assumes it wouldn't cause damage.
> and Tesla would do better how in this case? It also routinely crashes into stationary objects, presumably because the system assumes it wouldn't cause damage.
Are the Teslas in the room with you right now?
Please point out in my comment where I mentioned Tesla. I can wait.
The changes can be checked additionally by humans, although not always.
> We’ve automated most of that process to ensure it’s efficient and scalable. Every time our cars detect changes on the road, they automatically upload the data, which gets shared with the rest of the fleet after, in some cases, being additionally checked by our mapping team.
Doesn’t mean it’s foolproof. But the benefits far outweigh the drawbacks.
Waymo doesn't serve any snowy locales yet. But sure, years and years ago mapping was worse than it is today? The mapping used today is working quite well in warm weather locales.
> Reliance on HD Mapping gets you to "robotaxis" quicker and easier, but it doesn't and likely cannot scale.
If you can make the unit economics work for a large quantity of individual cars, mapping is a small fixed cost.
I agree that it's not economical to map every city and road in the US, since you need to generate revenue from every mapped road and city. So you can think of HD maps as amounting to building roads. They will be built in lucrative places. Cruise and Waymo won't make money from putting taxis in nowhere Arkansas, so they don't need to map it.
> the current limiting factor is not what the car sees or knows but what it does with that information. It is unclear how or why HD mapping would help them at that point.
That's simply untrue. All the hard stuff continues to be reliability and sensor gated. Cruise and Waymo have amazing sensors and even they struggle with sensor range, sensor reliability, model performance on tail cases, etc. For example, at night these cars typically do not have IR or Thermal sensing. They are relying on the limited dynamic range of their cameras + active illumination + hoping laser gets enough points / your object is reflective enough. Laser perception also hits limits when lasers shine on small objects (think: skinny railroad arm). Cars also have limits with regard to interpreting written signs, which is a big part of driving.
Occlusions are still public enemy #1. Waymo killed a dog. Cruise crashed into a fire truck coming out of a blind intersection even though their sensors saw the truck within 100ms.
LiDAR and HD mapping together are supremely useful, even if you don't drive with it, for enabling you to simulate accurately. You cannot simulate reliably while guessing at distances and locations. HD maps let you use visual odometry to localize, and distance measurements grounded in physics backstop the realism of your simulation at least in terms of the world's shape.
Tesla lacks the ability to resim counterfactuals with confidence since they don't have HD ground truth. There are believers at the company that maybe you could make "good enough" ground truth from imagery alone but that in and of itself is a huge risk, and it's what skipping steps looks like. Most in the industry agree that barring a major change in strategy they just have no way to regression test their software to the level of reliability required for L4 / no human supervision.
The obvious thing to do is to just have every Waymo robotaxi or car with licensed Waymo tech report in its daily mapping/obstacle data to the mothership, so you can get new changes almost immediately.
I dunno if said data would be as high quality as dedicated HD mapping cars, but it's probably at least decent, given the variety of cameras and lidars every Waymo car has.
Further, it seems to me that if you brake hard to avoid a dog, your car should warn me as I’m approaching. I’m not sure why we are trying to teach each car to drive when we could be teaching all the cars and the road to drive.
> Further, it seems to me that if you brake hard to avoid a dog, your car should warn me as I’m approaching.
What does this mean? Electric cars are already required to emit a sound as they drive.
I guess if it has to brake hard for something, honking might be a good idea, but I wouldn't want cars to constantly be beeping at everything in their vicinity if there's no imminent crash.
> I’m not sure why we are trying to teach each car to drive when we could be teaching all the cars and the road to drive.
I'm not sure what you mean. Presumably Waymo's software is the same across its fleet. They're not training one car's model at a time.
Well, if your car brakes hard to avoid a dog, your car should warn me. I’m not sure how to make this concept simpler so I can only repeat it.
> Electric cars are already required to emit a sound as they drive.
I know.
> I guess if it has to brake hard for something, honking might be a good idea, but I wouldn't want cars to constantly be beeping at everything in their vicinity if there's no imminent crash.
If you think in a discussion about robot cars that drive themselves being conducted on a hacker website, I’m suggesting that cars communicate their sensor data to each other by honking their horns, in really not sure what to tell you other than yes, this would be profoundly dim witted.
> I'm not sure what you mean.
I believe it.
> Presumably Waymo's software is the same across its fleet. They're not training one car's model at a time.
I believe it. I also believe you’re deeply missing the point, perhaps intentionally.
Agreed, but having the raw data is still useful, especially for less-used routes where it's not economically feasible to send out dedicated mapping cars all the time.
I'm just speculating here, but I can envision a few ways of dealing with the cost problem in scaling an HD mapping-based robotaxi fleet:
1. Robotaxi companies might simply stand to make enough money to cover the cost of routine HD mapping. Anywhere the revenue of putting taxi services in a new city outweighs the cost of implementing the necessary updates sufficiently, won't companies do it? We could think of these companies as having similar economics to Uber, but replacing the cost of paying drivers with the cost of routine HD mapping updates.
2. Smaller towns have less frequent construction, so the update costs might be lower as you target less dense areas.
3. I could see a single company that specializes in providing routinely updated maps to a variety of fleet-operating companies. This could potentially be a utility or somehow subsidized by the government. It would also be possible for government to coordinate construction with HD mapping updates. After all, by lowering the rate of accidents and decreasing square footage devoted to cars, governments have a vested interest in seeing robotaxis replace human-owned and driven cars.
> Tesla lacks the ability to resim counterfactuals with confidence since they don't have HD ground truth.
Tesla does have HD ground truth data for verification generated by their own LIDAR-equipped vehicles. However, according to a recent tweet by Elon Musk [1], they don't need LIDAR for that anymore.
> That's simply untrue. All the hard stuff continues to be reliability and sensor gated.
IR and thermal sensing are unnecessary if the bar is human level and neither is the lidar. The point is overused, but humans rely on two eyes in the driver seat. I don't see any evidence to suggest the modern model that Tesla has developed for their vision system is their limiting factor in the slightest to reach L4/L5.
Dogs jump into the road in front of cars all the time and get killed, and kids get endangered at school bus crossings. That's a reality of life that robotaxis do not need to solve.
That vision-only argument is marketing spin from Tesla. The biggest thing it leaves out is that humans process their vision input with a human brain, which Tesla vehicles very much do not have. If and when we create true AGI they will have a good argument, but a world where that exists will be wildly different from our current one and who knows if Tesla's tech will even be relevent anymore.
Why are you so confident that AGI, or a human brain, is necessary to be able to drive a car with only cameras?
I get annoyed with statements like this because technology changes and advances so quickly, and Tesla has made substantial technical leaps in this field of machine learning. They have the state-of-the-art vision -> voxels/depth models and are only improving.
Tesla, who use cameras only, have not demonstrated full self driving, despite trying for a decade. Elon Musk has stated "It is increasingly clear that all roads lead to AGI. Tesla is building an extremely compute-efficient mini AGI for FSD" [1]
Waymo, who use additional sensors like lidar, have a driverless taxi service which needs no safety drivers.
Waymo does have safety drivers, they're just driving the vehicle remotely when it's in certain areas instead of being in the vehicle. So it isn't "full" self driving either.
> Tesla, who use cameras only, have not demonstrated full self driving
There are entire youtube channels with hours of continuous video showing Teslas driving around SF, but also other parts of California, with no human intervention.
No, Waymo is not driving remotely. Remote operators can only answer simple questions. They're at the point of commercialization so it's all about unit economics. There's no point in driving remotely especially since it does not scale cost-wise.
Waymo is geofenced, but within its geofence it requires zero human intervention. Tesla on the other hand is famous for mistaking the moon for a traffic light. Saying "Tesla has so many miles on YouTube" is hilarious because first of all there are channels with lots of Cruise & Waymo footage too, and more importantly it's not the # of miles that matters, but the # of non-trivial scenarios you can handle.
I don't see why Tesla can't handle those scenarios if they also use remote operators. I wouldn't be surprised if they do.
Btw Waymo is nowhere near achieving unit economics. Their cars cost like 5 times what Teslas cost, and the sensors require a lot of upkeep and maintenance.
Who’s to say what unit economics they’ve achieved but I’d hazard to guess that their investors wouldn’t support expanding their fleet and service unless the unit economics are at least close to break even. Cost for sensors and overall BOM keeps going down as more suppliers enter the market.
Are you saying that there are times when a Waymo car's ability to respond to events is at the mercy of a random Internet connection? What happens if the safety driver is steering remotely, from another town, and there's packet loss for a couple of seconds in the middle of a curve?
Again, they don’t drive or steer remotely. What sometimes happens is a multiple choice question is presented to the operator in an ambiguous situation:
<photo of construction zone>
Can I drive through here?
[Yes] [No]
When this is happening, the car is stopped and lets the passenger know that it’s reaching out for remote help to figure out what to do. For me this has happened two times across my 125 Waymo rides (571 miles) so far, and was resolved in under 20 seconds. Though I must say, 20 seconds feels like ages when you’re in the car and blocking traffic!
Eyes which are orders of magnitude capable than the best cameras, and Teslas come with mediocre cameras, not the best. Eyes which are connected to a brain, and ML is a looooong ways from rivaling that.
> That's a reality of life that robotaxis do not need to solve.
Robotaxis do not need to account for things jumping out unexpectedly in front of them?
I am not sure that the vision in Teslas is adequate with -any- amount of processing to drive a car. Spatial resolution is limited, as is seeing distant vehicles during merges, etc.
Secondarily, there is no guarantee that the amount of processing is enough, because the extant human systems use much more.
“Cheating” by using more sensors to simplify out complexities and to cover for the shortcomings of other sensors in the suite seems wise.
Orders of magnitude more capable than the best cameras? I wish. I need corrective lenses for my eyes to even work at all. With that fixed they feed my brain an image that's upside down, black and white except in the centre, which is covered in blood vessels and which has a blind spot. They also take a long time to adjust to sudden changes in lighting conditions, don't do any true depth sensing, suffer frequent frame drops and can't run for more than about 20 hours at a time before they basically stop working.
My brain tries to hides all this from me, and makes me think that I see the world in glorious 3D technicolor all the time, but that's a lie as revealed by the many amusing optical illusions that have been discovered over the years.
Meanwhile, today I used ML that knows more than me, can think and type faster than me, which is a much better artist than me and which can read and react far faster than me to visual stimuli. Oh, it can also easily look in every direction simultaneously without pausing or ever getting distracted or bored.
Somehow it doesn't feel like I have a big advantage over computers when it comes to driving.
Are we talking about Tesla's cameras or the "best" cameras? There are smartphone cameras that do depth sensing and HDR, and cameras are cheaper than eyeballs so composing them to get more angular resolution seems OK.
ToF/structured illumination cameras are honestly not that capable.
The maximum dynamic range of the eye is ~130dB. It's very difficult to push an imaging system to work well at the dark end of what the eye will do with any decent frame rate.
It's not as different as it used to be, but even so: the Mk. I eyeball does pretty damn well compared to quite fancy cameras.
> There are smartphone cameras that do depth sensing and HDR
Depth sensing is again, estimated or using time of flight sensors which is pretty much short-range lidar. HDR is used already in AV perception, but still loses to your eyeballs in dynamic range and processing time.
Eyeballs have high dynamic range but with high mode switching times. Walk from a bright area to a dark area and it'll take seconds for your eyes to adjust. Cameras are so cheap you can just have a regular day camera and a dedicated night vision camera together, switching between feeds can be done in milliseconds.
Robots aren't humans. You need accurate depth perception to maneuver a robot precisely, and you need ground truth depth measurements to train learned depth perceivers as well as to understand their overall performance. Humans learn it by combining their other senses and integrating over very long time using very powerful compute hardware (brain). To date, robots learn it best when you just get the raw supervision signal directly using LiDAR.
> Walk from a bright area to a dark area and it'll take seconds for your eyes to adjust
You do realize cameras have the same issue, and that HDR isn't free / is very computationally intensive?
Your brain is _really really_ good at surmounting challenges including many that you did not mention. We don't know how to get close to this in terms of reliability when using cameras and ML alone. Cameras and ML alone can go very far, but every roboticist understands the problem of compounding errors and catastrophic failure. Every ML person understands how slow our learning loops are.
Consider that ML models used in the field have to get by with a fixed amount of power and ram. If you want to process time context of say 5 seconds, and with temporal context 10Hz and with resolution 1080p, how much data bandwidth are you looking at? Comparing what you see with your eyes with a series of 1080p photos, which is better? Up it to 4k: how long does it take to even run object detection and tracking with a limited temporal context?
Your brain is working with more temporal context, more world context, and has a much more robust active learning loop than the artificial systems we're composing today. It's really impressive what we can achieve, but to those who've worked on the problem it feels laughable to say you can solve it with just cameras and compute.
There are plenty of well respected researchers who think only data and active learning loops are the bottlenecks. In my experience they're focused on treating the self driving task as a canned research problem and not a robotics problem. There are as many if not more respected researchers who've worked on the self-driving problem and see deeper seated issues -- ones that cannot be surmounted without technologies like high fidelity sensors grounded in physics and HD maps.
Even if breadth of data is the problem and Tesla's approach is supposedly yielding more data -- there is also the question of the fidelity of said data (e.g. the distances and velocities from camera-only systems are estimated and have noiser gaussians than ones generated with LiDAR). If you make what you measure, and your measurements are noisy, how can you convince yourself or your loss function for that matter that it's doing a good job of learning?
It's relatively straightforward to build toy systems where subsystems have something on the order of 95% reliability. But robotics requires you to cut the tail much further. https://wheretheroadmapends.com/game-of-9s.html
Agree 100%. And IMO it is worth remembering that a really significant share of collisions are caused by well known risk factors. For those of us who avoid being in those situations to begin with, the robotaxi would need to be a good bit safer than our average.
> I don't see any evidence to suggest the modern model that Tesla has developed for their vision system is their limiting factor in the slightest to reach L4/L5
For one, frame rate and processing rate on human eyes is way higher than cameras. Dynamic range is another. Also, Cruise and Waymo are some of the only companies that have hard internal data / ability to simulate how well their safety drivers do, and in the very same scenario what their software driver will do. Without LiDAR you can't build that simulation, and once you have that data if you continue to use HD Maps and LiDAR there's probably a good reason.
> Dogs jump into the road in front of cars all the time and get killed, and kids get endangered at school bus crossings. That's a reality of life that robotaxis do not need to solve.
Robotaxis need to avoid any accident that a human would be able to avoid.
> IR and thermal sensing are unnecessary if the bar is human level
See, you could say this if you had some data that showed that incidents per X miles (when the vehicle is driving at night) is sufficiently low, + if the software passes some contrived scenarios to gut-check its ability to see in the dark with the necessary reliability. But you don't have that data, do you? Someone has it though :) and I'd argue regulators should have it too.
> For one, frame rate and processing rate on human eyes is way higher than cameras.
I don't think it's exciting to say that you must have theoretical parity with something to use it for this use case. Tesla's solution monitors ~6? cameras at once with accurate depth in each. That's 6x more views than a human can see. I wish people would stop comparing apples to oranges.
> Robotaxis need to avoid any accident that a human would be able to avoid.
I never said anything to the contrary. Animals get hit all the time, not just because a human wasn't paying attention.
> Tesla's solution monitors ~6? cameras at once with accurate depth in each
No, the depth is estimated. It's not accurate, at least not in the way you need for L4.
> I never said anything to the contrary. Animals get hit all the time, not just because a human wasn't paying attention.
I was just clarifying what the bar is. The bar is that avoidable accidents need to be avoided. Nobody will get mad if a plane crashes due to unavoidable circumstances (freak accident where two engines go out due to bird strikes or something). People will stop flying in the plane when it becomes clear that the airline is not doing everything it can to avoid fatalities.
> The only selling point of FSD (Supervised) is that it (can) work "everywhere."
I seem to recall Musk saying in the last couple years that "full self driving will basically require AGI." This appeared to me to be extremely honest and accurate, though I believe that in the moment he was trying to promote the idea that Tesla was an AGI company.
I guess the cars can and will update the mapping in real time ?
> at best in certain cities
If mapping a city is possible, so it's mapping a highway, even easier.
If cars do update the maps themselves, they require might just a couple of human-driven passes of the standard WWaymo cars on a highway to generate the maps.
The obvious question here is "why not both". Use mapping data where you can, LIDAR and other sensors where you can, and visual cameras when you must. There's no reason to limit yourself to just one input type. Elon claims that, sure, but it doesn't seem like a given at all.
Waymo and similar companies all use HD Mapping. Ignoring the specifics, it can be thought of as a centimeter-level perfect reconstruction of the environment, including additional metadata such as slopes, exact lane positions, road markings, barriers, traffic signs, and much more.
HD Mapping is great when it's accurate and available. But it requires a ton of data and constant updating, or the car will get "lost," and realistically will never be implemented in general, at best in certain cities.
Reliance on HD Mapping gets you to "robotaxis" quicker and easier, but it doesn't and likely cannot scale.
It remains to be seen if Tesla can generalize FSD enough to reach the same level as HD Mapping everywhere. Still, they have shown that the current limiting factor is not what the car sees or knows but what it does with that information. It is unclear how or why HD mapping would help them at that point.