Congrats to the Open Robotics team and everyone who contributed to this release!
Calling out a couple changes I'm excited about that we (Foxglove) helped contribute to:
- MCAP is now the default logging format in Rosbag2[0]. This is a much more performant and configurable format than the previous default (SQLite-based). SQLite is still a fully supported alternative.
- Message type definitions (schemas) can now be exchanged at runtime[1]. This means that tools such as rosbag2, or visualization tools such as Foxglove Studio[2] can now communicate with a ROS system over the network without needing a copy of the source code or complete ROS workspace.
Robotics is really really really really hard. So let's turn the entire thing into a system of distributed microservices, using CORBA/DDS for pubsub...
It'd be like if you designed a video game so that the physics engine (motion planning) happened in one process, while rendering (perception) happened in another process and the world state needed to be exchanged via pubsub. Why do we keep doing this to ourselves?
I have been doing robotics for 20 years, and the "trend" to adopt a distributed architecture was already popular 15 years ago. ROS just adopted and continued that trend.
I think you are missing some important points:
- a large amount of data can be transferred asynchronously. Pub/Sub is a great pattern to achieve that; the cases when you need a synchronous interface are the minority.
- a robotic software has MANY components, you need an approach that incentivize decoupling and lousily-coupled interfaces.
- this lousily-coupled architecture has been a catalyst for innovation and cooperation, because it makes it easier for people to share their code as "building blocks" of a larger system.
Is there a considerable overhead? Sure! But it is much more complicated to write thread-safe code in a huge, monolithic application.
I worked on a robotics project with a large team using ROS1. The loose coupling is pernicious: It's easy for everyone to work on their own ROS node in isolation and avoid testing the integrated system. There's no compiler to help you find all the clients of a node, etc.
Loose coupling is good if you mainly use open-source packages and modify one component for your research paper. I'm not sure it's a net positive when building most components yourself.
A big multithreaded program can still use queues instead of locks to share data between threads.
So many of these comments sound like engineering process flaws.
Everyone's working on their own ROS nodes in isolation....with no thought to eventual integration testing? It doesn't seem like that is a fundamental shortcoming specific to ROS.
Pub/Sub is a very natural fit for robotics. Companies do write ROS out of their production stacks when they get the resources. But they don’t replace pub/sub architecture.
ROS provides a plug-in backplane that allows you to innovate in one area while leveraging existing components for parts that are not your differentiator.
pub/sub is NOT a natural fit for robotics - you want bounded timing and generally for error-handling you want to know what happened as a result of a message being published. ROS introduced "commands" and "actions" to attempt to work around this, but its all just shit piled on shit.
Just make a multi-threaded app, and call functions. If you need to distribute over multiple CPUs, then go ahead and do some IPC. But pub/sub is NOT an architecture. Its a soup of tightly coupled fragments of functionality. It falls apart very quickly in the real world.
Actually I think it's more complicated than that. Pub/sub is a natural fit for robotics in a prototyping sense, but it's a poor fit for real-time systems which makes it harder to productionise something based on ROS. Especially safety critical systems.
"Just" a multi-threaded app and calling functions isn't really a good replacement. For something really basic, sure, but the modularity that ROS introduces, and the core principle of having serialisable recordable messages delivers a lot of value once you start building large systems - particularly those that can span multiple physical machines. But pub-sub is _too_ general as a concurrency model, if ROS had adopted something more constrained that was provably deterministic and was amenable to real time analysis then a lot of its concerns would go away.
It would also help to have compile-time guarantees about which nodes are running.
But I always like to say, why spend 5ns calling a function when you could spend 1ms waiting for a context switch or 50ms for the next event loop to tick over? Even modern microcontrollers are blindingly fast, and the majority of the time they spend these days is because of our brain-dead architecture decisions, not due to the underlying problem that actually needs solving.
> It would also help to have compile-time guarantees about which nodes are running.
I haven't had a chance to try it yet, but in theory I like the idea of something like Zenoh Flow to describe the larger data flow graph. https://zenoh.io/blog/2023-02-10-zenoh-flow/
> But I always like to say, why spend 5ns calling a function when you could spend 1ms waiting for a context switch or 50ms for the next event loop to tick over?
I think your context switch timing is off by several orders of magnitude, but regardless these things aren't one extreme or the other. For sharing data across threads, to an external system, or logging and visualization I still like pub/sub (and I've seen more than my share of horrible abuse), but it definitely shouldn't be treated as one size fits all.
In a best case, yes I've exaggerated wildly. But in a latency sensitive system, waiting for the message to get to the front of the queue could really take that time. Different queue priorities help, but really, why not just do a function call? To me, anything that isn't just a function call should really need to be justified. But I guess it's all design decisions.
Are you aware that ros service calls are RPCs (not based on pub/sub like actions)? Furthermore, if you use nodelets (http://wiki.ros.org/nodelet) you get zero copy communication between your algorithms. So I actually think that ROS has the facilities to address the needs you describe.
I'd be interested, though, in your suggestions on what a real architecture for robotics looks like in your mind? I still remember the time before ROS, 20 years ago, when each robotics team had to designate a sub-team just for building and maintaining the middleware. That was a waste of time and effort. But you seem to suggest that we go back to that? ROS might not be perfect but it's so much better than anything else that exists. It's also open source and we can all work together to make it better rather than reinventing the wheel each time.
The async nature of pubsub makes it great for isolating your part of the system, but moves the complexity to the system integration instead. Actually deploying a ROS based system is about as difficult as rewriting the whole thing from scratch as a monolith. Every time something goes over a pubsub it's like using a GOTO, except your debugger can't actually follow it, and you don't even know who was listening (or who wasn't that should have been) or what the downstream effects were. It makes it impossible to properly debug a system, because it isn't deterministic so you can never be sure if you've actually handled all of the edge cases, since there is a temporal component to the state that can't be reproduced.
A better system would take ideas from game engine design and realtime system execution budgets, with cascaded controllers on separate threads with dedicated compute resources for components that need higher update rates.
The reason ROS has traction is because of university labs, who just need something to work once to be able to publish their paper or write their dissertation. In industry the reliability requirements are much higher, and despite the intensive efforts from the ROS community to "industralize" ROS via adoption of DDS, there seemed to be little understanding that the message protocol wasn't the reason the industry uptake was so low.
> A better system would take ideas from game engine design and realtime system execution budgets, with cascaded controllers on separate threads with dedicated compute resources for components that need higher update rates.
This is how I've structured a control system, but using a pub/sub system to share data across those boundaries (and log/inspect in general). "Nodes" that share some resource or fundamentally run back to back based on the data flow can live in the same thread. Higher rate components (eg inner loop controller) live in a thread with a much higher priority. All of this is event driven, deterministic, and testable.
If you have more details about the system you're imagining I'd love to learn more, because so far I don't see what's incompatible with what you've described.
In general PX4 is better than ROS in this respect, but IMO still leans too heavily on queues. A bit of feedback there:
- more should be done by setting up constrained functionality-specific data, and then simply calling a function with just that data. Right now a lot of things are passed state they don't need just because it is part of the message they receive. This makes the code dependencies way harder to separate, because you effectively share function signatures (messages) between different modules. Of secondary concern is the extra memory bandwidth from the extra data passed around, and not being able to pass by const& due to the async.
- lots of things don't need updating at all until they tick over. If you don't update it frequently it has old data. You can try to make sure it works with the latest data by updating it frequently, but that of course has big overhead. I don't see any decent way of making this work unless you either 1) set up global threadsafe state so everything can access it, which is bad from a dependency and locking perspective, or 2) just call functions synchronously with exactly the data they need.
- The issue with message queues, beyond the need for messages to generalize as mentioned above, is that often components need multiple different messages from different sources to perform their tasks. This means every component needs to keep a local copy of the data they need, to translate this async data back into a synchronous paradigm when it runs. In fact, beyond the message passing itself, pretty much everything needs to do its work in a synchronous paradigm, so why even add the async stuff in the middle to begin with? Once the messiness of different sensors' reporting rates is consolidated into the EKF state, and external commands and directives and brought in, after that point pretty much everything could be synchronous. No overhead, no timing issues, no "which message has the data I need", no "why do we have 3 different variants of the same message with slightly different data and which one should I use", etc.
As long as you log the initial input data it should still be able to do replays for reproduction of realtime behaviour, but a) better testing can be done because you can actually tell what every subsystem needs to run correctly just by looking at its function signature, and b) its much easier to refactor (and understand) because of the same properties of the function signature describing all of the inputs and outputs.
Services and Actions are built on top of pubsub (with separate Request/Response vs Goal/Feedback/Result topics respectively). At least w/ ROS1 - I'm not sure if ROS2 improved things here...
Nodelets are also a disaster, which is why ROS2 kinda fixed this by decoupling nodes and processes.
When you're just starting, ROS can be nice for prototyping - you get a batteries-included platform that can do some SLAM and simple motion planning. But as you start adding new features, you need to figure out how to add those features over multiple nodes. This coordination overhead can quickly bring your system to its knees, or at least make it extremely difficult to debug and troubleshoot when things go wrong.
No one should be building or maintaining middleware. Build robots. Read your sensor data, build a model of the world, decide what to do, then send commands to your control systems. This is the hard part of robotics.
ROS solves the easiest part of robotics (plumbing and process management) in the shittiest possible way.
Industrial robot arm motor control loop? 1000Hz, single-threaded, dedicated CPU core or even an FPGA.
Robot vision system? 30Hz, tolerance for occasional dropped frames, and some lower priority tasks like saving audit images which are nice to have but not critical. A small monolith with several threads would be a fine choice.
Assigning robot-sorted widgets to customer orders? 1-5Hz, go ahead and use the web coding practices where 200ms is considered a fast response time.
All of these are serious robotics - but they could reasonably be coded in very different ways.
Sure it does. I've seen teams of >100 engineers working on largely monolithic robotic codebases across several teams/companies (volumes in the 10s of thousands).
Beyond that, robotics isn't all that special. There are other domains which have soft-realtime requirements and huge scale. AAA video games are a multi-$100-billion industry, which largely ship single-process systems with extreme performance on par with state-of-the-art perception and motion planning use-cases.
It'll be interesting to see how/when wasm shows up in vr & robotics.
Being able to have some shared memory you pass around between safely sandboxed worklets still has any of the same coordination problems. But it at least has much much much less multiprocess jank.
It let's you safely run lots of different code in a single processs, without paying the context switching cost.
Quests's jank problem is probably in part that multiprocess architecture has to spawn new processes & change priorities of what's running & handle expensive context switches between processes elegantly, which is hard. Not only is non-monolithic code sometimes harder to reason about, it has all kinds of performance boundaries that are very painful to cut through.
Read through Pipewire changelogs & how they've worked so hard to make best use of kernel primitives with less and less overhead, look how hard they've worked to get near Jack levels of efficiency. Great example of the slow march of multi-process optimization.
Sandboxes let you collocate different processes inside the same process. You still have many of the same architectural challenges, but the ability to spawn new code & run it has much much lower overhead. It creates the possibility to make way more multi-process like architectures than we can explore now.
In robotics you generally don't want untrusted code anyways. It's much easier to a bad actor to do dangerous behaviour in the real world than breaking out of some sandbox. Kind of an analogue-hole situation.
And at that point why bother with an OS? What benefits are and OS providing then anyways? Just deploy with a unikernel.
I don't think that affects broadscale usefulness of Sandboxes at all. That regurgitates the idea that Sandboxes are security measures. That's a naively small use case.
Elsewhere in the threads here is a discussion of Nodelets, which are often used to load & run trusted code on ROS inside of host processes. This is like a jank special homebrew version of a sandbox, with - as other commenters point out - brings a lot of pain. With wasm you can quickly spin up many cheap fast sub-runtimes and connect them ad-nauseum, with the zero-copy benefits & more. http://wiki.ros.org/nodelet
That's really the key. Sandboxes are really about having many (typically different) runtimes in process. Often these will be processes working with processes. As I highlighted already extensively, the benefit is performance & overhead, versus native processes & native ipc. Sandboxes are just a known term for a runtime within the process, for a subprocess.
I used to use ROS 1 for work. It is incredibly over-engineered. Somewhere in the code base there is a three level hierarchy wrapping a shared pointer to double. The actual address is configured using an xml file. The purpose? To “abstract” the commanded torque ultimately sent to a motor.
I've been using ROS for 10+ years, in both university and professional settings.
The hard parts were never in the middleware (ROS), but stuff like perception, world modelling, decision making. ROS gives you an ecosystem of decent options to choose from. And improve on if you need to at all.
The standardized interfaces and IPC make it very easy to plug, at runtime, eg. a different localization algorithm into the system.
I've also used ROS a lot (but not in the last 2 years), and +1 your comment. The main gripe with ROS is poor documentation, and people's over-reliance on coding in the simulator. Not the architecture. The biggest challenge IMO is actyally getting the perception, world modeling, decision making work reliably in the real world, and this is where AI should provide a big boost.
Heh, yes. At some RoboCup@Home competition (IIRC, Mexico City in 2012 so a while back), a team came in super confident, saying everything worked and they were going to win. Most of it worked ok in the lab, and well in the simulator too, they said. Didn't make it to the 2nd stage I think.
And RoboCup is still not exactly 'the wild' for robots.
Multiple reasons, here are my two cents:
* ROS nodes usually do ( or did back when we used ) pretty resource intensive tasks so they are generally distributed over multiple machines
* Individual nodes can have quite diverse library requirements ( specific library versions of obscure libraries ) that don't go together, so you want to decouple them
Only in academia. In industrial settings sending stuff over a network between machines is too unreliable for realtime operation, and oftentimes also too expensive. Also in industrial environments if you can't install things together, they're not of a standard that you're happy to deploy to production.
I actually talked to someone in industrial robotics, and message passing / message buses are really common there. All of their controllers are loosely coupled, because each unit might be from a different vendor, upgraded at a different time, etc... It also had the nice property that their systems all had multiple supervisors, and if the supervisor detected a bad state by monitoring the bus passively, it could stop all of controllers.
while(true) {
read sensors
update world model
decide what to do
act
}
You should only deviate from this when you have a specific reason (concurrency, libraries, IPC, etc). You can attach a debugger. You can deterministically play sensor data through and get great reproducibility for end-to-end testing. Starting with a distributed system is a handicap 80% of the time.
Design a system to efficiently partition the state between the two? I'm not saying that message passing is bad, just that it shouldn't be the default choice...
But that’s the thing, robots are supposed to interact with the world by default. They are supposed to integrate into society. At some level, there is a necessary distributed processing boundary, and in fact there are many - from the need to communicate with multiple internal heterogeneous processing units, multiple sensors running at different frequencies, external databases and cloud compute, remote operators or telemetry, ground stations, and even other robots. If you want them to be useful at all that is. How in the world do you integrate that system into a synchronous while loop?
This is a non-falsifiable argument. Of course there need to be abstraction layers between various systems. The question is whether pubsub, and all the baggage and difficulty that brings with it, is the correct abstraction mechanism. Tossing your data to the wind and hoping the next system picks it up correctly and runs with it is not how I envision building reliable, deterministic systems.
My argument is that robotics systems are naturally distributed. Pub sub works okay there, but the actor model is better in my opinion. Either way, I don’t see how it’s possible to argue a while loop is the main abstraction roboticist need.
Maybe we're talking about different kinds of systems. I work with robot teams, human-robot interaction, and long-term autonomy.
It really depends on the degree of granularity. ROS encourages the use of the actor model multiple times inside of the same machine. This is complete overkill, and actually reduces reliability and safety.
For example, how do you write unit tests for an actor-model system? Without unit tests, how do you properly characterize the code's behaviour? When I last did ROS work, I built the whole thing outside of ROS, tested and validated it worked with tests, and then put some small ROS wrappers on top, and it basically worked first time. But this isn't how ROS-native systems are developed, instead people use Gazebo/Rviz to tweak and add things, and you end up with a system that is grown organically, at the single algorithm level, with all the problems that entails.
As I posted cross-thread, in the actor model, with queues and threads, you inherently encode additional state via the temporal spacing of the messages. Trying to predict what all of these could be so that you can test for edge cases and make sure things are safe is basically impossible. The modularity of ROS lets you set up a giant system pretty quickly, but in order to iron out the edge cases takes about as much time as just rewriting the whole thing as a monolith, because you haven't actually been able to test the system properly and the long tail of hidden state and bugs is impossible to avoid, and also impossible to predict and test for.
From what I've seen of the ROS community, the concept of testing is severely lacking. It usually entails running simulations in lots of different scenarios, which in a testing hierarchy is only really your final integration tests. It doesn't tell you about degradations in various subsystems, eg. control or navigational ineffiencies. It doesn't tell you about regressions based on earlier behaviour. It isn't deterministic, so you get random failures, reducing trust in the testing infrastructure. It takes tons of compute, so your devs wait hours for something they should be able to know in seconds. And because it's slow, devs won't add tests to the same granularity they would otherwise.
In a high reliability environment deterministic code is really important. The actor model doesn't give you that, each and every time you cross its interface. It also makes abstractions for granular testing much more difficult. It isn't a silver bullet, and ROS leans so heavily on it that all of the downsides are effectively unmitigated and impossible to avoid.
It sounds like we're working in a similar space, for me it is drone obstacle avoidance and navigation systems, and I found ROS to be entirely unsuitable for anything more granular than inter-drone coordination.
> For example, how do you write unit tests for an actor-model system?
In an actor model, the units would be the actors. Test that they are deterministic and behave correctly given a message. You can test them for robustness by fuzzing messages and throw them at the actor. Then you use integration tests to test the whole system's performance.
> But this isn't how ROS-native systems are developed
Note that I haven't been arguing for ROS, but for loosely decoupled architectures for distributed systems like robots. I agree that ROS has many shortcomings. Although I would say this is not a shortcoming of ROS, but of ROS developers. Maybe ROS can be blamed for guiding people to work in such a way.
> As I posted cross-thread, in the actor model, with queues and threads, you inherently encode additional state via the temporal spacing of the messages.
Systems other than ROS do it better, but the point I've been trying to get across is that the actor model is great for distributed systems because it makes explicit the inextricable asynchronous, distributed nature of the system. As I've been arguing, you need to pass messages at some point if you want the robot to be a robot -- it has to interact with the world and society at some level, likely many levels. Your obstacle avoiding drone I assume is communicating with a base station, maybe remote compute, and a remote human operator. If we want to properly test this kind of system, we're going to have to make explicit the fact that the network is not reliable, latency is not zero, etc.
In this light, temporal spacing of messages, rather than being an encumbrance, becomes a necessity. It's a means to test and ensure that the system can handle all sorts of timings and orders of messages, just as it would need to do in the real world. By designing and conducting our tests to incorporate this, we can effectively simulate and anticipate the conditions our system will face.
Also, time-deterministic messaging protocols can be used to better manage this temporal aspect.
> you haven't actually been able to test the system properly and the long tail of hidden state and bugs is impossible to avoid, and also impossible to predict and test for.
But does the monolith avoid the edge cases or does it just fall for the fallacies of distributed computing?
> From what I've seen of the ROS community, the concept of testing is severely lacking.
Again, this seems like a shortcoming of the ROS community, and not the actor model.
For the drone example, the actor model works fine because each subsystem is safe. However if you have multiple components on the drone and want them to be managed by an actor model, as ROS would encourage, you introduce a world of uncertainty on an individual subsystem since that subsystem isn't actually autonomous on its own. Having more actors than strictly necessary due to the underlying physics of the problem is a huge issue.
> this light, temporal spacing of messages, rather than being an encumbrance, becomes a necessity.
And this is the crux of where we disagree. This is a messy part of reality which should be, as far as possible, abstracted away from the algorithms which need to operate on the data presented to them. If I'm running a Kalman filter I don't want to have to design in my filter around frequent gyroscope dropouts because image captures are happening, I want my system to have guaranteed behaviour that this won't happen. Actor model makes this harder by not giving me a way to have explicit guarantees, in fact it moves in the opposite direction by embracing flexibility.
While in general I agree that different components should be independently operable, as a system they will more than likely, in the real world, share various resources and you will need to deal with contention.
Any system which drastically increases overheads via serialisation, context changes, possibly network traffic and finally deserialisation in the place of a few instructions function call is a design which should be used very sparingly.
Actor model makes testing harder, and this results (again in the real world) in testing less. It also makes system level tests nondeterministic. Time deterministic protocols in place of function calls is just a nonstarter IMO. It's giving up control margin, increasing system load, and doesn't leave you any better with regard to system stability in case of failure.
Yes the actor model has its place, but at a very large granularity. Overuse, as in ROS, leads to horrible design constraints, opaque dependencies, difficult or impossible testing, and frankly impossible debugging.
Since you seem to be an actor model evangelist, how would you go about, just as an example, tracing execution flow in a debugger, for example? The data that gets passed into the actor interface is basically runtime-defined GOTOs. Similarly, how would you prove (in a certification perspective) that in certain scenarios the system as a whole behaves in a certain way, and fails in a safe way? Each subsystem can be proved to be safe, but the moment it goes through an async interface all bets are off.
> And this is the crux of where we disagree. This is a messy part of reality which should be, as far as possible, abstracted away from the algorithms which need to operate on the data presented to them. If I'm running a Kalman filter I don't want to have to design in my filter around frequent gyroscope dropouts because image captures are happening, I want my system to have guaranteed behaviour that this won't happen.
Annoyingly in a lot of real world setups you can't have these guarantees. Your gyroscope, camera, etc are all producing data asynchronously often with slightly different clocks, and they all have different little idiosyncrasies and failure modes.
For example the Mars helicopter almost crashed because it missed a single frame. https://mars.nasa.gov/technology/helicopter/status/305/survi...
If possible you absolutely want to fix the frame drop in the first place, but your algorithm should also be able to handle the drop out (or at least reset/recover).
> "The data that gets passed into the actor interface is basically runtime-defined GOTOs." ... "Any system which drastically increases overheads via serialisation, context changes, possibly network traffic and finally deserialisation in the place of a few instructions function call is a design which should be used very sparingly."
I think that your opinion of the actor model has been particularly colored by ROS, as these constraints aren't necessarily part of the actor model. It's an abstraction, and a formalism that is built around the idea of message passing, but that doesn't mean the actual implementation has to involve literal message passing. If a function call will really do the trick, a sufficient compiler can produce equivalent code.
But the question is... is the function call really synchronous. For instance, you give the example of a gyroscope attached to a kalman filter, but what about a GPS? What happens when the GPS becomes unavailable, and the kalman filter doesn't get any more updates? Indeed many (a majority actually) of my sensors have Ethernet interfaces, and we communicate with the sensors over networks that include routers. Some of the robot's sensors are external to the robot itself, and we communicate with them over a wireless network. So when you say this:
> Each subsystem can be proved to be safe, but the moment it goes through an async interface all bets are off.
I find myself in full agreement! But you cross that async interface as soon as you want data from your sensors, because the sensor interface is asynchronous. So you might as well deal with the asynchrony explicitly.
> Since you seem to be an actor model evangelist, how would you go about, just as an example, tracing execution flow in a debugger, for example?
Typically what I look at are message traces. What's nice about actor model is it lends itself to new ways of debugging, like time travel debugging. It's also a formalism, so we can leverage that formalism to prove properties of the program.
> how would you prove (in a certification perspective) that in certain scenarios the system as a whole behaves in a certain way, and fails in a safe way?
I guess it would depend on what system you're trying to certify and to what standard. If you have something in mind, how would you imagine going it ideally, and then maybe I can try to respond as to how my mind would wrap around it.
Can you give another popular implementation of the actor model as a counterexample? Happy to learn, but I care more about what can practically done for actual industrial use cases than what can be done on paper or behind closed doors somewhere with critical details for 3rd parties missing.
In terms of general actor model implementations, there’s the BEAM VM, Akka for Java, and even Rust’s Tokio async system is backed by the actor model.
Your point is taken about industrial robotics lacking proper tools here —- it’s often the case that industry is about 10-15 years behind research in the field of robotics. But ROS is already being used in industry, and I imagine in 15 years industry will enjoy improved tooling currently being used in research labs to support better testing and reliability in robotic systems.
The people I've worked with in the drone space who actually deliver working products won't touch ROS with a 10-foot barge pole. There are plenty of people using 10x the resources to deliver 10% of the product who are embracing ROS though, and 90% of the work is getting all the different ROS components to play nicely with each other at the same time. Never again.
You could have a lidar coming in at 15Hz, a camera at 30Hz, odometry at 60 or 100Hz - but typically you'll want to plan within that same range, at least for navigation (20-50Hz). "Vastly different" is a bit of a stretch.
Also - we have used queues to deal with different time scales for a really long time. It works fine here too.
For higher-level behaviors around grasping or manipulation, your point is super valid though. I suppose I'm mostly focusing on navigation-type tasks.
You aren’t thinking broad enough. Algorithms can run at megahertz, sensors can run at 10s of kilohertz to 10s of Hz, control loops can run at 5Hz. Remote database calls can run of course much longer than that, and then you have very long range planning tasks that can cycle days or weeks depending on deployment. I’d say that’s quite the range.
And you mention queues, yes exactly. Abstract a little more and you get pub sub. Abstract a little more and you have the actor model, which is a lovely way of building resilient, reliable, fault tolerant systems — exactly what we want out of robots.
Control loops also need to run at kilohertz and if you can't schedule them to run without jitter the whole system is useless. Realtime systems need to have an understanding of time budgets otherwise they will never be reliable enough for actually running in places where if they work suboptimally money is lost.
ROS (1/2) is just damn handy. There's a plethora of libraries available: navigation, localization, perception, motion planning, visualisation, record and replay events for debugging, high-level behavior definition with state machines and behavior trees, motion control, sensors, orchestration anything you need in a robotics system.
I've seen various half-assed versions or something akin to ROS (IPC to have process isolation and distributed, with some processes running on a RTOS) seen built over the years. All of which sucked in different ways. Especially a tool like RViz is always missing.
And in many many robotics video I see (of a moderately complex robot), there's ROS's RViz on some screen.
> Especially a tool like RViz is always missing. And in many many robotics video I see (of a moderately complex robot), there's ROS's RViz on some screen.
I would love the future robotics development stack to be more modular, so that (for example) future middleware solutions don't need to also bundle their own visualization software. This was direct inspiration for creating Foxglove Studio[0] for visualization and MCAP[1] for logging - both work great with ROS, or equally well without it.
Yep. I've worked in a startup making a Laser Direct Imaging PCB photomasking machine, basically using lasers to do photo masks etc a couple year ago.
When I came in, there was a custom IPC thing made, sending essentially Python dicts over ZeroMQ (IIRC). It worked to get the machine running and doing it's thing.
For calibration of the cameras (needed to see how warped the PCB was and adjust the pattern) I needed to keep track of transforms etc. A perfect use-case for something like ROS's TF, in some incarnation.
The machine was not a 'robot' per se, but there was many sensors, decision making, actuation, so kinda like a robot.
For debugging the images and calibration transforms, we needed to write custom stuff. The whole thing was akin to ROS, with a couple days it could have been made to work with it. But alas
It is wild to me that there's this whole huge message-bus/rpc system Data Distribution Service (DDS) that so much of thr computing industry else has never heard of, but which is totally core to ROS. https://en.wikipedia.org/wiki/Data_Distribution_Service
> message-bus/rpc system Data Distribution Service (DDS) that so much of thr computing industry else has never heard of, but which is totally core to ROS.
From my understanding ROS2 chose it because it had decent mindshare in automotive and some other industrial fields.
The core data serialization isn’t too terrible. I’ve been working toward writing an embedded version in Nim. The spec is simpler than it seems.
The biggest downside to ROS2 is the multiple layers of abstraction combined with a bespoke meta-build system. To be fair, it’s still simpler than diving into the grpc c++ library IMHO.
Both, sorta. There’s a serialization format, and the wire message format with QoS on top of UDP. The message format seems similar to MQTT semantics, but with a few extras.
I think the wire format is fixed, but it’s pretty simple. The serialization layer (CDR) is swappable, but mostly all the same, though one of the most popular implementations uses a slightly non-standard FastCDR. Though from my testing it produces the same bytes and the CDR spec was updated. Maybe they match now.
I wish there was a better implementers guide. DDS might be a decent alternative to MQTT for more generic embedded if so.
The wireshark wiki has a surprisingly good write up on the underlying RTPS protocol:
> The Real-Time Publish-Subscribe (RTPS) Wire Protocol provides two main communication models: the publish-subscribe protocol, which transfers data from publishers to subscribers; and the Composite State Transfer (CST) protocol, which transfers state.https://wiki.wireshark.org/Protocols/rtps
ROS docs talk quite a lot a out various vendor implementations of RTPS middleware. But little docs on the protocol. Did a little search but haven't found many truly great in depth resources.
(This post wasn't to say there are great implementers resources! I searched for a while for more specific implementation or Wireshark-centric things, to really explain what was what. Would absolutely adore more technical coverage!)
I like ROS, the library of things you can use to get prototypes running quick is nice, but wish they hadn't rolled its own build system in addition CMake (Catkin) with the concept of packages, while not being a true package manager. Whenever your middleware dictates your build system things start to get pretty messy. My dream for ROS 2 (or maybe 3) is one where ROS does less instead of more - sometimes simpler is better.
How is package compatibility with the new version? When I looked at ROS before it seemed to have issues with certain packages only being available for older versions.
ROS is a good research platform, but when it comes down to brass tacks, people stick with smaller, simpler systems. I've used NASA CoreFlight professionally and while it doesn't have all the nice creature comforts, I'm much happier in a small, clean, C-only codebase that runs on Linux, VxWorks, and RTEMS, with two layers of abstraction at most.
How can I get into robotics? I absolutely love the field from the outside but have zero idea how to get into it. Currently a software engineer & studying mechanical engineering part time with a goal of going into robotics eventually, but it doesn’t seem to have a natural starting point for hobbyists.
My 2 cents, as someone who has been mostly a hobbyist here.
(1) Find a buddy who is also interested in it. Things are a lot better if you have a friend.
(2) Then try to build something based on turtlebot or similar platform. Start with just something you can make roam around your house and do stuff.
(3) If you can afford to budget something like $5k+ to it overall, it will be a lot easier. I have seen people try to do everything on a shoestring and it makes things a lot harder. The reason is hardware is hard, and you can move a lot faster if you can just throw things at it. E.g. instead of getting 1 Rasberry Pi or Jetson, get 2-3. Get extra cameras, get 2 extra bases.
A buddy and I got a fully working robot roaming around the house doing basic stuff (chasing dogs, etc) in about 3 months, but we both would just show up with new parts. Having a 3d printer is helpful to make things looking good, but not necessary.
> Currently a software engineer & studying mechanical engineering part time with a goal of going into robotics eventually
Well it sounds like you’re a bit beyond being a just hobbyist if you’re pursuing formal education. Sorry if this is just me being ignorant , but does your masters programs have any career resources you could look into?
Yeah they do, however I was more interested in tinkering on projects on the weekends much like how I learnt to code. So many things seem to be simulation related with Gazebo or whatever, I’d be keen to build stuff but following a guide to help cover the EE parts.
For my part time bachelor's in electronics we made an autonomous drone that could fly a preplanned pattern and recognize people through machine learning. Software: Ardupilot and ROS. Hardware:home made 3d printed drone,PX4, Nvidia Jetson, some Intel camera). Very fun and absolutely something you could tinker with as an hobbyist (it was basically a thesis on a hobbyist project).
We first tried with a raspberry pi based autopilot, it crashed and we rebuilt it again and again until we realized the GPS was broken.
Calling out a couple changes I'm excited about that we (Foxglove) helped contribute to:
- MCAP is now the default logging format in Rosbag2[0]. This is a much more performant and configurable format than the previous default (SQLite-based). SQLite is still a fully supported alternative.
- Message type definitions (schemas) can now be exchanged at runtime[1]. This means that tools such as rosbag2, or visualization tools such as Foxglove Studio[2] can now communicate with a ROS system over the network without needing a copy of the source code or complete ROS workspace.
[0] https://github.com/ros2/rosbag2/pull/1160
[1] https://github.com/ros2/ros2/issues/1159
[2] https://github.com/foxglove/studio