This was one of the first major components to use Rust at our edge and we've been really happy with it.
The flexibility this approach (to matching traffic) has given us... whilst we also get the speed, memory safety... is just great. The speed at which we're iterating on the firewall and other systems that use this is a joy to behold. A lot of that speed derives from the confidence in this component.
We're also really pleased that whilst we have several proposals for optimisations, none have yet needed to be prioritised as the performance is great.
> . The speed at which we're iterating on the firewall and other systems that use this is a joy to behold.
How much of this do you attribute directly to Rust versus other factors like seniority of engineering talent, clear communicated requirements, organizational empowerment, etc? If you swapped out Rust for C, what would be the impact on velocity? Is it a 1-4x multiplier, or larger?
Engineering resource is finite, the less time spent on working on the engine the more time we can spend working on new features and other systems.
Before making this in Rust we experimented with a Go and also a Lua implementation as we investigated the approach. The Rust code took the same time to initially produce and was production ready very early and has been rock solid and required virtually no maintenance since it was put into production.
That frees up that engineer to work on other things, whilst also reducing ongoing maintenance that was anticipated and so those engineers are also working on other stuff.
I'd attribute a chunk of that to Rust... though it helps if you have great engineers who are familiar with the concepts of parsers, etc working on these things. The same may not be true if someone new to Rust was also new to the concepts needed.
Yes. So if one is going to do anything (optimize, write a new server, whatever) one should focus effort where it gets the most bang for the buck.
Before making this in Rust we experimented with a Go and also a Lua implementation as we investigated the approach.
This is excellent! More companies need to be willing to spike & experiment and change directions on effort up front. In general, the sooner you know, the more money you save.
Amen! But I know what my company would say. "Training devs and SRE on a new language is too expensive. And there are too few Rust devs in the job market, and the ones there are are too expensive to hire."
On the other hand, I have experienced a situation where a fairly young group experienced tech burnout from having to learn too much at once. It's a really tricky balance. I was definitely on the side of "We're programmers! Learning programming languages is like drinking water for us" originally, but now I step back a bit every time we think of introducing new tech. Still, spiking a solution in a couple of different languages seems awesome and I can think of a few projects I've worked on that would definitely have benefited from that.
Cloudflare still seems to operate pretty similar to a startup despite now being huge. These days, it seems like a lot more companies that begin as startups are preserving more of that experimental, innovative culture to their later stages. Y Combinator, and others, definitely deserve credit for positively influencing the industry.
1. Cloudflare has an unusual structure in that we have a three engineering groups: the core engineering group that builds things that the product management team specify, a totally separate disruption group that works on riskier bets, and a crypto research group.
2. I strongly believe that letting people work in languages they love has a huge advantage. Engineers want to learn new skills (and those that don't we don't want to hire) and so letting them do that means they are happier and do better work.
3. Small teams do more than large ones. The teams that work on Cloudflare products are very small and agile.
Given that I wrote the Go version and RReverser wrote the Rust version, I'm going to go with: A better engineer wrote the Rust version.
But actually I was excited by Rust too. The memory safety, performance (the Rust was faster) and the degree of control over how we could present the FFI to the languages we needed to integrate with, in addition to how readable the Rust was by comparison to the Go code (readable, but so much of it), and then avoiding the GC... the Rust implementation was a convincing winner.
You can kind of "fake" ADTs in Go with interface{} and switches, but it isn't all that nice.
Rust on the other hand with first-class ADT and pattern matching makes it a pleasure! You also have all the nice advantages of the compiler checking that all branches have been satisfied which catches a lot of bugs at compile time.
Go has it's place, and works great in those places, but I don't think this is one of them. I'm pleased we chose Rust and this feels like a great use for it!
It's not a big issue when you're working on the code continuously, thanks to the incremental compilation, but in general I'd always take slower compilation with more analysis (== lower risk of crashes and faster runtime) over faster compilation any time.
It's important to keep in mind that there is always going to be a bottleneck on productivity. Sometimes that bottleneck is compilation speed, sometimes it's the language allowing unbounded growth in complexity, sometimes it's ecosystem immaturity, sometimes it's fixing bugs another language would have avoided, and sometimes it's a language's lack of expressivity. But there's always something.
It's worth keeping in mind that in some other language or with some other library or using some other toolkit, you'll still wind up with a bottleneck. It'll just be a different one (and maybe a worse one).
Right now Rust compile times are still slower than C++, when doing "make all", as everything is always compiled from scratch and crate binaries aren't shared across projects.
Delphi/Ada/D/.NET Native like compile times are still on the horizon.
In seriousness, maybe part of the power of C++ is how it forces the programmer to slow down, be more methodical and concentrate more. A language that is faster and looser, could also burn out the programmer faster and encourage shallower thought patterns.
I know that when I am trying to find a bug (any bug, it could be logical or data) that I am slowing down a simulation of the interaction of components in my head. This helps one understand all of the abstractions and evolutions of processes within the program. Maybe that very act is the power of C++.
After looking into it, I think I actually did come across it long time ago. As you've noted, the approaches are different, but it's indeed inspiring in the sense of moving to safer languages for parsing in general.
In C, a common way to get good performance without JIT is with bytecode and a giant switch statement. It would be interesting to see if that's viable in Rust. How does performance compare to an implementation using closures?
Inko (https://inko-lang.org/) uses this approach for its bytecode interpreter. I can't speak about how well it performs compared to an AST interpreter since I never wrote one for Inko, but generally bytecode interpreters should be faster. I also think bytecode interpreters would be easier to optimise, as you're no longer directly tied to the AST of the input language.
It was even slightly worse than AST interpretation actually - as mentioned in the post, this is a much larger topic that deserves its own attention, but I did include the bytecode case in the linked slides. https://www.slideshare.net/RReverser/building-fast-interpret...
My understanding is that you're doing subroutine threaded code[0] instead of bytecode, which isn't a common choice. Maybe this has something to do with Rust's particular performance profile.
Did you profile your bytecode implementation to see where the time is going?
If you mean in a comparison against AST interpretation, then I'd say that Rust tagged enums (which C doesn't have) are already essentially a bytecode, except properly aligned, so it makes sense that they have same or slightly better performance than you would get with manual bytecode implementation.
I am not sure if I understand this argument. The C "tagged union pattern" is very similar memory-wise to Rust tagged enums, it just is less pleasant to use due to no pattern matching and so on.
I would expect that the advantage of the closure version of your AST interpreter is coming from somewhere else (at the end of the day it is still an AST interpreter). One possibility is that you are passing around the operands to your filters as Rust lexically-closed values instead of using a custom stack for your interpreter, which makes things a bit more "statically typed".
> at the end of the day it is still an AST interpreter
In a pretty remote sense, I guess.
> One possibility is that you are passing around the operands to your filters as Rust lexically-closed values instead of using a custom stack for your interpreter, which makes things a bit more "statically typed".
Yes, that and using native dynamic dispatch instead of walking a tree structure with branching are making this technique much closer to template JITs than AST interpretation, with corresponding performance wins.
The reason I think it is fair to call it an AST interpreter is that there is roughly one function call for each AST node. This isn't necessarily a bad thing though -- AST interpreters are super simple and maintainable (as you have demonstrated) so if you got good performance out of one then it is great!
C certainly has this via unions. It's more manual but provides more control, including choice of alignment.
It's CPU dependent but in practice the cache benefits of packed instructions will probably dominate any misalignment penalty. It's not like x86 instructions are aligned after all.
What's the best way to implement a variable-width bytecode interpreter in Rust?
More control is, as usual, less ability for automatic optimisations. For example, Rust uses knowledge of tag + data combinations for niche-filling enums where possible (and more similar optimisations to come), while LLVM can use this information to more easily deduce which branches are unreachable.
As for alignment - you can control it in Rust enums too, if you wish, by putting custom aligned structures as variant payloads.
In general, I'm sure it would be possible to achieve same (or, who knows, maybe even slightly better) performance than AST interpretation with manual tweaking, but the closure-based approach proved to give significant wins out of the box, while being more elegant to maintain and extend over time.
In a bytecode interpreter's loop, you want your language to be little more than a fancy assembler. Exotic techniques become viable and even critical for performance: computed goto, misaligned reads, pointer arithmetic, unchecked type conversions, etc. This is one place where you want to tightly control your machine code.
Oh, sure, if you're willing to go to unsafe techniques, you can go straight to a template JIT as well. But, for us, the balance is important, and safety is probably even more important than further wins in performance (which we can still easily get with algorithm-level optimisations before doing anything unsafe).
C++ is itself a “cute” name. It seems to be doing just fine.
Beyond that, “crate” specifically has utility. The closest existing name is “compilation unit,” but with things like incremental compilation, that’s a bit of a misnomer. We need some kind of name.
That's not the right way to talk about it. The error was in code that used Ragel but it was all on us. But as part of improving things we have moved away from non-memory safe languages and Ragel was using C.
In a way it was, but it wasn't Ragel's fault. We made an error that ended up with generated C code that contained the bug. Bottom line for us was... use memory safe languages.
As far as I remember not Ragel was the issue, but the code around it.
If you use a Rust parser from inside broken C code which Leaks memory the same thing can happen.
A better way to get good performance is to thread your switch statement, which is hard to do explicitly in Rust last time I tried (maybe you could do this if you mark functions as inlinable?).
The "big switch statement" approach is for each bytecode instruction to complete by jumping to a centralized dispatch location (i.e. the switch statement).
The "threaded" approach is for each bytecode instruction to complete by decoding and jumping to the handler for the next instruction.
Basically instead of "break" you have `goto handlers[nextIp->opcode].`
The advantages of threading are fewer jumps and better branch prediction (since branch prediction is tied to IP). The disadvantages are slightly larger code and compilers struggle to optimize it, since the control flow is not structured.
I think C++ does closures a bit better than Rust, particularly when you have lots of Arcs and Rcs. The clone dance gets old fast. However, Rust does everything else so much better, especially Cargo (here's hoping Buckaroo, vcpkg, etc. gain adoption).
In general I agree that Rc could've been nice to copy instead of cloning explicitly, but on another hand I'm a pretty big opponent of having "lots of Arcs and Rcs" since they usually hide the ownership and data flow, and are rather hard to refactor out once added.
I'm not very familiar with C++ closures, but I agree with rust closures having room for improvement for a different reason. Lack of fine grained control over copy/move.
let settings = thing_it_wants_to_take_by_reference;
let channel = thing_it_needs_to_take_ownership_of;
let closure_that_takes_everything_by_reference =
|| { some_code(settings, channel) };
let closure_that_takes_everything_by_move =
move || { some_code(settings, channel) };
let closure_that_does_what_I_want = {
let ref_settings = &settings;
move || { some_code(ref_settings, channel) }
};
let closure_with_hypothetical_syntax =
|| [move channel] { some_code(settings, channel) };
This was a fairly explicit design decision, because these cases come up so rarely. It's our opinion that it's not common enough to be worth increasing the complexity of the language.
You don't even need it in this case; this will compile, for example:
fn main() {
let settings = String::new();
let channel = String::new();
let closure_with_hypothetical_syntax =
|| { some_code(&settings, channel) };
}
fn some_code(settings: &String, channel: String) {
}
The only difference is the & on settings in the closure. This moves channel but not settings.
It’s cool! In general, the default is “rust tries to capture each thing the way it needs to be” and you can use move to override that. Sometimes, you do have to do what you wrote, at least in theory. I’ve never actually had to.
If your closures need to be 'static, it’s common in my experience to need to use `move`, and thus to need to clone types like Rc<T>. It depends on whether you’re dealing with Fn, FnMut or FnOnce, too.
C++ hides the clones from you for sure, but it means it's a lot easier to figure out in Rust where you're copying the closure (because the copies have to be explicit) and thus it's easier to reason about the performance of the Rust code.
Good point, I was thinking of cloning the closure itself, not capturing Arcs and having to clone them to avoid moves. Though in this case it would be C++'s implicit copies rather than capture clauses.
In terms of capture clauses, what I really like are Swift's, as they allow you to bind identifiers to expressions in the capture clause itself (as opposed to just doing capture-by-value vs capture-by-reference). Having capture clauses in Rust that have the same capability would be great.
Rust used to have capture clauses, but they were removed in order to reduce the complexity of the language. It might make sense to add them back later, but given that Rust is still constantly criticized for being too complex I think it's probably still not yet time.
That makes sense to me, I subscribe to the view that software "design patterns" are userland patches for things missing from the language or misuses thereof. The question is then whether the feature is missing enough (the pattern is common) that it makes sense to feed it back into built-in support or not.
I don't know how common the knowledge of nikomatsakis's pattern is though, and thus if crates could be automatically searched for it.
I am surprised ebpf wasn't mentioned. it's made for exactly that and you get bonus features like lite weight JIT and safety given that it supports a strict subset of asm commands that are deemed safe.
The arguments against JIT aren't backed by any prototypes or proofs. it felt like the author has a bias towards dynamic dispatch and walked backwards towards it.
There is no mention how frequently these filters change nor the possible perf advantage that you'd get from running compiled expressions that would compensate for the accumulated compile time of these filters over time.
Often you need 2 approaches together and you switch from interpretation to JIT after a condition. (in db case after the same query shows up multiple times for example). In their case it can be something as simple as if filter hasn't updated in x time then consider it stable and compile it.
On the other hand, PGs dynamic dispatch is a fun read just like any other code in PGs repo.
In the other blog post (linked from the top of this blog post) I wrote about the motivation for using the Wireshark-like syntax and how we also have a few customers with a lot of rules (tens of thousands and greater).
That said, it seems obvious to us that now we have a Rust library that can parse a Wireshark-like syntax into an AST... that we don't have to just perform the matching in Rust. i.e. that we can ask the library to produce translations of the expression as SQL (for our ClickHouse), GraphQL (for our analytics API), or even eBPF.
We can't run everything in eBPF, but we could check the list of the fields within an expression to see whether it could be run in eBPF, and then look at heavy hitter rules and promote the ones doing the most work inside L7 to be eBPF and to run in XDP.
Even if we don't do this for customer configured rules, this might be something we do for handling denial of service attacks using the same Wireshark-like expression syntax throughout our system.
Got it. Just to clarify, I didn't mean to implement the filters using ebpf inside the kernel but rather use ebpf's engine as a lib at the application layer.
Instead of a JIT being the next step, did you consider vectorizing the execution?
Amortizing the cost of dispatch over multiple packets means that under heavy load, the performance of the system should be fairly close to what you could JIT, but the system would be much more simple.
Of course, this only helps with the worst case where a backlog is starting to build, but it at least reduces the worst case.
I bet a JIT in LuaJIT would have been easy and fast: compile to Lua code in strings, call "eval" (actually load() in Lua, but the name eval is better known).
I'm guessing because an off-by-one or an extra skip might mean you miss the end of the string and go off into la-la land feeding whatever garbage happens to be in memory to your parser? That would mostly be a C issue (as it has no string abstraction at all).
Hence my noting this is more of a C issue, or more specifically more of an issue with C strings, such as any string you don't have ownership of because it comes from a buffer you were jut handed, because I don't quite know how much of a thing string_view is yet.
And of course there's a very fine line in C++ between "ensure boundary safety" (s.at) and "here comes a buffer over-read" (s[]).
In Rust by contrast, the latter is spelt `unsafe { s.get_unchecked() }` so it's a bit harder to miss / fat finger it in.
Boundary safety is only one of many possible issues. For example, Rust makes it easier to work with arbitrary slices (not copied substrings) while statically verifying ownership of the original string.
The true negative in the post has been fixed. Rust also doesn't prevent all dangling references as long as you depend on some unsafe third-party library/external library/libc calls and use them in an improper way.
> which provides more flexibility than Rust even with NLL
Not detecting issues being more flexible than detecting issues surprises no one. Raw pointers are also "more flexible" than smart pointers let alone borrow-checked references.
Unfortunately at() is not idiomatic C++. If you don't have tools to automatically prevent idiomatic unsafe C++ features ([], pointer arithmetic) from being used, developers are surely going to use them.
We use -D_GLIBCXX_DEBUG which automatically checks subscript access to vectors. Raw pointer usages are restricted to local (within a class/function) non-owning cases.
Fine, but neither of those practices are idiomatic C++. The first is also going to really hurt performance since C++ libraries and tools don't optimize for that case.
The top level program can use str.at(), but does every string function everywhere in all your libraries use str.at()?
This is likely to be a particular problem for strings (as supposed to any other structure), due to the long and complicated history of strings in C and C++. There are likely to be many string functions across the codebase, and null termination in some contexts confuses the issue even more.
I am surprised so few interpreters are written in garbage collected languages, which would give the interpreted language a GC for free. The main interpreter loop (or jitted code, for that matter) shouldn't allocate, so there is no inherent cost to the approach. And a tuned, mature GC is likely better than yet another MVP reimplementation.
> I am surprised so few interpreters are written in garbage collected languages, which would give the interpreted language a GC for free.
1. not exactly, you're now doubling your allocations as you're allocating a native object which has to create sub-structues representing the guest object, unless you're using the rpython technique of the interpreter being fed into a magical thing of magic which spits out a "native" runtime
2. the GC of the host language doesn't necessarily match what you want for the guest
3. and not being able to manipulate and interact with the GC can make efficiently implementing the guest difficult
> 1. not exactly, you're now doubling your allocations as you're allocating a native object which has to create sub-structues representing the guest object
Those sub structures can live within the same allocation. And in many cases, the object passed to the interpreter can be the undecorated native object.
> 2. the GC of the host language doesn't necessarily match what you want for the guest
It's almost certainly better than recounting or naive mark and sweep, which is what you get when someone reimplements a GC for an interpreter.
Doesn't Graal/Truffle address that pretty directly? It's a way to build language interpreters directly on top of the JVM that can JIT the interpreter as well as the code running on top of it.
Unfortunately not. Graal/Truffle solves the problem of making it easier to implement a language on the JVM. However, there is still some friction if your language has different GC semantics than the JVM. For example, you can make python on the JVM, but it is really hard to make the C interface for python work on the JVM. This is mainly due to the python memory model which, for worse, assumes that data in memory isn't moved by the GC. It allows for you to directly manipulate python objects in C code without fear of that data moving to a new location.
Javascript is much easier to do in the JVM because the language makes no assumptions on how memory is stored (AFAIK).
What languages have materially different GC semantics, other than Python on the cpython vm? (PyPy loses the fast collection at the end of scope for efficiency reasons)
Python is the key one you already mentioned - it has deterministic semantics and most other language do not.
Ruby also exposes parts of its GC, and allows it to be hooked by C extensions, so when I implemented it on top of the JVM I had to basically reimplement half of it in Java in order to implement those parts of the language.
Erlang's allocation pattern is markedly different than e.g. Java's, in no small part because it's completely impossible for an object to refer to a more recent allocation than itself. More generally, immutability-heavy language would have quite different needs than mutation-heavy ones.
Finalizers and resurrection is a big way that GCs can differ. There's also exotic reference types: Java has WeakReference, SoftReference, PhantomReference...
For fun, I tried porting the Wren interpreter from C to Go and it was about 3x slower. There are a couple of issues:
- Giant switch statements are not optimized as well as C.
- In C you can do tricks like NaN encoding to create a union between floats and pointers. In Go, this doesn't work and you have to use an interface.
There might be a garbage-collected language that lets you write efficient interpreters (perhaps Julia?) but I don't think you can take this for granted.
Very true, but that's also not a pure interpreter anymore. In languages that support eval() or generating and loading bytecode, it's leveraging the JIT compiler and effectively generating native code.
I am surprised so few interpreters are written in garbage collected languages, which would give the interpreted language a GC for free.
There have been Smalltalk implementations written in other Smalltalk implementations which have used this approach. In those cases, the GC is very appropriate for the language implemented.
It also comes at the the downside of removing some of the sandboxing abilities from that interpreter. Now the interpreted programs memory is the native programs memory, and the abilities to monitor and limit the interpreted programs memory might be decreased.
This was one of the first major components to use Rust at our edge and we've been really happy with it.
The flexibility this approach (to matching traffic) has given us... whilst we also get the speed, memory safety... is just great. The speed at which we're iterating on the firewall and other systems that use this is a joy to behold. A lot of that speed derives from the confidence in this component.
We're also really pleased that whilst we have several proposals for optimisations, none have yet needed to be prioritised as the performance is great.