This right here is why I think Rust really stands out from other languages. You can call it from other languages just as you can call native code written in C. It's not married to a complex runtime system.
Most other languages are "dead ends", because code written in them is not easily reusable from other languages.
You can kinda coerce D and C++¹ to do the same thing, but this is definitely something Rust and Ada got really right.
--
¹Okay I lied, C++ does this pretty easily with extern "C". However, C++ APIs frequently do not convert well to C at all and you often have to write wrappers for them. C++ classes in particular just can't cross ABI boundaries without a lot of pain, while Ada records and Rust structs are usually just a pragma away from being accessible from C (their “methods” don't use a special calling convention; method calls are just syntactic sugar for passing a struct as the first argument).
The x86 C++ ABI convention of passing this in ECX probably caused a lot of C interop headaches over the years, but really, C should have been passing the first argument in a register anyway.
I wouldn't quite say that's an RFC, it's just an issue in the RFCs repo. Typically people use the issue tracker in order to represent RFCs that they would like to see written, sort of a wishlist.
Ah I misunderstood! Thinking about it a bit more and given how heavily rust uses generics it seems like having an ABI that is at all usable by another higher level language might be challenging. I guess there is always compiler plugins though :)
Microservices don't help polyglots make regexes faster. If you're sticking a 100ns operation behind an endpoint with JSON serialization/deserialization, network round-trip, etc., then I'm not sure what to say to that.
Actually this microservices approach has existed for decades in erlang/otp and can maintain quite excellent speed. You're not going to beat raw c speed-wise, but typically you will beat comparable dynamic languages -- and you don't have to worry about decoding/encoding json or any of that bs. I'd personally be absolutely thrilled if something like genserver and some other otp protocols were ported to other dynamic languages -- but alas real processes aren't nearly efficient enough.
A wink does not necessarily mean sarcasm? I agree that in this case it probably does, but I don't see how that use of sarcasm adds anything substantive to the conversation (aside from a not-relevant-to-this-post dig on microservices, perhaps).
On a serious note, implementing libraries as standalone applications that can be accessed via IPC is actually a viable approach. Plan 9 does this with almost all of its services. Not exactly a fair comparison though, since the IPC mechanism in Plan 9 is just the file system and much easier to learn.
> This right here is why I think Rust really stands out from other languages. You can call it from other languages just as you can call native code written in C
Which is only true, when we forget about the history of computing before C was widespread outside AT&T and C ABI became a synonym for OS ABI.
Rust is great, and a very good step forward in terms of safe code, but lets not forget the ones that came before.
So that people learn from the past and prepare for a better future.
For example, many millennial have this strange idea that C and C++ are the only option for native AOT compiled code, with language features for cache control, RAII or system programming in general.
The aren't the only option, they became the only option given the decisions that the OS vendors took regarding which languages to sell to their customers on the OS SDKs.
Many of those decisions weren't taken in terms of language quality, rather due to money as UNIX was a free OS, whereas other OSes with better Rust like safety were of course more expensive than free.
So young developers get taught this myth that C was the first system programming language, when in the mid-60's we already had mainframe systems being programmed in type safe systems programming languages, across the whole stack.
Using languages like NEWP (https://en.wikipedia.org/wiki/NEWP), just to give one example. A language where one already had the distinction between safe and unsafe modules, with unsafe modules requiring administrator privileges to run.
Did any of those provide safety by means of compile-time checks (as opposed to sophisticated runtime systems, which are by necessity more expensive), though? AFAICT, Lisp didn't, Smalltalk didn't and Oberon didn't.
If you need unsafe code to implement the garbage collector and other core system facilities, then “selectively” disabling safety doesn't really mean much, does it?
Also, determining whether code is safe to run requires proof, not trust. Since when are system administrators particularly skilled at coming up with proofs?
Somehow I'm doubtful that, using 60's-era technology, many non-trivial data structures and algorithms could be proven safe in a mechanized way. As for trusting a programmer's word, it's probably better than trusting a non-programmer's, but enabling unsafe code is still fundamentally a social hack. While you can work around technical problems by social means, this will never constitute a ultimately satisfactory solution.
I am not saying they were full proof, I am saying that systems programming languages from 60's and 70's were much safer than C.
Many of memory corruption issues with C aren't a problem with the stronger type system of Algol and PL/* family of languages.
However OSes written in those languages were commercial, whereas C was bundled with UNIX, distributed for free, because AT&T was forbidden to charge for it.
If it wasn't for that, most likely Rust wouldn't be needed, at least not in the way it looks like.
> Many of memory corruption issues with C aren't a problem with the stronger type system of Algol and PL/* family of languages.
This is probably just my lack of imagination, but, how would you get the same degree of control C offers over the memory layout of data structures (among other things), without giving up on safety, and using only 60's and 70's technology? Even (safe) Rust doesn't help in some cases, say, if you want a single dynamically allocated memory block containing an array of “m” integers followed by an array of “n” floating-points, where “m” and “n” are determined at runtime. Or are you just going to say this is never desirable?
> If it wasn't for that, most likely Rust wouldn't be needed, at least not in the way it looks like.
Because we would still be using tagged architectures, and every machine instruction would incur in the cost of checking these tags? Or because non-determinism is never too big a price to pay for safe memory management? (Never mind the fact that programs also need to manipulate other resources that aren't as plentiful as memory, and thus require eager reclamation.) No, thanks.
Some of the languages that come to mind without GC, Algol and its variants, NEWP, Mesa, Modula-2, Ada.
With GC, Lisp, Smalltalk, Modula-2+, Modula-3, Eiffel, Oberon, Oberon-2, Active Oberon, Component Pascal,....
Keeping the focus on the ones without GC, all of them were use to build full stack mainframes, so of course they supported dynamic structures. Just the memory was measured in KB not GB.
And, yes they did not prevent the programmer of doing a double free or calling free in memory that wasn't yet allocated.
But they prevented:
- implicit conversions between pointers and arrays
- using bad pointers for output parameters in function calls
- bad, implicit, conversions between numeric types
- implicit conversions between enumerated values and integers
- using non-existent enumerated values
- strings with missing terminator
- Using assignment operator when a comparison was intended
- Doing pointer manipulation all over the place instead of when it is really needed
For your m * n example, that is something you can easily do in Ada, even if a bit verbose.
Hence why Rust would be less needed, because we would be in a computing world, where the majority of the exploits C has brought into mainstream computing would be reduced to a very tiny subset, also with unsafe red signs that one has to explicitly opt-in.
And just like with Assembly we could probably manage it, if served in very tiny doses as infrastructure code for other languages.
As for tagged architectures, why not?
Specially since that is what Intel and other manufactures are now trying to push as a workaround tame C corruption errors.
But all of that is history now, and we really do need languages like Rust to make the lower layers of our OSes and language runtimes safe.
> And, yes they did not prevent the programmer of doing a double free or calling free in memory that wasn't yet allocated.
Well, in 2016, or even in 1995, dynamic resource (not just memory!) allocation is the norm, not the exception, so, effectively, “unsafe dynamic resource management” means “unsafe” without further qualification.
> (long list)
These are indeed problems with C. Let's not beat a dead horse.
> For your m * n example, that is something you can easily do in Ada, even if a bit verbose.
Rather than “m * n”, it's more like “m + n”, or even “mx + ny”, or even “mx + padding + ny”, but okay.
> Hence why Rust would be less needed, because we would be in a computing world, where the majority of the exploits C has brought into mainstream computing would be reduced to a very tiny subset, also with unsafe red signs that one has to explicitly opt-in.
I would accept this claim if unsafe code would only be used to implement the occasional non-critical system component. But, can you implement a safe language runtime without “opt-in” unsafe language features? Clearly, static checks at some level are the only way of you really want safety. If you don't, that's okay - just be honest about it.
> As for tagged architectures, why not?
Because it's expensive! The user shouldn't have to use his machine to compute something that any responsible programmer would've known in advance. And runtime is too late to fix any mistakes anyway.
> Specially since that is what Intel and other manufactures are now trying to push as a workaround tame C corruption errors.
> But, can you implement a safe language runtime without “opt-in” unsafe language features? Clearly, static checks at some level are the only way of you really want safety. If you don't, that's okay - just be honest about it.
No, because at systems programming level, specially when interfacing with hardware is not possible to have 100% safety.
Not even SPARK or formal methods allow for that, there is always a very tiny portion of unsafety at some level.
But the point is not to remove it completely, rather to contain it to the point that is a tractable problem and easy to review, even manually.
Which is easier to achieve when languages are strong typed and require explicit escape hatches for doing unsafe things.
> Because it's expensive! The user shouldn't have to use his machine to compute something that any responsible programmer would've known in advance. And runtime is too late to fix any mistakes anyway.
Expensive is relative. I rather have the computer do stuff for me, and focus being productive.
Also just like everything else in computing, if there would be more research in that area, most likely the performance would increase.
Given that the manufacturers are turning to tagged architectures ideas to try to sort out the C mess, apparently they see a path forward there.
> > Specially since that is what Intel and other manufactures are now trying to push as a workaround tame C corruption errors.
> Well, that's sad.
Intel has added the MPX extensions to their processors, already available in gcc, clang and msvc++
> Which is easier to achieve when languages are strong typed and require explicit escape hatches for doing unsafe things.
No disagreement here. What I'm saying is that, in today's environment, requiring escape hatches for dynamic resource management is effectively requiring escape hatches for everything. The worst part is that, in most languages, you don't even need to enable escape hatches to preform nonsensical operations like mutating the same object in two threads, without explicit concurrency control. Literally everything is unsafe!
To the best of my knowledge, the only practical language today that attempts to mitigate these issues is Rust. Even if we somehow revived those old "safer than C even if this doesn't mean much" languages, they would probably not help much.
> Expensive is relative. I rather have the computer do stuff for me, and focus being productive.
With runtime checks, your computer isn't really doing anything for you. Your user's computer is alleviating a problem created by you.
> Also, as much as I like Rust, they still need to sort out some usability issues like non-lexical ownership.
I'd rather take the inconvenience of declaring the occasional use-once temporary variable, which takes at most 10 seconds of my time, than the inconvenience of tracking object lifetime bugs, which can take days or weeks to fix. Correctness is non-negotiable.
It might be desirable now, but those '60s/'70s OSes probably weren't dynamically allocating memory ever. They'd make a buffer in [their equivalent to] the .BSS section that was M×sizeof(int), and another that was N×sizeof(float), and then they'd process records through those buffers blockwise.
Exactly. In the '60s and '70s malloc was a luxury that most programs didn't have, and so Rust's facilities weren't needed to ensure safety. That doesn't describe the state of play in 2016. Those safer languages from the '60s and '70s would also be inadequate to ensure safety today.
> It might be desirable now, but those '60s/'70s OSes probably weren't dynamically allocating memory ever.
Those systems suddenly seem a lot less impressive now. While the lack of dynamic resource management poses other problems (e.g. determining up-front if the system has enough resources to complete the given task), I can't imagine this being more challenging than implementing a modern runtime system (JIT compiler, garbage collector, green thread scheduler, etc.).
Stack, bounds, and code vs data checks like Burrough's can be encoded with almost no gates in a processor. They can run in parallel to main CPU's activity with CPU results only written if checks succeed. One aspect of SAFE architecture at crash-safe.org did that along with some others. So, no, it's pretty easy and not done since modern processors are optimized to run unsafe, C code. ;)
So, in other words, a tagged architecture, except the possible values of the tags are "usable" and "not usable"? Sounds horrible - it's the hardware equivalent of checking at runtime whether a variable is in scope in, say, Python or JavaScript. I don't see any point to checking at runtime what can be known statically.
The tags are more commonly a few types like scaler, array, pointer w/ bounds, and so on. It checks basic stuff rather than full heap safety. Too tired to remember if there's one on that. The point was that runtime checking of critical aspects of safety isn't necessarily some high-overhead operation. The basics that go the furthest cost little to nothing in many designs.
Then, you can go from there in your language or compiler design.
"The runtime check doesn't make the operation any less unsafe - it only makes the consequences of an error less disastrous."
I think we have a different definition of unsafe. If you think it's potential to cause problems, then I agree it doesn't make it less unsafe. Alternatively, it does make it less unsafe if that means it reduces the ability to invisibly corrupt data, hack your system, or crash your system. That's what I'm claiming. Even Rust or Ada might have a hard time meeting that definition once their pristine codes are converted into assembly.
> I think we have a different definition of unsafe.
“Potentially without useful meaning”, e.g., indexing an array without knowing whether the the index is within bounds.
> Alternatively, it does make it less unsafe if that means it reduces the ability to invisibly corrupt data, hack your system, or crash your system.
Your kid (your data) attends the same school as a bully (your program). One day, the bully threatens to punch your kid in the face (raises an exception). “But it's less wrong because he never punched your kid! He only threatened it!”
> Even Rust or Ada might have a hard time meeting that definition once their pristine codes are converted into assembly.
As long as the assembly code has the same meaning as the safe program from which it was generated, it's safe.
"Your kid (your data) attends the same school as a bully (your program). One day, the bully threatens to punch your kid in the face (raises an exception). “But it's less wrong because he never punched your kid! He only threatened it!”"
You're going out on a limb here. I'm not even countering that example as it's rigged to prevent that. Instead, I'll point out these unsafe things don't happen in a vacuum: there's something in the language that sets the risk in motion and then there's something else that determines what happens. The compile-time safety stops it from being set in motion. The run-time safety makes things set in motion meaningless as they'll just be exceptions or alerts for admins.
Back to your analogy, it would be more like a prison learning environment where people learned in cells with bulletproof glass while one shouted he was "gonna cut someone" for not sharing answers. He can try every unsafe thing he wants but the glass makes it meaningless. The prisoner that's being targeted can literally just pretend all that unsafety doesn't even exist with no effect.
Ideally, we have prevention and detection. Reason for prevention is obvious. Reason for detection, even in a perfect language scheme, is because compiler or hardware errors (esp SEU's) will make something act badly eventually with runtime measures being last-ditch effort. If there going to be there, though, then might as well lean on them some more if there's no performance hit, eh? ;)
If my analogy is is flawed (which it almost certainly is), yours is even more so: the computing equivalent of your analogy is preventing runtime errors by simply not allowing software components to interact - yes, of course, without interaction there are no errors because there is no computation!
I was thinking about that as I submitted it but was multitasking. You called it so I gotta correct it. So, let's drop the analogy and go back to what I originally claimed: CPU's modified to protect pointers, arrays, stacks, and so on. The primitives forced to be used in acceptable ways. The programmer does the rest expressing the problem in the type-safe HLL.
Now, almost every hack on a system that I can think of requires forcing a pointer to go out-of-bounds or something like that. Further, the rest send in data that becomes code. One check, from Burroughs, is CPU looking for code tag bit before executing which only can be set by OS or isolated service on microkernel. So, that wouldn't work.
What remains, with little performance hit & no static checking, is a system where hackers (a) have to con the admin into installing malware, (b) break the minimal, trusted loader/installer w/out abusing above components, or (c) get a denial of service attack. Forget analogies: the reality is much more impressive given there's almost no high-severity CVE's left. You can also do great static checks and such as I always recommend. Yet, you either don't or rarely need them in practice if goal is integrity or confidentiality rather than availability.
"Check and incidentally... mate" (Sherlock Holmes, Game of Shadows)
> Yet, you either don't or rarely need them in practice if goal is integrity or confidentiality rather than availability.
My goal is correctness. A program that fails with a runtime exception is just as wrong as another that silently corrupts your data. Dijkstra put it very nicely:
“We could, for instance, begin with cleaning up our language by no longer calling a bug a bug but by calling it an error. (...) The nice thing of this simple change of vocabulary is that it has such a profound effect: while, before, a program with only one bug used to be "almost correct", afterwards a program with an error is just "wrong" (because in error).”
Cute quote by Dijkstra that shows he hadn't quite figured out reality yet. So, correctness is your goal. That means you'll have to specify the behavior, safety/security policy, and prove the two equivalent. The implementation, both source and binary, will have to be shown equivalent along with proven free of language-level issues. Finally, you have to do that on triple, modular hardware [1] that's rad-hard w/ similar rigor in its lifecycle. Or use run-time checks [2] for each algorithm that can correct errors probably also on TMR or rad-hard board. Let's not forget the custom circuitry for RAM that works [3].
Darn, Dijkstra or not, you still need some kind of runtime checks and protection for correctness. :)
I guess money, if you look at the history of computers the majority of interesting stuff failed because it didn't managed to pay for the investments that were done.
C and UNIX got widely adopted, because AT&T wasn't allowed to sell it, so they gave the source code for free to universities. Which then used the code to replace the expensive OSes that they were running on their mainframes.
When AT&T was allowed to charge for it afterwards, they tried to make people pay for it, but failed to do so.
Also around this time UNIX vendors started selling the developer tools separately instead of having them for free. This motivated many to contribute to GNU, which was largely ignored up to that point.
Sending you this link as you might have missed what actually happened on Burroughs end. I thought they petered out and got bought by Unisys. I didn't realize they were Unisys plus were selling billions in MCP systems.
That's probably true. Most material on the mainframe market says IBM has around 90% of it. Their revenues tend to suggest that, too. Probably the common effect with the First Mover advantage. I'm sure IBM bribing government officials in exchange for contracts helped, too. Let's not forget that angle. The NSA business alone would've been enough to found several, computer companies. ;)
"In September 1986 Unisys was formed through the merger of the mainframe corporations Sperry and Burroughs, with Burroughs buying Sperry for $4.8 billion. "
"The merger was the largest in the computer industry at the time and made Unisys the second largest computer company with annual revenue of $10.5 billion"
As to why IBM dominated, I have a hypothesis that's consistent with what happened to other companies. It's basically Gabriel's Worse is Better effect where certain characteristics improve marketing & uptake. The IBM mainframe, IIRC, were backward compatible or integrated with stuff IBM people already bought. They were optimized for Fortran etc that existing apps were written in where Burroughs was optimized for Algol, the language of the future. Further, in a trend that still exists, people focused on raw performance per dollar instead of reliability, maintenance, security, technical debt reduction, and so on that Burroughs architecture could provide. Similar effect with Intel i432, i960, and later Itanium w/ its excellent reliability/security improvements. Burroughs started taking out hardware security & recently got ported to Intel CPU's IIRC. Intel just stopped taking chances. Both in legacy mode since it's all people buy.
Sad story of how computer market works. The Burroughs systems are still around, though, with Unisys making plenty of money. MCP is on release 17 with Unisys also releasing a tool that lets you run MCP in a virtual machine on a PC. It's called MCPExpress I believe. Here's the page.
Some did (SPARK for example), but they sacrificed dynamic allocation (i.e. malloc). Rust's real innovation (in an industry sense) is safe dynamic allocation without GC.
ATS has had this (via linear types like Rust) for a while, but it's a massive pain to use. Rust's main innovation is gluing together existing ideas with a powerful community backing.
This only works for simple algorithms that don't touch the system. As soon as you try to bind something that handles a file on disk, a socket, a thread, it's wishful thinking anyway, as any language have different ways and paradigms to approach threading, asynchronous I/O, signals and other runtime-level issues, and getting code from a different language / library / paradigm is probably not going to fly, or even introduce subtle bugs.
For instance, if I bind a Rust library that opens a file on disk and use it from Go, I lose all the benefits of the Go runtime like the fact that the FD is not inherited by sub processes (as Go always uses O_CLOEXEC, while glibc doesn't), or the fact that blocking calls never return EINTR to application code in case of signals (the Go runtime handles this for the application).
So you might think that the library works, while its subtly violating your language's runtime conventions in a way that might cause hard to debug bugs in production.
One of the benefit of writing code in single "modern" language/runtime is that you don't have to worry about these hairy issues.
It feels like you're throwing the baby out with the bath water, and I don't think a good regex library could be classified as "simple." I specifically wrote Go bindings to Rust's regex lib in tandem with the C API to make sure I wasn't baking any assumptions into the C API that would make it difficult for another language to bind to them.
This process was lucrative. My first draft of the C API defined an iterator that held on to a pointer to the text being searched for the lifetime of the iterator. It turns out that this is inconvenient to use from a language like Go (or perhaps, any language with a moving GC) because C holding on to pointers to memory allocated in the host language can be bad juju. Because I did this process in tandem, I was able to correct the issue at the C API level.
As a result, if you need fast regexes in Go, you now have an option, and that option is built on a reasonably mature regex library with thousands of tests and a benchmark harness that sports PCRE1, PCRE2, Tcl, RE2 and Oniguruma, of which, Rust's regex engine compares quite favorably.
My comment wasn't meant to disrespect your work, or make it look like a simple task to achieve.
The GP comment said that Rust is nice because you can write libraries that can be bound to other languages with more runtime. I was highlighting that this property sounds nice in theory, but has a small scope of applicability in practice. A regex library fits the small scope.
> but has a small scope of applicability in practice
I disagree. It's a critical tool for adopting Rust libraries in projects that are otherwise C or C++ or Python or Go or Java or ..., for example. The scope is pretty vast.
>>Most other languages are "dead ends", because code written in them is not easily reusable from other languages.
FFI isn't the only way to be reusable from another language though. Many interpreted languages are easy to embed into C, C++, and often other languages.
Yes, but the conversion between domain types is the killer there. Have you ever tried embedding ruby into python (or vice versa?) The string/unicode/bytes conversions alone demand a full project's worth of work.
Basically, having base types convert toll-free is HUGE.
You can call Go from C, but it's extremely slow and requires special handling of GC'd references (among other things). It's not really a suitable replacement for C bindings in many circumstances.
How about message-passing without serialization? Imagine that what we got in the 1970s instead of Unix was something like an Erlang VM, in the sense of each OS "process" having an inbox ring-buffer to which other processes could write; and every process having a set of basic agreed-upon tagged types, such that processes could stick raw-memory structs into other processes' inboxes, as long as they consisted only of those basic tagged types.
Anything going beyond ASCII/UTF-8 or C types will be very hard to agree upon. And I think there is no way around serialization ( == chasing pointers and converting to a contiguous chunk of memory). Except if you restrict to a single language ecosystem like Erlang or Haskell, which enable data sharing between (often userspace-) threads in a shared address space. But these need a much more complex runtime.
Almost every "language ecosystem" has a runtime which itself—at least on ⋆nix—relies on libc. This is why every modern language exposes e.g. a FILE type with the same semantics; they're all just wrapping the libc FILE struct + API.
What I'm imagining is a world where "libc", from way-back-when in the 1970s, exposed declarations for some more interesting ADT types + APIs than just register-sized scalars and ASCIZ strings. Not types that require heap allocation or anything, mind you; just some fancier on-stack value types.
Imagine the C runtime being just "batteries-included" enough that if you were writing, say, a JSON parser in C, there would be obvious "native C types" to decode each of JSON's container types into.
And now imagine a world where every modern "language ecosystem" had been built up on top of this libc, instead of the one we have. A libc one can rely on to represent almost all the basic types one needs in a well-known (or at least, consistent per machine) manner in memory.
Imagine how easy message-passing IPC would become in such a world.
That's the idea behind zero-copy serialization protocols like Cap'n Proto [1] and FlatBuffers [2]. The idea is that they define a particular mapping from memory layout to data model, and then provide bindings to that mapping from multiple languages. If you write them into, say, shared memory or a memory-mapped file, then multiple programs in multiple languages can access them, effectively for free.
There still remain a lot of real-world gotchas with this. There are compromises in versioning & upgradability; while both these protocols are designed so you can add new fields in a backwards-compatible way, they also bloat the memory requirements when you modify a schema enough. Code complexity is very high, which means the number of implementations is fairly limited. And many managed languages don't do well with direct memory addressing (Java, V8, and C# are particular offenders), so oftentimes the speed benefits are lost in translation.
But people have certainly been thinking about this issue.
> ... because code written in them is not easily reusable from other languages.
Which is also technically true for C as well. The ABI [EDIT: Not API] is an implementation detail which can vary from compiler to compiler.
So long as you're just using LLVM-based compilers for both Rust and your other languages, you should be OK, but don't be surprised if you run into problems.
> So long as you're just using LLVM-based compilers for both Rust and your other languages, you should be OK, but don't be surprised if you run into problems.
No. You can use the Go bindings to Rust's regex library where the regex library is compiled with `rustc` (llvm based) and your host program, written in Go, is compiled with the standard Go compiler, which is not llvm based.
This works because the Go toolchain knows how to link with objects exposing a C ABI and the Rust toolchain knows how to provide said objects. There is no particular dependency on llvm here.
> Which is also technically true for C as well. The API is an implementation detail which can vary from compiler to compiler.
The C ABI (API is less relevant there) is really the platform/OS's calling convention, which — aside from Microsoft (and embedded systems I guess, because embedded) — is generally the architecture's calling conventions. Which are usually well documented.
> So long as you're just using LLVM-based compilers for both Rust and your other languages, you should be OK, but don't be surprised if you run into problems.
As long as all the languages involved use the same ABI you're in the clear, that's got nothing to do with LLVM. That's what cgo does on the Go side[0]. And as pcwalton notes in an other comment, in theory you could even have the Rust FFI follow the Go/Plan9 ABI.
[0] at a significant cost for the default toolchain since `gc` doesn't use the platform ABI. I believe the gccgo-based alternative toolchain has no cgo overhead.
We're interested in how the Hyperscan library (full disclosure: I am an Intel employee and Hyperscan designer) stacks up in this comparison. From what I understand there are Go and Rust bindings for Hyperscan out there as 3rd party projects (although I don't know much about them).
We find Rust intriguing and the fine-grained control would be a good match for what our run-time needs; it would be a interesting project to re-implement parts of Hyperscan in Rust.
I'm interested too. I'm looking into adding hyperscan to Rust's benchmark harness (which already has PCRE1, PCRE2, RE2, Tcl and Oniguruma). I'm trying to figure out how best to wrap the API. The callback function in `hs_scan` is something I haven't seen in a regex library before (although I bet I can understand why it's used), but for example, Rust's wrapper cripples the callback in comparison to C by dropping the `void *context` parameter: http://flier.github.io/rust-hyperscan/doc/hyperscan/type.Mat...
I'm happy to wrap the C directly (in fact, that's what I've done for most of the other regex engines I benchmark against). Before I try to wrangle the callback-style API, is there an alternate API with a more explicit iterator-like interface?
Thanks for looking into this; we like to see benchmarks run by other people to avoid the prospect that we're just benchmarking the things we already know about and like doing (which are often also the things which we've optimized).
There are no other interfaces outside of the callback API. It's an interesting prospect to have an iterator-like API but a naive implementation of an iterator API would involve us having to save out all of our state somehow (in an explicit fashion, as opposed to a bunch of stack frames). Our system is pretty complex and so it's not just a case of saving out a couple DFA states...
I've done a tiny bit of hacking around on the command line. I'm essentially comparing your `simplegrep` utility with my own utility that operates similarly using `time`.
The first thing I noticed was that hyperscan wasn't lining up with match counts from the rest of the regex engines. I dug into your docs a bit more, and indeed, your engine reports all matches, whereas traditionally, most regex engines offer an iterator over successive non-overlapping leftmost-first matches (including start+end). Of course, that's not to say you're wrong or anything---it's just going to be a real bear to accurately benchmark them.
All in all, hyperscan looks pretty fast. Some of your docs suggest you're doing even smarter literal detection than I am. Prefix literals and literals "near the front" are easy enough to figure out (the latter is on my list of things to do), but literals at the end or middle of a regex seem much trickier. In particular, it seems very easy to fall into worst case quadratic behavior. If you felt like expanding on some implementation details there, that would be awesome. :-)
In any case, I'll share some timings. I ran all of the following regexes on a ~2GB file containing a concatenation of some large sample of Project Gutenberg. It's my goto for "feeling" out regexes and collecting large samples for profiling. I modified simplegrep.c to not print every match and instead increment a single static `count`. I also modified it to mmap the file, so that it is comparable to my utility in Rust which also mmaps the file. The count of total matches is printed at the end. I tried to limit the regexes to ones that had the same number of matches in all regex engines, but on some of them, hyperscan reported slightly more. I then modified simplegrep.c to use the HS_FLAG_SOM_LEFTMOST flag. I did this because that's what Rust's regex engine does (and so do most others). Thus, the benchmark is flawed. But knowing this...
Benchmarks are reported in seconds. I tried to take the best of 3 where appropriate, but overall, I saw very little variation run to run.
Your observation about match counts and "all-matches" vs "leftmost-longest" or "pcre-locally-maximizing" semantics is accurate and Hyperscan is meant to be that way. All-matches is the only semantics we can implement efficiently when you take into account streaming and multiple regex matching. Block mode scans of single regexes are something we like to do well at but they are not our primary use case; more typical is hundreds of regexes in streaming mode.
Similarly, accurate Start of Match (SoM) tracking is a pain. This is a work in progress. You can see the hit we are taking for it (perhaps unnecessarily given the SoM cases don't look like the hardest SoM calculations we have to do). This will improve but SoM is typically hard for us (given the streaming and multi-pattern requirements).
As for our secrets - they are not really secrets; anyone who wants to learn them can simply read our code. I will outline the literal discovery approach, though, as I'm inordinately proud of it (possibly for the wrong reasons):
We find 'factor' literals in a NFA Graph (a Glushkov automata embedded in a BGL graph; Glushkov automata are both our internal intermediate representation of regexes and an occasional implementation strategy) as follows: we explore the local neighborhood of a graph node and calculate 'if I wanted to cut the graph here, which literal strings would I need to see'. For example, if we have paths "abc" and "def" to a node, we would give this a score corresponding to "2 3-character literals" (pretty good). On the other hand, we could have a single path "abcdefghij" to the node (which would be a really good score - a low score) or we could have multiple short literals (say, corresponding to the expansion of \d - 10 1-character literals - really bad - a high score). If we can't find anything reasonable at all the score is effectively infinite. This allows us to score each edge in the graph. We then wire up our graph to a source and sink (at the start and end of the pattern) and do network flow and grab a min-cut. More or less - there are some wrinkles, but that's basically it.
This may or may not be the world's best idea, but I had hankered to find some weird use of netflow ever since a Graph Theory class in 1991, where a professor of mathematics showed us an algorithm to solve the stable marriages problem by reducing it to network flow. This was made more memorable by the fact that he, probably unintentionally, posited a slightly obscene formulation: all the boys were wired to a mysterious 'source', the girls to a mysterious 'sink', and some unknown substance flowed from the former group to the latter (!).
The min-cut trick has its problems, as it does things like scoring a graph where we cut with 5 different copies of the same literal "abc" as if they were distinct literals (having thrown away the way the scores were found when we do min-cut, which just treats the scores as a magic floating point number). But it's been fairly practical. Someone with a better grasp of abstract algebra and graph theory would be welcome to come along and tell us whether the thing that implements our score in min-cut could be more complex (some sort of class that cleverly tracks some or all of the 'sums of literals' and implements operations like + and min) and still allow the min-cut algorithms to work.
Cutting is also more complex for us given the fact that a 'late' cut isn't nearly as good for us as an 'early' cut if we are streaming, as the leftover bits of an NFA graph on the left of such a cut need to be evaluated all the time in streaming mode, while this isn't true to of the parts on the right of a cut. So in practice we don't really do the netflow algorithm in the pure form above (at least, not since version 2.something) but it still gets used. The fact that we are actually decomposing our problem also means that some cuts are better than others to leave a 'clean' set of engines also messes things up.
We'd love help working on this problem (hey, we're open source) but Building Your Own regex engine seems to be the thing everyone must do. It's probably a lot more fun than trying to grab a tractable bit to work on in a 100+Kloc source base.
Very interesting. That is much more sophisticated than what I'm doing. My literal extraction is a simple greedy approach on the AST that only finds prefixes. My "scoring" is just a silly heuristic at this point. I had been doing it on the NFA itself, but found the AST easier to work with.
I think my problem is that I don't know how to take interior literals or suffix literals and apply them to search without provoking worst case quadratic behavior. For example, take the regex `\w+\s+Holmes`. It's easy enough to find `Holmes` and do a fast literal search for it. Then you can match `\w+\s+` in reverse from the start of the literal hit. But whoops, `Holmes` is a substring in the set of all strings described by `\w+\s+`, which means your automaton can start re-scanning input you've already seen. In pathological cases, you get quadratic behavior. (Similar to a poorly written Boyer-Moore implementation.)
I wonder if this gets easier using hyperscan's "all-matches" semantics. I haven't given alternative matching semantics much thought yet.
> We'd love help working on this problem (hey, we're open source) but Building Your Own regex engine seems to be the thing everyone must do. It's probably a lot more fun than trying to grab a tractable bit to work on in a 100+Kloc source base.
Yeah, I started on Rust's regex engine almost 2 years ago. It's been a labor of love.
Do you perhaps happen to have a link to the results of these benchmarks? I'm desperately looking for some, and http://lh3lh3.users.sourceforge.net/reb.shtml seems to be quite date.
The benchmarks aren't quite ready to be widely disseminated yet. With that said, I do record benchmarks from time to time in the repo to track progress, which you can see here: https://github.com/rust-lang-nursery/regex/tree/master/bench... --- You'll have to do the comparison yourself though.
My plan at the moment is to publish something more comprehensive, but it will take time.
And yes, the benchmarks you linked are not only outdated, but misleading, and parts of the conclusion I do not agree with. It also doesn't include PCRE2, which is very fast.
Something I've wondered for a while... in the long term (that is, I'm more asking about architecture than what exists today), is it possible to integrate Rust more deeply into the calling convention of another language, so that rather than generating "C" functions, it might generate true "Go" functions?
I'm also asking beyond Go, so if the answer is "M:N really screws up Go w.r.t. Rust", I'm also curious about whether, for instance, you could write native Rust extensions to Perl or Python or $WHATEVER by changing out the runtime or something. If it works better for some languages than others, which ones might they be?
It's nice to be able to generate true C functions, and it is of course perfectly appropriate and correct to start there, given the real-world state of the ecosystem. But a lot of languages take various forms of penalties to call into C, in order to ensure guarantees that are not necessarily required by languages that aren't C, and, simultaneously, miss out on guarantees that Rust may be able to provide that C can't express and thus the C bindings can't count on.
Possibly. You'd have to add support for the nonstandard Plan 9 calling conventions, stack growth checks, and GC stack map information. LLVM broadly has support for all of these features (contrary to what some Go team members have said), but the specific binary formats for these features are incompatible with the Go runtime, so it'd be necessary to contribute upstream changes to LLVM or fork it.
Did they say segmented stack / stack maps is not available in llvm or it was not available when they started working on Go around 2007? Because I notice that segmented support was added around 2012 and stack maps in 2013.
May I ask why? From my understanding Rust just have revamped their regex[0] using the same (similar) scheme as in RE2, which is now the Go's implementation, am I misinformed?
> I wrote these bindings primarily as a test case for the C API of Rust's regex library. In particular, I wanted to be sure that it was feasible to write low overhead bindings to another language. For the most part, I think that was a success.
The Rust regex library does use various techniques that RE2 and Go's library also use (and has from the start, that patch is just adding another feature that was missing), but it is generally faster than both of those libraries: take a look at the benchmarks in the link.
@jerf - In response to your deleted comment, the README explains some of the numbers. The really crazy numbers are silly, and I say so in the README.
However, there are plenty of benchmarks well over 10x that are not silly at all. Look at the second block. The BeforeHolmes benchmark in particular is a good litmus test, since it defeats most optimizations.
Those particular speed ups are due to algorithms used. RE2 would have a similar speedup when compared with Go's regexp package.
Some of the other benchmarks in the 100x-400x range are likely due to smarter use of literal scanning.
Unfortunately, no, that is not the usual way any more. Two primary reasons:
1. Compiling regexes at compile time requires a compiler plugin, which only works on nightly.
2. The compile-time regexes are dramatically slower (think orders of magnitude) than the normal Regex::new method. The normal Regex::new method is what's shown here in these bindings.
The compiler plugin was code I wrote two years ago, and really hasn't changed since then. Regex::new on the other hand has gotten a lot of love.
The plugin's performance may one day get better, but it won't be soon unless someone else works on it.
Most other languages are "dead ends", because code written in them is not easily reusable from other languages.