I've done a bunch more optimization since I wrote this post last year, and I'm in the process of writing a follow-up.
Spoilers:
- Since last year I've made this algorithm run another ~2x faster. The performance improvement comes about largely thanks to replacing ropey with my own jumprope[1] library. (~60ms -> 30ms)
- I've written a new CRDT algorithm which is another 10x faster. (time drops to 4ms). But the algorithm is only faster when the editing history is largely linear. So, its a better approach for something like google docs (where most edits are made by one user or online). But its much worse for something like Git. I think its possible to make it better in every case, but I haven't figured out how to do that yet.
I re-tested automerge a couple of months ago. Automerge-rs has adopted a lot of the tricks I listed in this post, which is delightful to see. Performance has gotten a lot better (5 minutes in 2021 dropped to 400ms in 2022). But its still 100x slower than diamond types and uses 100x more RAM (200mb for automerge vs 2mb for diamond types). The main optimization automerge is missing is the internal run-length encoding trick that Yjs pioneered.
> I think its possible to make it better in every case, but I haven't figured out how to do that yet.
There are some processes for which there is no one-algo-fits all solution. Rather than trying to find a general solution that is not terrible for either case but also not optimal for either case, could it be more efficient overall to try detect which case you have and apply the most suitable method?
Of course you'd still have to implement & maintain both methods, and there is the cost of the check on each use, and the higher cost if the check's heuristic sometimes picks the less optimal method - that might all add up to meaning a single method that is good enough for all but not ideal for either may still be the best option.
Unless there is a method to be found that is ideal (or reasonably close to) for all cases.
I think its ridiculous we still use binary files in native software. I want a world where the operating system stores semantically rich data (imagine if all files on your computer were JSON or something). Instead of an app "saving" by overwriting the entire file, an app could send a patch (diff) to the OS, which could store a transactional change log. Immediately we'd get faster saving, and no chance at data corruption when things get saved. And we could have optional change tracking for everything.
Then rather than loading the data from disk at startup, apps could subscribe to a change feed from the OS. I want to be able to open the same file in 2 apps. As I edit the file in one app, I should see my changes immediately appear in the other app. CLI tools like wc could update the word count of a file live, as its being edited. Compilers could consume code changes live, and update binaries based on super fine-grained incremental compilation, without needing to diff.
And this model could be extended to the cloud. Right now you can't really store a google doc on your local computer and have it sync in an intelligent way with actual google docs. And if you use Dropbox and two users edit the same file, the file ends up in an unresolvable conflict state. With this model, and using CRDTs, we could seamlessly replicate changes between devices and users. So, I could edit a file locally (on a plane), it could replicate over wifi to my phone and I can keep editing it there. And then I could share it with another user, who can see my changes, edit it with me and so on. And the same data could be exposed through a collaborative editor in the browser, or with native apps.
Thats the model I want for computers. None of this "native vs web app" distinction. Just a user's work, transparently replicated.
I'm working on building a platform for this, but I'm not ready to demo it yet.
> The infuriating part was that several people sent me links to the paper and (pointedly) asked me what I think about it. Written up as a Published Science Paper, these speed comparisons seemed like a Fact About The Universe. And not what they really were - implementation details of some java code, written by a probably overstretched grad student. One of a whole bunch of implementations that they needed to code up.
> "Nooo! The peer reviewed science isn't right everybody! Please believe me!". But I didn't have a published paper justifying my claims. I had working code but it felt like none of the smart computer science people cared about that. Who was I? I was nobody.
Although I to some degree understand /why/ this is the way it is, it's quite frustrating the amount of ~unfounded belief people place in "published studies". I'm not advocating to denigrate science, but I think folks should still be skeptical of results (esp. low N or non-reproduced) when they're challenged, and not treat them as Facts About the Universe.
c.f. "The code was peer-reviewed, so the application must be bug-free and 100% correct, right?" /s
In 2012, I was doing NLP stuff with the Stanford NLP POS tagger. They claimed it was highly optimized, and hey, who was I to doubt them? But man, it sure _felt_ slow.
So I did some _very minimal_ profiling and, as a complete outsider, found small, simple changes that increased throughput by 4x without affecting maintainability.
I think this fits your "low N" example -- I'm much more likely to believe something is optimal if you can tell me that multiple (experienced!) people have tried to optimize something and failed. If it's just 1 person, that 1 person better have amazing domain expertise for me to believe it.
I'm interested to see that the rust code run as WASM takes 193ms, but natively runs in 56ms. That's some serious overhead from the runtime! I would have hoped running wasm was closer to native speed than that.
Added a runtime for a dynamic language and code that can run on any platform the runtime supports—not just the one you compiled for and including through remote network delivery such as a web browser—and it's ONLY 3-4x slower!?
I don't think you fully appreciate just how mindbogglingly impressive it is that it's SO CLOSE to native speed!
I don't agree at all. The JVM gets much much better performance than that, and it runs on a runtime that supports ~all platforms. Idiomatic Java and C code will be within ~25% of each other. [1]
Being 4x slower is pretty abysmal, honestly. I would expect a small penalty for running the JIT, but after that the jitted WASM code and native code should be identical. If the hype is to be believed, the jitted code should actually be better than native code.
Maybe running the JIT is just really slow? If the native program ran for a minute, would the WASM version run in 4 minutes, or 1.15 minutes?
The JVM has a ton of different things going on, it's not an apples-to-apples comparison.
For one, the JVM isn't sandboxed by default. WASM is. That adds additional overhead.
Another consideration is that the JVM uses garbage collection. This doesn't just apply to the code itself, it includes things like the hot code cache and stuff. The JVM trades throughput for latency. WASM doesn't have a GC model.
Lastly, WASM doesn't define the runtime. There is no JIT for "WASM", that's just a bytecode spec. I'm actually not even sure if when you run it in a browser it will JIT the WASM bytecode.
I would also expect GC to be an advantage for WASM here, since it originated from a language without a GC. Having to run a GC is slower than not having to run a GC, and the WASM code from the article doesn't need garbage collection.
Sandboxing is a plausible explaination. I can't imagine seccomp filters hurt performance that bad, but who knows how V8 does its sandboxing.
Are there any comprehensive benchmarks of WASM runtimes?
I've been using V8 to benchmark WASM code because its convenient, and because WASM in V8 is the obvious comparison to make when benchmarking against Javascript code. But if one of the other WASM runtimes does a better job optimizing, it might be interesting to see that.
I wonder how performance would change if using SQLite instead. You would get those range trees and indexes for free (hence much simpler implementation). Plus persistent storage, some guarantees, etc.
It'd be really fun to see if this approach can be adapted to sqlite's b-tree. But it might not work as-is. The b-tree implementation I have is quite a bit different from most b-trees because I'm storing offsets rather than "keys", and because I'm storing compacted runs of items rather than individual rows.
Which parts of SQLite have range trees? How would you see a CRDT implemented using SQLite? I've been thinking a lot about fitting SQLite into the world of CRDT, but haven't come up with a good-enough generalized approach.
> Rows are currently modeled as LWW maps. I.e., each column in a row is a LWW Register.
Multi value registers are also planned - though I'm not sure how that'll be implemented.
(One also has the opportunity to see failing writes to the LWW register, so for my app I can (in theory) potentially surface those somewhere to the user as a resolvable conflict.)
Yeah, I've seen this project. Looks pretty interesting!
Although probably the most challenging data type to get right would be RGA, especially for text. Not to mention rich text :)
Sets and registers in the end are pretty straight-forward as soon as one understand the idea of causality, partial order, and arbitrary total order. In terms of the implementation there's also not a lot to optimize I believe, and at worst you could simply store a counter along with the value.
But RGA is a whole different story. Especially in the context of SQLite, which doesn't have default page compression, or prefix compression for keys.
Still, one probably could put together something similar to how Yjs deals with that, and store chunks of data as rows in a table.
It's posts like this that make appreciate that video game programmers aren't the only ones out there keeping it real. Good work Joseph G but also keep the ego in check...
> Good work Joseph G but also keep the ego in check...
I'm not convinced that a more humble write-up would be better. I think it might just be more boring. But I hear you that it can be a bit of a turn off for some. I'm not sure the best way to thread that needle.
For what it’s worth, I didn’t find the tone of the article unpleasant. As you said, anything less might be a bit boring or dry. The way it’s written now, at least to me, comes off as sincere/authentic (in addition to be interesting & informative).
I've done a bunch more optimization since I wrote this post last year, and I'm in the process of writing a follow-up.
Spoilers:
- Since last year I've made this algorithm run another ~2x faster. The performance improvement comes about largely thanks to replacing ropey with my own jumprope[1] library. (~60ms -> 30ms)
- I've written a new CRDT algorithm which is another 10x faster. (time drops to 4ms). But the algorithm is only faster when the editing history is largely linear. So, its a better approach for something like google docs (where most edits are made by one user or online). But its much worse for something like Git. I think its possible to make it better in every case, but I haven't figured out how to do that yet.
I re-tested automerge a couple of months ago. Automerge-rs has adopted a lot of the tricks I listed in this post, which is delightful to see. Performance has gotten a lot better (5 minutes in 2021 dropped to 400ms in 2022). But its still 100x slower than diamond types and uses 100x more RAM (200mb for automerge vs 2mb for diamond types). The main optimization automerge is missing is the internal run-length encoding trick that Yjs pioneered.
[1] https://crates.io/crates/jumprope