Serialized transactions over a broad distribution of keys isn't a huge problem, but you're right: this bodes poorly for hot data. I'm more concerned about their CAP semantics: I'm sorry, but claiming multidc availability and acid transactions is not gonna work.
It's a bit like hating numbers, and saying "What do you mean by 'three'; what is it - three volts, three amps, three metres? Clearly 'three' is meaningless, and we should stop using it and all the other numbers besides."
decibels are simply a dimensionless ratio, used as a multiplier for some known value of some known quantity.
In every context where decibels are used, either the unit they qualify is explicitly specified, or the unit is implicity known from the context. For instance, in the case of loudness of noise to human ears in air, the unit can be taken to be dBA (in all but rare cases which will be specified) measured with an appropriate A-weighted sensor, relative to the standard reference power level.
And similar (but different) principles apply to every other thing measured in dB; either theres an implicit convention, or the 0 dB point and measurement basis are specified.
People who assume that everyone is an idiot but themselves are rarely correct.
I look forward to the author discovering about (for example) the measurement of light, or colorimetry, and the many and various subtleties involved. The apparent excessive complexity is necessary, not invented to create confusion.
BoltDB author here. Yes, it is a bad design. The project was never intended to go to production but rather it was a port of LMDB so I could understand the internals. I simplified the freelist handling since it was a toy project. At Shopify, we had some serious issues at the time (~2014) with either LMDB or the Go driver that we couldn't resolve after several months so we swapped out for Bolt. And alas, my poor design stuck around.
LMDB uses a regular bucket for the freelist whereas Bolt simply saved the list as an array. It simplified the logic quite a bit and generally didn't cause a problem for most use cases. It only became an issue when someone wrote a ton of data and then deleted it and never used it again. Roblox reported having 4GB of free pages which translated into a giant array of 4-byte page numbers.
> This can lead to scalability issues in large clusters, as the number of connections that each node needs to maintain grows quadratically with the number of nodes in the cluster.
No, the total number of dist connections grows quadratically with the number of nodes, but the number of dist connections each node makes grows linearally.
> Not only that, in order to keep the cluster connected, each node periodically sends heartbeat messages to every other node in the cluster.
IIRC, heat beats are once every 30 seconds by default.
> This can lead to a lot of network traffic in large clusters, which can put a strain on the network.
Lets say I'm right about 30 seconds between heart beats, and you've got 1000 nodes. Every 30 seconds each node sends out 999 heartbeats (which almost certainly fit in a single tcp packet each, maybe less if they're piggybacking on real data exchanges). That's 999,000 packets every second, or 33k pps across your whole cluster. For reference, GigE line rate with full 1500 mtu packets is 80k pps. If you actually have 1000 nodes worth of work, the heartbeats are not at all a big deal.
> Historically, a "large" cluster in Erlang was considered to be around 50-100 nodes. This may have changed in recent years, but it's still something to be aware of when designing distributed Erlang systems.
I don't have recent numbers, but Rick Reed's presentation at Erlang Factory in 2014 shows a dist cluster with 400 nodes. I'm pretty sure I saw 1000+ node clusters too. I left WhatsApp in 2019, and any public presentations from WA are less about raw scale, because it's passe.
Really, 1000 dist connections is nothing when you're managing 500k client connections. Dist connections weren't even a big deal when we went to smaller nodes in FB.
It's good to have a solid backend network, and to try to bias towards fewer larger nodes, rather than more smaller nodes. If you want to play with large scale dist, so you spin up 1000 low cpu, low memory VMs, you might have some trouble. It makes sense to start with small nodes and whatever number makes you comfortable for availability, and then when you run into limits, reach for bigger nodes until you get to the point where adding nodes is more cost effective: WA ran dual xeon 2690 servers before the move to FB infra; facebook had better economics with smaller single Xeon D nodes; I dunno what makes sense today, maybe a single socket Epyc?
This could be a big deal in terms of raising the bar for deployment practices.
Right now "nobody ever got fired for" setting up deployment via rsync and some ad-hoc shell scripts. That works for a single host, although it's not great for reproducibility. But as soon as you go to multiple hosts you need some degree of orchestration, monitoring, and integration with your load balancer to avoid downtime.
CodeDeploy offers those benefits, so if it turns out to be even slightly good, it could become the "nobody ever got fired for" choice, for any non-trivial app running on AWS.
The way to think about Hyperlight is as a security substrate intended to host application runtimes. You’re right that the Hyperlight API only supports C and Rust today — but you can use that to for example load Python or JS runtimes which can then execute those languages natively.
But we might be able to do even better than that by leveraging Wasm Components [1] and WASI 0.2 [2]. Using a VM guest based on Wasmtime, suddenly it becomes possible to run functions written in any language that can compile to Wasm Components — all using standard tooling and interfaces.
I believe the team has a prototype VM guest based on Wasmtime working, but they still needed a little more time before it’s ready to be published. Stay tuned for future announcements?
Lovely. In about 1974 I was paid to write a function, in IBM 360 assembly language, to compute square roots. I was asked to make it as efficient as possible. I was in my last year as an undergraduate student. I used a Chebyshev approximation for the initial guess (after scaling the input to lie between 0 and 1), and then used two (or was it three) unrolled iterations of Newton's method to get the solution. First money I ever received for writing code!
Spacing is a challenge. And you lose some legibility giving up proportional fonts. I think kerning in proportional fonts makes a big difference, letting your eyes recognize the shape of different letter groupings.
Monospace text is fine if you avoid long-form text, like when it's structured and highlighted in a code editor.
But it sure is pretty! Especially with Unicode charts and ASCII art.
First, the rfc is pointlessly complex and optimized for files, not for streaming. if you want to play with it, manage blocks by yourself, ignore the asinine interleaving and block size management.
Second, the algorithm is actually split in two parts, and while the second (generation of repair blocks) is linear, the first is cubic on the number of messages that you put together in a block (~~ matrix gaussian elimination).
And while parts of both encoding and decoding can be cached, I think that "linear time" encoding for raptorq is actually just false marketing speak.
Anyways, since you express disappointment in ~1000 cycle cost. That's about right. The latency between cores is actually quite high and there's not much you can do about it, especially on a system like x86 which has extremely strong cache coherency by default. One thing that is really important to understand is that the ownership of cache lines dramatically affects the cost of memory.
For IPC, this effectively requires one thread writing to memory (thus, making that cache line modified to that core, and evicted from all other cores). Then, when the polling thread checks in on the line, it will have to demote that cache line from modified, to shared by flushing it out to memory (usually L2 or L3, but also writing out to memory). This causes some memory traffic, and constantly means that the cores are fighting over the same cache line. Since x86 is strongly ordered and caches are coherent, this traffic is extremely expensive. Think of a write as "tell all other cores that you modifed this memory, so they have to evict/invalidate their cache lines". And a read as "tell all cores to flush their cache lines if they're modified, then wait for them to tell me they're done, then I can read the memory". This effectively is a massively blocking operation. The simple act of reading the mailbox/ticket/whatever from another core to check if a message is ready will actually dramatically affect the speed the other core can write to it (as now that write is effectively full latency).
There are some tricks you can do to get extremely low latency between cores. One of them, is making sure you're on cores that are physically near each other (eg. on the same processor socket). This is only really relevant on servers, but it's a big thing. You can actually map out the physical processor layout, including on a single die, based on the latency between these cores. It's quite subtle and requires low noise, but it's really cool to map out the grid of cores on the actual silicon due to timing.
Another trick that you can do, is have both threads on the same core, thus, using hyperthreads. Hyperthreads share the same core and thus a lot of resources, and are able to actually skip some of the more expensive coherency traffic, as they share the same L1 cache (since L1 is per-core). The lowest latency you will be able to observe for IPC will be on the same core with hyperthreads, but that's often not really useful for _processing_ the data, since performance will not be great on two busy cores. But in theory, you can signal a hyperthread, the hyperthread can then go and raise some other signal, while the original hyperthread still continues doing some relevant work. As long as one of them is blocking/halted, the other won't really be affected by two things on the same thread.
Finally, the most reasonable trick, is making sure your tickets/buffers/mailboxes/whatever are _not_ sharing the same cache lines (unless they contain data which is passed at the same time). Once again, the CPU keeps things in sync at cache line levels. So having two pieces of data being hammered by two cores on the same cache line is asking for hundreds of cycles per trivial data access. This can be observed in an extreme case with many core systems, with multiple sockets, fighting over locks. I've done this on my 96C/192T system and I've been able to get single `lock inc [mem]` instructions to take over 15,000 cycles to complete. Which is unreal for a single instruction. But that's what happens when there's 200-500 cycles of overhead every single time that cache line is "stolen" back from other cores. So, effectively, keep in your head which state cache lines will be in. If they're going to be modified on one core, make sure they're not going to be read on another core while still being written. These transitions are expensive, you're only going to get your 3-4 cycle "random L1 data hit performance" if the cache line is being read, and it's in the exclusive, modified, or shared state, and if it's being written, it has to be exclusive or modified. Anything else and you're probably paying hundreds of cycles for the access, and thus, also probably hurting the other side.
Ultimately, what you're asking from the CPU is actually extremely complex. Think about how hard it would be for you to manage keeping a copy of a database in sync between hundreds of writers and reads (cores). The CPU is doing this automatically for you under the hood, on every single memory access. It is _not_ free. Thus, you really have to engineer around this problem, batch your operations, find a design that doesn't require as intense of IPC, etc. On more weakly ordered systems you can use some more tricks in page tables to get a bit more control over how cache coherency should be applied for various chunks of memory to get more explicit control.
This is a cool series of posts, thanks for writing it!
We've released a bit about how the AWS Lambda scheduler works (a distributed, but stateful, sticky load balancer). There are a couple of reasons why Lambda doesn't use this broadcast approach to solve a similar problem to the one these posts are solving.
One is that this 'broadcast' approach introduces a tricky tradeoff decision about how long to wait for somebody to take the work before you create more capacity for that resource. The longer you wait, the higher your latency variance is. The shorter you wait, the more likely you are to 'strand' good capacity that just hasn't had a chance to respond yet. That's a tunable tradeoff, but the truly tough problem is that it creates a kind of metastable behavior under load: excess load delays responses, which makes 'stranding' more frequent, which reduces resource usage efficiency, which makes load problems worse. Again, that's a solvable problem, but solving it adds significant complexity to what was a rather simple protocol.
Another issue is dealing with failures of capacity (say a few racks lose power). The central system doesn't know what resources it lost (because that knowledge is only distributed in the workers), and so needs to discover that information from the flow of user requests. That can be OK, but again means modal latency behavior in the face of failures.
Third, the broadcast behavior requires O(N^2) messages for N requests processed (on the assumption that the fleet size is O(N) too). This truly isn't a big deal at smaller scales (packets are cheap) but can become expensive at larger scales (N^2 gets steep). The related problem is that the protocol also introduces another round-trip for discovery, increasing latency. That could be as low as a few hundred microseconds, but it's not nothing (and, again, the need to optimize for happy-case latency against bad-case efficiency makes tuning awkward).
Fourth, the dynamic behavior under load is tricky to reason about because of the race between "I can do this" and getting the work. You can be optimistic (not reserving capacity), at the cost of having to re-run the protocol (potentially an unbounded number of times!) if you lose the race to another source of work. Or, you can be pessimistic (reserving capacity and explicitly releasing what you don't need), at the cost of making the failure cases tricky (see the classic problem with 2PC coordinator failure), and reducing efficiency for popular resources (in proportion to the latency and popularity of the resource you're looking for). Slow coordinators can also cause significant resource wastage, so you're back to tuning timeouts and inventing heuristics. It's a game you can win, but a tough one.
This needle-in-a-haystack placement problem really is an interesting one, and it's super cool to see people writing about it and approaching the trade-offs in designs in a different way.
Stringref is an extremely thoughtful proposal for strings in WebAssembly. It’s surprising, in a way, how thoughtful one need be about strings.
Here is an aside, I promise it’ll be relevant. I once visited Gerry Sussman in his office, he was very busy preparing for a class and I was surprised to see that he was preparing his slides on oldschool overhead projector transparencies. “It’s because I hate computers” he said, and complained about how he could design a computer from top to bottom and all its operating system components but found any program that wasn’t emacs or a terminal frustrating and difficult and unintuitive to use (picking up and dropping his mouse to dramatic effect).
And he said another thing, with a sigh, which has stuck with me: “Strings aren’t strings anymore.”
If you lived through the Python 2 to Python 3 transition, and especially if you lived through the world of using Python 2 where most of the applications you worked with were (with an anglophone-centric bias) probably just using ascii to suddenly having unicode errors all the time as you built internationally-viable applications, you’ll also recognize the motivation to redesign strings as a very thoughtful and separate thing from “bytestrings”, as Python 3 did. Python 2 to Python 3 may have been a painful transition, but dealing with text in Python 3 is mountains better than beforehand.
The WebAssembly world has not, as a whole, learned this lesson yet. This will probably start to change soon as more and more higher level languages start to enter the world thanks to WASM GC landing, but for right now the thinking about strings for most of the world is very C-brained, very Python 2. Stringref recognizes that if WASM is going to be the universal VM it hopes to be, strings are one of the things that need to be designed very thoughtfully, both for the future we want and for the present we have to live in (ugh, all that UTF-16 surrogate pair pain!). Perhaps it is too early or too beautiful for this world. I hope it gets a good chance.
> Not necessarily. The energy generated may be used for purposes such as heating people’s homes, or industrial processes essential for creating medicine, operating schools and hospitals etc.
... or securing a global, permissionless, decentralized, cryptography-based monetary system which cannot be debased or otherwise corrupted by the government-de-jour, including those who oppress their people.
> You can see how this is very different from using the energy simply to operate an online gambling operation or to perpetrate large scale securities fraud.
Indeed, I can see that. Fortunately, miners aren't in either of those businesses.
But since we're deciding who gets or doesn't get to use energy, let's also ban casinos, online gambling (like you said), heck.. anything that might waste energy or use it for things that we don't like (even though other people might like it or even need it).
Instead of, you know, letting the market decide who can or cannot use energy based on what provides the most value for their use of it.
One of the most surprisingly clear code bases is LLVM. It’s an old complex beast, yet it’s organized beautifully.
I mean, it has a lot of essential complexity but little accidental complexity.
That’s usually what I strive for when coding. Complexity is sometimes unavoidable, that’s fine, that’s why it’s essential. However, avoidable complexity should be… well… avoided.
One thing I noticed in the middle, when concatenating Rust files for a demonstration:
> Note also that the files are sorted before concatenating, so that the result is guaranteed to be deterministic.
No locale was defined and the example sort command used cannot be considered deterministic. The results could vary wildly on different systems just through the locale alone!
Two solutions: define the locale ("LC_ALL=C" before the command should be sufficient), or use the -V flag on sort.
Any dates on supporting range of ports to my running sqlite/db instance on? ex) 5999-8999
edit: not sure why this is being downvoted? not being able to define a range of port seems like a huge oversight. the forums there are not very active so I am asking here.
Parent is right, with some assumptions to be made explicit. Considering the reactions this deserves more explanations ;)
Let's say that the heat energy of the warm thing is "E" and the outside and fridge are at same temp for simplicity. Let's also assume that the house is not heated using a heat pump.
Consider the two cases:
1) The warm thing is put into a fridge. It's a heat pump, with coefficient of performance C. To cool the thing the fridge will consume E/C, and E will be released as heat inside your room. So you paid E/C of electricity to get E Joules of warming in your home;
2) The warm thing is put outside. The energy E is wasted outside, and then the piece put into the fridge with no extra cooling there. No energy spent by the fridge, but no warming of the house either. To do a fair comparison with the same final state, we need to add E joules of energy to heat the house, which will require E Joule of electricity or primary energy (because no heat pump to heat the house).
So to reach the same point, in case (1) we spent E/C of heating energy, and E in the second case. It is indeed more efficient to put the warm thing directly into the fridge, as this will contribute to heating the house with the efficiency of a heat pump.
And if you heat your house with a heat pump of same efficient C, you don't care either way: it's equivalent.
Interesting design. I made a lock a couple years ago that is quite similar in principle (though this design is different and has a couple nice improvements).
I hate fonts that have textural imperfections that are repeatedly perfectly on every instance of the same character, destroying the illusion.
When I wrote my book "Crafting Interpreters", I hand-lettered every single word in every illustration separately so that it would be as imperfect as it appeared.
This is a great list of influences on the design (from the article comments where the prototype author Sam Gross responded to someone wishing for more cross pollination across language communities):
—————
"… but I'll give a few more examples specific to this project of ideas (or code) taken from other communities:
- Biased reference counting (originally implemented for Swift)
- mimalloc (originally developed for Koka and Lean)
- The interpreter took ideas from LuaJIT and V8's ignition interpreter (the register-accumulator model from ignition, fast function calls and other perf ideas from LuaJIT)
Have nothing but praise for FoundationDB. It has been by far the most rock solid distributed database I have ever had the pleasure of using. I used to manage HBase clusters, and the fact that I have never once had to worry about manually splitting "regions" is such a boon for administration...let alone JVM GC tuning.
We run several FDB clusters using 3-DC replication and have never once lost data. I remember when we wanted to replace all of the FDB hardware (one cluster) in AWS, and so we just doubled the cluster size, waited for data shuffling to calm down, and just started axing the original hardware. We did this all while performing over 100K production TPS.
One thing that makes the above seamless for all existing connections is that clients automatically update their "cluster file" in the event that new coordinators join or are reassigned. That alone is amazing...as you don't have to track down every single client and change / re-roll with new connection parameters.
Anyway, I talk this database up every chance I get. Keep up the awesome work.
Founders are building a distributed systems simulation product now called Antithesis. My data fabric startup, Stardog, is a happy Antithesis early adopter customer. It’s helping us reproduce and fix non-deterministic bugs deterministically. Good stuff.
Ehhhh, doesn't align with my experience. I think FDB is actually really poorly tested. When I was evaluating it for replacement of the metadata key-value store at a major, public web services company we found that injecting faults into virtual NVMe devices on individual replicas would cause corrupt results returned to clients. We also found that it would just crash-loop on Linux systems with huge pages, because although someone from the project had written a huge-page-aware C++ allocator "for performance", evidently nobody had ever actually tried to use it, including the author.
It's also really, really weird that their non-scalable architecture hits a brick wall at 25 machines. Ignoring the correctness flaws, it only works if you can either design around that limit by sharding, and never off cross-shard transactions, or if you can assure yourself that your use case will never outgrow half a rack of equipment.
I was Chief Architect at Slack from 2016 to 2020, and was privileged to work with the engineers who were doing the work of migrating to Vitess in that timeframe.
The assumption that tenants are perfectly isolated is actually the original sin of early Slack infrastructure that we adopted Vitess to migrate away from. From some earlier features in the Enterprise product (which joins lots of "little Slacks" into a corporate-wide entity) to more post-modern features like Slack Connect (https://slack.com/help/articles/1500001422062-Start-a-direct...) or Network Shared Channels (https://slack.com/blog/news/shared-channels-growth-innovatio...), the idea that each tenant is fully isolated was increasingly false.
Vitess is a meta-layer on top of MySQL shards that asks, per table, which key to shard on. It then uses that information to maintain some distributed indexes of its own, and to plan the occasional scatter/gather query appropriately. In practice, simply migrating code from our application-sharded, per-tenant old way into the differently-sharded Vitess storage system was not a simple matter of pointing to a new database; we had to change data access patterns to avoid large fan-out reads and writes. The team did a great write-up about it here: https://slack.engineering/scaling-datastores-at-slack-with-v...
I absolutely love reading Justine's code. It comes up from time to time here and it just makes me happy. It reminds me of the passage from "Programming Sucks" by Still Drinking [1]
Every programmer occasionally, when nobody’s home, turns off the lights, pours a glass of scotch, puts on some light German electronica, and opens up a file on their computer. It’s a different file for every programmer. Sometimes they wrote it, sometimes they found it and knew they had to save it. They read over the lines, and weep at their beauty, then the tears turn bitter as they remember the rest of the files and the inevitable collapse of all that is good and true in the world.
This file is Good Code. It has sensible and consistent names for functions and variables. It’s concise. It doesn’t do anything obviously stupid. It has never had to live in the wild, or answer to a sales team. It does exactly one, mundane, specific thing, and it does it well. It was written by a single person, and never touched by another. It reads like poetry written by someone over thirty.
Her work is always just a little bit trippy in a good way haha.
That's why I miss QNX. Tiny microkernel, yet the system can do almost all of POSIX. Driver are in user space. File systems are in user space. Networking is in user space. It's used widely for embedded applications, especially automotive, but since Blackberry took over and locked it down, access is expensive. Blackberry also killed off the QNX desktop environment and self-hosting. You used to build QNX on QNX.
L4 is even smaller, but too small. You have to run another OS on top of L4 to get anything done. QNX can run ordinary application programs. At one point there was a version of Firefox for QNX, and all of the GNU command line tools worked.
I personally suspect the engineering challenges might pale next to the challenges of political organization.
Quite simply, the continued human expertise and organization necessary to manage and sustain such a system is far, far beyond anything we have today.
We can't even manage to globally reduce CO2, and the recent government responses to coronavirus have led a lot to be desired, to say the least.
Just because you put people up in space habitats doesn't mean they become any less power-hungry, any more cooperative, or any more peaceful.
You say we'll have the engineering in 1,000 years, and I could buy that. But read Aristotle's Politics from 2,400 years ago, which is concerned mainly with political stability and revolution, and he might as well be describing people today.
I'd like to see us manage "spaceship earth" a helluva lot better before I have even the remotest faith we could manage space habitats politically. Heck, we couldn't even manage Biosphere 2, remember?