More importantly if you have an index on purely random IDs, then each insert will go to some random position into the tree whereas having IDs that increase with time will make all new IDs end up at the end of the tree which reduces index fragmentation.
depending on scale and architecture, either behavior can be better. it’s easier to shard when writes occur randomly over the overall space. it’s easier to coalesce when writes all happen in a given place (head or tail)
Only when writing all at once and when you know what the shard boundaries are and the number of shards (and boundaries) are stable. If they’re changing, growing, et c. you can’t tell where they’re at predictably and random is the least likely to cause problems and allow sub-sharding dynamically.
Very large real world datasets are unlikely to be static long enough, and equipment stable enough, to not consider this effect.
> If they’re changing, growing, et c. you can’t tell where they’re at predictably and random is the least likely to cause problems and allow sub-sharding dynamically.
I'm confused by your reply, because I never suggested not to use random bits for sharding.
I'm just saying that 60+ random bits should be enough to shard, change, grow, and sub-shard with. You don't need 122.
People were talking about the value of time+random UUIDs versus all-random UUIDs, and how those behave.
You said that sometimes the random behavior is preferable.
In response to that, I was saying that even if you want to sort randomly at some particular step, you should use the time+random format, because other steps might not want to sort randomly. You should directly choose to use the random part, instead of indirectly forcing it by making the entire UUID random.
Then you said "Only when writing all at once and when you know what the shard boundaries are and the number of shards (and boundaries) are stable."
I can't figure out how that relates to my post. I thought you were worried about insufficient random bits to use for sharding, but apparently that wasn't your concern. So I have no idea what your concern is. If you have a use case for randomness, use the random half of the UUID.
There's some flexibility in how you fill in a UUIDv7, but let's go ahead and say that the ones we're worried about have the first 32 bits filled with timestamp and the last 32 bits filled with random.
If you want pure sort-by-time, then use it the normal way. If you want pure sort-by-random, then it's slightly awkward but you can prioritize the random part.
But the additional power is that you can shard by the last 32 bits, then sort by the first 32 bits within a shard. And you don't need weird workarounds like hashing the UUID.
You said "it’s easier to shard when writes occur randomly over the overall space. it’s easier to coalesce when writes all happen in a given place (head or tail)". But you can have both at the same time. You can have easy sharding and easy coalescing.
Except you literally can't do random distribution AND be compliant with UUIDv7 if you use any sort of normal lexical sorting/indexing, as they use the start of the key as the most significant bits. UUIDv7 is literally designed to have stable lexical sorting orders, have the time as the most significant bits, and have the most significant bits of the time as the most significant bits of the key! It's their primary design criteria!
You can't 'prioritize' random parts of a key for sorting without writing a bunch of custom sorting (and key parsing) logic, which is generally undesirable for a number of reasons - and frankly completely unnecessary in these cases. You just wouldn't use UUIDv7 (or probably a UUID in general), and the benefits would pay for themselves very quickly anyway.
To quote the UUIDv7 RFC:
"This document presents new time-based UUID formats which are suited for use as a database key." (as the first line of the abstract)
"Due to the shortcomings of UUIDv1 and UUIDv4 details so far, many widely distributed database applications and large application vendors have sought to solve the problem of creating a better time-based, sortable unique identifier for use as a database key."
"- Timestamps MUST be k-sortable. That is, values within or close to the same timestamp are ordered properly by sorting algorithms.
- Timestamps SHOULD be big-endian with the most-significant bits of the time embedded as-is without reordering.
- Timestamps SHOULD utilize millisecond precision and Unix Epoch as timestamp source. Although, there is some variation to this among implementations depending on the application requirements.
- The ID format SHOULD be Lexicographically sortable while in the textual representation.
- IDs MUST ensure proper embedded sequencing to facilitate sorting when multiple UUIDs are created during a given timestamp.
- IDs MUST NOT require unique network identifiers as part of achieving uniqueness.
- Distributed nodes MUST be able to create collision resistant Unique IDs without a consulting a centralized resource."
I'm pointing out that for some systems, that makes UUIDv7 unsuitable because you WANT the keys to be randomly distributed to avoid hotspots. Using UUIDv7 in these situations will result in a single node receiving all writes (and all reads for a given time range), which in the dataset sizes I'm referring to is usually impossible to handle. No single node can handle that kind of load, regardless of how efficient it may be.
For other types of systems (such as single machine databases or 'tight' clusters of databases without extreme write loads), UUIDv7 and similar is great, as it allows easy/cheap write combining when that is actually possible for a machine to handle the load.
> Except you literally can't do random distribution AND be compliant with UUIDv7 if you use any sort of normal lexical sorting/indexing, as they use the start of the key as the most significant bits. UUIDv7 is literally designed to have stable lexical sorting orders, have the time as the most significant bits, and have the most significant bits of the time as the most significant bits of the key! It's their primary design criteria!
> You can't 'prioritize' random parts of a key for sorting without writing a bunch of custom sorting (and key parsing) logic, which is generally undesirable for a number of reasons - and frankly completely unnecessary in these cases. You just wouldn't use UUIDv7 (or probably a UUID in general), and the benefits would pay for themselves very quickly anyway.
Forget prioritizing, that was about going fully random. Seriously, let's pretend I never said that specific sentence.
Let's focus on just the sharding scenario. None of what you said there conflicts with what I said about sharding.
Unless these database engines are so incompetent that you can't shard on something as simple as id[12:16]?
> I'm pointing out that for some systems, that makes UUIDv7 unsuitable because you WANT the keys to be randomly distributed to avoid hotspots. Using UUIDv7 in these situations will result in a single node receiving all writes (and all reads for a given time range), which in the dataset sizes I'm referring to is usually impossible to handle. No single node can handle that kind of load, regardless of how efficient it may be.
You only want the keys to be randomly distributed at the sharding layer. Once it reaches its home node, you don't want random distribution within that node. At best you begrudgingly accept it.
It's within a node that things like "normal lexical sorting" matter the most, so UUIDv7 does a great job of making that smooth.
You don't need lexical sorting between shards, especially when you're randomizing the shard.
The point I'm making is all these shenanigans are completely unnecessary, don't really help, and make everything extremely hard to manage, reason about, and get performance from - all to try to force usage of a specific key format (UUID) in a situation which it is not designed for, and for which it is not suited.
It's square peg, round hole.
And folks working on Exabyte sized indexed datasets generally already get this. So I'm not sure why i'm even having this discussion? I'm not even getting paid for this!
"it allows easy/cheap write combining" is not "completely unnecessary". What the heck, at least be consistent.
And it's not shenanigans! You could shard based on the first bytes of a key, or you could shard based on the last bytes of the key. Neither one should be harder. Neither one is shenanigans.
Wow a long thread of back and forth and confusion :)
Fwiw I’m with Dylan on this one!
I have direct experience of absolutely humongous data processing using random bits for shard selection where each shard uses sorted storage and benefits from the sortability of the time bits so, with just the smallest buffering, all inserts are basically super fast appends.
This is super normal in my experience. And I can’t wait for the new UUID formats to land and get widely supported in libs to simplify discussions with event producers :)
Because sometimes you want some data to be collocated, while the rest sharded.
For instance, you might use a random object ID as a prefix value in the index, followed by attribute ID which isn’t. Or a modified time, so you can have a history of values which can be read out linearly.
If using it directly, that means Objects and their data are sharded randomly across, but when looking for an objects attributes (or attribute by time), their index entries are always co-located and you can read them out linearly with good performance.
If blindly hashing keys to distribute them, you can’t do that. Also, you can’t really do a linear read at all, since no data will be ‘associatable’ with others, as the index value is randomized, and what is stored in the index has no related to the key provided by the user.
You can only do a straight get, not a read. That is very limiting, and expensive with large data sets as most algorithms benefit greatly from having ordered data. (Well, you could do a read, but you’d get back entries in completely random order)
Needless to say, this is ‘advanced’ usage and requires pretty deep understanding of your data and indexing/write/read patterns, which is why random hashing is the most common hash map behavior.
I’ve never seen that kind of optimization on a dataset that would fit on a database server of any kind. Tens of PB or EB usually, but sometimes only several hundred TB if it’s high load/in-memory only.
Or, you could use a graph database and stop having frustrating relational impedance mismatch, nonlocality etc. You can have O(1) lookups instead of O(log N) for almost everything
That will depend on which graph database you use as a graph database might just store the graph in an underlying relational database. And it will also depend on what kind of data you have and what kind of queries you want to perform. For a graph database it might be faster to navigate along links in the graph but I would guess you will have to pay a big performance penalty if you have to operate orthogonally to your links, like aggregate across all instances of some entity.
Graph databases don't solve that. All databases, document, graph, rel ALL implement indexes to find specific things in the exactly the same way. Very well known tree, hash and other techniques.
The representation (outside of indexing) has properties that make your USE CASE better or worse. EGreg would not be someone to hire to architect a solution. He'll just put your 1Trillion row per month use-case in a graph DB like Neo4J and you'll just watch it fall over when you run billing.
When you’re talking about data sets so large they dictate what hardware you use, and introduce terms like “cluster”, then 1 = √n
Which is why we need a version 2 of complexity theory, that doesn’t treat memory access or arithmetic on arbitrary precision numbers (aka as n actually goes to infinity) as O(1) operations. They aren’t. Which every large system engineer knows but few will talk about.
And that complexity theory already exists. Typical whiteboard engineering uses transdichotomous models to gloss over some polylogarithmic factors (as do much of the literature), but more accurate models exist.
The difference isn't usually super relevant when comparing multiple solutions all using the same model of computation though since the extra terms don't tend to bump one complexity class above another when switching models, and if you cared about actual runtime characteristics you wouldn't be relying on log factors in such a crude tool anyway.
Imagine a data center containing exabytes of data. How long does it take to access an arbitrary bit of that data?
We use clusters because computers cannot contain an infinite amount of memory, storage, or CPUs, because of physics. You see this same thing play out at smaller scales but it's more obvious at the macro scale. More addresses take logn time to sort out, but time to access is measured in radii, not gate depth.
In a world where clusters are rare, Knuth made decent approximations. In a world where clusters are not only de rigeur but hosted on multitenant hardware spread out over upwards of 100 miles, those approximations are bullshit and need to change.
Integer arithmetic is really quantized logarithmic complexity. If your hardware has a bucket your number fits into, you're calculating n+1 or nxn in constant time. But if your data set doubles in size (especially for multiplication) you may find yourself in a bucket that doesn't exist or a more expensive one. Contemporary code is more likely to reach for bignum which is logn, but again stairstepped to each number of integers it takes to represent the entire number. A bigger problem when your data set is sparse, so that values grow faster than population (eg, UUID).
But on the topic of hash tables, you can only reach 'O(1)' if you can avoid collisions. And to avoid collisions you need a key of length m, which grows as n grows. You cannot put 51 pigeons in 50 pigeonholes. So the key length of your hash keys is m >= logn, which means the time to compare keys is also logn, which means hash tables are never actually O(1) access or insert time. Actual access time for a hash table on non-imaginary hardware is √nlogn, not O(1).
If you consider that we have applications that are just a single hash table occupying the entire RAM of a single machine, then this is not picking nits. It's capacity planning.
You cannot put 51 pigeons in 50 pigeonholes. So the key length of your hash keys is m >= logn, which means the time to compare keys is also logn, which means hash tables are never actually O(1) access or insert time.
I am not sure I am following this argument. You are not going to have more than 2^64 pigeons and pigeonholes on any system soon and I almost dare to say you will never ever get to 2^128. And for 64 or 128 bit keys comparisons and many other operations are for all practical purposes constant time. I guess you could argue that this is sweeping a factor of log(n) under the rug because of things like carry chains which could be faster for smaller bit sizes but I am not sure that this is really useful on common hardware, an addition will take one clock cycle independent of the operand values.
Sure, they can be and are often longer, but not because you were forced to use long keys, it just happened that the thing you want to store in a hash table is a long string. The way you worded it made it sound to me like you were saying that one has to use long keys, not that in practice one often has to deal with long keys. But even then I am not convinced that this should give an additional factor in the complexity analyses, I think I would argue, at least in some situations, that calculating hashes of long keys should still be considered constant time, that for the longest keys. But I can also imagine to take this into account if the key length is not only big but also highly variable.
Look, if you have N items related to X, at insert time, you store them in an array and have X point to that array, instead of foreign keys.
For example, when a user has 7 articles. Do you want to just point to where the articles are stored? Or do you want to do O(log n) lookup for each article?
And if you have many-to-many, do you want to join an Intermediate Table for even more processing, or just follow a pointer to a range of an intermediate node directly and traverse?
What about when you delete rows? Do you just leave the space unused now? Or if you update a row to be larger? Rewrite the whole array (so possibly O(n) updates)?
How do you deal with data that gets accessed in different orders based on relationships from multiple other entities? Duplicate the data? If so, updates now get amplified and you can fit less data in RAM so you're more likely to require disk IO. If not, you need a layer of indirection (so you have an array of pointers instead of an array of data).
Even with a layer of indirection, updates that grow a row and require a reallocation will cause you to have to go update all pointers (also, those pointers need to be indexed to find who you need to update). To avoid write amplification, you can use an array of ids instead of an array of pointers. Now you want an id <-> pointer lookup, which could be done with a hash map in O(1), or with a tree in O(logN) if you want to also allow efficient range queries.
I'm not exactly sure on the implementation details, but for ballpark numbers, for an 8 kB page size with 80% fill factor and 16 B entries (8 B key + 8 B pointer), with 10E9 rows, log(N) ought to be ~4 (page IOs). Not ideal, but the point is for a lot of real-world engineering, O(logN) is effectively O(1) (with a small enough constant factor that it's fine for general use).
So if your data is not write-once, trees are a well-rounded default data structure.