Hacker News new | past | comments | ask | show | jobs | submit login

UUIDv7 has a specific format that doesn’t support that.

For the case I’m describing, you can’t use it.

For situations you want write coalescing it’s fine though.

Not sure it we’re agreeing here?




What do you mean it doesn't support that?

There's some flexibility in how you fill in a UUIDv7, but let's go ahead and say that the ones we're worried about have the first 32 bits filled with timestamp and the last 32 bits filled with random.

If you want pure sort-by-time, then use it the normal way. If you want pure sort-by-random, then it's slightly awkward but you can prioritize the random part.

But the additional power is that you can shard by the last 32 bits, then sort by the first 32 bits within a shard. And you don't need weird workarounds like hashing the UUID.

You said "it’s easier to shard when writes occur randomly over the overall space. it’s easier to coalesce when writes all happen in a given place (head or tail)". But you can have both at the same time. You can have easy sharding and easy coalescing.


Except you literally can't do random distribution AND be compliant with UUIDv7 if you use any sort of normal lexical sorting/indexing, as they use the start of the key as the most significant bits. UUIDv7 is literally designed to have stable lexical sorting orders, have the time as the most significant bits, and have the most significant bits of the time as the most significant bits of the key! It's their primary design criteria!

You can't 'prioritize' random parts of a key for sorting without writing a bunch of custom sorting (and key parsing) logic, which is generally undesirable for a number of reasons - and frankly completely unnecessary in these cases. You just wouldn't use UUIDv7 (or probably a UUID in general), and the benefits would pay for themselves very quickly anyway.

To quote the UUIDv7 RFC:

"This document presents new time-based UUID formats which are suited for use as a database key." (as the first line of the abstract)

"Due to the shortcomings of UUIDv1 and UUIDv4 details so far, many widely distributed database applications and large application vendors have sought to solve the problem of creating a better time-based, sortable unique identifier for use as a database key."

"- Timestamps MUST be k-sortable. That is, values within or close to the same timestamp are ordered properly by sorting algorithms.

- Timestamps SHOULD be big-endian with the most-significant bits of the time embedded as-is without reordering.

- Timestamps SHOULD utilize millisecond precision and Unix Epoch as timestamp source. Although, there is some variation to this among implementations depending on the application requirements.

- The ID format SHOULD be Lexicographically sortable while in the textual representation.

- IDs MUST ensure proper embedded sequencing to facilitate sorting when multiple UUIDs are created during a given timestamp.

- IDs MUST NOT require unique network identifiers as part of achieving uniqueness.

- Distributed nodes MUST be able to create collision resistant Unique IDs without a consulting a centralized resource."

[https://www.ietf.org/archive/id/draft-peabody-dispatch-new-u...]

I'm pointing out that for some systems, that makes UUIDv7 unsuitable because you WANT the keys to be randomly distributed to avoid hotspots. Using UUIDv7 in these situations will result in a single node receiving all writes (and all reads for a given time range), which in the dataset sizes I'm referring to is usually impossible to handle. No single node can handle that kind of load, regardless of how efficient it may be.

For other types of systems (such as single machine databases or 'tight' clusters of databases without extreme write loads), UUIDv7 and similar is great, as it allows easy/cheap write combining when that is actually possible for a machine to handle the load.


> Except you literally can't do random distribution AND be compliant with UUIDv7 if you use any sort of normal lexical sorting/indexing, as they use the start of the key as the most significant bits. UUIDv7 is literally designed to have stable lexical sorting orders, have the time as the most significant bits, and have the most significant bits of the time as the most significant bits of the key! It's their primary design criteria!

> You can't 'prioritize' random parts of a key for sorting without writing a bunch of custom sorting (and key parsing) logic, which is generally undesirable for a number of reasons - and frankly completely unnecessary in these cases. You just wouldn't use UUIDv7 (or probably a UUID in general), and the benefits would pay for themselves very quickly anyway.

Forget prioritizing, that was about going fully random. Seriously, let's pretend I never said that specific sentence.

Let's focus on just the sharding scenario. None of what you said there conflicts with what I said about sharding.

Unless these database engines are so incompetent that you can't shard on something as simple as id[12:16]?

> I'm pointing out that for some systems, that makes UUIDv7 unsuitable because you WANT the keys to be randomly distributed to avoid hotspots. Using UUIDv7 in these situations will result in a single node receiving all writes (and all reads for a given time range), which in the dataset sizes I'm referring to is usually impossible to handle. No single node can handle that kind of load, regardless of how efficient it may be.

You only want the keys to be randomly distributed at the sharding layer. Once it reaches its home node, you don't want random distribution within that node. At best you begrudgingly accept it.

It's within a node that things like "normal lexical sorting" matter the most, so UUIDv7 does a great job of making that smooth.

You don't need lexical sorting between shards, especially when you're randomizing the shard.


The point I'm making is all these shenanigans are completely unnecessary, don't really help, and make everything extremely hard to manage, reason about, and get performance from - all to try to force usage of a specific key format (UUID) in a situation which it is not designed for, and for which it is not suited.

It's square peg, round hole.

And folks working on Exabyte sized indexed datasets generally already get this. So I'm not sure why i'm even having this discussion? I'm not even getting paid for this!

Ciao!


"it allows easy/cheap write combining" is not "completely unnecessary". What the heck, at least be consistent.

And it's not shenanigans! You could shard based on the first bytes of a key, or you could shard based on the last bytes of the key. Neither one should be harder. Neither one is shenanigans.

> It's square peg, round hole.

Going entirely random is an even worse peg.


Wow a long thread of back and forth and confusion :)

Fwiw I’m with Dylan on this one!

I have direct experience of absolutely humongous data processing using random bits for shard selection where each shard uses sorted storage and benefits from the sortability of the time bits so, with just the smallest buffering, all inserts are basically super fast appends.

This is super normal in my experience. And I can’t wait for the new UUID formats to land and get widely supported in libs to simplify discussions with event producers :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: