Hacker News new | past | comments | ask | show | jobs | submit login

Is this specific to MySQL or does it apply to Postgres too?



As far as I can gather from this post and looking at the data type documentation, MySQL does not have a specific UUID type, but Postgres does.[0] I'll assume that Postgres has some internal optimisations to UUID that MySQL thus lacks.

Addendum: I also realise this is anecdotal, but someone on Stackoverflow mentions a significant speed up from changing `text` to `uuid` in Postgres.[1] But this also fits with what I've been told on #postgresql on libera.chat. That being said, integers would still outperform uuid.

[0] https://www.postgresql.org/docs/14/datatype-uuid.html [1] https://stackoverflow.com/questions/29880083/postgresql-uuid...


The UUID type index in Postgres is optimized: https://brandur.org/sortsupport


The Postgres UUID type works just like BINARY(16) covered in the article.


A lot of the problems listed in the post are physical issues with larger data types that are somewhat random - eg the size, how clustered indexes work, and you will have the same problems with them in SQL Server.


If you order the data based on the uuid and your uuid is randomly distributed, then you will almost always be writing the data in the middle of your table, physically. You can cut the impact somewhat by using spare tables (leaving lots of empty space) but eventually you'll be re-writing the data.

SQL Server has a sequential uuid type which avoids exactly this problem.


If that's the underlying issue, sequentiallity... Just use uuid uuid 6 or 7. They are time based and approximately sortable (unlike uuid1).

https://datatracker.ietf.org/doc/html/draft-peabody-dispatch...

Disclaimer, I have no data to back up this solves the performance problems described. It's just likely to solve that "writing in the middle of the table" part.


Why use a sequential UUID over integers as an id and then associating them with a random UUID?

Integers are a superior way to record sequential data.


Because you don't want your record IDs to be perfectly sequential data.

You only care about them being sequential enough that your database engine will write them down quickly and efficiently (for engines like InnoDB or MSSQL which have such behaviour). But as a developer you typically want them to be random so that you can generate them in a decentralised manner, prevent guessing or accidental joins, safely merge records from multiple tables, etc.

Sequential UUIDs usually preserve enough randomness for the latter (as long as you don't do stuff like generating a trillion IDs exactly at midnight), while providing enough sequentiality for the former.


But now the ID has a random concatenation of data in it, like the MAC address of the node that inserted it into the database and a timestamp.

It'd be much more orderly for each node to have a sequential ID, the ID of the node that created it, a timestamp when it was created and a real UUID v4 for if someone wants to prevent collisions. Same data, but the primary key on an individual node is a lot smaller (half the size) and the metadata is available for use if someone wants it. There is natural room to build up namespaces and such. It is better to keep these observations separate instead of munging them all into the record ID.


I 100% agree that metadata like timestamps and machine IDs, if you care about them, should have their own columns.

I only disagree with:

> It'd be much more orderly for each node to have a sequential ID

I've outlined some reasons above why a random ID has advantages over a sequential one.

I run Postgres in prod so I can use purely random UUIDv4, but for those running mysql or mssql, UUIDv6 and the like are a useful compromise between sequentiality and randomness. The fact that they happen to use timestamps as the source of sequentiality is only an implementation detail.


> SQL Server has a sequential uuid type which avoids exactly this problem.

you refer to the uuid generated by sql server?


Yes. When generating you can use NewSequentialId() which generates a sequential guid to avoid needing to reorganise indexes.



There are several problems, one of which is also ergonomy.

I've only read it partially (it's a very interesting read nonetheless), however, this is a key point (italic mine):

> Let’s assume a table of 1B rows having UUID values as primary key and five secondary indexes. If you read the previous paragraph, you know the primary key values are stored six times for each row. That means a total of 6B char(36) values representing 216 GB.

It assumes that MySQL users store UUIDs as CHAR(36), which is very wasteful, since an UUID actually requires 16 bytes (128 bits).

Now, one can store UUIDs a binary blobs in MySQL, however, they are not human-readable, so one tends to store them as CHAR(36) instead, wasting 20 bytes per entry (in total, requiring 2.25 times the strict necessary).

By supporting UUID as native data type, the storage can use the strict necessary amount of bytes, but still maintain readability, because the RDBMS will convert the data to a human-readable form.

Additionally, MySQL's clustered indexes are subject to write amplification, which makes things worse.

I haven't read the rest of the article though, which likely includes other consideration about the spatiality problems due to randomness. Things gets even more complex, due to the relationship with the data structures (I haven't fully read the article).


> they are not human-readable, so one tends to store them as CHAR(36) instead

one is an idiot


One does tend to be an idiot at times...


Postgres is somewhat different mainly because it doesn't use clustered index primary keys. So the row's position on disk is not related to the primary key index entry's position on disk.

Additionally using the less cryptographically secure uuid v1 can be a performance optimization since it has implicit time based sorting.


> Additionally using the less cryptographically secure uuid v1 can be a performance optimization since it has implicit time based sorting.

Except the way the fields are laid out basically defeats the point: UUIDv1 lays a 60 bits timestamp starting from the lower 32 bits, so it only sorts within a 7 minutes (2*32 * 100ns) bucket.

Hence the proposal for UUIDv6, which lays the exact same timestamp in reverse order (starting from the "top" 32b), making it naturally sortable.


Huh, you're right. I guess I gotta message an old coworker and tell them their "perf optimization" didn't work.

Lesson learned, thanks!


Do check that they're not using a "cheating" implementation, I think some DBs have "sequential UUIDs" which are either UUIDv1 the right way up (aka uuidv6) internally, or an other scheme which yields a sequence (e.g. mssql's NEWSEQUENTIALID).

Alternatively, it's possible that they created pseudo-UUIDv1 by hand putting data in UUIDv6.


Some years back (maybe 5-10?) I remember doing a test with UUIDs on Postgres, and found no speed difference between UUIDs and integer PKs.

I don't remember the parameters of the test, however.


In my testing (years ago, don't have the data) there was no detectable difference between PKs of the uuid and bigint types in my application, but there was a difference between uuid and int (int was faster). uuid is 16 bytes, bigint is 8 bytes, int is 4 bytes, so at some scale there will be a performance difference.


I've experienced some performance differences with UUIDs vs integers, but it has been minor enough to never cause any issues, even in tables with billions of records.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: