This is why you should use ids that combine both a time component and a sequence.
Eg UUIDv7 has a milliseconds time component and then a field that increments for each event in the same millisecond, and then enough random bits to make collisions between ids generated on different machines astronomically unlikely.
Of course there are only so many bits so you might generate too many events in the same time slice so the sequence overflows, and you might actually get collisions between machines, and you are limiting your event generation speed by forcing your cpu to sync on the increment etc.
I’m feeling a bit like an accidental time traveler, because I can recall a conversation at a tech meetup that had to have been at least ten years ago where someone was struggling with unique UUIDs because they were bursting above 1000 UUIDs per millisecond and not happy with the available options.
How old is UUID7? I can’t get the internet to tell me.
Perhaps one of many on this list[1]? ULID is listed as having 1.21e+24 unique per millisecond. I have no idea though about what may match up with your timeline(s).
UUIDv7 was first proposed just a few years ago, but the rfc contains a good list of well known ULID systems that predate it.
Putting the timestamp in the high bits of random ids is a “trick” i learned from DBAs (that used to be a thing!) in the 90s. And often the drive was to improve DB insert performance as it was for the other uses of an ID you can order and reason about.)
Yeah they had a problem where the amortized ID rate was less than the number of valid UUIDs you could generate in a given time interval, but with a peak rate well above that, so UUIDs weren't quite going to function for naming/numbering them. You gotta be pushing a lot of messages through a system to hit that, but it's possible. And adding more layers of coordination on top of something meant to reduce coordination overhead tends to... make things messy.
I've half-heartedly tried to look up his problem every time I've had to introduce UUIDs to something (most recently fixing agents not generating correlation IDs), and I have not figured out which version of UUID he was talking about. I now suspect it was some proprietary or industry-specific quote-unquote UUID spec rather than an industry wide one. I may look through the UUID7 spec at some point to see if they mention prior art. Much more plausible than him being a time traveler.
To sort or filter the records by time. Sure, you can just add an extra column if you need this, but there are cases when this is not convenient to do. E.g. when you want to be able to export the records info files, naming the files with the IDs and still be able to sort them.
More importantly if you have an index on purely random IDs, then each insert will go to some random position into the tree whereas having IDs that increase with time will make all new IDs end up at the end of the tree which reduces index fragmentation.
depending on scale and architecture, either behavior can be better. it’s easier to shard when writes occur randomly over the overall space. it’s easier to coalesce when writes all happen in a given place (head or tail)
Only when writing all at once and when you know what the shard boundaries are and the number of shards (and boundaries) are stable. If they’re changing, growing, et c. you can’t tell where they’re at predictably and random is the least likely to cause problems and allow sub-sharding dynamically.
Very large real world datasets are unlikely to be static long enough, and equipment stable enough, to not consider this effect.
> If they’re changing, growing, et c. you can’t tell where they’re at predictably and random is the least likely to cause problems and allow sub-sharding dynamically.
I'm confused by your reply, because I never suggested not to use random bits for sharding.
I'm just saying that 60+ random bits should be enough to shard, change, grow, and sub-shard with. You don't need 122.
People were talking about the value of time+random UUIDs versus all-random UUIDs, and how those behave.
You said that sometimes the random behavior is preferable.
In response to that, I was saying that even if you want to sort randomly at some particular step, you should use the time+random format, because other steps might not want to sort randomly. You should directly choose to use the random part, instead of indirectly forcing it by making the entire UUID random.
Then you said "Only when writing all at once and when you know what the shard boundaries are and the number of shards (and boundaries) are stable."
I can't figure out how that relates to my post. I thought you were worried about insufficient random bits to use for sharding, but apparently that wasn't your concern. So I have no idea what your concern is. If you have a use case for randomness, use the random half of the UUID.
There's some flexibility in how you fill in a UUIDv7, but let's go ahead and say that the ones we're worried about have the first 32 bits filled with timestamp and the last 32 bits filled with random.
If you want pure sort-by-time, then use it the normal way. If you want pure sort-by-random, then it's slightly awkward but you can prioritize the random part.
But the additional power is that you can shard by the last 32 bits, then sort by the first 32 bits within a shard. And you don't need weird workarounds like hashing the UUID.
You said "it’s easier to shard when writes occur randomly over the overall space. it’s easier to coalesce when writes all happen in a given place (head or tail)". But you can have both at the same time. You can have easy sharding and easy coalescing.
Except you literally can't do random distribution AND be compliant with UUIDv7 if you use any sort of normal lexical sorting/indexing, as they use the start of the key as the most significant bits. UUIDv7 is literally designed to have stable lexical sorting orders, have the time as the most significant bits, and have the most significant bits of the time as the most significant bits of the key! It's their primary design criteria!
You can't 'prioritize' random parts of a key for sorting without writing a bunch of custom sorting (and key parsing) logic, which is generally undesirable for a number of reasons - and frankly completely unnecessary in these cases. You just wouldn't use UUIDv7 (or probably a UUID in general), and the benefits would pay for themselves very quickly anyway.
To quote the UUIDv7 RFC:
"This document presents new time-based UUID formats which are suited for use as a database key." (as the first line of the abstract)
"Due to the shortcomings of UUIDv1 and UUIDv4 details so far, many widely distributed database applications and large application vendors have sought to solve the problem of creating a better time-based, sortable unique identifier for use as a database key."
"- Timestamps MUST be k-sortable. That is, values within or close to the same timestamp are ordered properly by sorting algorithms.
- Timestamps SHOULD be big-endian with the most-significant bits of the time embedded as-is without reordering.
- Timestamps SHOULD utilize millisecond precision and Unix Epoch as timestamp source. Although, there is some variation to this among implementations depending on the application requirements.
- The ID format SHOULD be Lexicographically sortable while in the textual representation.
- IDs MUST ensure proper embedded sequencing to facilitate sorting when multiple UUIDs are created during a given timestamp.
- IDs MUST NOT require unique network identifiers as part of achieving uniqueness.
- Distributed nodes MUST be able to create collision resistant Unique IDs without a consulting a centralized resource."
I'm pointing out that for some systems, that makes UUIDv7 unsuitable because you WANT the keys to be randomly distributed to avoid hotspots. Using UUIDv7 in these situations will result in a single node receiving all writes (and all reads for a given time range), which in the dataset sizes I'm referring to is usually impossible to handle. No single node can handle that kind of load, regardless of how efficient it may be.
For other types of systems (such as single machine databases or 'tight' clusters of databases without extreme write loads), UUIDv7 and similar is great, as it allows easy/cheap write combining when that is actually possible for a machine to handle the load.
> Except you literally can't do random distribution AND be compliant with UUIDv7 if you use any sort of normal lexical sorting/indexing, as they use the start of the key as the most significant bits. UUIDv7 is literally designed to have stable lexical sorting orders, have the time as the most significant bits, and have the most significant bits of the time as the most significant bits of the key! It's their primary design criteria!
> You can't 'prioritize' random parts of a key for sorting without writing a bunch of custom sorting (and key parsing) logic, which is generally undesirable for a number of reasons - and frankly completely unnecessary in these cases. You just wouldn't use UUIDv7 (or probably a UUID in general), and the benefits would pay for themselves very quickly anyway.
Forget prioritizing, that was about going fully random. Seriously, let's pretend I never said that specific sentence.
Let's focus on just the sharding scenario. None of what you said there conflicts with what I said about sharding.
Unless these database engines are so incompetent that you can't shard on something as simple as id[12:16]?
> I'm pointing out that for some systems, that makes UUIDv7 unsuitable because you WANT the keys to be randomly distributed to avoid hotspots. Using UUIDv7 in these situations will result in a single node receiving all writes (and all reads for a given time range), which in the dataset sizes I'm referring to is usually impossible to handle. No single node can handle that kind of load, regardless of how efficient it may be.
You only want the keys to be randomly distributed at the sharding layer. Once it reaches its home node, you don't want random distribution within that node. At best you begrudgingly accept it.
It's within a node that things like "normal lexical sorting" matter the most, so UUIDv7 does a great job of making that smooth.
You don't need lexical sorting between shards, especially when you're randomizing the shard.
The point I'm making is all these shenanigans are completely unnecessary, don't really help, and make everything extremely hard to manage, reason about, and get performance from - all to try to force usage of a specific key format (UUID) in a situation which it is not designed for, and for which it is not suited.
It's square peg, round hole.
And folks working on Exabyte sized indexed datasets generally already get this. So I'm not sure why i'm even having this discussion? I'm not even getting paid for this!
"it allows easy/cheap write combining" is not "completely unnecessary". What the heck, at least be consistent.
And it's not shenanigans! You could shard based on the first bytes of a key, or you could shard based on the last bytes of the key. Neither one should be harder. Neither one is shenanigans.
Wow a long thread of back and forth and confusion :)
Fwiw I’m with Dylan on this one!
I have direct experience of absolutely humongous data processing using random bits for shard selection where each shard uses sorted storage and benefits from the sortability of the time bits so, with just the smallest buffering, all inserts are basically super fast appends.
This is super normal in my experience. And I can’t wait for the new UUID formats to land and get widely supported in libs to simplify discussions with event producers :)
Because sometimes you want some data to be collocated, while the rest sharded.
For instance, you might use a random object ID as a prefix value in the index, followed by attribute ID which isn’t. Or a modified time, so you can have a history of values which can be read out linearly.
If using it directly, that means Objects and their data are sharded randomly across, but when looking for an objects attributes (or attribute by time), their index entries are always co-located and you can read them out linearly with good performance.
If blindly hashing keys to distribute them, you can’t do that. Also, you can’t really do a linear read at all, since no data will be ‘associatable’ with others, as the index value is randomized, and what is stored in the index has no related to the key provided by the user.
You can only do a straight get, not a read. That is very limiting, and expensive with large data sets as most algorithms benefit greatly from having ordered data. (Well, you could do a read, but you’d get back entries in completely random order)
Needless to say, this is ‘advanced’ usage and requires pretty deep understanding of your data and indexing/write/read patterns, which is why random hashing is the most common hash map behavior.
I’ve never seen that kind of optimization on a dataset that would fit on a database server of any kind. Tens of PB or EB usually, but sometimes only several hundred TB if it’s high load/in-memory only.
Or, you could use a graph database and stop having frustrating relational impedance mismatch, nonlocality etc. You can have O(1) lookups instead of O(log N) for almost everything
That will depend on which graph database you use as a graph database might just store the graph in an underlying relational database. And it will also depend on what kind of data you have and what kind of queries you want to perform. For a graph database it might be faster to navigate along links in the graph but I would guess you will have to pay a big performance penalty if you have to operate orthogonally to your links, like aggregate across all instances of some entity.
Graph databases don't solve that. All databases, document, graph, rel ALL implement indexes to find specific things in the exactly the same way. Very well known tree, hash and other techniques.
The representation (outside of indexing) has properties that make your USE CASE better or worse. EGreg would not be someone to hire to architect a solution. He'll just put your 1Trillion row per month use-case in a graph DB like Neo4J and you'll just watch it fall over when you run billing.
When you’re talking about data sets so large they dictate what hardware you use, and introduce terms like “cluster”, then 1 = √n
Which is why we need a version 2 of complexity theory, that doesn’t treat memory access or arithmetic on arbitrary precision numbers (aka as n actually goes to infinity) as O(1) operations. They aren’t. Which every large system engineer knows but few will talk about.
And that complexity theory already exists. Typical whiteboard engineering uses transdichotomous models to gloss over some polylogarithmic factors (as do much of the literature), but more accurate models exist.
The difference isn't usually super relevant when comparing multiple solutions all using the same model of computation though since the extra terms don't tend to bump one complexity class above another when switching models, and if you cared about actual runtime characteristics you wouldn't be relying on log factors in such a crude tool anyway.
Imagine a data center containing exabytes of data. How long does it take to access an arbitrary bit of that data?
We use clusters because computers cannot contain an infinite amount of memory, storage, or CPUs, because of physics. You see this same thing play out at smaller scales but it's more obvious at the macro scale. More addresses take logn time to sort out, but time to access is measured in radii, not gate depth.
In a world where clusters are rare, Knuth made decent approximations. In a world where clusters are not only de rigeur but hosted on multitenant hardware spread out over upwards of 100 miles, those approximations are bullshit and need to change.
Integer arithmetic is really quantized logarithmic complexity. If your hardware has a bucket your number fits into, you're calculating n+1 or nxn in constant time. But if your data set doubles in size (especially for multiplication) you may find yourself in a bucket that doesn't exist or a more expensive one. Contemporary code is more likely to reach for bignum which is logn, but again stairstepped to each number of integers it takes to represent the entire number. A bigger problem when your data set is sparse, so that values grow faster than population (eg, UUID).
But on the topic of hash tables, you can only reach 'O(1)' if you can avoid collisions. And to avoid collisions you need a key of length m, which grows as n grows. You cannot put 51 pigeons in 50 pigeonholes. So the key length of your hash keys is m >= logn, which means the time to compare keys is also logn, which means hash tables are never actually O(1) access or insert time. Actual access time for a hash table on non-imaginary hardware is √nlogn, not O(1).
If you consider that we have applications that are just a single hash table occupying the entire RAM of a single machine, then this is not picking nits. It's capacity planning.
You cannot put 51 pigeons in 50 pigeonholes. So the key length of your hash keys is m >= logn, which means the time to compare keys is also logn, which means hash tables are never actually O(1) access or insert time.
I am not sure I am following this argument. You are not going to have more than 2^64 pigeons and pigeonholes on any system soon and I almost dare to say you will never ever get to 2^128. And for 64 or 128 bit keys comparisons and many other operations are for all practical purposes constant time. I guess you could argue that this is sweeping a factor of log(n) under the rug because of things like carry chains which could be faster for smaller bit sizes but I am not sure that this is really useful on common hardware, an addition will take one clock cycle independent of the operand values.
Sure, they can be and are often longer, but not because you were forced to use long keys, it just happened that the thing you want to store in a hash table is a long string. The way you worded it made it sound to me like you were saying that one has to use long keys, not that in practice one often has to deal with long keys. But even then I am not convinced that this should give an additional factor in the complexity analyses, I think I would argue, at least in some situations, that calculating hashes of long keys should still be considered constant time, that for the longest keys. But I can also imagine to take this into account if the key length is not only big but also highly variable.
Look, if you have N items related to X, at insert time, you store them in an array and have X point to that array, instead of foreign keys.
For example, when a user has 7 articles. Do you want to just point to where the articles are stored? Or do you want to do O(log n) lookup for each article?
And if you have many-to-many, do you want to join an Intermediate Table for even more processing, or just follow a pointer to a range of an intermediate node directly and traverse?
What about when you delete rows? Do you just leave the space unused now? Or if you update a row to be larger? Rewrite the whole array (so possibly O(n) updates)?
How do you deal with data that gets accessed in different orders based on relationships from multiple other entities? Duplicate the data? If so, updates now get amplified and you can fit less data in RAM so you're more likely to require disk IO. If not, you need a layer of indirection (so you have an array of pointers instead of an array of data).
Even with a layer of indirection, updates that grow a row and require a reallocation will cause you to have to go update all pointers (also, those pointers need to be indexed to find who you need to update). To avoid write amplification, you can use an array of ids instead of an array of pointers. Now you want an id <-> pointer lookup, which could be done with a hash map in O(1), or with a tree in O(logN) if you want to also allow efficient range queries.
I'm not exactly sure on the implementation details, but for ballpark numbers, for an 8 kB page size with 80% fill factor and 16 B entries (8 B key + 8 B pointer), with 10E9 rows, log(N) ought to be ~4 (page IOs). Not ideal, but the point is for a lot of real-world engineering, O(logN) is effectively O(1) (with a small enough constant factor that it's fine for general use).
So if your data is not write-once, trees are a well-rounded default data structure.
One benefit of keeping it separate is that you can choose the precision of the timestamp necessary. "Millisecond precision" is arbitrary and commonly insufficient.
Lucky for you, they also define UUIDv8 as a free-for-all where you can do whatever you want, and nanosecond precision is one of the examples given in the RFC.
Otherwise, failure recovery requires robust storage. Upon startup, you just wait until your timestamp ticks over, and then you know you're not re-issuing any UUIDs.
With a pure counter-based system, you need robust distributed storage and you need the machine to reserve batches of IDs via committing writes to the robust distributed storage. Otherwise, a disk failure may cause you re-issue UUIDs.
Though, seeding a thread-local AES-256-ctr or ChaCha20 instance via getrandom()/getentropy() is probably a better fit for most situations. If you need sequential IDs for database performance, seed a simple 128-bit counter using getrandom()/getenropy(). If you're going to need more than 2^40 unique IDs, then use IDs larger than 128 bits.
Assuming getrandom()/getentropy() is getting you a high-entropy seed, it's easy to size the IDs to bring the probability of collision much lower than the probability of background radiation flipping a bit and breaking your equality test between IDs. If you're not running redundant calculations and/or on radiation-hardened hardware, it's difficult to argue against randomized IDs.
> With a pure counter-based system, you need robust distributed storage and you need the machine to reserve batches of IDs via committing writes to the robust distributed storage. Otherwise, a disk failure may cause you re-issue UUIDs.
Yes, a pure counter based system is probably worse than one that uses time. Don't use counters, especially not in a distributed setting.
I do feel 74 bits per second is enough though. That would require 2^(74/2) = 137 billion UUIDs generated in a single second for a 50% chance of a single collision.
The logistics of combining fields in indexes and identifiers is relatively complex, while the logistics of indexing a single field is comparatively trivial. This is also why you don't ship timestamps using separate fields for second/minute/hour/day/month/year, but a single ISO-string or UNIX timestamp as representation: it makes querying and interpreting the value more consistent.
It is noy just the DDL of the primary table that needs to care about it, but also all foreign keys, DML, queries, APIs accessing the data, etc. Storing that UUIDv7 is likely going to be cheaper than pushing the cost of keeping a composite identity onto other components and systems that work with that data.
Well, that depends on how you manage it. An ORM will find it trivial. Custom SQL, not always so much.
That said, PostgreSQL offers https://www.postgresql.org/docs/current/rowtypes.html which gives you the best of both worlds. A single field which contains 2 fields that can later be extracted. So everywhere you can write a single field for joins, etc. But then when you need it broken out...
Any pure relational database design will eschew surrogate keys - most real-world systems will (should) add them back - because a surprising number of good natural keys end up changing (names change, phone numbers change, emails change, twitter handles become irrelevant/disappear, location of birth may change subject to geographic regions changing size...).
And on top of all that, there are efficiency concerns.
That said, at least for join tables AFAIK - there aren't often a need for row IDs beyond the involved foreign keys - unless you need to add meta data from other relations/tables...
My British passport says I was born in Birr; my Irish passport says I was born in Galway. These are both correct, because they are answering different questions. (I was born in the town of Ballinasloe in County Galway, but my mother's place of residence at the time of my birth was the town of Birr in County Offaly.)
That's my memory of it. My British passport is expired, and is probably somewhere in my parents' house. I don't actually know what it says on it. I do clearly remember that at least one British form of ID I had at one time recorded my mother's place of residence rather than my actual birth place. I thought it was the passport. Maybe I was wrong.
I dont know why people use relational databases other than they were first and “that’s the way it’s always been done”.
Why not use a graph database? O(1) lookups instead of O(N). Why need indices if you can just point to the data. Why use JOINs when map-reduce querying is far more flexible?
ISAM and VSAM come to mind. Yes it says "files" in there a lot but it got used like a database with programming interfaces like finding records in a database. If you will this method is more like NoSQL databases than a relational DB. The I in ISAM was a step towards not having to know the key (true "NoSQL"). Kind of like today's NoSQL databases all also give the ability to add indexes now.
I've been interested in learning more about them and how to best utilize them in my company. What graph database and query language would you recommend (regardless of stack)?
> My intuition let's me know that you can not get below O(n log n) (Lower limit for comparison based ordering)
That intuition should point at O(log n), shouldn't it?
In any case, it totally depends how your data is stored and how/what you want to look up.
If you already have some kind of id of your node in the graph you want to look up, you can get O(1) lookup and still call it a graph database. If you have to traverse the graph, then it depends on the structure of the graph and where your entry point is, etc.
I'm rather skeptical of graph databases. Whatever they can do, you can do with a relation database and datalog. (Think of datalog as SQL plus recursion, perhaps.) See https://en.wikipedia.org/wiki/Datalog
When you sort them they will be ordered by generation time. This is useful in UIs, and greatly improves performance in databases (how much depends on DB and use case).
For our “session” records, it’s a UUIDv7. This sorts beautifully, and if I wanted to, I could look at a log and easily see all the entries in a particular session.
For a larger db, we just need unique entries and at least in Dynamo, it is an advantage to equally distribute them as much as possible. UUIDv4 there.
Iirc you dont want to spend entropy that you don't need. The advantage of using the time means you need to spend less bits on expensive randomness and thus generation is cheaper.
I only work in embedded, so I’m not sure about fancy computers - but when we make UUIDv7s, I am certain the random is faster than pulling the time, formatting for milliseconds, then storing in position.
And they play nicely with sort ordering in popular databases like postgres!
They’re not in tree yet but there are a bunch of awesome pg extensions that provide access to uuidv7 (outside of just using it at the application level)
The problem with UUIDs is they are completely unreadable. Not just you can't understand them, it is prohibitively hard to even distinguish between them visually. This is why identifiers which would not include any bit of information (or noise) beyond the necessary minimum are useful in some cases.
Don't even need the "same millisecond" part to save a few cycles, use case depending. An overflowing increment with a counter of any sort plus a few random bits is usually enough.
Yes, but only on single machines, UUID and co are intended for distributed systems.
Although now I wonder if / how UUID v7 can do sequential keys on distributed systems. Mind you, on those systems "close enough" will probably be good enough, and sorting will be done by date instead of incremental ID.
Should you, though? BLE itself uses UUIDs, but the ones it uses are in a shortened format just because the BLE frames are small and the protocol is not too fast.
Granted I've even used JSON over GATT so maybe I should keep my mouth shut :).
> This is why you should use ids that combine both a time component and a sequence.
Computers should run like clockwork, so in this example of using all the cores, in Windows and likely some other OS's, threads are assigned to cores when they are started and you can have many threads per core, ergo the time component should also have the thread and the core number combined with it with multi core systems.
Its possible to write the code in such a way some variables are bounced out of the cpu cache back to memory to avoid any caching issues because I think the cache on some cpu's are per core, and some are per cpu.
> threads are assigned to cores when they are started
Do they? I thought the normal behaviour was for cores to pick any available thread to run, so core migration is quite normal.
> ergo the time component should also have the thread and the core number combined with it with multi core systems.
Sorry, how exactly does it follow from the previous? You seem to have omitted the other half of your syllogism. After all, clockworks do not have thread nor core numbers so I don't quite see how having those in UUIDs will make computers run like clockwork.
I used to be the program manager owner of the security event log in windows.
When things happen simultaneously or very closely in time on multi-core systems, thread scheduling can significantly affect observations of those things. For example, your thread quantum might expire before you get to the syscall to get a time stamp, or before you can pass a buffer to queue your event for subsequent time stamping.
In fact, on multiprocessing systems, it was very common to see out of order event log entries on Windows back in the day (2000-oughts). You also could not count on the log timestamp accuracy too precisely; 1s was pretty much the safe lower bound (some components truncated or rounded time stamps IIRC).
If you want unique identifiers, use version 4 (random) UUIDs. Problem solved.
The probability of a collision is roughly the same as the probability of a fully grown dinosaur spontaneously manifesting in your bedroom due to quantum fluctuations.
More seriously, If you can use them, good old increments are probably best. They are fast and cheap. Especially in a database. They can have privacy/security issues (you could guess things by the values of ids of stuff). UUIDs are better in those case or when you deal with a distributed system.
If you're going to do that then you might as well just use UUID, since you effectively reintroduce the negative aspects of that (infinitesimally miniscule chance of collisions, computation involved in the calculation, etc.)
The difference is that you can still use sequential IDs internally, while exposing hashed IDs to the outside. This protects your database from collisions under all circumstances, while in the absolute worst case, a single user might experience bugs because two external IDs collide.
This is a weird proposal. If you're using non-hashed IDs internally and exposing hashed IDs externally, you are going to need to map those (securely hashed) ids back to internal ids when the client hands them to you.
I guess you could do this with complete table scans, hashing the ids and looking for matches, but that would be horribly inefficient. You could maintain your own internal reverse index of hash -> id but now I have to ask what's the point? You aren't saving any storage and you're adding a lot of complexity.
Seems like if you want random unguessable external ids, you're always better off just generating them and using them as primary keys.
Also, you aren't protecting your database "from collisions under all circumstances" - there's no guarantee your hash won't collide even if the input is small.
Yes, it is more reasonable to use encrypted IDs externally from structured/sequential IDs internally, not hashed IDs. Recovering the internal ID from the external ID is computationally trivial since it will fit in a single AES block and you don't have to worry about collisions.
Yes, I tend to like this philosophy in database design, of internal sequential ids which are used for joins between tables etc. and an exposed "external reference". But I typically would use a UUID for my external reference rather than a hash of the internal id.
Doesn't that just add a whole lot of unnecessary complexity? If elements have multiple IDs, one of which should not be leaked to the outside, that's just asking for trouble in my opinion.
Is generating UUIDv4 or UUIDv7 really too much effort? I'd assume that writing the row to the database takes longer than generating the UUID.
It also means once your hash function leaks for whatever reason or gets brute forced because of whatever weird weakness in your system, it's game over and everybody will forever be able to predict any future ids, guess neighboring ids, etc., unless you're willing to change the hash and invalidate all links to any content on your site.
If I'm in a scenario where I think I need consecutive ids internally and random ones externally, I'll just have two fields in my tables.
Store just the sequential id, compute the hash on the edge.
This keeps your database simple and performant, and pushes complexity and work to the backend servers. This can be nice because developers are typically more at home at that layer, and scaling the backend can be a lot easier than scaling your database. But it also comes with the downsides listed in this thread.
Good point. Back when we did that we just used a reversible hash function (some would call it encryption). There are some simple algorithms meant for encrypting single integers with a reasonable key.
I might be misremembering, but didn't YouTube do this in the early days? So yeah, that was what I was thinking of when replying, not a traditional hash function.
Hum, no. You can easily hash numbers from 1 up to the value you see and guess the next value.
If you want a secure identifier, make a random 64 or 128 bits number (a UUID type 4). And do not use this number as an internal identifier, because identifiers performance is all about predictability and low entropy.
Unless the RNG itself is the problem, there's no reason not to. All kinds of civilization-ending catastrophes are vastly more likely than a collision in a space of size 2^122.
> The probability of a collision is roughly the same as the probability of a fully grown dinosaur spontaneously manifesting in your bedroom due to quantum fluctuations.
I'd love to see the math for the probability of the second option.
> The probability of a collision is roughly the same as the probability of a fully grown dinosaur spontaneously manifesting in your bedroom due to quantum fluctuations.
But now the probability of bad things happening increased by about a factor of two, which is not acceptable.
Knowing nothing about UUID v4 generation, I have likely a stupid question. What makes you so confident that all implementations and their entropy sources are flawless enough to make actual collision probability close enough to theory?
We are not confident in any of that. Which is why we mitigate it with checksums, ECC etc.
So depending on the consequences you might opt to reduce the risk or help mitigate the consequences.
To just state that it is about as likely as my coffee maker being a portal to the future isn't very helpful. Poor entropy sources or bugs are not uncommon.
Despite the resolution being nanoseconds, what is the actual precision of computer clocks? I can't imagine it is actually nanoseconds. Takes me back to teaching physics labs where I had to hound students to remember that the accuracy of their measuring device is not identical to the smallest number it displays...
For devices clocked above 1Ghz it's perfectly possible for the clock to increment every ns, although that doesn't make it accurate to that level, and multi core systems may have clocks that are not synchronised to that level.
ARMv8 guarantees that it's clock increments at at least 1Ghz, for intel and earlier ARM it's more complicated
Cycle counting is one thing, but it gets tricky when you have frequency scaling in play. Another problem that even without freq scaling, cpu clocks are not designed to be super accurate/exact, and the true frequency might vary significantly even if the nominal freq is fixed
I think on newer x86 chips the ‘cycle’ counter increments at a ~constant rate, rather than changing with variable cpu frequencies. There’s a cpuid flag for this and I think Linux exposes it in procfs somewhere too. Older chips do have the problem you describe, as well as problems with cycle counts diverging (or never being close) between cores/sockets. It still isn’t exact in the way you describe. The most reasonable thing you can do is regularly calibrate based on an actual clock (whose rate also gets calibrated based on ntp…) to have a cycles<->ns conversion factor.
It doesn't matter that much they're not super accurate exact, they do in fact count ticks n tocks j dandy. I wonder if they are counted equally, interesting question if i do say so myself about my own question, i know rising edge is steady w the next rising edge, n falling edge w falling edge, but yeah...no i think that's a specification figure of merit, that they be balanced w respect to each other. N Intel® despite it's many failures, long ago forecast n long overdue due to the sheer difficulty of their business model n how long they kept it going--they were saying it was going to end soon in the late seventies already--n you know what, i'm fine w that. They don't make old chips rot like other software companies i could mention, bits rot but glass is timeless. Which brings me back to the point, in my analysis the problem is not the clocks being inaccurate but rather the jitter, which means a single run will not suffice in describing a repeatable exact clock time taken for eg an inner loop, which is what is worth bothering with. The minimum jitter attainable currently is 1 cycle, n then i guess you run the same code repeatedly n take the minimum with more repeatability as a consequence of that low jitter.
In the early nineties it was not so, you'd get the same number of clock cycles again n again n again.
N then it gets tricky because cycle counts are thresholds set within which, if voltages n frequency are proper, the operation will complete deterministically w a soft error rate no greater than the system as a whole, about one per 30 years of hammering on the chip at sea level. Which is not enough for my tastes, n the jitter is a fucking mess.
I much prefer the GA144, least jitter of any platform bar none, sensible because it is fully asynchronous logic, no clock anywhere in sight until you connect it to an oscillator, n even then the vibration doesn't rock the system like that of a grandfather clock synchs w another grandfather clock with whose pendulum's swing the former clock's pendulum swing is aligned. GA144 it's pretty easy to tell average case complexity of a function down to the tens of picoseconds, at which point you have to check that there's no open window letting a draft in. In fact the time trial will tell you such a draft is coming into the house, it happened to me, while not from a parallel universe in spacial dimensions it is by all means from another universe in the time dimension.
> In a strictly monotonically increasing sequence of values, all values that have a predecessor are larger than its predecessor.
Strictly monotonic values imply some synchronization / coordination, with an associated performance impact when there are many concurrent processes. This functionality is available via the https://www.erlang.org/doc/man/erlang.html#unique_integer-1 function. With a warning associated with it:
> Strictly monotonically increasing values are inherently quite expensive to generate and scales poorly. This is because the values need to be synchronized between CPU cores. That is, do not pass the monotonic modifier unless you really need strictly monotonically increasing values.
Relatedly, Erlang's own refs aren't created by a strictly-monotonic global generator; but rather are internally a pair of a regular monotonic identifier, and the PID of the requesting process. In other words, they're akin to UUIDv1s, or to https://en.wikipedia.org/wiki/Snowflake_ID s.
You only really need strictly monotonic global identifiers if you need immediately-consistent first/last-write-wins. If you can instead rely on eventually-consistent first/last-write-wins (i.e. if your write events enter an event store/queue that linearizes them by ID, and then all "simultaneous" writes but the one with highest ID priority can be dropped/ignored, either during processing, or on read), then I'd recommend first considering packed (nodeID, seq) pairs. And, if you want global event orderibility, considering the snowflake ID formulation (timestampMajor, nodeID, timestampMinor, seq) specifically.
Comment out CLOCK_MONOTONIC_RAW because that's not available on FreeBSD, and it looks OK to me? My understanding is that we should see some timestamps repeating if there's a collision, correct? I can't get any collisions...
I'm running it on real hardware (Ryzen 9 5900X, 12 cores 24 threads) and all I'm doing is executing the program once. My understanding is that the program is using runtime.GOMAXPROCS(0) to ensure it's using all available CPU cores because that falls back to runtime.NumCPU ?
Oh, I was only running the C program not the Go binary which orchestrates it, that's why I wasn't seeing the longer stats. But I'm still getting no duplicate collisions.
running longer zeros test ...
sampled 10000000 pairs; 0 time diff zeros = 0.000000%; 0 nano diff zeros = 0.000000%
starting parallel test 24 goroutines x 10000000 samples ...
10000000 samples from a thread; 0 collisions inside the thread; 0 collisions with other threads
10000000 samples from a thread; 0 collisions inside the thread; 945466 collisions with other threads
10000000 samples from a thread; 0 collisions inside the thread; 1982688 collisions with other threads
10000000 samples from a thread; 0 collisions inside the thread; 2739919 collisions with other threads
10000000 samples from a thread; 0 collisions inside the thread; 3334361 collisions with other threads
10000000 samples from a thread; 0 collisions inside the thread; 4109772 collisions with other threads
10000000 samples from a thread; 0 collisions inside the thread; 4489881 collisions with other threads
10000000 samples from a thread; 0 collisions inside the thread; 5157178 collisions with other threads
10000000 samples from a thread; 0 collisions inside the thread; 5596508 collisions with other threads
10000000 samples from a thread; 0 collisions inside the thread; 5854763 collisions with other threads
10000000 samples from a thread; 0 collisions inside the thread; 5937583 collisions with other threads
10000000 samples from a thread; 0 collisions inside the thread; 6434076 collisions with other threads
10000000 samples from a thread; 0 collisions inside the thread; 6521917 collisions with other threads
10000000 samples from a thread; 0 collisions inside the thread; 6932626 collisions with other threads
10000000 samples from a thread; 0 collisions inside the thread; 7104428 collisions with other threads
10000000 samples from a thread; 0 collisions inside the thread; 7285076 collisions with other threads
10000000 samples from a thread; 0 collisions inside the thread; 7514904 collisions with other threads
10000000 samples from a thread; 0 collisions inside the thread; 7833317 collisions with other threads
10000000 samples from a thread; 0 collisions inside the thread; 7737265 collisions with other threads
10000000 samples from a thread; 0 collisions inside the thread; 7723545 collisions with other threads
10000000 samples from a thread; 0 collisions inside the thread; 7730441 collisions with other threads
10000000 samples from a thread; 0 collisions inside the thread; 8311161 collisions with other threads
10000000 samples from a thread; 0 collisions inside the thread; 7965117 collisions with other threads
10000000 samples from a thread; 0 collisions inside the thread; 8460173 collisions with other threads
102297835 final samples; 137702165 total collisions = 57.375902%; possible duplicate collisions? 0
At some point doesn’t this come down to the ISA? A CPU running at 3GHz gets 3 clock cycles per nanosecond.
I bet there is a fair amount of optimization in the compiler that leads to back to back assembly calls of reading the clock register. If subsequent time.Now() calls happen within 3 clock cycles of each other, can you really fairly expect unique nanosecond precision…
It takes like 20 cycles to read the cycle count register on modern chips. Even if collisions are somewhat unlikely, it’s still much worse to get several a day than ‘almost never going to happen’.
Reminds me of some Lotus Notes folklore. Apparently they used to use timestamps with a resolution of 1 second as unique IDs for things. When there was a collision, they would just add 1 second. Eventually, you'd end up with items being in the future because there would be so many collisions.
Silly Rabbit - absolutely accurate times are a security problem. CPU designers (even waay back in Alpha @ DEC) intentionally introduced clock jitter, just to prevent total predictability. For x86, I think if you performed 3-4 of them, and then saved the values in to registers, and then upon completion reported those values, you would find that the time deltas are NOT exactly the same.
Do you have any sources for this? My googling skills are failing me. I'm surprised early x86 (which I assume you're including) were aware of security issues with accurate clocks; I certainly wasn't until this millennium :D I would rather guess observed clock jitter would be explained by interrupts or some such. Not saying you're wrong, I'd just like to learn more.
I’ve met too many people who are surprised by millisecond or microsecond timestamp collisions.
My most memorable and least favorite variety of this is when people try to assemble a timestamp from two system calls, one for the most significant digits and a second for the least.
If the smallest digits roll over from 99x to 00x after you read the large digits, due to process preemption, you can create a time stamp for an entity that happens before the entity that caused it to exist. This breaks some code spectacularly (I’ve seen infinite loops at least twice).
If you haven’t memorized this as a thing to always avoid, you end up with tests that pass 99.5% of the time, and it can take someone with very good pattern matching skills to catch that the same test has been red once a week for a month and a half, which is a very long time for a logic bomb to live in CI/CD code before getting fixed.
The problem is achieving locality in cohesion/meaning, which usually involves locality in time, which provokes using timestamps as part of the id at the very least. But it's a chain of very lazy thinking IMHO and it's like a peek at the house of cards some large systems are built like.
Our Go ULID package has millisecond precision + monotonic random bytes for disambiguation while preserving ordering within the same millisecond. https://github.com/oklog/ulid
If you need unique nanosecond, keep track of the previously generated one and increase it if necessary. Would require global lock or atomic stuff, but should be good enough for practical uses.
"Logical Physical Clocks" may be of interest. The timestamps are monotonic and don't require atomic clocks like in google spanner.
HLC captures the causality relationship like logical clocks, and enables easy identification of consistent snapshots in distributed systems. Dually, HLC can be used in lieu of physical/NTP clocks since it maintains its logical clock to be always close to the NTP clock. Moreover HLC fits in to 64 bits NTP timestamp format, and is masking tolerant to NTP kinks and uncertainties.
UUIDs are designed to solve a slightly different, albeit related, problem: where you don’t have synchronisation of the system as a whole. The solution the GP is solving is for systems that are synchronised and therefore generating a UUID is additional and unnecessary overhead.
This is why it scares me a bit to use a raw timestamp as a sort key in DynamoDB. I append a (random) unique ID to the timestamp text to avoid it. Better safe than sorry, I figure.
I have a use case that uses a similar solution, but for a different reason: it doesn't scare me to use timestamp as sort key I since I know my hash keys are unique and I only write every few seconds. But I still add random amount of ms (well under the frequency of writes/updates), because otherwise I'll be hitting hot partition issues on the GSI (indexed by timestamp, as you can guess).
We ran into this, though we were using millisecond timestamps with the Number type.
Ended up working around it by adding numbers after the decimal point. DynamoDB apparently supports up to 38 digits of precision, and JavaScript (we're NodeJS based) by default encodes numbers to allow for up to 16 digits (combined, before and after the decimal point). A UNIX epoch millisecond timestamp is 13 digits, so we were able to use 3 just to add uniqueness without having to update anything else in our codebase.
We certainly now plan to essentially always use string-based (and generically named) partition and sort keys, which would allow us to more easily do what you're describing.
I was going to post about "use a UUID", but I was surprised to learn that no UUID uses both timestamp + a random component. You can either get fully random with UUID4, or have a time + MAC based UUID with UUID1. Strange, I would have thought there would exist a UUID that uses time + random to minimize collisions like described in the post.
It's not uncommon to provide a way how to do it in standards because it will happen whether you want it or not. See e.g. HTTP headers starting with X- or MIME types in application/vnd.
> I was wondering: how often do nanosecond timestamps collide on modern systems? The answer is: very often, like 5% of all samples, when reading the clock on all 4 physical cores at the same time.
A few things to consider:
1. This would depend a lot on how regularly you are checking, the more regular, the more collisions.
2. There may also be some weirdness where threads doing similar tasks will synchronise or desynchronise to maximize throughput.
3. Your system may not offer accurate nanosecond time. For this reason some OSes tend not to offer lower than microsecond time, as the clocks simply cannot offer higher usable accuracy. You're also measuring the time to call the function(s), create the timestamp, etc.
A simple solution I had years ago was to add random precision to reduce collisions. That breaks most tie situations. If you have 5% ties in nano seconds, with random pico second accuracy your collisions are down to 0.005%. You could also seed the random offset by an ID.
Let me tell you about the time the performance guy figured out one of the big cycle eaters was that every CPU in the SMP system was trying to read the time of day from the same place in memory so they "fixed" that by giving every CPU its only copy of the current time ...
The high resolution precision time counters are derived from the system base clock, usually operating at ~33MHz, which translates exactly into that 30ns granularity observed.
If you really want robust time derived timestamp identifiers, truncate the high resolution timer to at best 10µs resolution replace the low bits with the hash of them, concatenated with the value of the CPU cycle counter (`RDTSC` on x86).
The HRT patch set for Linux (which is eight years old) claims that Linux had a 10μs clock resolution before the patch, and conversations on SO suggest the resolution is now 1ns, so I believe this information is dated, or at least OS dependent.
I don’t think this is right. I am pretty doubtful of the discussion of the granularity of the results of clock_gettime, but I failed to find any sources and I don’t have a suitable computer to hand to experiment.
But here are two more doubts:
1. On at least some systems, clock_gettime will be implemented by a vdso call that (on x86) uses rdtsc to give you the time, so you should expect its results to be pretty highly correlated with an rdtsc immediately afterwards, so you aren’t really adding useful entropy (whereas you are losing time-locality of your identifiers which may be desirable if they are to end up as primary keys somewhere)
2. On some systems (eg some AMD processors), rdtsc gave me pretty low precision results (eg rounding to the nearest 10 cycles) so you also won’t necessarily get the entropy you would expect from the description of the instruction. I failed to find a reference for this too apart from an offhand mention in a paper about timing attacks.
Because the raw value from the timestamp will be very low entropy and have the short scale variation concentrated in just a few bits. A hash not just destroys information, it also creates entropy by mixing that information over all the bits that come out of it. And using 64 bit hash replacing a 64 bit nanosecond counter that has short term entropy of less than 16 bits, you're in fact reducing the likelihood of a collision by a factor of 2^48.
The hash, in this case, is just one deterministic way to shorten the CPU counter to few bits which can then be used to increase the entropy of the timestamp by replacing the timestamps stale bits. What's being asked here is not why use some compressed bits of the CPU counter increases the entropy of the timestamp overall. Rather, why you'd use a hash of the entropy providing information to do this since hashes allow for many:1 mappings (i.e. allow for decreases in entropy) while never providing better than 1:1 mappings (i.e. preserving entropy). At best, the hash is preserving the entropy of the timestamp and counter bits and, at worst, it is throwing some away. The question that follows: is there a better way that preserves the entropy of the inputs without the risk of throwing some away and, if so, was there some other reason to still use the hash? This is what I took amelius to be asking.
Knowing the timestamp delta is ~30ns then even a 1 THz processor would only execute 30,000 cycles before said timestamp increases and the outcome stays unique anyways. From that perspective, taking the low 16 bits of the cycle counter is already guaranteed to help produce perfectly unique timestamps, and without the risk of introducing hash collisions. It's also significantly easier to compute and now monotonic* now, whereas hashes were neither. From that perspective, I too don't see what value the hash is supposed to add to the information being fed to it in this case.
In threaded/distributed/parallel scenarios you may wish to throw the lane/core/node number in the bits as well. In the case having the same timestamp is supposed to be invalid this leaves you a simple deterministic way to tiebreak the situation as part of the creation of the timestamp itself.
*A 10 GHz CPU running for a little over 58 years could risk a 64 bit counter rollover in the middle of a time delta. If that's too close for comfort or you want the code to work on e.g. 32 bit counter systems, you'd need to eat more cycles and registers to set another bit to the outcome of whether current cycle count is < the one at the start of the current timestamp delta.
Yeah, when I was thinking about the hash, I thought of it as stuffing to fill the unused portion of the number that would look better than using zeroes.
TIMESTAMP010111101010101TSC "looks better" than TIMESTAMP000000000000000TSC even though they contain the same information.
I would drop the hash, it's deceptive.
I don't believe it breaks the monotonicity, though? I mean, it would if it weren't already broken. If you're taking the low 16 bits of the TSC, then a rollover in those 16 bits during the same timestamp will already go backwards. TIMESTAMP0...0 follows TIMESTAMPf...f.
I guess it depends which portion you look at. If you solely look at the time based portion you do indeed still have a function which never decreases but that is true even in the case of reading the raw counter whole on its own anyways. If you look at the whole value, including the hashed portion, it's no longer monotonic.
In the cycle based case looking at the whole value is the same thing as looking at a relative time stamp which has more precision that the system clock. In this way, it's "truly" monotonic across the entire value, not just monotonic on a part and unique in another.
Side topic: It also comes with an even stronger guarantee of always increasing instead of just "not changing direction". Is there a name for that?
It's a nitpick, but it concentrates the entropy. It doesn't create any.
I do think it will make the answer more clear, as the hash concentrates the less than 64 bits of entropy on that 128 bits of original data into a usable 64 bits package.
Actually hashes do create entropy (every computation creates entropy in some form or another). What's the entropy of a 4 bit number? What's the entropy of a 4 bit number hashed by a 64 bit hash function? The act of computation does in fact create entropy, as per the 2nd law of thermodynamics, a part of which shows up in the hash.
I don't think you understand what this conversation is about. We are considering information theoretic entropy, not thermodynamic entropy from the mechanism of computation itself.
The result of applying a deterministic function on a random variable cannot have more entropy than the underlying random variable. This is a theorem, one that is trivial enough to not have a name. But you can find solution sets to homework that will prove it for you: https://my.ece.utah.edu/~rchen/courses/homework1_sol_rr.pdf
60 bits. Yes, I know, you can compress it down very well. But consider that entropy in computation involves not just the bits you store, but also the bits that the processor touches and eventually dissipates as heat into the universe.
Boltzmann. But it doesn't really matter, it's the same thing. Yes, I know that looking at a sequence of, say 1000 identical bits looks like it's got just 10 bits of entropy after simple RLE compression. But you must not forget the entropy that also generated in the computation itself, and subsequently dissipated into the universe.
It's not the same thing. If I define a function that always returns 1 then the Shannon entropy is extremely low regardless if the Boltzmann entropy of running it on a CPU is high. That the two measures can be different shows they cannot be the same thing. Related in concept, different in definition. In fact, you can even use the same formulas for calculating it - what differs is what your calculating it on.
then it's Kolmogorov complexity is also extremely low.
Look if you have a well enough hash function, it output should be near the Shannon limit and hardly compressible, and ideally contain as much entropy as it has bits. But you can feed in just a single bit or the entire knowledge of humanity, in the end you're going to get a fixed amount of bits, and entropy near of that, and if you throw any form of lossless compression at it, it will hardly compress.
But quantum mechanics tells us, that information cannot be destroyed. So when you feed it more bits, than it emits, then its mostly the entropy of the information you feed in, that you get out of the hash. But if you feed it just a single bit, the additional entropy comes from the computational process.
I know, this is now getting really philosophical, but here's something to ponder on: How would you implement a hash function for a reversible computing architecture?
Most hashes are really good but the point was why replace the perfectly unique information in the cycle counter + time stamp combo with "most likely nearly unique" in the first place. After all, if the former isn't unique then neither are the hashes anyways.
Hashes are EXTREMELY compressible, albeit known algorithms are extraordinarily slow. E.g. I can compress any SHA256 output to a matter of kilobytes, maybe less, by using the SHA256 algorithm as the compressor algorithm and iterating through seeds until I get a match. With true entropy you can't guarantee that for all inputs, regardless of how long you take.
Different types of "information" ate at play here with the different types of entropy as well. If I have a 16 bit hash function and feed it a 64 bit value 48 bits of computational information is lost (at minimum). What happens with the physical information you used to represent the computation after you get the information result is separate from what happens with the computational information.
You don't know precisely at which frequency the cycle counter runs. Depending on the system load it might either run faster or slower than the lowest bits the HPTC. For what it's worth this part is more or less nondeterministic, so the sane thing to do, is spread out the information as much as possible (maximize entropy), in order to minimize the probability of collisions.
That's ok, the bits past the low bits are just there to avoid collisions, not an actual measure of high precision time beyond the low bits.
It's not worse than the hash solution, I'm just saying it's not necessary to hash it if the only objective is to reduce collisions.
In fact the hashing solution, if it is replacing the low bits with a hash of low bits plus something else, is actually destroying valuable time information.
That only works, if you know exactly, that the low bits are constant. Otherwise you may run into the issue that due to unsteady rate of RDTSC between two processes/threads that may be preemptively unscheduled between reading the HPTC and the RDTSC you might again end up with colliding time stamps. And if you took the derivative between timestamps taken in succession, you might even find is to be non-monotonic in places.
The combination of multiple counters incremented by individual unsteady clocks used to be a source for pseudo random scrambler sequences; these days we prefer LFSRs, but overall this is something that can be weird.
Hence my recommendation: Just throw xxHash32 on concatenation of the HPTC's low bits and the CPU clock cycle counter, and forgo any pretense of monotony in the low bits (because very likely you don't have it anyway).
I thought clock_gettime() usually does use rdtsc(p) on Linux? Possibly depending on the particular clock type (montonic, realtime, etc). Either way I'd be interested in knowing more.
On a system with frequency scaling you can see that under higher load the difference between RDTSC in subsequent iterations of a tight loop that does nothing else than reading that register will drop. Here's how it looks on the system I'm currently using: https://www.youtube.com/watch?v=FKKjSJ1JZ78
> RDTSC is directly influenced by frequency scaling
Unfortunate wording, RDTSC itself is not influenced by frequency scaling, it has constant frequency on modern CPUs after all. Your video nicely shows that RDTSC delta is influenced by CPU frequency, as expected, but how does it affect using RDTSC as a clock? On my CPU RDTSC seems to tick at 3GHz, for example. I wonder how precise it is though, how much its frequency can drift from spec.
What is this "accuracy" you're talking about? Between the scheduler hanging over a process's head, ready to yank away its compute time for milliseconds between the syscall to read out the HPTC or the CPU cycle counter, and the process actually writing the timestamp into a variable/in-memory struct, reading the HPTC from the underlying hardware also not being super accurate, and the CPU cycle counter being influence by frequency scaling, on a bog-standard machine the highest precision you can reasonable expect is on the order of 100µs or so.
Modern CPUs don't really give you accurate nanosecond-scale time stamps anyways. The CPU will randomly speed up or slow down, execute instructions out of order, and even speculatively execute instructions. Not to mention that it'll have a dozen different clocks - which are not guaranteed to be in sync.
This may be coming from a place of ignorance, but if that were the case, then time would drift significantly constantly due to Intel speed step for example. And if that were the source of truth for time, then when your computer is off, it wouldn’t be able to keep. I’m pretty sure they all have real time clock chips in the motherboards.
There's time keeping (which uses the real time clock) and high resolution clocks, which are good for relative timers. They're never the same components.
Invariant (Constant) TSC is detectable via `cpuid` and applies to `rdtsc/rdtscp` by default. In that aspect, there's no tradeoff being made there (observable to software) AFAICK.
Are there cheaper ways of getting elapsed time with sub microsecond precision? Interested as I've only ever heard of rdtsc at the lowest level in userspace for x86.
I ran across a random stackoverload thread with benchmarks claiming it was about 2x the cost of doing a naive gettime(). But frankly, hard to figure out when you factor in all the various caches, OO execution pipelines, etc,
Well, you can always attempt to catch the CPU cycle counter overflow (happens at roughly 10Hz on current machines), adding up the carries and add it to a nanosecond counter shifted up by a few bits. Problem with the CPU cycle counter is, that it's not in lockstep with the HPTC, due to dynamic frequency scaling.
If you really, really, really need system wide, nanosecond precision timestamps, you'll have to resort to dedicated hardware incrementing a counter at the desired rate, and with a well known latency for each read access. On the hardware and driver level you'd have some MMIO port mapped into user space with an identity transform between bus address and process address space.
However this still doesn't solve the problem, of the scheduler being able to throw in several milliseconds between reading the high precision timer value into a register and writing it back into the memory holding the timestamp variable. Seriously, on your typical computer system software derived timestamps at more than 100µs resolution are kind of bogus.
It's 64 bit on 64 systems. In the world of hard realtime applications there's still a huge armada of systems out there in the field running on 32 bit (think motion control, CNC machines, industrial robotics). If you're developing software that's concerned with this kind of precision, you might find yourself confronted with such "outdated" systems more often, than not.
Slacks API uses straight nanoseconds for the IDs of their messages which I always found very curious but figured they must have some sort of way on the backend of resolving collisions.
A message ID only needs to be unique per workspace, in which case you'd expect very few collisions to begin with, and you could even retry on insert failure with a new timestamp. I don't think that would cause significant performance penalties.
> it is unsafe to assume that a raw nanosecond timestamp is a unique identifier
It is reasonable to assume that a raw nanosecond timestamp is a unique identifier in cases when you can reasonably expect such identifiers are not generated too offten.
I.e. this is Ok (and even redundant) for identifying manual interactions a single user would pdosuce. It probably is also Ok for labelling manual interactions multiple users would produce if that's a small team.
I guess I'm old, the macOS behaviour is more in line with my expectations.
But this got me thinking, how feasible it would be to tie the various clock systems of a computer to some reference clock, like 10 MHz GPSDO? Obviously it wouldn't improve the granularity, but you could ensure that the timestamps are actually accurate. Because otherwise I doubt that random computer clock would be accurate down to 32ns even with NTP.
Getting a 10MHz PPS signal requires specialized expensive hardware and typically doesn't scale well to cover every server. That sort of thing is best left to extreme applications with FPGAs or ASICs. In particular the 10MHz version; a lot of commodity hardware only supports the 1Hz one.
A major problem in synchronization of the system clock is PCIe. Hardware can timestamp PPS signal or PTP/NTP packets with accuracy of a few nanoseconds if everything is well compensated, but the PCIe bus between the CPU and the timestamping HW has a latency of hundreds of nanoseconds with potentially large asymmetry, degrading the accuracy significantly.
Measuring that error is difficult. A possibility is to run a program periodically making random reads from memory (avoiding the CPU cache) to generate a PPS signal on the memory bus, which can be observed with a scope. There is a lot of noise due to other memory activity, RAM refresh, etc. From the configured memory speed and tRCD+tCL timings the uncertainty of the error can be reduced.
This might improve with new hardware. There is a feature called Precision Time Measurement (PTM), which is a hardware implementation of an NTP-like protocol in PCIe, but so far I have seen this working only on some onboard NICs.
This was realised in the development of imageboard software (futaba.php, later yotsuba.php that runs 4chan) in the naming images, which is a unix timestamp + random 3 digit number because just the timestamp often causes clashes.
I don't think this is true in practice. You don't just produce timestamp_ns in a busy loop. There is also the cpu, memory or io operation that generated or copied the data bound to the timestamp. And that isn't sub-ns.
I frankly don't understand how it's good. UUID originally was intended as something you use very sparingly, to name, say, a product SKU maybe, an organization, something like that. Not literally content that collides commonly at the same nanosecond, in the same application, in the same platform/org.
At some point we have to question the sanity of using one single flat address space for everything from the tiniest identifier to the... well, "Universe", as if it makes sense.
We can have registrars for global ids, and we can nest local ids inside them, we can have a hierarchy, and we can have local compact ids, 32, 64 or 128 bit, which will never collide, and be locally cohesive.
So why aren't we doing this? Is it ignorance? Is it because magic is easier than actually figuring out the synchronization and order of things in a system like this (and no, synchronization does NOT imply you need to issue ids strictly linearly).
> We can have registrars for global ids, and we can nest local ids inside them, we can have a hierarchy, and we can have local compact ids, 32, 64 or 128 bit, which will never collide, and be locally cohesive.
We have this. It's called OID (Object Identifier) and is used in X.509, LDAP, SNMP, and many other technologies. A global registry exists (I have an entry) and it's a giant tree.
> So why aren't we doing this? Is it ignorance?
The problem you are solving here is for durable, long-lasting keys. There is also a need to generate large batches of short-lived keys that need to never collide for security/identification purposes. A centralized registry will not work for that and it requires a wholly different technique.
My point was, let's start with that large OID tree, for example.
And continue this concept downward. Except when it's inside your org, you're the registrar of your namespace's sub-OIDs, and so on. And there's precisely no reason not to have hierarchical sequential ids for everything. You need to generate ids on 100 servers? Good, give each server a namespace. You need to do that in a 100 processes on each server? Good, give a sub-namespace to those too.
And the best of all is that the above-organization OID part of the tree is ONLY needed if you show these ids outside your organization. Otherwise you can use the internal private part of the identifiers.
So what am I missing here? Maybe they need to be hard to guess, so add a random key part to them (distinct from the generated sequence part) as a "password". Done.
I'll also have the need to make namespaces dynamic and centrally coordinate their lifecycle along with the rest of my infrastructure as I stand up anything that needs some sort of coordinated ID. So I've just moved this issue to an even larger scope with even more complexity.
What do I gain for this? UUIDs solve all of this because the chance of creating a duplicate is so low it can effectively be ignored. I can namespace UUIDS and create chains of them if needed.
This is the reason both exist. We need both. We can use both.
When you get used to this basic principle of when you create something, the creator slaps a name on it, it stops being considered a hassle or confusing.
More specifically:
1. When you create something, the creator names it.
2. When you bring something you created outside your scope, you prepend your name to it.
It's kind of what we do with (actual) children if you think, slightly less structured, but the idea was always there. Just add recursion to complete the principle.
I generate children faster than the birth registry can issue certificates and Cronus eats them before it could be issued anyway. Nesting namespaces doesn't solve the problem of scale within a single namespace.
> I frankly don't understand how it's good. UUID originally was intended as something you use very sparingly, to name, say, a product SKU maybe, an organization, something like that. Not literally content that collides commonly at the same nanosecond, in the same application, in the same platform/org.
What's wrong with using UUIDs for things that are commonly used every nanosecond? It's still improbable to get 2 UUIDs for identifier for the same "thing" in the system, even if you generate billions of them per second. It's a pretty good way to get non-colliding IDs without a central registry. That's the main feature.
I've never seen duplicated UUID generated, and if that's the issue, you may think about using cryptographic randomness source to generate it.
UUID v7 has millisecond precision + 72 bits of randomness, which is _a lot_.
Also, central registry for IDs seems like a great way to shoot yourself in a foot, in case it's down, your network is down, you're on cellular connection, on a plane...
The problem with hierarchic identifiers is that you need to be extremely dilligent when assigning them, and you make it extremely hard to refactor something in the future. But refactoring is something that happens all the time.
For example, you might have "comments" and "posts", and later merge the two entities and call them "notes". Now you have to figure out a way to merge the identifiers -- with UUIDs you don't have issues like this, because they are universally unique.
A lot of developers have found that UUIDv7 provides the properties they want (unique identifiers which sort by creation time), and they don't come with the hassle that other approaches come with.
It's sort of this. Although it would've been nice if the size of the IP wasn't restricted, so one day we can add an optional segment on top and connect the whole Milky Way, or something.
ULIDs are almost the same as UUIDv7, except that they have a few extra random bits that UUIDv7 use to encode the version of UUID. Every UUIDv7 is a valid ULID, and the main difference is the encoding when displayed in a textual form.
Eg UUIDv7 has a milliseconds time component and then a field that increments for each event in the same millisecond, and then enough random bits to make collisions between ids generated on different machines astronomically unlikely.
Of course there are only so many bits so you might generate too many events in the same time slice so the sequence overflows, and you might actually get collisions between machines, and you are limiting your event generation speed by forcing your cpu to sync on the increment etc.
But in practice UUIDv7 works great at scale.