Hacker News new | past | comments | ask | show | jobs | submit login
UUIDs generally do not meet security requirements (littlemaninmyhead.wordpress.com)
130 points by xnyhps on Nov 26, 2015 | hide | past | favorite | 62 comments



The problem highlighted here is not so much UUIDs, but instead the use of poor PRNGs (i.e. things other than /dev/urandom or getrandom(2) on Linux). All of the issues with the example libraries are around of the use of PRNGs from math libs, not crypto libs.

A "secure token generation" library might have the same flaws (and many do!).

A correctly generated v4 UUID (S4.4 of the RFC) should be acceptable for use as a secure token, given it has 122 random bits.

The trouble with UUIDs, if any, is that only v4 UUIDs are really suitable for use as secure tokens, with v1 through v3 being entirely unsuitable.

Personally, outside of DB PKs in Postgres (or similar), I prefer to base64 encode 256-bits from /dev/urandom for tokens, as they're a little shorter (and Go makes it easy enough to do that).


> A correctly generated v4 UUID (S4.4 of the RFC) should be acceptable for use as a secure token, given it has 122 random bits.

No, you're missing the point. Security requires more than randomness, it also requires unpredictability. The RFC Section 6 is very clear about this: "Do not assume that UUIDs are hard to guess; they should not be used as security capabilities". I don't know how many UUID generators have cryptographic properties, but it is wrong to assume that it it suitable for security just because the randomness appears good. You can speak to any cryptographer about that.


> A correctly generated v4 UUID (S4.4 of the RFC) should be acceptable for use as a secure token, given it has 122 random bits.

No it's not, from §4.4:

> Set all the other bits to randomly (or pseudo-randomly) chosen values.

I can use a PRNG with a cycle length of 8 and it would be fully correct according to the RFC, but it would be trivial to brute-force all the values.


Kind of... you're technically right and and practically not. Nobody would describe a PRNG with cycle length of 8 as random. It's just reading numbers off of a list.

In the same section of uuid RFC: "See Section 4.5 for a discussion on random numbers.", and then: "Advice on generating cryptographic-quality random numbers can be found in RFC1750 [5].", so it's not like authors were unaware of the implications of trivial PRNG. It's just up to the implementation whether CSPRNG is required (user-visible security tokens), or just a PRNG (internal identifiers).


Correctly generated implies not having such a computationally weak correlation


Obviously I'm interpreting "correctly generated" to mean "generated in accordance with the RFC", so no, it does not.


"Correctly generated" also means, at the very least, "useful to the people implementing it." If you can't use a given UUID implementation without generating collisions in the first 100 items, you won't use that UUID implementation. There's an "implicit spec" behind each spec of what people will or won't actually bother to do.

Generating correlated v4 UUIDs isn't wrong, but it is stupid in the sense of being self-defeating: nobody who wants to generate v4 UUIDs wants them with insufficient randomness, and will have no reason to not consider a bad random-number source to be a bug, because it's interfering with getting them the thing they want to get by generating v4 UUIDs in the first place.

(The real point of the explicit UUID spec, meanwhile, is in saying what constitutes a valid UUID—and it's certainly valid to generate a UUIDv4 using insufficient entropy. There's no way for a peer receiving such UUIDs to guess that they're maybe-predictable-with-enough-effort and reject them on that basis, which is all "validity" can ever mean: what can or cannot be technically enforced by protocol peers.)


I agree. From the title, I would have thought that there was something about the format, and not the generation method, that made it insecure.


Boost.UUID suffers from the same problem, at least at the time I checked.


[deleted]


I'm not sure what the SHA512 buys you. If the source was crypto-random from the start, the only thing a Hash gets you is the potential for a birthday attack.

Fortunately the birthday attack (assuming SHA512 is secure) is at 256 bits. So I don't believe you've lost security in this case... but it would seem that you've almost certainly lost performance.


Fortunately Python is safe in this case. https://hg.python.org/cpython/file/2.7/Lib/uuid.py#l582 - urandom is the source for the UUID4. (same for 3.5: https://hg.python.org/cpython/file/3.5/Lib/uuid.py#l600)




Even if you're only concerned about making unique identifiers, that don't need to be secret, that merely need to be unique, hiding the UUID generation state is a very good idea if users might have the ability to cause new UUID generators to be created. Otherwise, instead of a 2^(128/2) = 2^64 birthday attack, they can make 2^N generators, find the one that overflows into the next the quickest (i.e. a birthday attack on the 2N msb's), and have 2^(128-2N) work to do. This means you could do a 2^36-sized birthday attack and then have 2^56 work left to do, and that part's not memory-bound. For example, this works if you have generators that seed a 128-bit counter with /dev/urandom, and then increment from there. You can avoid this if you "just" use some CSPRNG.


For anyone else wondering about their node.js code: node-uuid uses nodes crypto.randomBytes: https://github.com/broofa/node-uuid/blob/master/uuid.js#L59


I wrote a fairly popular Python library to easily generate short UUIDs (encoded with base57):

https://github.com/stochastic-technologies/shortuuid

Since it turned out that most uses didn't actually need a UUID and just needed a random string, I figured I'd maximize the randomness per bit and added a .random() call that returns a short string that is taken from os.urandom().

It would probably be good if you asked yourself whether you actually need a UUID or a random string, and used the right tool for the job. Using a UUID when you want a random string and vice versa leads to these kinds of problems.


Shouldn't this be 'Javascript UUIDs generally do not meet security requirements' ?

As he mentions in his story, the server was actually using Java UUIDs which are cryptographically secure. So 'UUIDs generally do not meet security requirements' is false, as generally here only applies to Javascript. Python, Ruby, Java all have cryptographically safe UUID.


> Shouldn't this be 'Javascript UUIDs generally do not meet security requirements' ?

Nothing wrong with v4 UUIDs generated in Javascript, so long as you don't use Math.random(). Every current browser supports, eg, crypto.getRandomValues(), which is cryptographically secure. If you use a broken PRNG in any other language, you'll get broken v4 UUIDs too. :)

(Also, don't make the common mistake of conflating "UUID" with "v4 UUID". A v4 UUID contains 122 bits of (hopefully) random data. A v1 UUID contains 0 bits of randomness.)


This kind of problem is so prevalent that I think directly creating an 128-bit random ID from a CSPRNG is much safer than betting on the saneness of UUID implementers.


There's nothing insane about using a faster nonsecure PRNG for a library function that you expect to possibly be used on a hot path, as long as the security properties are clearly documented.


CSPRNGs are not much slower than the alternatives, and really, how often the hot path on data creation falls on the processor, instead of memory access or IO?

Nope, looks like a really bad choice for generic libraries.


Well if you just get 122 random bits from a CSPRNG you get an UUIDv4.


When do you use UUIDs and why?

I use traditional auto increment integer IDs as primary keys in SQL databases. I use those SQL query results in application code. Debugging seems a lot easier with smaller integer values rather than UUIDs that I saw in several enterprise software and SQL-DBs.


I use UUIDs for pretty much any "primary key"-like scenario:

* They're natively supported in most databases and languages

* They're strongly typed, meaning there's no risk of accidentally doing things that make no semantic sense with them (e.g. adding or multiplying)

* They ensure that any API user will use the correct datatype, rather than some clients breaking when your ids go above 2^31

* They avoid exposing information about how many entries they are, e.g. you can't tell how many users I have by signing up and checking your user ID

* The client can generate them without roundtripping to the database; this can save on roundtrips when you're saving several related pieces of data, and makes it easier to have circular datastructures if you need them

* As others have said, they're usable in an AP datastore

None of this is impossible to do with integers, but UUIDs make it very easy.


Doesn't the random nature of uuids make them extremely inefficient candidates for primary key as they can't be indexed well?


It's not like an auto increment key is meaningful, so the read performance will be the same either way. As for writes, a new key could go anywhere in the index, but OTOH you can insert multiple values concurrently (unlike with auto increment keys with MVCC where you'll have collisions and one insert will have to be rolled back and redone). As always, benchmark your use case and see what gives acceptable performance.


> As for writes, a new key could go anywhere in the index

This forces index tree rebalancing to occur on many (even most) writes, which is hugely detrimental to performance.


Which tree structure is this for? Many tree structures (e.g. the classic red-black tree) perform much better (doing less rebalancing) for randomized inserts than for ordered ones.


You forgot to mention that UUID's hurts JOIN performance. You want to join on 64 bits integers and not on 128 bit UUID's. You can still save the UUID in another column as a unique-id for the purposes you mentioned.


If your APIs have user specific endpoints, use a UUID rather than an integer in the API (what your DB uses is a separate question, see below). While it should not matter (because your authentication and authorization mechanisms are good, right?), merely making it so a user can't easily guess what endpoints correspond to another user is an additional protection.

For the database itself, the benefits are simply those that come from the fact that every ID, regardless of where it was generated, should be unique. This means you can, per the_mitsuhiko, generate keys in a distributed fashion, but it means more than that. Even rows from disparate databases have unique identifiers, which has benefits all the way down the line, as everyone has a way to refer to a particular entry that uniquely identifies it, regardless of where it came from. It lets you separate, and recombine the data, without worrying (depending on the type of UUID, at least) about collisions (separating being something like a NoSQL database, where based on the hashing key it gets written to one nodeset, allowing you to scale out; recombining being something like that, or you had multiple regions/databases in a SQL solution, but you need to run queries against both and merge the results for insertion into a single database for various metrics or similar).

Whether you need any of that is entirely dependent on your use cases; an incrementing ID in the database is perfectly valid for many needs.


> When do you use UUIDs and why?

When you need the client to determine the PK before an operation happens. In particularly necessary when working with distributed environments. There are many different forms of UUID and when you know what you are doing you can build very powerful systems with them that you cannot do with auto incrementing integer keys (even with holes).


In distributed systems, you can easily deal with this by assigning each client node an ordinal, and the client generates a sequential id that is a multiple of the highest ordinal of the ring added to the client's local ordinal. You can synchronize ordinal values between nodes with something like zookeeper or etcd.


If you're having to synchronize it using a strongly consistent mechanism, couldn't you just use an auto-incrementing integer anyway (not meaning to be snarky here - I legitimately don't know)? The benefit of a UUID is you don't need strong consistency; even on a partition of < n+1 nodes you can still generate IDs, and even if the network is unpartitioned, you can generate an ID without making a roundtrip.

I would also debate your use of the word 'easily'; what's easier, setting up zookeeper/etcd, or just generating a UUID?

But yes, if your use case is "I want to make sure I have non-conflicting IDs across my cluster", synchronizing them is a possible solution. And the right one, if your requirements stipulate absolute ordering based on the generation of each ID.


You would need to synchronize less often and that can be big benefit performance wise. i.e. you can tolerate more partitions without effects. If done right and in some cases you can have the same benefits as UUIDs with smaller keysizes.

After all UUIDs only make it unlikely that you have conflicting IDs not impossible. Especially if your random source is not that random.


You will never in your lifetime find a collision if you use v4 UUIDs generated from a CPRNG such as /dev/urandom. No one will.


Yes, key being, if your source is not that random i.e. programming errors.


Auto-incrementing using some kind of centralized store is synchronous, UUID generation is asynchronous.


In which case, you could equally use v1 or v5 UUIDs.

Plus a UUID can be generated by a source outside your network -- eg, client side, with the ability to treat them as quasi-nonces for replay detection.

Or inside your network, during long-running partitions.

The bigger question is: why would I install, manage, update, monitor zookeeper/etcd and write my own custom ID generator and all the storage mechanisms around it, when I can just use UUIDs?


Right, but I'm addressing the parent case in which ordered id is required (for whatever use case)


v1 UUIDs are ordered.


Our clients come and go and are largely unknown to us. Even in that case what you are describing is UUID1.


Integer IDs are fine for a non-distributed app where you have a single DB and every app server is directly connected to it. The second that's no longer the case, integer IDs become a nightmare.

On the other hand, a well chosen UUID, can be generated anywhere and be relied on to be globally unique. So maybe your client wants to create some records on the client, sync to a local db, and then do a batched update the next time the laptop it's running on has a wifi connection. Or maybe you want to scale your databases horizontally, or distribute your app across multiple data centers, or basically make any sort of AP (ie, Available and Partition Tolerant) system. In which case, integer IDs are asking for a world of pain, but a well chosen token is great, because you can keep working efficiently when talking to your Single Source Of Integer Truth is slow or impossible.

(And yes, there's other solutions to the issue. But UUIDs can solve it with a minimal amount of design and coding.)

> Debugging seems a lot easier with smaller integer values rather than UUIDs

My experience has been that both are just as easy, and leaking integer IDs can be a (mild) security vulnerability.


You are correct, UUIDs are a pain because they're large and hard to remember/compare when debugging. But, they can be useful for internal functionality in DB engines also. One of our database engine products uses them with the manifests attached to replicated updates. A UUID uniquely identifies each table being replicated (assigned when a table is "published"), and the manifest contains the list of UUIDs that have loaded the update. That way the engine knows whether a particular table can safely ignore the update, which is important with bi-directional replication.


And you're correct for most cases. But if you need to keep a master DB and local DBs on mobile devices without internet connectivity in sync, you'll be thankful to be able to generate v4 UUIDs and assume that the collision rate is nil for all practical purposes.


Another use case for UUIDs are message IDs. Not necessarily for emails, but for other messaging systems that must be able to refer to other messages (like in-reply-to), but without assuming there's a central database that stores every message.


I use Spring Security and it uses UUIDs for CSRF tokens. Luckily, Java does this securely.


UUID4 random is 122 bit of randomness:

https://gist.github.com/miohtama/6e72c2458a7138599dc1


The point here is that UUID4 is only as random as the underlying source of randomness. If the source of randomness is garbage your UUID4 is garbage as well. v8's Math.random() does not even remotely come close to a CSPRNG, and as a post a few days ago noted[0] it has a very, very small cycle length, so you can easily (via brute-force) find where in the cycle it was when it generated a given value, and enumerate all the UUIDs it could generate.

[0] https://medium.com/@betable/tifu-by-using-math-random-f1c308...


Here's the link to the 219 post HN discussion of that Medium post:

https://news.ycombinator.com/item?id=10598065


C's rand() is typically poor as well. Thought it was par for the course for builtin rng to be somewhat shitty, but maybe other languages stack up better?


> Thought it was par for the course for builtin rng to be somewhat shitty, but maybe other languages stack up better?

That seems to be pretty common, although v8's seems especially terrible. In thread linked by PhantomGremlin[0] pcwalton also points out that stronger RNG can be "detrimental" to benchmarks if they're slower, as many benchmarks ultimately bench the speed of the basic RNG (or for an other issue in the same category the speed of the basic hash function), so the incentive for builtin RNGs is to be fast and cheap, not to be good.

[0] https://news.ycombinator.com/item?id=10598065


Luckily we have <random> in modern C++ these days. While they unfortunately left the default as implementation defined you can still manualy pick a good PRNG and random_device should give you good CSPRNG quality entropy in any sane implementation (although I wish they would formally say that it couldn't be based on a PRNG).

Also worth checking out: https://channel9.msdn.com/Events/GoingNative/2013/rand-Consi...


C's rand() (rather, your libcs rand()) is perfectly fine for what it's supposed to be: fast random data, with no particular guarantees to cryptographic security, for things like randomized algorithms or the 99% of cases where you're picking what powerup to spawn instead of generating super important key material.

To make it into a CSPRNG would be a detriment. The only thing I would change is to have a minimum cycle length.


No, C's rand is unsuited for most randomized algorithms. MCMC algorithms won't work with rand unless you're lucky, it's even unsuited for choosing pivots for quicksort. The only application where it might be ok is games.


> The only application where it might be ok is games.

Not even that. Different types of games require different types of RNGs (e.g. in an RPG or strategy game you probably want a stable seeded RNG to preclude RNG save-scumming), the C standard requires almost no guarantee of rand().


> The only application where it might be ok is games.

This is silly, there are plenty of ways randomness is used to communicate randomness to the user (think games, song shuffling, visualizations, art, etc. Algorithms using randomness ≠ randomized algorithms.


https://twitter.com/mjmalone/status/667429857165488130

The PRNG behind Math.random() has been fixed in Chrome very recently.


Fixed?

As far as I know, there are no CSPRNGs that are as fast as, say, mersenne twister. So if 'fixed' means 'made it a CSPRNG', then I'd have to say they broke it. crypto.getRandomValues already exists.


It was fixed as in the PRNG in v8 was awful, even for a non-CSPRNG, and was recently replaced with a much better one: https://code.google.com/p/v8/issues/detail?id=4566


How fast do most users of a PRNG really need it do be? I think it's not unreasonable to pick something like ChaCha20 as a default PRNG (cryptographically secure, huge seed space, much smaller state than MT19937), and let people who need millions of random numbers for simulations use something else.


The algorithm that V8 chose to replace the current generator, xorshift128+, passes BigCrush and can produce a random uint64 in just over 1 nanosecond on a modern processor (that's ~7GB/s). Seems like a good choice. Not sure how it compares to something like ChaCha20 on performance, but I'm guessing it compares favorably (though it's unlikely to be your bottleneck either way).

Since I wrote the above linked article there's been a bunch of discussion, some of which has been about using crypto to back Math.random(). I'm sort of torn on that front - I feel like a good PRNG is useful for some stuff (like array shuffles), but maybe not? Maybe there are undiscovered (or even discovered, but not well known) vulnerabilities that justify CSPRNG even there, as there are with hash collision vulnerabilities.

Anyways, what I learned is that the benchmarks (particularly SunSpider, it seems) are putting pressure on implementors to (over)emphasize Math.random() performance, but nothing is really pressuring them to produce good quality. Sounds like the best thing to do might be to put some quality requirements in the ECMA spec to balance the performance pressures. Check out my recent Twitter likes for more details. I'm @mjmalone.


Actually, this may be even worse than described. The UUID digits all amount to the hex digit of hi in the 163 place (i.e. the 4th least-significant nybble). However, it also turns out that hi[init]*0xFFFF will give the same digit sequence but with a leading 0. This probably can be used to determine the RNG state from a single UUID, but the calculations for that are beyond me.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: