Hacker News new | past | comments | ask | show | jobs | submit login

> replace the low bits with the hash of them, concatenated with the value of the CPU cycle counter (`RDTSC` on x86)

you're concatenating two values and then taking the hash of the combination, ie:

hash(low bits + CPU cycle counter)




But a hash() destroys information.

When you are trying to reduce collisions, why would you use a function that is known to introduce collisions?


Because the raw value from the timestamp will be very low entropy and have the short scale variation concentrated in just a few bits. A hash not just destroys information, it also creates entropy by mixing that information over all the bits that come out of it. And using 64 bit hash replacing a 64 bit nanosecond counter that has short term entropy of less than 16 bits, you're in fact reducing the likelihood of a collision by a factor of 2^48.


The hash, in this case, is just one deterministic way to shorten the CPU counter to few bits which can then be used to increase the entropy of the timestamp by replacing the timestamps stale bits. What's being asked here is not why use some compressed bits of the CPU counter increases the entropy of the timestamp overall. Rather, why you'd use a hash of the entropy providing information to do this since hashes allow for many:1 mappings (i.e. allow for decreases in entropy) while never providing better than 1:1 mappings (i.e. preserving entropy). At best, the hash is preserving the entropy of the timestamp and counter bits and, at worst, it is throwing some away. The question that follows: is there a better way that preserves the entropy of the inputs without the risk of throwing some away and, if so, was there some other reason to still use the hash? This is what I took amelius to be asking.

Knowing the timestamp delta is ~30ns then even a 1 THz processor would only execute 30,000 cycles before said timestamp increases and the outcome stays unique anyways. From that perspective, taking the low 16 bits of the cycle counter is already guaranteed to help produce perfectly unique timestamps, and without the risk of introducing hash collisions. It's also significantly easier to compute and now monotonic* now, whereas hashes were neither. From that perspective, I too don't see what value the hash is supposed to add to the information being fed to it in this case.

In threaded/distributed/parallel scenarios you may wish to throw the lane/core/node number in the bits as well. In the case having the same timestamp is supposed to be invalid this leaves you a simple deterministic way to tiebreak the situation as part of the creation of the timestamp itself.

*A 10 GHz CPU running for a little over 58 years could risk a 64 bit counter rollover in the middle of a time delta. If that's too close for comfort or you want the code to work on e.g. 32 bit counter systems, you'd need to eat more cycles and registers to set another bit to the outcome of whether current cycle count is < the one at the start of the current timestamp delta.


Yeah, when I was thinking about the hash, I thought of it as stuffing to fill the unused portion of the number that would look better than using zeroes.

TIMESTAMP010111101010101TSC "looks better" than TIMESTAMP000000000000000TSC even though they contain the same information.

I would drop the hash, it's deceptive.

I don't believe it breaks the monotonicity, though? I mean, it would if it weren't already broken. If you're taking the low 16 bits of the TSC, then a rollover in those 16 bits during the same timestamp will already go backwards. TIMESTAMP0...0 follows TIMESTAMPf...f.


I guess it depends which portion you look at. If you solely look at the time based portion you do indeed still have a function which never decreases but that is true even in the case of reading the raw counter whole on its own anyways. If you look at the whole value, including the hashed portion, it's no longer monotonic.

In the cycle based case looking at the whole value is the same thing as looking at a relative time stamp which has more precision that the system clock. In this way, it's "truly" monotonic across the entire value, not just monotonic on a part and unique in another.

Side topic: It also comes with an even stronger guarantee of always increasing instead of just "not changing direction". Is there a name for that?


> it also creates entropy

It's a nitpick, but it concentrates the entropy. It doesn't create any.

I do think it will make the answer more clear, as the hash concentrates the less than 64 bits of entropy on that 128 bits of original data into a usable 64 bits package.


Actually hashes do create entropy (every computation creates entropy in some form or another). What's the entropy of a 4 bit number? What's the entropy of a 4 bit number hashed by a 64 bit hash function? The act of computation does in fact create entropy, as per the 2nd law of thermodynamics, a part of which shows up in the hash.


I don't think you understand what this conversation is about. We are considering information theoretic entropy, not thermodynamic entropy from the mechanism of computation itself.

The result of applying a deterministic function on a random variable cannot have more entropy than the underlying random variable. This is a theorem, one that is trivial enough to not have a name. But you can find solution sets to homework that will prove it for you: https://my.ece.utah.edu/~rchen/courses/homework1_sol_rr.pdf


> every computation creates entropy in some form or another

Ok, what is the entropy created by this function that maps a 4-bit number to a 64 bit number:

    0 -> 0
    1 -> 1
    2 -> 1
    3 -> 1
    4 -> 1
    ...
    15 -> 1


60 bits. Yes, I know, you can compress it down very well. But consider that entropy in computation involves not just the bits you store, but also the bits that the processor touches and eventually dissipates as heat into the universe.


What definition of entropy do you use?

(I'm using Shannon entropy.)


Boltzmann. But it doesn't really matter, it's the same thing. Yes, I know that looking at a sequence of, say 1000 identical bits looks like it's got just 10 bits of entropy after simple RLE compression. But you must not forget the entropy that also generated in the computation itself, and subsequently dissipated into the universe.


It's not the same thing. If I define a function that always returns 1 then the Shannon entropy is extremely low regardless if the Boltzmann entropy of running it on a CPU is high. That the two measures can be different shows they cannot be the same thing. Related in concept, different in definition. In fact, you can even use the same formulas for calculating it - what differs is what your calculating it on.


> If I define a function that always returns 1…

then it's Kolmogorov complexity is also extremely low.

Look if you have a well enough hash function, it output should be near the Shannon limit and hardly compressible, and ideally contain as much entropy as it has bits. But you can feed in just a single bit or the entire knowledge of humanity, in the end you're going to get a fixed amount of bits, and entropy near of that, and if you throw any form of lossless compression at it, it will hardly compress.

But quantum mechanics tells us, that information cannot be destroyed. So when you feed it more bits, than it emits, then its mostly the entropy of the information you feed in, that you get out of the hash. But if you feed it just a single bit, the additional entropy comes from the computational process.

I know, this is now getting really philosophical, but here's something to ponder on: How would you implement a hash function for a reversible computing architecture?


Most hashes are really good but the point was why replace the perfectly unique information in the cycle counter + time stamp combo with "most likely nearly unique" in the first place. After all, if the former isn't unique then neither are the hashes anyways.

Hashes are EXTREMELY compressible, albeit known algorithms are extraordinarily slow. E.g. I can compress any SHA256 output to a matter of kilobytes, maybe less, by using the SHA256 algorithm as the compressor algorithm and iterating through seeds until I get a match. With true entropy you can't guarantee that for all inputs, regardless of how long you take.

Different types of "information" ate at play here with the different types of entropy as well. If I have a 16 bit hash function and feed it a 64 bit value 48 bits of computational information is lost (at minimum). What happens with the physical information you used to represent the computation after you get the information result is separate from what happens with the computational information.


Why?

    low bits + CPU cycle counter
is enough. No need of the hash()


You don't know precisely at which frequency the cycle counter runs. Depending on the system load it might either run faster or slower than the lowest bits the HPTC. For what it's worth this part is more or less nondeterministic, so the sane thing to do, is spread out the information as much as possible (maximize entropy), in order to minimize the probability of collisions.


That's ok, the bits past the low bits are just there to avoid collisions, not an actual measure of high precision time beyond the low bits.

It's not worse than the hash solution, I'm just saying it's not necessary to hash it if the only objective is to reduce collisions.

In fact the hashing solution, if it is replacing the low bits with a hash of low bits plus something else, is actually destroying valuable time information.


That only works, if you know exactly, that the low bits are constant. Otherwise you may run into the issue that due to unsteady rate of RDTSC between two processes/threads that may be preemptively unscheduled between reading the HPTC and the RDTSC you might again end up with colliding time stamps. And if you took the derivative between timestamps taken in succession, you might even find is to be non-monotonic in places.

The combination of multiple counters incremented by individual unsteady clocks used to be a source for pseudo random scrambler sequences; these days we prefer LFSRs, but overall this is something that can be weird.

Hence my recommendation: Just throw xxHash32 on concatenation of the HPTC's low bits and the CPU clock cycle counter, and forgo any pretense of monotony in the low bits (because very likely you don't have it anyway).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: