Hacker News new | past | comments | ask | show | jobs | submit login

The whole anonymization of IP addresses by just hashing the date and IP is just security theater.

Cryptographic hashes are designed to be fast. You can do 6 billion md5 hashes in a second on an MacBook (m1 pro) via hashcat and there’s only 4 billion ipv4 addresses. So you can brute force the entire range and find the IP address. Basically reverse the hash.

And that’s true even if they used something secure like SHA-256 instead of broken MD5




Aside from it being technically trivial to get an IP back from its hash, the EU data protection agency made it very clear that "hashing PII does not count as anonymising PII".

Even if you hash somebody's full name, you can later answer the question "does this hash match the this specific full name". Being able to answer this question implies that the anonymisation process is reversible.


We're members of some EU projects, and they share a common help desk. To serve as a knowledge base, the tickets are kept, but all PII is anonymized after 2 years AFAIK.

What they do is pretty simple. They overwrite the data fields with the text "<Anonymized>". No hashes, no identifiers, nothing. Everything is gone. Plain and simple.


I think it would be reasonable to at least number different participants so you don't completely lose the flow of conversation.


Design of the helpdesk keeps the dialogue intact. So no context or flow is lost in that particular case.


KISS That's the best way to go about it


I think the word "reversible" here is being stretched a bit. There is a significant difference between being able to list every name that has used your service and being able to check if a particular name has used your service. (Of course these can be effectively the same in cases where you can list all possible inputs such as hashed IPv4 addresses.)

That doesn't mean that hashing is enough for pure anonymity, but used properly hashes are definitely a step above something fully reversible (like encryption with a common key).


I'm not sure the distinction is meaningful. If the police demand your logs to find out whether a certain IP address visited in the past year, they'd be able to find that out pretty quickly given what's stored. So how is privacy being respected?


if it fulfills the same function, does it matter?

if you have an ad ID for a person, say example@example.com, and you want to deduplicate it,

if you provide them with the names, the company that buys the data can still "blend" it with data they know, if they know how the hash was generated... and effectively get back that person's email, or IP, or phone number, or at least get a good hunch that the closest match is such and such person with uncanny certainty

de-anonymization of big data is trivial in basically every case that was written by an advertising company, instead of written by a truly privacy focused business.

if it were really a non-reversible hash, it would be evenly distributed, not predictable, and basically useless for advertising, because it wouldn't preserve locality. It needs to allow for finding duplicates... so the person you give the hash to, can abuse that fact.


It depends. For example, if each day you generate a random nonce and use it to salt that day's PII (and don't store the nonce) then you cannot later determine (a) did person A visit on day N or (b) is visitor X on day N the same as visitor Y on day N+1. But you can still determine how many distinct visitors you had on day N, and answer questions about within-day usage patterns.


Is an ipv4 address really classes as PII? Sounds a bit insane.


It can be used to track you across the web, get a general geographic area, and if you have the right connections one can get the ISP subscriber address. Given that PII is anything that can be used to identify a person, I think it qualifies despite it being difficult for a rando to tie an IP to a person.

Additionally in the case of ipv6 it can be tied to a specific device more often. One cannot rely on ipv6 privacy extensions to sufficiently help there.


That's compounded by the increasing use of static IPs, or at least extremely long-lasting dynamic IPs in some ISPs.


There is a reason I specified ipv4 and not v6


Yes but if the business is not in the EU they don't need to care one bit about GDPR or EU.


If they target residents of the EU, they must care.

Edit:

This is a different bear:

Also, Bear claims to be GDPR compliant: https://bear.app/faq/bear-is-gdpr-compliant/


Author here. I commented down below, but it's probably more relevant in this thread.

For a bit of clarity around IP addresses hashes. The only use they have in this context is preventing duplicate hits in a day (making each page view unique by default). At the end of each day there is a worker job that scrubs the ip hash which is now irrelevant.


Have you considered serving actual small transparent image with caching headers set to expire at midnight?


For context, this problem also came up in a discussion about Storybook doing something similar in their telemetry [0] and with zero optimization it takes around two hours to calculate the salted hashes for every IPv4 on my home laptop.

[0] https://news.ycombinator.com/item?id=37596757


Hashes should be salted. If you salt, you are fine, if you don't you aren't.

Whether the salt can be kept indefinitely, or is rotated regularly etc is just an implementation detail, but the key with salting hashes for analytics is that the salt never leaves the client.

As explained in the article there seems to be no salt (or rather, the current date seems to be used as a salt, but that's not a random salt and can easily be guessed for anyone who wants to say "did IP x.y.z.w visit on date yy-mm-dd?".

It's pretty easy to reason about these things if you look from the perspective of an attacker. How would you do to figure out anything about a specific person given the data? If you can't, then the data is probably OK to store.


> Hashes should be salted. If you salt, you are fine, if you don't you aren't.

> Whether the salt can be kept indefinitely, or is rotated regularly etc is just an implementation detail, but the key with salting hashes for analytics is that the salt never leaves the client.

I think I'm missing something.

If the salt is known to the server, then it's useless for this scenario. Because given a known salt, you can generate the hashes for every IP address + that salt very quickly. (Salting passwords works because the space for passwords is big, so rainbow tables are expensive to generate.)

If the salt is unknown to the server, i.e. generated by the client and 'never leaves the client'... then why bother with hashes? Just have the client generate a UUID directly instead of a salt.


Without a salt, you can generate the hash for every IP address once, and then permanantly have a hash->IP lookup (effectively a Rainbow table). If you have a salt, then you need to do it for each database entry, which does make it computationally more expensive.


People are obsessed with this attack from the 1970s, but in practice password cracking rigs just brute force the hashes, and that has been the practice since my career started in the 1990s and people used `crack`, into the 2000s and `jtr`, and today with `hashcat` or whatever it is the cool kids use now. "Rainbow tables" don't matter. If you're discussing the expense of attacking your scheme with or without rainbow tables, you've already lost.


> If you're discussing the expense of attacking your scheme with or without rainbow tables, you've already lost.

Can you elaborate on this or link to some info elaborating what you mean? I'd like to learn about it.


> why bother with hashes? Just have the client generate a UUID directly instead of a salt.

The reason for all this bonanza is that the ePrivacy directive requires a cookie banner, "making exceptions only for data that is "strictly necessary in order to provide a [..] service explicitly requested by the subscriber or user"*.

In the end, you only have "pinky promise" that someone isn't doing more processing on the server end, so in reality it doesn't matter much especially if the cookie lifetime is short (hours or even minutes). Actually, a cookie or other (short-lived!) client-side ID is probably better for everyone if it wasn't for the cookie banners.


ALL of the faff around cookies is the biggest security theater of the past 40 years. I remember hearing the fear-mongering in the very early 2000's about cookies in the mainstream media - it was self-evidentally a farce then, and a farce now.


Isn't in this case data is part of "strictly necessary" data (IP address)? That's all that gets collected by that magic CSS + server, no?


ePrivacy directive only applies to information stored on the client side (such as cookies).


> > the salt never leaves the client

> I think I'm missing something.

...

> If the salt is known to the server,

That's what you were missing yes


Did you miss the second half where GP asked why the client doesn't just send up a UUID, instead of generating their own salt and hash?


Salts are generally stored with the hash, and are only really intended to prevent "rainbow table" attacks. (I.e. use of precomputed hash tables.) Though a predictable and matching salt per entry does mean you can attack all the hashes for a timestamp per hash attempt.

That being said, the previous responder's point still stands that you can brute force the salted IPs at about a second per IP with the colocated salt. Using multiple hash iterations (e.g. 1000x; i.e. "stretching") is how you'd meaningfully increase computational complexity, but still not in a way that makes use of the general "can't be practically reversed" hash guarantees.


As I said the key for hashing PII for telemetry is that the client does the hashing on the client side and the client never transmits the salt. This isn't a login system or similar. There is no "validation" of the hash. All the hash is is a unique marker for a user that doesn't contain any PII.


What's the point in hashing the IP + salt then, just let each client generate a random nonce and use that as the key


How does the client generate the same salt every time they visit the page, without using cookies?


Use localstorage!

Kidding, of course. I don't think there's a way to track users across sessions, without storing something and requiring a 'cookie notification'. Which is kind of the point of all these laws.


Storing a salt with 24h expiry would be the same thing as the solution in the article. It would be better from a privacy perspective because the IP would then not be transmitted in a reversible way.

If I hadn't asked for permission to send hash(ip + date) then I'd sure not ask permission if I instead stored a random salt for each new 24h and sent the hash(ip + todays_salt).

This is effectively a cookie and it's not strictly necessary if it's stats only. So I think on the server side I'd just invent some reason why it's necessary for the product itself too, and make the telemetry just an added bonus.


If you can use JS it's easy. For example localStorage.setItem("salt", Math.random()). Without JS it's hard I think. I don't know why this author wants to use JS, perhaps out of respect for his visitors, but then I think it's worse to send PII over the wire (And an IP hashed in the way he describes is PII).


EU's consent requirements don't distinguish between cookies and localStorage, as far as I understand. And a salt that is only used for analytics would not count as “strictly necessary” data, so I think you'd have to put up a consent popup. Which is precisely the kind of thing a solution of that is trying to avoid.


Indeed, but as I wrote in another reply: it doesn't matter. It's even worse to send PII over the wire. Using the date as the salt (as he does) just means it's reversible PII - a.k.a. PII!.

Presumable these are stored on the server side to identify returning visitors - so instead of storing a random number for 24 hours on the client, you now have PII stored on the server. So basically there is no way to do this that doesn't require consent.

The only way to do it is to make the information required for some necessary function, and then let the analytics piggyback on it


IP address is "non-sensitive PII"[0]. It's pretty hard to identify someone from an IP address. Hashing and then deleting every day is very reasonable.

[0] https://www.ibm.com/topics/pii


I think I agree with you there. But again, the idea of a "salt" is then overcomplicating things. It's exactly the same to have the client generate a GUUID and just send that up, no salting or hashing required.


Yup for only identifying a system that’s easier. If this is all the telemetry is ever planned to do then that’s all you need. The benefit of having a local hash function is when you want to transmit multiple ids for data. E.g in a word processor you might transmit hash(salt+username) on start and hash(salt+filename) when opening a document and so on. That way you can send identifiers for things that are sensitive or private like file names in a standardized way and you don’t need to keep track of N generated guids for N use cases.

On the telemetry server you get e.g

Function “print” used by user 123 document 345. Using that you can do things like answering how many times an average document is printed or how many times per year an average user uses the print function.


What you're describing is called "pepper". What's called "salt" is not varied between rows and therefore not stored with the hash (and best stored far away from the hash).


You have this exactly backwards.


You're right of course, that's what I get by posting late at night. Can't edit or delete anymore.


Salting a standard cryptographic hash (like SHA2) doesn't do anything meaningful to slow a brute force attack. This problem is the reason we have password KDFs like scrypt.

(I don't care about this Bear analytics thing at all, and just clicked the comment thread to see if it was the Bear I thought it was; I do care about people's misconceptions about hashing.)


What do you mean by "brute force" in the context of reversing PII that has been obscured by a one way hash? My IP number passed through SHA1 with a salt (a salt I generated and stored safely on my end) is 6FF6BA399B75F5698CEEDB2B1716C46D12C28DF5 Since this is all that would be sent over the wire for analytics, this is the only information an attacker will have available.

The only thing you can brute force from that is some IP and some salt such that SHA1(IP+Salt) = 6FF6BA399B75F5698CEEDB2B1716C46D12C28DF5 But you'll find millions of such IPs. Perhaps all possible IP's will work with some salt, and give that hash. It's not revealing my IP even if you manage to find a match?


If you also explicitly mentioned the salt used (as bear appear to have done?), this just becomes a matter of testing 4 billion options and seeing which matches


I think it's just unsalted in the example code. Or you could argue that the date is kind of used as a salt. But the point was that salting + hashing is fine for PII in telemetry if and only if the salt stays on the client. It might be difficult to do without JS though.


What does "stay on the client" mean? It has to be consistent across visits and you don't want to use cookies (otherwise you don't need to mess with addresses at all). You have no option except sending every client the same salt.


In this context it means exactly that: staying on the client. And yes that means using cookies (or rather, local storage probably). So this is requiring consent no matter how you do it. But note that the system in the article also requires consent since it sends PII (the IP) over the wire and saves it on the server. It's reversible and not anonymous - so it's even worse than using a cookie without consent I'd say.

Yes if all you ever want to send is a unique visitor ID then there is no point in having a local hash, because you can just generate a random ID and use that to identify the user.

What I mean is that if you want to send multiple pieces of PII (such as an IP, a filename, a username,...) then the only way to do that safely is to send hash(salt+filename) for example, where the salt is not known to the server receiving the hash. The IP in the suggestion to use a locally stored hash here just represented "PII that should be sent anonymously" and not "A good way of identifying a unique system".


> Salting a standard cryptographic hash (like SHA2) doesn't do anything meaningful to slow a brute force attack.

Sure, but it does at least prevent the use of rainbow tables. Arguably not relevant in this scenario, but it doesn't mean that salting does nothing. Rainbow tables can speed up attacks by many orders of magnitude. Salting may not prevent each individual password from being brute forced, but for most attackers, it probably will prevent your entire database from being compromised due to the amount of computation required.


Rainbow tables don't matter. If you're discussing the strength of your scheme with or without rainbow tables, you have already lost.

https://news.ycombinator.com/item?id=38098188


That's just a link where you claim the same thing. What's your actual rationale? Do you think salting is pointless?


I think it's meaningless. Every password KDF is randomized. But randomization isn't what makes them difficult to brute-force; non-password-KDF "salted" hashes are extremely easy to brute force, which is the whole reason we have password KDFs.

"Salted hashes" are one of the more treacherous security cargo cults, because they create the impression that the big problem you have to solve hashing a password is somehow mixing in a salt. No: just doing that by itself doesn't accomplish anything meaningful at all.


Understood, but I think this is just people using "hash" and "KDF" interchangeably. Agree that a KDF is the way to go, but salting it is also a good idea to explode the search space.


I don't follow your point. Yes, you can brute force. But if I have 50,000 unsalted password hashes, I can crack those using rainbow tables much faster than brute forcing all 50,000. For a single password hash, I agree it probably doesn't really matter.


I don't understand why anyone would ever have 50,000 or even 1 unsalted password hash.


Maybe they use a secret salt or rotating salt? The example code doesn't, so I'm afraid you are right. But one addition and it can be made reasonable secure.

I am afraid, however, that this security theater is enough to pass many laws, regulations and such on PII.


> Cryptographic hashes are designed to be fast.

Not really. They are designed to be fast enough and even then only as a secondary priority.

> You can do 6 billion … hashes/second on [commodity hardware] … there’s only 4 billion ipv4 addresses. So you can brute force the entire range

This is harder if you use a salt not known to the attacker. Per-entry salts can help even more, though that isn't relevant to IPv4 addresses in a web/app analytics context because after the attempt at anonymisation you want to still be able to tell that two addresses were the same.

> And that’s true even if they used something secure like SHA-256 instead of broken MD5

Relying purely on the computation complexity of one hash operation, even one not yet broken, is not safe given how easy temporary access to mass CPU/GPU power is these days. This can be mitigated somewhat by running many rounds of the hash with a non-global salt – which is what good key derivation processes do for instance. Of course you need to increase the number of rounds over time to keep up with the rate of growth in processing availability, to keep undoing your hash more hassle than it is worth.

But yeah, a single unsalted hash (or a hash with a salt the attacker knows) on IP address is not going to stop anyone who wants to work out what that address is.


A "salt not known to the attacker" is a "key" to a keyed hash function or message authentication code. A salt isn't a secret, though it's not usually published openly.


> only as a secondary priority

That's not a reasonable way to say it. It's literally the second priority, and heavily evaluated when deciding what algorithms to take.

> This is harder if you use a salt not known to the attacker.

The "attacker" here is the sever owner. So if you use a random salt and throw it away, you are good, anything resembling the way people use salt on practice is not fine.


Don't forget that md5 is comparatively slow & there are way options for hashing nowadays:

https://jolynch.github.io/posts/use_fast_data_algorithms/


That is easy to fix though. Just use a temporary salt.

Pseudo code:

    if salt.day < today():
        salt = {day: today(), salt: random()}
    ip_hash = sha256(ip + salt.salt)


Assuming you don't store the salts, this produces a value that is useless for anything but counting something like DAU. Which you could equally just do by counting them all and deleting all the data at the end of the day, or using a cardinality estimator like HLL.


DAU in regards to a given page.

Have you read the article? That is what the author's goal seems to be.

He wants to prevent multiple requests to the same page by the same IP counted multiple times.


Is that more efficiently done with an appropriate caching header on the page as it is served?

Cache-Control: private, max-age=86400

This prevents repeat requests for normal browsers from hitting the server.


That same uselessness for long-term identification of users is what makes this approach compliant with laws regulating use of PII, since what you have after a small time window isn't actually PII (unless correlated with another dataset, but that's always the case).


That's precisely all that OP is storing in the original article.

They're just getting a list of hashes per day, and associated client info. They have no idea if the same user visit them on multiple days, because the hashes will be different.


Of course if you have multiple severs or may reboot you need to store the salt somewhere. If you are going to bother storing the salt and cleaning it up after the day is over it may be just as easy to clean the hashes at the end of the day (and keep the total count) which is equivalent. This should work unless you want to keep individual counts around for something like seeing distribution of requests per IP or similar. But in that case you could just replace the hashes with random values at the end of the day to fully anonymize them since you no longer need to increment then.


This the type of comment that reinforces not even trying to learn or outsource security.

You’ll never know enough.


I think the opposite? I’m a dev with a bit of an interest in security, and this immediately jumped out at me from the story; knowing enough security to discard bad ideas is useful.


Not if they use a password hash like Argon2 or scrypt


Even then it is theatre because if you know the IP address you want to check it's trivial to see if there's a match.


And this is why such a hash will still be considered personal data under legislation like GDPR.


But that's very heavy to compute at scale...


True, but also it's a blogging platform - does it really have that kind of scale to be concerned with?


Probably not, I was mainly thinking if that kind of solution was to be adopted at a scale like Google Analytics.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: