Hey, author here. For a bit of clarity around IP addresses hashes. The only use they have in this context is preventing duplicate hits in a day (making each page view unique by default). At the end of each day there is a worker job that empties them out while retaining the hit info.
If 10 users share an IP on a shared VPN around the globe and hit your site, you only count that as 1? What about corporate networks, etc? IP is a bad indicator
Yep, this is covered in the writeup. Results are accurate enough to gauge whether your post is doing well or not (keep in mind that this is a simple analytics system for a small blogging platform).
I also tested it in parallel with some other analytics platforms and it actually performed better due to the fact that adblockers are more prevalent than IP sharing per reader in this context.
Analytics doesnt need to be accurate. The important thing isnt really exact how many visitors etc. The important thing is trends. Do we have more users now than last week? Do we get more traffic from X than Y? If it is 1000 or 1034 isnt so important.
The idea of using CSS-triggered requests for analytics was really cool to me when I first encountered it.
One guy on twitter (no longer available) used it for mouse tracking: overlay an invisible grid of squares on the page, each with a unique background image triggered on hover. Each background image sends a specific request to the server, which interprets it!
The whole anonymization of IP addresses by just hashing the date and IP is just security theater.
Cryptographic hashes are designed to be fast. You can do 6 billion md5 hashes in a second on an MacBook (m1 pro) via hashcat and there’s only 4 billion ipv4 addresses. So you can brute force the entire range and find the IP address. Basically reverse the hash.
And that’s true even if they used something secure like SHA-256 instead of broken MD5
Aside from it being technically trivial to get an IP back from its hash, the EU data protection agency made it very clear that "hashing PII does not count as anonymising PII".
Even if you hash somebody's full name, you can later answer the question "does this hash match the this specific full name". Being able to answer this question implies that the anonymisation process is reversible.
We're members of some EU projects, and they share a common help desk. To serve as a knowledge base, the tickets are kept, but all PII is anonymized after 2 years AFAIK.
What they do is pretty simple. They overwrite the data fields with the text "<Anonymized>". No hashes, no identifiers, nothing. Everything is gone. Plain and simple.
I think the word "reversible" here is being stretched a bit. There is a significant difference between being able to list every name that has used your service and being able to check if a particular name has used your service. (Of course these can be effectively the same in cases where you can list all possible inputs such as hashed IPv4 addresses.)
That doesn't mean that hashing is enough for pure anonymity, but used properly hashes are definitely a step above something fully reversible (like encryption with a common key).
I'm not sure the distinction is meaningful. If the police demand your logs to find out whether a certain IP address visited in the past year, they'd be able to find that out pretty quickly given what's stored. So how is privacy being respected?
if you have an ad ID for a person, say example@example.com, and you want to deduplicate it,
if you provide them with the names, the company that buys the data can still "blend" it with data they know, if they know how the hash was generated... and effectively get back that person's email, or IP, or phone number, or at least get a good hunch that the closest match is such and such person with uncanny certainty
de-anonymization of big data is trivial in basically every case that was written by an advertising company, instead of written by a truly privacy focused business.
if it were really a non-reversible hash, it would be evenly distributed, not predictable, and basically useless for advertising, because it wouldn't preserve locality. It needs to allow for finding duplicates... so the person you give the hash to, can abuse that fact.
It depends. For example, if each day you generate a random nonce and use it to salt that day's PII (and don't store the nonce) then you cannot later determine (a) did person A visit on day N or (b) is visitor X on day N the same as visitor Y on day N+1. But you can still determine how many distinct visitors you had on day N, and answer questions about within-day usage patterns.
It can be used to track you across the web, get a general geographic area, and if you have the right connections one can get the ISP subscriber address. Given that PII is anything that can be used to identify a person, I think it qualifies despite it being difficult for a rando to tie an IP to a person.
Additionally in the case of ipv6 it can be tied to a specific device more often. One cannot rely on ipv6 privacy extensions to sufficiently help there.
Author here. I commented down below, but it's probably more relevant in this thread.
For a bit of clarity around IP addresses hashes. The only use they have in this context is preventing duplicate hits in a day (making each page view unique by default). At the end of each day there is a worker job that scrubs the ip hash which is now irrelevant.
For context, this problem also came up in a discussion about Storybook doing something similar in their telemetry [0] and with zero optimization it takes around two hours to calculate the salted hashes for every IPv4 on my home laptop.
Hashes should be salted. If you salt, you are fine, if you don't you aren't.
Whether the salt can be kept indefinitely, or is rotated regularly etc is just an implementation detail, but the key with salting hashes for analytics is that the salt never leaves the client.
As explained in the article there seems to be no salt (or rather, the current date seems to be used as a salt, but that's not a random salt and can easily be guessed for anyone who wants to say "did IP x.y.z.w visit on date yy-mm-dd?".
It's pretty easy to reason about these things if you look from the perspective of an attacker. How would you do to figure out anything about a specific person given the data? If you can't, then the data is probably OK to store.
> Hashes should be salted. If you salt, you are fine, if you don't you aren't.
> Whether the salt can be kept indefinitely, or is rotated regularly etc is just an implementation detail, but the key with salting hashes for analytics is that the salt never leaves the client.
I think I'm missing something.
If the salt is known to the server, then it's useless for this scenario. Because given a known salt, you can generate the hashes for every IP address + that salt very quickly. (Salting passwords works because the space for passwords is big, so rainbow tables are expensive to generate.)
If the salt is unknown to the server, i.e. generated by the client and 'never leaves the client'... then why bother with hashes? Just have the client generate a UUID directly instead of a salt.
Without a salt, you can generate the hash for every IP address once, and then permanantly have a hash->IP lookup (effectively a Rainbow table). If you have a salt, then you need to do it for each database entry, which does make it computationally more expensive.
People are obsessed with this attack from the 1970s, but in practice password cracking rigs just brute force the hashes, and that has been the practice since my career started in the 1990s and people used `crack`, into the 2000s and `jtr`, and today with `hashcat` or whatever it is the cool kids use now. "Rainbow tables" don't matter. If you're discussing the expense of attacking your scheme with or without rainbow tables, you've already lost.
> why bother with hashes? Just have the client generate a UUID directly instead of a salt.
The reason for all this bonanza is that the ePrivacy directive requires a cookie banner, "making exceptions only for data that is "strictly necessary in order to provide a [..] service explicitly requested by the subscriber or user"*.
In the end, you only have "pinky promise" that someone isn't doing more processing on the server end, so in reality it doesn't matter much especially if the cookie lifetime is short (hours or even minutes). Actually, a cookie or other (short-lived!) client-side ID is probably better for everyone if it wasn't for the cookie banners.
ALL of the faff around cookies is the biggest security theater of the past 40 years. I remember hearing the fear-mongering in the very early 2000's about cookies in the mainstream media - it was self-evidentally a farce then, and a farce now.
Salts are generally stored with the hash, and are only really intended to prevent "rainbow table" attacks. (I.e. use of precomputed hash tables.) Though a predictable and matching salt per entry does mean you can attack all the hashes for a timestamp per hash attempt.
That being said, the previous responder's point still stands that you can brute force the salted IPs at about a second per IP with the colocated salt. Using multiple hash iterations (e.g. 1000x; i.e. "stretching") is how you'd meaningfully increase computational complexity, but still not in a way that makes use of the general "can't be practically reversed" hash guarantees.
As I said the key for hashing PII for telemetry is that the client does the hashing on the client side and the client never transmits the salt. This isn't a login system or similar. There is no "validation" of the hash. All the hash is is a unique marker for a user that doesn't contain any PII.
Kidding, of course. I don't think there's a way to track users across sessions, without storing something and requiring a 'cookie notification'. Which is kind of the point of all these laws.
Storing a salt with 24h expiry would be the same thing as the solution in the article. It would be better from a privacy perspective because the IP would then not be transmitted in a reversible way.
If I hadn't asked for permission to send hash(ip + date) then I'd sure not ask permission if I instead stored a random salt for each new 24h and sent the hash(ip + todays_salt).
This is effectively a cookie and it's not strictly necessary if it's stats only. So I think on the server side I'd just invent some reason why it's necessary for the product itself too, and make the telemetry just an added bonus.
If you can use JS it's easy. For example localStorage.setItem("salt", Math.random()).
Without JS it's hard I think. I don't know why this author wants to use JS, perhaps out of respect for his visitors, but then I think it's worse to send PII over the wire (And an IP hashed in the way he describes is PII).
EU's consent requirements don't distinguish between cookies and localStorage, as far as I understand. And a salt that is only used for analytics would not count as “strictly necessary” data, so I think you'd have to put up a consent popup. Which is precisely the kind of thing a solution of that is trying to avoid.
Indeed, but as I wrote in another reply: it doesn't matter. It's even worse to send PII over the wire. Using the date as the salt (as he does) just means it's reversible PII - a.k.a. PII!.
Presumable these are stored on the server side to identify returning visitors - so instead of storing a random number for 24 hours on the client, you now have PII stored on the server. So basically there is no way to do this that doesn't require consent.
The only way to do it is to make the information required for some necessary function, and then let the analytics piggyback on it
I think I agree with you there. But again, the idea of a "salt" is then overcomplicating things. It's exactly the same to have the client generate a GUUID and just send that up, no salting or hashing required.
Yup for only identifying a system that’s easier. If this is all the telemetry is ever planned to do then that’s all you need. The benefit of having a local hash function is when you want to transmit multiple ids for data. E.g in a word processor you might transmit hash(salt+username) on start and hash(salt+filename) when opening a document and so on. That way you can send identifiers for things that are sensitive or private like file names in a standardized way and you don’t need to keep track of N generated guids for N use cases.
On the telemetry server you get e.g
Function “print” used by user 123 document 345. Using that you can do things like answering how many times an average document is printed or how many times per year an average user uses the print function.
What you're describing is called "pepper". What's called "salt" is not varied between rows and therefore not stored with the hash (and best stored far away from the hash).
Salting a standard cryptographic hash (like SHA2) doesn't do anything meaningful to slow a brute force attack. This problem is the reason we have password KDFs like scrypt.
(I don't care about this Bear analytics thing at all, and just clicked the comment thread to see if it was the Bear I thought it was; I do care about people's misconceptions about hashing.)
What do you mean by "brute force" in the context of reversing PII that has been obscured by a one way hash? My IP number passed through SHA1 with a salt (a salt I generated and stored safely on my end) is
6FF6BA399B75F5698CEEDB2B1716C46D12C28DF5
Since this is all that would be sent over the wire for analytics, this is the only information an attacker will have available.
The only thing you can brute force from that is some IP and some salt such that SHA1(IP+Salt) = 6FF6BA399B75F5698CEEDB2B1716C46D12C28DF5
But you'll find millions of such IPs. Perhaps all possible IP's will work with some salt, and give that hash. It's not revealing my IP even if you manage to find a match?
If you also explicitly mentioned the salt used (as bear appear to have done?), this just becomes a matter of testing 4 billion options and seeing which matches
I think it's just unsalted in the example code. Or you could argue that the date is kind of used as a salt. But the point was that salting + hashing is fine for PII in telemetry if and only if the salt stays on the client. It might be difficult to do without JS though.
What does "stay on the client" mean? It has to be consistent across visits and you don't want to use cookies (otherwise you don't need to mess with addresses at all). You have no option except sending every client the same salt.
In this context it means exactly that: staying on the client. And yes that means using cookies (or rather, local storage probably). So this is requiring consent no matter how you do it. But note that the system in the article also requires consent since it sends PII (the IP) over the wire and saves it on the server. It's reversible and not anonymous - so it's even worse than using a cookie without consent I'd say.
Yes if all you ever want to send is a unique visitor ID then there is no point in having a local hash, because you can just generate a random ID and use that to identify the user.
What I mean is that if you want to send multiple pieces of PII (such as an IP, a filename, a username,...) then the only way to do that safely is to send hash(salt+filename) for example, where the salt is not known to the server receiving the hash. The IP in the suggestion to use a locally stored hash here just represented "PII that should be sent anonymously" and not "A good way of identifying a unique system".
> Salting a standard cryptographic hash (like SHA2) doesn't do anything meaningful to slow a brute force attack.
Sure, but it does at least prevent the use of rainbow tables. Arguably not relevant in this scenario, but it doesn't mean that salting does nothing. Rainbow tables can speed up attacks by many orders of magnitude. Salting may not prevent each individual password from being brute forced, but for most attackers, it probably will prevent your entire database from being compromised due to the amount of computation required.
I think it's meaningless. Every password KDF is randomized. But randomization isn't what makes them difficult to brute-force; non-password-KDF "salted" hashes are extremely easy to brute force, which is the whole reason we have password KDFs.
"Salted hashes" are one of the more treacherous security cargo cults, because they create the impression that the big problem you have to solve hashing a password is somehow mixing in a salt. No: just doing that by itself doesn't accomplish anything meaningful at all.
Understood, but I think this is just people using "hash" and "KDF" interchangeably. Agree that a KDF is the way to go, but salting it is also a good idea to explode the search space.
I don't follow your point. Yes, you can brute force. But if I have 50,000 unsalted password hashes, I can crack those using rainbow tables much faster than brute forcing all 50,000. For a single password hash, I agree it probably doesn't really matter.
Maybe they use a secret salt or rotating salt? The example code doesn't, so I'm afraid you are right. But one addition and it can be made reasonable secure.
I am afraid, however, that this security theater is enough to pass many laws, regulations and such on PII.
Not really. They are designed to be fast enough and even then only as a secondary priority.
> You can do 6 billion … hashes/second on [commodity hardware] … there’s only 4 billion ipv4 addresses. So you can brute force the entire range
This is harder if you use a salt not known to the attacker. Per-entry salts can help even more, though that isn't relevant to IPv4 addresses in a web/app analytics context because after the attempt at anonymisation you want to still be able to tell that two addresses were the same.
> And that’s true even if they used something secure like SHA-256 instead of broken MD5
Relying purely on the computation complexity of one hash operation, even one not yet broken, is not safe given how easy temporary access to mass CPU/GPU power is these days. This can be mitigated somewhat by running many rounds of the hash with a non-global salt – which is what good key derivation processes do for instance. Of course you need to increase the number of rounds over time to keep up with the rate of growth in processing availability, to keep undoing your hash more hassle than it is worth.
But yeah, a single unsalted hash (or a hash with a salt the attacker knows) on IP address is not going to stop anyone who wants to work out what that address is.
A "salt not known to the attacker" is a "key" to a keyed hash function or message authentication code. A salt isn't a secret, though it's not usually published openly.
That's not a reasonable way to say it. It's literally the second priority, and heavily evaluated when deciding what algorithms to take.
> This is harder if you use a salt not known to the attacker.
The "attacker" here is the sever owner. So if you use a random salt and throw it away, you are good, anything resembling the way people use salt on practice is not fine.
Assuming you don't store the salts, this produces a value that is useless for anything but counting something like DAU. Which you could equally just do by counting them all and deleting all the data at the end of the day, or using a cardinality estimator like HLL.
That same uselessness for long-term identification of users is what makes this approach compliant with laws regulating use of PII, since what you have after a small time window isn't actually PII (unless correlated with another dataset, but that's always the case).
That's precisely all that OP is storing in the original article.
They're just getting a list of hashes per day, and associated client info. They have no idea if the same user visit them on multiple days, because the hashes will be different.
Of course if you have multiple severs or may reboot you need to store the salt somewhere. If you are going to bother storing the salt and cleaning it up after the day is over it may be just as easy to clean the hashes at the end of the day (and keep the total count) which is equivalent. This should work unless you want to keep individual counts around for something like seeing distribution of requests per IP or similar. But in that case you could just replace the hashes with random values at the end of the day to fully anonymize them since you no longer need to increment then.
I think the opposite? I’m a dev with a bit of an interest in security, and this immediately jumped out at me from the story; knowing enough security to discard bad ideas is useful.
Seems clever and all, but `body:hover` will most probably completely miss all "keyboard-only" users and users with user agents (assistive technologies) that do not use pointer devices.
Yes, these are marginal groups perhaps, but it is always super bad sign seeing them excluded in any way.
I am not sure (I doubt) there is a 100 % reliable way to detect that "real user is reading this article (and issue HTTP request)" from baseline CSS in every single user agent out there (some of them might not support CSS at all, after all, or have loading of any kind of decorative images from CSS disabled).
There are modern selectors that could help, like :root:focus-within (requiring that user would actually focus something interactive there, what again is not guaranteed for al agents to trigger such selector), and/or bleeding edge scroll-linked animations (`@scroll-timeline`). But again, braille readers will probably remain left out.
I think most mobile browsers emit "hover" state whenever you tap / drag / swipe over something in the page. "active" state is even more reliable IMO. But yes, you are right that it is problematic. Quoting MDN page about ":hover" [1]:
> Note: The :hover pseudo-class is problematic on touchscreens. Depending on the browser, the :hover pseudo-class might never match, match only for a moment after touching an element, or continue to match even after the user has stopped touching and until the user touches another element. Web developers should make sure that content is accessible on devices with limited or non-existent hovering capabilities.
They had to, and had to make that decision when mobile browsers were first developed, because so many sites had navigation flyouts that relied on :hover. So they had to have something trigger that pseudoselector.
I really wish modern touchscreens spent the extra few cents to support hover. Samsung devices from the ~2012 era all supported detection of fingers hovering near the screen. I suspect it’s terrible patent laws holding back this technology, like most technologies that aren’t headline features.
I’d be surprised if it was related to patent law; hover detection is fundamentally straightforward with capacitive touchscreens and it’s just a matter of calibrating the capacitance curve, and deciding whether to expose a hover state or not.
No, the reason is just because it’s too fiddly and too unreliable (to use, I mean, more than the touchscreen itself being unreliable, though the more sensitive you tune it the more that will become a problem as well).
Apple had their 3D Touch and they dropped it because (from what I hear, I never used it) it largely confused people due to non-obviousness/non-familiarity, required more care to use accurately than people liked (or, in some cases, were able to provide), and could become unreliable.
I’ve used a variety of “normal” capacitive touchscreens that could be activated anywhere from a few millimetres from the surface to requiring a firm/large-surface-area touch. It’s all about how they’re calibrated, and how much they’ve deteriorated (public space ones seem to often go very bad astonishingly quickly).
Most recently, in Indian airports the checkin machines that you can choose to use say that they’re touchless, that you can just put your finger near and it will work and isn’t that all lovely and COVID-19-aware of them, but on three separate machines I’ve tried it very carefully and couldn’t get it to activate before touching the screen.
That's interesting. Did they pass the :hover event down to the browser?
I have to have two interfaces for my web apps, because sometimes I want to hide some text of marginal value behind a :hover, but only a mouse can see it unless I break it out somehow for touch.
I'm a keyboard user when on my computer, qutebrowser but I think your sentiments are correct, the numbers of keyboard only users are probably much much smaller than the number of people using Adblock. So OPs method is likely to produce a more accurate analytics than a JavaScript only design.
OP just thought of a creative, effective and probably faster more code efficient way to do analytics. I love it, thanks OP for sharing it
Joking aside, I love to read websites with keybaords, esp. if I'm reading blogs. So, it's possible that sometimes my pointer is out there somewhere to prevent distraction.
I think there might be more than ten [1] blind folks using computers out there, most of them not using pointing devices at all or not in a way that would produce "hover".
As written, it depends on where your pointer is, if your device has one. If it’s within the centre 760px (the content column plus 20px padding on each side), it’ll activate, but if it’s not, it won’t. This means that some keyboard users will be caught, and some mouse users (especially those with larger viewports) won’t.
> And not just the bad ones, like Google Analytics. Even Fathom and Plausible analytics struggle with logging activity on adblocked browsers.
I believe that's as they're trying to live in what amounts to a toxic wasteland. Users like us are done with the whole concept and as such I assume if CSS analytics becomes popular, then attempts will be made to bypass that too.
I manually unblocked Piwik/Matomo, Plausible and and Fathom from ublock. I don't see any harm in what and how these track. And they do give the people behind the site valuable information "to improve the service".
e.g. Plausible collects less information on me than the common nginx or Apache logs do. For me, as blogger, it's important to see when a post gets on HN, is linked from somewhere and what kinds of content are valued and which are ignored. So that I can blog about stuff you actually want to read and spread it through channels so that you are actually aware of it.
If every web client stopped the tracking, you, as blogger, could go back to just getting analytics on server logs (real analytics, using maths).
Arguably state of the art in that approach to user/session/visits tracking 20 years ago beats today's semi-adblocked disaster. By good use of path aliases aka routes, and canonical URLs, you can even do campaign measurement without messing up SEO (see Amazon.com URLs).
You're just saying a smaller-scale version of "as a publisher it's important for me to collect data on my audience to optimize my advertising revenue." The adtech companies take the shit for being the visible 10% but publishers are consistently the ones pressuring for more collection.
I'm a website 'publisher' for a non-profit that has zero advertising on our site. Our entire purpose for collecting analytics is to make the site work better for our users. Really. Folks like us may not be in the majority but it's worth keeping in mind that "analytics = ad revenue optimization" is over-generalizing.
Of course analytics from 13 years ago doesn't help us optimize page load times. But it is extremely useful to notice that content that has gotten deep use steadily for a decade suddenly doesn't. Then you know to take a closer look at the specific content. Perhaps you see that the external resource that it depended on went offline and so you can fix it. Or perhaps you realize that you need to reprioritize navigation features on the site so that folks can better find the stuff they are digging for which should no longer include that resource. We have users that engage over decades and content use patterns that play out over years (not months). And understanding those things informs changes we make to our site that make it better for users. Perhaps this is outside your world of experience, but that doesn't mean it isn't true. And we also gather data to help optimize page load times.....
Can you give some examples of changes that you made specifically to make the site work better for users, and how those were guided by analytics? I usually just do user interviews because building analytics feels like summoning a compliance nightmare for little actual impact.
We generally combine what we learn from interviews/usability testing with what we can learn from analytics. Analytics often highlights use patterns that are of a 'we can definitely see that users are doing 'x' but we don't understand why' genre. Then we can craft testing/interviews that help us understand the why. So that's analytics helping us target our interviews/user testing. It also works the other way. User testing indicates users will more often get to where they need to be with design a versus design b. But user testing is always contrived: users are in an "I'm being tested mode" not a "I'm actually using the internet for my own purposes" mode. So it's hard to be sure they'll act the same way in vivo. With analytics you can look for users making the specific move your testing indicated they would. If they do great. But if not you know your user testing missed something or was otherwise off base.
I've decided to either stop working or keep working on some things based on the fact that I did or didn't get any traffic for it. I've become aware some pages were linked on Hacker News, Lobsters, or other sites, and reading the discussion I've been able to improve some things in the article.
And also just knowing some people read what you write is nice. There is nothing wrong with having some validation (as long as you don't obsess over it) and it's a basic human need.
This is just for a blog; for a product knowing "how many people actually use this?" is useful. I suspect that for some things the number is literally 0, but it can be hard to know for sure.
User interviews are great, but it's time-consuming to do well and especially for small teams this is not always doable. It's also hard to catch things that are useful for just a small fraction of your users. i.e. "it's useful for 5%" means you need to do a lot of user interviews (and hope they don't forget to mention it!)
How horrifying that someone who does writing potentially as their income would seek to protect that revenue stream.
Services like Plausible give you the bare minimum to understand what is viewed most. If you have a website that you want people to visit then it’s a pretty basic requirement that you’ll want to see what people are interested in.
When you start “personalising” the experience based on some tracking that’s when it becomes a problem.
> a pretty basic requirement that you’ll want to see what people are interested in.
not really
it should be what you are competent and proficient at
people will come because they like what you do, not because you do the things they like (sounds like the same thing, but it isn't)
there are many proxies to know what they like if you want to plan what to publish and when and for how long, website visits are one of the less interesting.
a lot of websites such as this one get a lot of visits that drive no revenue at all.
OTOH there are websites who receive a small amount of visits, but make revenues based on the amount of people subscribing to the content (the textbook example is OF, people there can get from a handful of subscriber what others earn from hundreds of thousands of views on YT or the like)
so basically monitoring your revenues works better than constantly optimizing for views, in the latter case you are optimizing for the wrong thing
I know a lot of people who sell online that do not use analytics at all, except for coarse grained ones like number of subscriptions/number of items sold/how many email they receive about something they published or messages from social platforms etc.
that's been true in my experience through almost 30 years of interacting and helping publishing creative content online and offline (books, records, etc)
> people will come because they like what you do, not because you do the things they like (sounds like the same thing, but it isn't)
This isn’t true for all channels. The current state of search requires you to adapt your content to what people are looking for. Social channels are as you’ve said.
It doesn’t matter how you want to slice it. Understanding how many people are coming to your website, from where and what they’re looking at is valuable.
I agree the “end metric” is whatever actually drives the revenue. But number of people coming to a website can help tune that.
emails revived or messages on social media are just another analytic and filling that same need as knowing pages hits. and somehow these people are vega analytics junkies instead of mainlining page hits. your unconvincing in the argument for "analytics are not needed"
Nothing's gonna block your webserver's access.log fed into an analytics service.
If anything, you're gonna get numbers that are inflated because it's a bit impossible to dismiss all of the bot traffic just by looking at user agents.
The bit of the web that feels to me like a toxic wasteland is all the adverts; the tracking is a much more subtle issue, where the damage is the long-term potential of having a digital twin that can be experimented on to find how best to manipulate me.
I'm not sure how many people actually fear that. Might get responses from "yes, and it's creepy" to "don't be daft that's just SciFi".
I didn't know this. But with uMatrix you could default to all websites and then whitelist those you wanted it for. At least that's the way I used it and uBlock advanced user features.
You’re blocking the image, not the CSS. Here’s a rule to catch it at present:
||bearblog.dev/hit/
This is the shortest it can be written with certainty of no false positives, but you can do things like making the URL pattern more specific (e.g. /hit/*/) or adding the image option (append $image) or just removing the ||bearblog.dev domain filter if it spread to other domains as well (there probably aren’t enough false positives to worry about).
I find it also worth noting that all of these techniques are pretty easily circumventable by technical means, by blending content and tracking/ads/whatever. In case of all-out war, content blockers will lose. It’s just that no one has seen fit to escalate that far (and in some cases there are legal limitations, potentially on both sides of the fight).
> In case of all-out war, content blockers will lose. It’s just that no one has seen fit to escalate that far (and in some cases there are legal limitations, potentially on both sides of the fight).
The Chrome Manifest v3 and Web Environment Integrity proposals are arguably some of the clearest steps in that direction, a long term strategy being slow-played to limit pushback.
True, though doing it in CSS does have a couple of interesting aspects, using :hover would filter out bots that didn't use a full-on webdriver (most bots, that is). I would think that using an @import with 'supports' for an empty-ish .css file would be better in some ways (since adblockers are awfully good at spotting 1px transparent tracking pixels, but less likely to block .css files to avoid breaking layouts), but that wouldn't have the clever :hover benefits.
I have a genuine question that I fear might be interpreted as a dismissive opinion but I'm actually interested in the answer: what's the goal of collecting analytics data in the case of personal blogs in a non-commercial context such as what Bearblog seems to be?
I can speak to this from the writer's perspective as someone who has been actively blogging since c. 2000 and has been consistently (very) interested in my "stats" the entire time.
The primary reason I care about analytics is to see if posts are getting read, which on the surface (and in some ways) is for reasons of vanity, but is actually about writer-reader engagement. I'm genuinely interested in what my readers resonate with, because I want to give them more of that. The "that" could be topical, tonal, length, who knows. It helps me hone my material specifically for my readers. Ultimately, I could write about a dozen different things in two dozen different ways. Obviously, I do what I like, but I refine it to resonate with my audience.
In this sense, analytics are kind of a way for me to get to know my audience. With blogs that had high engagement, analytics gave me a sort of fuzzy character description of who my readers were. As with above, I got to see what they liked, but also when they liked it. Were they reading first thing in the morning? Were they lunch time readers? Were they late at night readers. This helped me choose (or feel better about) posting at certain times. Of course, all of this was fuzzy intel, but I found it really helped me engage with my readership more actively.
Feedback loops. Contrary to what a lot of people seem to think, analytics is not just about advertising or selling data, it's about analysing site and content performance. Sure that can be used (and abused) for advertising, but it's also essential if you want any feedback about what you're doing.
You might get no monetary value from having 12 people read the site or 12,000 but from a personal perspective it's nice to know what people want to read about from you, and so you can feel like the time you spent writing it was well spent, and adjust if you wish to things that are more popular.
Curiosity? I like to know if anyone is reading what I write. It's also useful to know what people are interested in. Even personal bloggers may want to tailor content to their audience. It's good to know that 500 people have read an article about one topic, but only 3 people read one about a different topic.
For the curiosity, one solution I've been pondering, but never gotten around to implementing is just logging the country of origin for a request, rather than the entire IP.
IPs are useful in case of attack, but you could limit yourself to simply logging subnets. It's a little more aggressive block a subnet, or an entire ISP, but it seems like a good tradeoff.
I attempted to do this back at the start of this year, but lost motivation building the web ui. My trick is not CSS but simply loading fake images with <img> tags:
> Why not just get this info from the HTTP server?
This is explained in the blog post:
> There's always the option of just parsing server logs, which gives a rough indication of the kinds of traffic accessing the server. Unfortunately all server traffic is generally seen as equal. Technically bots "should" have a user-agent that identifies them as a bot, but few identify that since they're trying to scrape information as a "person" using a browser. In essence, just using server logs for analytics gives a skewed perspective to traffic since a lot of it are search-engine crawlers and scrapers (and now GPT-based parsers).
Yes, no one has argued that Cloudflare Pages arent using servers. But it is "hard" to track using logs if you are a cloudflare customers. Guess only way would be to hack into cloudflare itself and access my logs that way. But that is "hard" (because yes theoretically it is possible i know). And not a realistic alternative.
Let's say I have an e-commerce website, with products I want to sell.
In addition to analytics, I decide to log a select few actions myself such as visits to product detail page while logged in.
So I want to store things like user id, product id, timestamp, etc.
How do I actually store this?
My naive approach is to stick it in a table.
The DBA yelled at me and asked how long I need data.
I said at least a month.
They said ok and I think they moved all older data to a different table (set up a job for it?)
How do real people store these logs?
How long do you keep them?
Unless you’re at huge volume you can totally do this in a Postgres table. Even if you are you can partition that table by date (or whatever other attributes make sense) so that you don’t have to deal with massive indexes.
I once did this, and we didn’t need to even think about partitioning until we hit a billion rows or so. (But partition sooner than that, it wasn’t a pleasant experience)
An analytics database is better (clickhouse, bigquery...).
They can do aggregations much faster and can deal with sparse/many columns (the "paid" event has an "amount" attribute, the "page_view" event has an "url" attribute...)
We've got 13 years worth of data stored in mysql (5 million visitor/year). It's a pain to query there so we keep a copy in clickhouse as well (which is a joy to query).
I only track visits to a product detail page so far.
Basically,
some basic metadata about the user (logged in only),
some metadata about the product,
and basic "auditing" columns -- created by, created date, modified by, modified date
(although why I have modified by and modified date makes no sense to me, I don't anticipate to ever edit these, they're only there for "standardization". I don't like it but I can only fight so many battles at a time).
I am approaching 1.5 million rows in under two months.
Thankfully, my DBA is kind, generous, and infinitely patient.
Clickhouse looks like a good approach.
I'll have to look into that.
> select count(*) from trackproductview;
> 1498745
> select top 1 createddate from TrackProductView order by createddate asc;
> 2023-08-18 11:31:04.000
what is the maximum number of rows in clickhouse table?
Is there such a limit?
I use Postgres with timescale db. Works unless your e-commerce is amazon.com. Great thing with timescale db is that they take care of creating materialized views with the aggregates you care about (like product views per hour etc) and you can even choose to "throw away" the events themselves and just keep the aggregations (to avoid getting a huge db if you have a lot of events).
Oh - that is freakin' amazing.
- Using CSS to load an endpoint - clever
- Hashing ip + date for anon tracking - thoughtful
- Using :hover to ensure its a real user - genius
Probably should be "salted hashes might be considered PII". It has not be tried by the EU court and the law is not 100% clear. It might be. It might not be.
Correct. This is a flawed hashing implementation as it allows for re-identification.
Having that IP and user timezone you can generate the same hash and trace back the user. This is hardly anonymous hashing.
Wide Angle Analytics adds daily, transient salt to each IP hash which is never logged thus generating a truly anonymous hash that prevents reidentification.
You can estimate the actual numbers based on the collision rate.
Analytics is not about absolute accuracy, it's about measuring differences; things like which pages are most popular, did traffic grow when you ran a PR campaign etc.
> ‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;
This does not reference hashing, which can be an irreversible and destructive operation. As such, it can remove the “relating” part - i.e. you’ll no longer be able to use the information to relate it to an identifiable natural person.
In this context, if I define a hashing function that e.g. sums all ip address octets, what then?
The linked article talks about identification numbers that can be used to link a person. I am not a lawyer but the article specifically refers to one person.
By that logic, if the hash you generate cannot be linked to exactly one, specific person/request - you’re in the clear. I think ;)
If the data gets stored in this way (hash of IP[0]) for a long time I'm with you. But if you only store the data for 24 hours it might still count as temporary storage and should be "anonymized" enough.
IMO (and I'm not a lawyer): if you store ip+site for 24 hours and after that only store "region" (maybe country or state) and site this should be GDPR compliant.
The GDPR is very clear here (https://gdpr-info.eu/art-4-gdpr/). So you must have misunderstood the lawyers you talked to or you are referring about a hash that cannot identify a person. If information can be linked to a person it is considered PII of that person.
Are you sure? Even if you run a headless browser, you might not be triggering the hover event, unless you specifically tell it to or your framework simulates a virtual mouse that triggers mouse events and CSS.
You totally could be triggering it, but not every bot will, even the fancy ones.
The :hover pseudo-class could be applied and unapplied multiple times for a single page load. This can certainly be mitigated using cache related http headers but then if the same page is visited by the same person a second time coming from the same referrer, the analytics endpoint won't be loaded.
But maybe I'm not aware that browsers guarantee that "images" loaded using url() in CSS will be (re)loaded exactly once per page?
I'm not sure about `url()` in CSS but `<img>` tags are guaranteed to only be loaded once per URL per page. I would assume that `url()` works the same.
This bit me when I tried to make a page that reload an image as a form of monitoring. However URL interestingly includes the fragment (after the #) even though it isn't set to the server. So I managed to work around this by appending #1, #2, #3... to the image URL.
"The only downside to this method is if there are multiple reads from the same IP address but on separate devices, it will still only be seen as one read. And I'm okay with that since it constitutes such a minor fragment of traffic."
Many ISPs are now using CG-NAT so this approach would miscount thousands of visitors seemingly coming from a single IP address.
I can think of many "downsides" but whether those matter or are actually upsides really depends on your use-case and perspective.
* You cannot (easily) track interaction events (esp. relevant for SPAs, but also things like "user highlighted x" or "user typed Y, then backspaced then typed Z)"
* You cannot track timings between events (e.g. how long a user is on the page)
* You cannot track data such as screen-sizes, agents, etc.
Looks like a clever way to do analytics. Would be neat to see how it compares with just munging the server logs since you're only looking at page views basically.
re the hashing issue, it looks interesting but adding more entropy with other client headers and using a stronger hash algo should be fine.
> Now, when a person hovers their cursor over the page (or scrolls on mobile)...
I can imagine many cases where real human user doesn't scroll the page on mobile platform. I like the CSS approach but I'm not sure it's better than doing some bot filtering with the server logs.
I sure hope you're being sarcastic here and illustrating the ridiculousness of privacy extremists (who, btw, ruined the web, thanks to a few idiot politicians in the EU).
If not, what's wrong with a service knowing you're accessing it? How can they serve a page without knowing you're getting a page?
I’d like to see a comparison of the server log information with the hit endpoint information: my feeling is that the reasons for separating it don’t really hold water, and that the initial request server logs could fairly easily be filtered to acceptable quality levels, obviating the subsequent request.
The basic server logs include declared bots, undeclared bots pretending to use browsers, undeclared bots actually using browsers, and humans.
The hit endpoint logs will exclude almost all declared bots, almost all undeclared bots pretending to use browsers, and some humans, but will retain a few undeclared bots that search for and load subresources, and almost all humans. About undeclared bots that actually use browsers, I’m uncertain as I haven’t inspected how they are typically driven and what their initial mouse cursor state is: if it’s placed within the document it’ll trigger, but if it’s not controlled it’ll probably be outside the document. (Edit: actually, I hadn’t considered that bearblog caps the body element’s width and uses margin, so if the mouse cursor is not in the main column it won’t trigger. My feeling is that this will get rid of almost all undeclared bots using browsers, but significantly undercount users with large screens.)
But my experience is that reasonably simple heuristics do a pretty good job of filtering out the bots the hit endpoint also excludes.
• Declared bots: the filtration technique can be ported as-is.
• Undeclared bots pretending to use browsers: that’s a behavioural matter, but when I did a little probing of this some years ago, I found that a great many of them were using unrealistic user-agent strings, either visibly wonky or impossible or just corresponding to browsers more than a year old (which almost no real users are using). I suspect you could get rid of the vast majority of them reasonably easily, though it might require occasional maintenance (you could do things like estimate the browser’s age based on their current version number and release cadence, with the caveat that it may slowly drift and should be checked every few years) and will certainly exclude a very few humans.
• Undeclared bots actually using browsers: this depends on the unknown I declared, whether they position their mice in the document area. But my suspicion is that these simply aren’t worth worrying about because they’re not enough to notably skew things. Actually using browsers is expensive, people avoid it where possible.
And on the matter of humans, it’s worth clarifying that the hit endpoint is worse in some ways, and honestly quite risky:
• Some humans will use environments that can’t trigger the extra hit request (e.g. text-mode browsers, or using some service that fetches and presents content in a different way);
• Some humans will behave in ways that don’t trigger the extra hit request (e.g. keyboard-only with no mouse movement, or loading then going offline);
• Some humans will block the extra hit request; and if you upset the wrong people or potentially even become too popular, it’ll make its way into a popular content blocker list and significant fractions of your human base will block it. This, in my opinion, is the biggest risk.
• There’s also the risk that at some point browsers might prefetch such resources to minimise the privacy leak. (Some email clients have done this at times, and browsers have wrestled from time to time with related privacy leaks, which have led to the hobbling of what properties :visited can affect, and other mitigations of clickjacking. I think it conceivable that such a thing could be changed, though I doubt it will happen and there would be plenty of notice if it ever did.)
But there’s a deeper question to it: if you don’t exclude some bots; or if the URL pattern gets on a popular content filter list: does it matter? Does it skew the ratios of your results significantly? (Absolute numbers have never been particularly meaningful or comparable between services or sources: you can only meaningfully compare numbers from within a source.) My feeling is that after filtering out most of the bots in fairly straightforward ways, the data that remains is likely to be of similar enough quality to the hit endpoint technique: both will be overcounting in some areas and undercounting in others, but I expect both to be Good Enough, at which point I prefer the simplicity of not having a separate endpoint.
(I think I’ve presented a fairly balanced view of the facts and the risks of both approaches, and invite correction in any point. Understand also that I’ve never tried doing this kind of analysis in any detail, and what examination and such I have done was almost all 5–8 years ago, so there’s a distinct possibility that my feelings are just way off base.)
> Even Fathom and Plausible analytics struggle with logging activity on adblocked browsers.
The simple solution is to respect the basic wishes of those who do not want to be tracked. This is a "struggle" only because website operators don't want to hear no.
I don't know how I feel about this overall. I think we took some rules from the physical world that we liked and discarded others that we've ended up with a cognitively dissonant space.
For example, if you walked into my coffee shop, I would be able to lay eyes on you and count your visits for the week. I could also observe were you sit and how long you stay. If I were to better serve you with these data points, by reserving your table before you arrive with your order ready, you'd probably welcome my attention to detail. However, if I were to see you pulled about x number of watts a month from my outlets, then locked up the outlets for a fee suddenly - then you'd rightfully wish to never be observed again.
So what I'm getting at is, the issues with tracking appear to be with the perverse assholes vs. the benevolent shopkeeps of the tracking.
To wrap up this thought: what's happening now though is a stalker is following us into every store, watching our every move. In physical space, we'd have this person arrested and assigned a restraining order with severe consequences. However, instead of holding those creeps accountable, we've punished the small businesses that just want to serve us.
--
I don't know how I feel about this or really what to do.
The coffee shop analogy falls apart after a few seconds because tracking in real life does not scale the same way that tracking in the digital space scales. If you wanted to track movements in a coffee shop as detailed as you can on websites or applications with heavy tracking, you would need to have a dozen people with clipboards strewn about the place, at which point it would feel justifiably dystopian. The only difference on a website is that the clipboard-bearing surveillers are not as readily apparent.
> you would need to have a dozen people with clipboards strewn about the place
Assuming you live in the US, next time you're in a grocery store, count how many cameras you can spot. Then consider: these cameras could possibly have facial recognition software; these cameras could possibly have object recognition software; these cameras could possibly have software that tracks eye movements to see where people are looking.
Then wonder: do they have cameras in the parking lot? Maybe those cameras can read license plates to know which vehicles are coming and going. Any time I see any sort of news about information that can be retrieved from a photo, I assume that it will be done by many cameras at >1 Hertz in a handful of years.
I'm in Germany, so even if those places have cameras, they need to post a privacy notice describing what they do and how long they retain data. Sure, most people will not read this, but it's out there. I will make a mental note to read these privacy notices more often from now.
I think that's the point. It's the level of detail of tracking online that's the problem. If a website just wants to know someone showed up, that's one thing. If a site wants to know that I specifically showed up, and dig in to find out who I specifically am, and what I'm into so they can target me... that's too much.
Current website tracking is like the coffee shop owner hiring a private investigate to dig into the personal lives of everyone who walks in the door so they can suggest the right coffee and custom cup without having to ask. They could not do that and just let someone pick their own cup... or give them a generic one. I'd like that better. If clipboards in coffee shops are dystopian, so is current web tracking, and we should feel the same about it.
I think Bear strikes a good balance. It lets authors know someone is reading, but it's not keeping profiles on users to target them with manipulative advertising or some kind of curated reading list.
The coffe shop reserving my place and having my order ready before I arrive sounds nice - but is it not an innecessary luxury, that I would not miss had I never even thought of its possibility? I never asked for it, I was ready to stand in line for my order, and the tracking of my behavior resulted in a pleasant surprise, not a feature I was hoping for. If I really wanted my order to be ready when I arrive, then I would provide the information to you, not expect that you observe me to figute it out.
My point is that I don't get why the small businesses should have the right to track me to offer me better services that I never even asked for. Sure, its nice, but its not worth deregulating tracking and allowing all the evil corps to track me too.
You walk into your favorite coffee shop, order your favorite coffee, every day. But because of privacy reasons the coffeeshop owner is unaware of anything. Doesn't even track inventory, just orders whatever whenever.
One day you walk in and now you can't get your favorite coffee... Because the owner decided to remove that item from the menu. You get mad, "Where's my favorite coffee?" the barista says "owner removed it from menu" and you get even more upset "Why? Don't you know I come in here every day and order the same thing?!"
Nope, because you don't want any amount of tracking whatesoever, knowing any type of habits from visitors is wrong!
But in this scenario you deem the owner knowing that you order that coffee every day ensures that it never leaves the menu, so you actually do like tracking.
As much I agree with respecting folks wishes to not be tracked, most of these cases are not about "tracking".
It's usually website hosts just wanting to know how many folks are passing through. If a visitor doesn't even want to contribute to incrementing a private visit counter by +1, then maybe don't bother visiting.
If it was just about a simple count the host could just `wc -l access.log`. Clearly website hosts are not satisfied with that, and so they ignore DO_NOT_TRACK and disrespectfully try to circumvent privacy extensions.
> If it was just about a simple count the host could just `wc -l access.log`
That doesn't really work because huge amount of traffic is from 1) bots, 2) prefetches and other things that shouldn't be counted, 3) the same person loading the page 5 times, visiting every page on the site, etc. In short, these numbers will be wildly wrong (and in my experience "how wrong" can also differ quite a bit per site and over time, depending on factors that are not very obvious).
What people want is a simple break-down of useful things like which entry pages are used, where people came from (as in: "did my post get on the frontpage of HN?")
I don't see how anyone's privacy or anything is violated with that. You can object to that of course. You can also object to people wearing a red shirt or a baseball cap. At some point objections become unreasonable.
Is there a meaningful difference between recording "this IP address made a request on this date" and "this IP address made a request on this date after hovering their cursor over the page body"? How is your suggestion more acceptable than what the blog describes?
My point is that your `wc -l access.log` solution will also track people who send the Do Not Track header unless you specifically prevent it from doing so. In fact, you could implement the exact same system described in the blog post by replacing the Python code run on every request with an aggregation of the access log. So what is the pragmatic difference between the two?
Even the GDPR makes this distinction. Access logs (with IP addresses) are fine if you use them for technical purposes (monitor for 404, 500 errors) but if you use access logs for marketing purposes you need users to opt-in, because IP addresses are considered PII by law. And if you don't log IPs then you can't track uniques. Tracking and logging are not the same thing.
Google cloud and AWS VPS and many hosting services collect and provide this info by default. Chances are most websites do this including this one you are using now. HN does up bans meaning they must access visitor IP.
Why aren't people starting their protest against the website they're currently using instead of at OP.
We all know that tracking is ubiquitous on the web. This blogpost however discusses technology that specifically helps with tracking people who don't want to be tracked. I responded that an alternative approach is to just not. That's not hypocritical.
Again, you don't answer the question of what's the difference between a image pixel or javascript logging that you visited the site vs nginx/apache logging you visited the site?
You're upset that OP used an image or javascript instead of grepping `access.log` makes absolute no sense. The same data is shown there.
It's rude to tell people how they feel and it's rude to assert a post makes "absolutely no sense" while at the same time demanding a response.
One difference is intent. When you build an analytics system you have an obligation to let people opt out. Access logs serve many legitimate purposes, and yes, they can also be used to track people, but that is not why access logs exist. This difference is also reflected in law. Using access logs for security purposes is always allowed but using that same data for marketing purposes may require an opt-in or disclosure.
I have, unfortunately, become cynical in my old age. Don't take this the wrong way, but...
<cynical_statement>
The purpose of the web is to distribute ads. The "struggle" is with people who think we made this infrastructure so you could share recipes with your grand-mother.
</cynical_statement>
No matter how bad the web gets, it can still get worse. Things can always get worse. That's why I'm not a cynic. Even when the fight is hopeless --and I don't believe it is-- delaying the inevitable is still worthwhile.
The infrastructure was put in place for people to freely share information. The ads only came once people started spending their time online and that's where the eyeballs were.
The interstate highway system in the US wasn't build with the intent of advertising to people, it was to move people and goods around (and maybe provide a means to move around the military on the ground when needed). Once there were a lot of eyes flying down the interstate, the billboard was used to advertise to those people.
The same thing happened with the newspaper, magazines, radio, TV, and YouTube. The technology comes first and the ads come with popularity and as a means to keep it low cost. We're seeing that now with Netflix as well. I'm actually a little surprised that books don't come with ads throughout them... maybe the longevity of the medium makes ads impractical.
The CSS tracker is as useful as server log-based analytics. If that is the information you need, cool.
But JS trackers are so much more. Time spent on the website, scroll depth, screen sizes, some limited and compliant and yet useful unique sessions, those things cannot be achieved without some (simple) JS.
Server side, JS, CSS... No one size fits all.
Wide Angle Analytics has strong privacy, DNT support, an opt-out mechanism, EU cloud, compliance documentation, and full process adherence. Employs non-reversible short-lived sessions that still give you good tracking. Combine it with custom domain or first-party API calls and you get near 100% data accuracy.
The CSS tracker is as useful as server log-based analytics.
It is not. Have you read the article?
The whole point of the CSS approach is to weed out user agents which are not doing mouse hover on the body events. You can't see that from server logs.
I've added an edit to the essay for clarity.