Having previously been in a position where an ad agency thought we were sending ...

Jordrok · on Nov 27, 2019

How did it get to the point where there was enough "bad" data that the ad agency took notice and was angry enough to take action? I like the idea of jamming ad networks with junk data, but I would imagine that it would have to happen at a massive scale to make any sort of difference.

lovehashbrowns · on Nov 27, 2019

They accept a certain rough percentage of bad data because having 100% clean data is a little impossible, but basically we constantly had people using our app on virtual machines, phone farms, reverse engineering our API, etc. It was a constant battle to ban them as soon as possible.

At some point, our ad partner contacted us letting us know that some of our data was coming from blacklisted IP addresses--AWS, Linode, known bots, etc. Ranges where a human almost certainly isn't actually viewing ads, and told us to fix it asap or get out.

We ended up licensing an IP blacklist. It updates daily, and it comes with both individual IP addresses and cidr ranges. We didn't have time to write a fraud system to ban users, or do this check via our api. So my solution was to check every IP that came in through our load balancers against the blacklist and blackhole it somehow.

Since we were using nginx, I swapped to open resty because that comes with Lua already fully baked in. Next, I wrote a Lua script that just checks if an IP address is in the blacklist. It even had a caching module! That was awesome.

The real hard part was where to keep the IP blacklist. I came up with the solution to use Redis. If an IP address exists as a key in the Redis DB, it's blacklisted. This "if key exists" check is O(1) in Redis as far as I still know. So I wrote a cron job that runs every day to download the new blacklist, expand the cidr ranges, pipe the individual IPs into a second unused redis db, save the DB and restart production redis so it picks up the backup and refreshes its list of addresses. This list was massive, btw, especially when you expanded the cidr ranges, some of which /8. And the Lua script would just run a GET query on redis. If the key exists, open resty would just return a 40x code. Lua+open resty and redis are all super fast so we didn't lose much by checking every single API request this way.

After that, the ad agency was happy and we didn't get booted. But it was a super close call. Basically if redis didn't exist or wasn't as awesome, I'm fairly certain some engineers would have worked a solid 72hrs to write the php needed for an effective ban system that could go into production. I wrote the lua/redis solution and got it into production in an evening. So simple and really fun to write.

If this were to happen to a company getting bad data from a browser, either they'd have to clean up the data or get kicked out as well. Ad agencies pay for this data, so it's not like they're gonna turn into a charity and accept it. I'm sure it also messes up their datasets as well. I can't even imagine what it would take to clean data coming from a known good source/ip but with bad info. Yikes.