High-performance .NET by example: Filtering bot traffic

hdhzy · on April 24, 2017

Excellent post showing how to correctly improve code w.r.t. performance using the scientific method: hypothesis, measuring the baseline, change, measuring effect with real tools, real code.

Thanks for sharing!

Someone1234 · on April 24, 2017

I also found it an interesting post in that it kind of inadvertently proves that for most situations you shouldn't optimise to this extent.

Meaning, yes, the OP got impressive performance improvements but the code is also completely unreadable and utilises unsafe code sections which could expose you to security problems/memory leaks/memory corruption. Not to mention they've recreated and will need to maintain an in-house version of the Dictionary class.

Their first optimisation (from Enumerator to List and Any() to Count()) are something every codebase could use. Most of their other optimisations make the code a maintenance minefield.

Plus programmers are expensive. Hardware is cheap. Why spent time on harder code to write that's also harder to maintain in the medium to long term when instead you could just throw money at hardware and call it a day? Just food for thought, not really a criticism in and of itself.

PS - Please don't take this post too seriously. I am not really being critical, just playing devil's advocate. I actually enjoyed the linked article a lot.

jackmott · on April 24, 2017

Hardware might be cheap at some scales, but what if your service gets popular?

What it latency for a single request can't be improved with more cores?

What if your product is used by consumers who may have old hardware? or phones, or watches, or laptops and they want their battery to last?

What if you consistently practiced at making high performance code, maybe then it wouldn't seem "unreadable" to you any more?

What if Linq-like higher order functions weren't slow? https://github.com/jackmott/LinqFaster

What if slow software was common today because of modern attitudes, and I wasn't seeing any increase in stability or features to show for it?

deanCommie · on April 24, 2017

> what if your service gets popular?

Then re-design for map-reduce, and scale horizontally

> What it latency for a single request can't be improved with more cores?

Then look into pre-computing and caching

> What if your product is used by consumers who may have old hardware? or phones, or watches, or laptops and they want their battery to last?

I thought we were talking about server requests? If we are, then offload this work to the server

> What if you consistently practiced at making high performance code, maybe then it wouldn't seem "unreadable" to you any more?

But it's not all about you. Unless you're working on a pet project, or you have the credibility and reputation to be the final call on a significant open source project, you might get hit by a bus tomorrow. Or, if you do a really good job, your company will need you to be a force multiplier to teach a dozen others to try to imitate you. Even if you're a "10x" programmer.

And by the way, when you optimize code THIS much, any refactoring or tweaks to new features cause your optimizations to get tossed out, and you have to start over from scratch.

> What if Linq-like higher order functions weren't slow? https://github.com/jackmott/LinqFaster

https://github.com/jackmott/LinqFaster#limitations

> What if slow software was common today because of modern attitudes, and I wasn't seeing any increase in stability or features to show for it?

Except you are, and you don't even realize it. Optimizations like this blog post matter a LOT on client software. Be it apps, or websites - anything run on the client will need this kind of attention sometimes.

But this guy is writing server software. Micro-optimizing on the server side the way he is doing is silly.

alexandrnikitin · on April 24, 2017

> But this guy is writing server software. Micro-optimizing on the server side the way he is doing is silly.

Khm... What if you have millions of requests per second with tight latency requirements measured in milliseconds; and a bunch of business logic to fit into that. Such optimizations aren't so silly. There are different scenarios on both client and server sides.

dahauns · on April 24, 2017

Heh. You mean, like the example in the article? :)

ro_sharp · on April 25, 2017

I'd suggest that OLTP applications where non-trivial amounts of state are shared between transactions are a reasonable counterexample where single core performance optimisation really matters.

pc86 · on April 24, 2017

I don't want to pick one single thing about your response, because I agree with a good portion of it, but if you're developing for watches you're probably going to care about performance from the start and not have the GP's (completely valid in certain scenarios) attitude.

drewmate · on April 24, 2017

I wonder if good VCS could solve this problem. For instance, if he documented each step with comments/commits in the code as well as he did for this blog post, it would be easy to go back and see not only why he did what he did, but the much more readable (albeit less performant) original code.

It seems like most git guis are pretty commit-focused. I'm not sure of any way to do it in the command line (though it must be possible) but it would be nice to highlight a section of code in your IDE and have git give you a history of just those lines (or that function) as far back as you want to go.

osd · on April 24, 2017

No, no no no.

I cannot believe that you are honestly saying a 2x increase to throughput in production is something you "shouldn't take seriously" because the code isn't as readable as it was before.

Programmers are expensive. Hardware is cheap. That doesn't justify completely throwing out the window any performance increasing changes just because a fresh college grad won't be able to understand what's going on within 10 minutes.

Someone1234 · on April 24, 2017

> I cannot believe that you are honestly saying a 2x increase to throughput in production is something you "shouldn't take seriously" because the code isn't as readable as it was before.

I've worked on large and old codebases for years. I've seen plenty of examples of where a "clever" programmer has optimised the heck out of a section of code, made it completely unmaintainable, and as a result forced a re-write (the resulting code, which was slower, was also easier to maintain and reason about).

"Production" is a meaningless rallying cry. Everything is production sooner or later. Not everything needs to be fast, although there are critical areas of a typical project that do. It is really a question of code quality relative to performance, unless performance itself is actually causing you problems. Typically these overeager performance fixes are done to code preemptively.

This is all fine for your pet and toy projects, go work on something enterprise grade. Big code means clean code. I'd take ten lines of clean easy to consume code over one "clever" line of genius.

hdhzy · on April 24, 2017

I don't know your experiences but the "performance optimizations" that I've seen lacked obvious technical support around it. Usually people thought "it will be faster" without using profiler (sin no 1) or did not properly isolate and control the performance critical part of the code (making performance tests part of the build etc.). But these are other issues separate from optimizing code.

The usual caveat of making code correct, readable and then fast in this order of course also applies :)

osd · on April 24, 2017

[flagged]

Someone1234 · on April 24, 2017

That sounds more like a personal attack than a constructive response to the topic.

I only brought up large (old) codebases to put into context the kind of maintainability issues I am discussing, and why my perspective is what it is.

osd · on April 24, 2017

It's not a personal attack because I don't actually know anything about you, and I'm not claiming to. I'm making that statement in regard to your two posts up the chain. They are statements that don't reveal any solid, usable points other than to say "simpler code is easier to understand, sometimes performance doesn't matter as much" but in such a strong form as to suggest that the advice plays out a lot in the real world. It doesn't.

abiox · on April 25, 2017

> ..as to suggest that the advice plays out a lot in the real world. It doesn't.

what makes you say that?

hdhzy · on April 24, 2017

Well I don't think the author advises to go all the way down with your code but rather shows how to do that if you want to. As usual - what will anyone do with that knowledge is up to them but you surely will not advise not using a tool because you can do bad things with it.

alexandrnikitin · on April 24, 2017

I agree with you, usually you shouldn't optimize to the point when code quality starts suffer. It's all about trade offs. If you have one or two hundred servers and millions of RPS then it could be reasonable. Or coming back to the code quality, perhaps it's worth to re-visit the efficiency part and find another algorithm/ approach (someone suggested FSM in comments in this case)

manigandham · on April 25, 2017

> Plus programmers are expensive. Hardware is cheap.

This is entirely a matter of scale. 1 dev can make faster software that runs on thousands of machines with far more ROI than the hours they spent. This is common in several industries.

throwasehasdwi · on April 24, 2017

Blocking access based on arbitrary user agent strings is a really bad idea. Every single bad bot will avoid known user agent strings or pretend to be Google, so you're only blocking well behaved ones. Plus there's thousands of browser versions out there, so there's a very good chance you're blocking some users for no reason.

The proper way to do this is to block by IP, based on behavior. Block IPs slowing down the site or throw up a captcha like cloudflare does.

Blocking bots sounds great but it just brings Google one step closer to a monopoly. Even good bots just pretend to be people nowadays because lots of people are implementing naive site protection strategies.

Edited: to be less mean

alexandrnikitin · on April 24, 2017

Yes, you're right. There are many ways to block robots: IP, UA, behaviour analysis. An advertising company has to have UA based filtering to be compliant with standards. However, the focus of the blog post is on performance rather on how to block bots.

Twirrim · on April 24, 2017

The article is not about filtering bad bots. It even says so right near the top:

>We won’t cover black bots because it is a huge topic with sophisticated analysis and Machine learning algorithms. We will focus on the white and grey bots that identify themselves as such.

This is about not wasting time & effort showing advertising banners to good bots.

cm2187 · on April 24, 2017

Plus it's baking planned obsolescence into your code. As soon as you stop updating it, it will start blocking newer browser editions and versions.

John23832 · on April 24, 2017

> single dumbest idea

I agree with your sentiment, but you should try to be a little more constructive.

jlg23 · on April 24, 2017

In theory you are right, but in reality 99% of the rogue bots are actually just some scraper tools were ignorant users changed sane defaults to "make it go faster". They usually don't have enough knowledge to even understand that they are routed into a black hole - not to speak of being able to do something about it.

Disclaimer: Getting rid of those idiots^wmisguides poor souls is part of my job description.

throwasehasdwi · on April 24, 2017

Then why not block based on bots that ignore robots text or make requests too quickly like I suggested? Looking for sub-strings in the UA header is a hack at best.

jlg23 · on April 24, 2017

They all require one to keep state information on the server side and with overall little benefit. Yes, I do what you proposed, but in hindsight it was a waste of time to set that up: 99% simply don't know what they are doing and therefore easy to catch.

I don't care about this one person who knows to change the user agent sent, those are the ones you can usually talk to and they'll happily throttle their crawlers.

The other 99 are the problem - and they are a problem that can be solved pretty well by simple string matching.

krzrak · on April 24, 2017

> Every single bad bot will avoid known user agent strings

If I was writing a bot, I would set user agent to some well known and very popular value, i.e. newest Chrome on Windows, or something like that.

senorjazz · on April 24, 2017

Rather than block on UA, just add some honeypots. An invisible link. Any bot that pulls that page gets blocked as scrapers tend to pull all links from the page and follow.

Use the robots.txt to ban the pulling of specific pages. Bots 99% of the time ignore robots, so if they pull it: block

Check how quickly pages are pulled. If passes a threshold: block

alexandrnikitin · on April 24, 2017

Yes, using honeypots is one of the ways to identify bots. But that wasn't the focus of the post. I'll add some clarification.

marklit · on April 24, 2017

I've seen bot traffic claiming to be recent versions of Firefox from residential IPs in the Ukraine pulling robots.txt. Sometimes this is one of the few clues to go on.

pc86 · on April 24, 2017

I'm pretty sure the point of the article was performance testing of C#, not best practices for banning bots...

doubleplusgood · on April 24, 2017

I did something similar with nginx, the data file from 51degrees and some lua code; each instance only handles 10-20k requests/sec so no clever optimization was needed.

oblio · on April 24, 2017

Would you mind posting the Lua code?

doubleplusgood · on April 25, 2017

Hi,

I've made a gist[0]; feel free to get in touch via GH if you'd like to discuss it further.

[0] - https://gist.github.com/marklr/ae0c2f1eb61855d13cde6cef6bf63...

doubleplusgood · on April 24, 2017

Sure, but there's nothing really special, just some json parsing and some variable assignments back into an internal nginx request. I'm on mobile at the moment but I'll follow up on this in the next 24 hours.

NKCSS · on April 24, 2017

I'd probably store cached results for Dictinary<int, HashSet<string>> allowed, notAllowed; where int == length of the user agent. This should probably be blazing fast as well instead of keep doing those lookups.

alexandrnikitin · on April 24, 2017

I doubt that exactly that will work. There are tens of thousands of different UAs (maybe 100K). Perhaps some kind of tiny (few CPU cache lines) cache for most popular UAs could help. But again: measure, measure, measure :)

NKCSS · on April 24, 2017

Publish your test set and I can look at it :)

alexandrnikitin · on April 24, 2017

I'm afraid I can't do that because of proprietary data. I think I can come up with analogous tests using open data. I'll let you know ;)

andrewgrowles · on April 24, 2017

A FST seems like a good fit for this problem. I believe it will be much more compact than the Aho-Corasick algorithm trie structure. Depends on the size of the dictionary.

alexandrnikitin · on April 24, 2017

I have that feeling too. But I found it harder to implement, especially with fallback failure references.

andrewgrowles · on April 24, 2017

I'd look for a library and not write one from scratch. Lucene [1] and OpenFST [2] are great implementations. I haven't used C#, so I don't know if bindings exist or not.

Also you may find this talk useful [3] (Particularly slide 11).

Great write up by the way. Really thorough on the benchmarking!

[1] https://lucene.apache.org/core/4_1_0/core/org/apache/lucene/...

[2] http://www.openfst.org/twiki/bin/view/FST/WebHome

[3] https://www.slideshare.net/lucenerevolution/text-tagging-wit...

animal531 · on April 25, 2017

One can find Lucene.NET here: https://lucenenet.apache.org/

alexandrnikitin · on April 24, 2017

Awesome! The talk is great! It would be really interesting to try it. Thanks for sharing.

frik · on April 24, 2017

Good post!

A lot of manual work with various perf tools.

What's a bit missing is some production performance monitoring (APM) that gives you such data, with no manual interaction.

alexandrnikitin · on April 24, 2017

I intend to write a separate blog post about low-overhead production monitoring (not sure when it happen though)

tener · on April 24, 2017

So, the industry standard requires them not to serve ads to the bots... which means they have implemented the ad blocking themselves?

brilliantcode · on April 24, 2017

what if the "grey" traffic came from residential IP addresses using a normally distributed range of user agents? How would you reliable distinguish them from regular traffic?

Benfromparis · on April 24, 2017

Basically, we are using two sort of technics : technical and behavior.

Technical : if the UserAgent claim to be a regular browser (let say Chrome 43) we will check on network level if the client implement http protocol like Chrome 43 usually do and on the JS side if the Javascript render is correct for Chrome. In case it's a real Chrome, we will check if the Browser is controlled by automation Tool.

Behavior : we will check if the path of requests is regular according to the website usage.

Disclaimer: I'm working at https://datadome.co, a bot protection tool.

alexandrnikitin · on April 24, 2017

It depends. Usually do nothing if that traffic is very low. There's no reliable way to do that. Honeypots and behavior analysis are very useful here.