Excellent post showing how to correctly improve code w.r.t. performance using the scientific method: hypothesis, measuring the baseline, change, measuring effect with real tools, real code.
I also found it an interesting post in that it kind of inadvertently proves that for most situations you shouldn't optimise to this extent.
Meaning, yes, the OP got impressive performance improvements but the code is also completely unreadable and utilises unsafe code sections which could expose you to security problems/memory leaks/memory corruption. Not to mention they've recreated and will need to maintain an in-house version of the Dictionary class.
Their first optimisation (from Enumerator to List and Any() to Count()) are something every codebase could use. Most of their other optimisations make the code a maintenance minefield.
Plus programmers are expensive. Hardware is cheap. Why spent time on harder code to write that's also harder to maintain in the medium to long term when instead you could just throw money at hardware and call it a day? Just food for thought, not really a criticism in and of itself.
PS - Please don't take this post too seriously. I am not really being critical, just playing devil's advocate. I actually enjoyed the linked article a lot.
Then re-design for map-reduce, and scale horizontally
> What it latency for a single request can't be improved with more cores?
Then look into pre-computing and caching
> What if your product is used by consumers who may have old hardware? or phones, or watches, or laptops and they want their battery to last?
I thought we were talking about server requests? If we are, then offload this work to the server
> What if you consistently practiced at making high performance code, maybe then it wouldn't seem "unreadable" to you any more?
But it's not all about you. Unless you're working on a pet project, or you have the credibility and reputation to be the final call on a significant open source project, you might get hit by a bus tomorrow. Or, if you do a really good job, your company will need you to be a force multiplier to teach a dozen others to try to imitate you. Even if you're a "10x" programmer.
And by the way, when you optimize code THIS much, any refactoring or tweaks to new features cause your optimizations to get tossed out, and you have to start over from scratch.
> What if slow software was common today because of modern attitudes, and I wasn't seeing any increase in stability or features to show for it?
Except you are, and you don't even realize it. Optimizations like this blog post matter a LOT on client software. Be it apps, or websites - anything run on the client will need this kind of attention sometimes.
But this guy is writing server software. Micro-optimizing on the server side the way he is doing is silly.
> But this guy is writing server software. Micro-optimizing on the server side the way he is doing is silly.
Khm... What if you have millions of requests per second with tight latency requirements measured in milliseconds; and a bunch of business logic to fit into that. Such optimizations aren't so silly.
There are different scenarios on both client and server sides.
I'd suggest that OLTP applications where non-trivial amounts of state are shared between transactions are a reasonable counterexample where single core performance optimisation really matters.
I don't want to pick one single thing about your response, because I agree with a good portion of it, but if you're developing for watches you're probably going to care about performance from the start and not have the GP's (completely valid in certain scenarios) attitude.
I wonder if good VCS could solve this problem. For instance, if he documented each step with comments/commits in the code as well as he did for this blog post, it would be easy to go back and see not only why he did what he did, but the much more readable (albeit less performant) original code.
It seems like most git guis are pretty commit-focused. I'm not sure of any way to do it in the command line (though it must be possible) but it would be nice to highlight a section of code in your IDE and have git give you a history of just those lines (or that function) as far back as you want to go.
I cannot believe that you are honestly saying a 2x increase to throughput in production is something you "shouldn't take seriously" because the code isn't as readable as it was before.
Programmers are expensive. Hardware is cheap. That doesn't justify completely throwing out the window any performance increasing changes just because a fresh college grad won't be able to understand what's going on within 10 minutes.
> I cannot believe that you are honestly saying a 2x increase to throughput in production is something you "shouldn't take seriously" because the code isn't as readable as it was before.
I've worked on large and old codebases for years. I've seen plenty of examples of where a "clever" programmer has optimised the heck out of a section of code, made it completely unmaintainable, and as a result forced a re-write (the resulting code, which was slower, was also easier to maintain and reason about).
"Production" is a meaningless rallying cry. Everything is production sooner or later. Not everything needs to be fast, although there are critical areas of a typical project that do. It is really a question of code quality relative to performance, unless performance itself is actually causing you problems. Typically these overeager performance fixes are done to code preemptively.
This is all fine for your pet and toy projects, go work on something enterprise grade. Big code means clean code. I'd take ten lines of clean easy to consume code over one "clever" line of genius.
I don't know your experiences but the "performance optimizations" that I've seen lacked obvious technical support around it. Usually people thought "it will be faster" without using profiler (sin no 1) or did not properly isolate and control the performance critical part of the code (making performance tests part of the build etc.). But these are other issues separate from optimizing code.
The usual caveat of making code correct, readable and then fast in this order of course also applies :)
It's not a personal attack because I don't actually know anything about you, and I'm not claiming to. I'm making that statement in regard to your two posts up the chain. They are statements that don't reveal any solid, usable points other than to say "simpler code is easier to understand, sometimes performance doesn't matter as much" but in such a strong form as to suggest that the advice plays out a lot in the real world. It doesn't.
Well I don't think the author advises to go all the way down with your code but rather shows how to do that if you want to. As usual - what will anyone do with that knowledge is up to them but you surely will not advise not using a tool because you can do bad things with it.
I agree with you, usually you shouldn't optimize to the point when code quality starts suffer. It's all about trade offs. If you have one or two hundred servers and millions of RPS then it could be reasonable. Or coming back to the code quality, perhaps it's worth to re-visit the efficiency part and find another algorithm/ approach (someone suggested FSM in comments in this case)
> Plus programmers are expensive. Hardware is cheap.
This is entirely a matter of scale. 1 dev can make faster software that runs on thousands of machines with far more ROI than the hours they spent. This is common in several industries.
Blocking access based on arbitrary user agent strings is a really bad idea. Every single bad bot will avoid known user agent strings or pretend to be Google, so you're only blocking well behaved ones. Plus there's thousands of browser versions out there, so there's a very good chance you're blocking some users for no reason.
The proper way to do this is to block by IP, based on behavior. Block IPs slowing down the site or throw up a captcha like cloudflare does.
Blocking bots sounds great but it just brings Google one step closer to a monopoly. Even good bots just pretend to be people nowadays because lots of people are implementing naive site protection strategies.
Yes, you're right. There are many ways to block robots: IP, UA, behaviour analysis. An advertising company has to have UA based filtering to be compliant with standards. However, the focus of the blog post is on performance rather on how to block bots.
The article is not about filtering bad bots. It even says so right near the top:
>We won’t cover black bots because it is a huge topic with sophisticated analysis and Machine learning algorithms. We will focus on the white and grey bots that identify themselves as such.
This is about not wasting time & effort showing advertising banners to good bots.
In theory you are right, but in reality 99% of the rogue bots are actually just some scraper tools were ignorant users changed sane defaults to "make it go faster".
They usually don't have enough knowledge to even understand that they are routed into a black hole - not to speak of being able to do something about it.
Disclaimer: Getting rid of those idiots^wmisguides poor souls is part of my job description.
Then why not block based on bots that ignore robots text or make requests too quickly like I suggested? Looking for sub-strings in the UA header is a hack at best.
They all require one to keep state information on the server side and with overall little benefit. Yes, I do what you proposed, but in hindsight it was a waste of time to set that up: 99% simply don't know what they are doing and therefore easy to catch.
I don't care about this one person who knows to change the user agent sent, those are the ones you can usually talk to and they'll happily throttle their crawlers.
The other 99 are the problem - and they are a problem that can be solved pretty well by simple string matching.
Rather than block on UA, just add some honeypots. An invisible link. Any bot that pulls that page gets blocked as scrapers tend to pull all links from the page and follow.
Use the robots.txt to ban the pulling of specific pages. Bots 99% of the time ignore robots, so if they pull it: block
Check how quickly pages are pulled. If passes a threshold: block
I've seen bot traffic claiming to be recent versions of Firefox from residential IPs in the Ukraine pulling robots.txt. Sometimes this is one of the few clues to go on.
I did something similar with nginx, the data file from 51degrees and some lua code; each instance only handles 10-20k requests/sec so no clever optimization was needed.
Sure, but there's nothing really special, just some json parsing and some variable assignments back into an internal nginx request. I'm on mobile at the moment but I'll follow up on this in the next 24 hours.
I'd probably store cached results for Dictinary<int, HashSet<string>> allowed, notAllowed; where int == length of the user agent. This should probably be blazing fast as well instead of keep doing those lookups.
I doubt that exactly that will work. There are tens of thousands of different UAs (maybe 100K). Perhaps some kind of tiny (few CPU cache lines) cache for most popular UAs could help. But again: measure, measure, measure :)
A FST seems like a good fit for this problem. I believe it will be much more compact than the Aho-Corasick algorithm trie structure. Depends on the size of the dictionary.
I'd look for a library and not write one from scratch. Lucene [1] and OpenFST [2] are great implementations. I haven't used C#, so I don't know if bindings exist or not.
Also you may find this talk useful [3] (Particularly slide 11).
Great write up by the way. Really thorough on the benchmarking!
what if the "grey" traffic came from residential IP addresses using a normally distributed range of user agents? How would you reliable distinguish them from regular traffic?
Basically, we are using two sort of technics : technical and behavior.
Technical : if the UserAgent claim to be a regular browser (let say Chrome 43) we will check on network level if the client implement http protocol like Chrome 43 usually do and on the JS side if the Javascript render is correct for Chrome.
In case it's a real Chrome, we will check if the Browser is controlled by automation Tool.
Behavior : we will check if the path of requests is regular according to the website usage.
Thanks for sharing!