This comes up in my work modestly frequently, generally with a slightly different scenario for the tradeoff between "complicated and considered" versus "cheap and dirty but gets the job done."
e.g. We could spend 3 weeks using feature vectors and backtesting against prior data to figure out what signals accounts which are likely to churn send (for the purpose of proactively identifying them and attempting to derisk them, like by having the customer success team help them out directly)... or we could write a two-line if statement based on peoples' intuitive understandings of what unsuccessful accounts look like. (No login in 30 days and doesn't have the $STICKY_FEATURE enabled? Churn risk!)
The chief benefit of the second approach is that it actually ships, 100% of the time, whereas priorities change and an emergency develops and suddenly 75% complete analysis gets thrown in a git repo and forgotten about. Actually shipping is a very useful feature to have.
I get the feeling this is more "era-defining" than that...
What hit me was the "Processors are so fast now we can Brute force grep over 100GB in a second".
We are entering a world where 20TB on a magnetic disk is viable, but randomly accessing that data could take months to extract. So how we store data on disks will become vitally important to how we use the data - not unlike tape drives of pre-1980s era where rewinding to the front of the tape cost you several minutes of (expensive) waittime.
Imagine a scenario where these guys design how to stream logs to the disk to maximise streaming reads, then optimise reading that out to SSD thence to RAM and L2 and so forth. All designed to drive text past a regex running at bazillions of times a second.
Lets call it the New Brute Force, where its just as much effort to get out the door as elegant algorithms, but it is much much much simpler. And of course, sells more hardware.
Expect to see Intel Inside wrapped around the New Brute Force any time soon :-)
Edit: And they used Java ! I was expecting low-level C optimisations all over the place
Indeed, in the not too distant future we expect to move to spinning disks, with the data arranged for streaming reads. Streaming off disk can be pretty fast. LZ4 compression is fast enough to give us another big boost without turning CPU into a bottleneck. But the big enabler will be spreading data across disks, so that in principle we can use every spindle we own in service of every nontrivial query. The fact that server logs have a very high write-to-read ratio really helps us out here.
Today, we're keeping everything on SSD -- twice, in fact, to maintain a hot spare for each server. SSD prices have fallen to the point where, even renting from Amazon, we can make the cost model work; and SSD is a great crutch as we tune our storage layer. But spinning disk will be a lot cheaper.
As for Java: we've been pretty surprised ourselves. Going in, we expected to need a lot more "native" (C) code. As it turned out, currently the only non-Java code we're using is the LZ4 library, and even there we only recently moved to the non-Java implementation. We do soon expect to move a few more bits of code to C, such as the core text-search loop. But based on the performance we've been getting in pure Java, we don't anticipate any need to move more than a fraction of 1% of our overall code base.
We do have some code that lives in a sort of middle ground -- Java code written like it was in C. Our database buffers live in a set of 2GB byte arrays, soon to be replaced by native memory blocks. We do our own (very simple) memory allocation within that, and use a sprinkling of Unsafe operations to access it. This gets us closer to C performance (not all the way there), without any of the hassles of bridging across languages. This part is still well under 1% of our overall codebase.
I am reminded of John Carmack's comment on Oculus - he was amazed that the hardware had the power but the headsets performed badly - then he looked at the code and saw seas of abstractions on abstractions.
I remember reading "Latency Mitigation Strategies", by Carmack. It doesn't mention the Oculus except in an acknowledgment at the end, but it might be about the same thing lifeisstillgood is talking about:
I just remember a documentary / interview - iirc a five minute interview at a trade show. It stuck in my head. I am afraid I cannot find the link now but trade show / two people sitting down, a demo of oculus rift then comments.
Interesting.
That must be why C/C++ is always the go-to solution.
Instead of thinking "too many abstractions are making this code slow, therefore let's get rid of the abstractions" I usually had rather pick a better language where abstractions have little or no penalty (Haskell, Scheme, etc).
Maybe we are talking about different abstractions. Perhaps libraries? I reimplemented (probably badly) session handling code in wsgi (python) - because the Django code was over 2000 lines long and used other libraries and I did not understand it - especially when all I wanted was to store a 128bit number in a client cookie and then look it up when I saw it again.
So the idea is simple (cookie sessions) but the different ways of implementing it can hold complexity, errors and abstractions.
There is nothing stopping the same happening with Haskell - I can I am sure write terrible code even in the best languages (see my entire output for proof :-)
They do? In my experience, lack of abstraction (often due to not analyzing the issue at hand thoroughly) results in non-abstract, verbose, hard to understand and refactor code. It's the difference between, say, building an SQL query by appending strings to a buffer (move one line and everything blows up) and building a model of your query (projection, etc...). Sure, the abstraction means more code, but it's much more easier to manipulate and considerably less risk-prone. It won't be faster than doing it the other way, but it won't be necessarily measurably slower.
Euhm abstractions in Haskell carry lots of penalties. Haskell is generally a language that only makes sense if you buy the "sufficiently smart compiler" argument. Haskell's abstractions shouldn't carry a penalty, because the compiler compiles them out when it recognizes them.
That's cute but while it's impressive what it recognizes, it's generally still stupid, and it will get beaten by bad programmers (especially by bad programmers. Becoming good at Haskell means, amongst other things, learning what the compiler will screw up).
Scheme, likewise, doesn't have free abstractions. Unless you mean macros, but those are not really free either imho.
There's one high-level language in wide use that has "free" abstractions, or at least, costs as low as possible, and that's C++.
netty's buffer management libraries do a good job at what you are looking for. Having an instance of the pooled allocator that is configured to prefer "direct" i.e. off heap memory might just be the thing.
I think it was posted a while back; it was an interesting enough article that it stuck in my head and I was able to find it pretty easily by googling "why is grep so fast?".
When analyzing big data algorithms, the amount of computation by the CPU is often irrelevant. It's ignored. What matters is disk reads and disk access patterns. If you can make a scanning algorithm that start at byte 0, and reads everything once in a sequential manner, you can make O(n^4) computation with it. The CPU will be long finished with the inefficient computation by the time the next set of data is retrieved from the (spinning) disk.
And on using Java. It's actually pretty fast, and certainly fast enough for a prototype. Once you have the stack build, you can benchmark and optimize the bits that needs to be optimized, usually this means C or handcrafted ASM.
I don't think I've ever heard anybody claim that streaming algorithms permit O(n4) computation. That seems implausible. Most streaming algorithms are linear, or sublinear. I'm not sure what you meant by O or n or 4, however.
Indeed, there will always be an instant, given that "n" is increasing toward infinity, when the computation time will take over the streaming time.
Systems complexity are always bounded by their algorithm having the greatest complexity. As streaming is linear, there will always be a value of "n" after which the computation will take over, even with a very quasi-linear complexity (even an O(n^1.00000001) computation will be slower than the streaming for very very large values of n).
I should have mentioned that it implies that you can work on sufficiently small values of n, usually measured in how many cache lines it takes up. Saying that "there will always be a value of "n" after which the computation will take over" is, while true, completely theoretical. Not that it have its uses, but I have yet to see a IO heavy process being limited by the CPU. If it's streaming, it's usually also quite simple to process, and if it's random access the CPU is doing something else most of the time.
I always think of big-O in terms of algorithm analysis, while real-world performance is more a function of operation costs/memory costs. Its use in the latter is more colloquial but I get what you mean now.
As long as you do not have algorithmic dependence within the data in a cache line, you can regard calculation on that data as a constant that is dependent on the size of the data in the cache line. Even more so if you can utilize SIMD instructions. I know what I wrote wasn't clear on this, I was writing it in a hurry, sorry about that.
The reason I talk in cache lines, is because it's taking the real world aspect into the analysis of the algorithm. A theoretical analysis, while valuable in some aspects, are very far from the real world scenarios. I have done countless theoretical analysis of algorithms, found the actual (hidden) constants in big-O ect. but it just doesn't model the real world very well. This is the exact reason you see many fast algorithms use a naive, but cache friendly algorithm once the input is sufficiently small (e.g. with a divide and conquer algorithm).
I wrote it in a hurry. O(n^4) is only for the size of the current chunk of data being operated on, usually measured in cache lines. I guess it should have been m, as n is usually used for the total input size. It's not unusual to see O(m^c * log n) algorithms or even O(m^c * log log n) if you can use vEB search structure, where m is the number of cache lines the algorithm works on, c is an algorithm specific constant, and n is the total input size.
Ah, OK. So you meant space overhead in practical terms
I would think that a greater than constant time overhead would not be compatible with a real time streaming operation. just a guess, though.
Thanks for mentioning http://en.wikipedia.org/wiki/Van_Emde_Boas_tree as I hadn't heard of that. The last paragraph on the page gives some nice pointers (use a trie or randomized hash table).
I see what you're saying, but the article was not about "cheap and dirty" at all. They describe a precisely tuned system, riding on the controlled application of brute force. Like a rocket to Mars.
This approach is precisely what PHK was arguing against. That position paper basically says "Go ahead and do the logically correct thing and let the OS and VM (memory) take care of making it efficient".
I will always pick the cheap and dirty option until the project is at a stage where you have a lot of breathing room. Simple code trumps fancy code in reality where you don't have a large team, and you don't have a lot of time and you need something that works even if it's only half as efficient a solution.
this is the point where I say you won't get breathing room if your code is all quick and dirty. There is a balance between quick and dirty and well-formed and not needing tending every few weeks.
Just do high quality code that will remain maintainable into the future. If the business feels that has too much of a financial or time penalty the business needs to find money to pay for more high quality (slower) developers, or to find a different you.
Don't take someone else's problem / role on as your own to solve - it won't help you or them.
I think it depends on the context. Machine Learning and statistics isn't magic, and you definitely don't want to use it for the wrong reasons if you can figure it out with domain knowledge.
I would just like to throw in here if you are going to be featurizing users to do churn prediction, I would highly recommend doing a whole data pipeline and do profit curves.
A data driven approach is only as good as the algorithms used.
If you used profit curves to augment churn prediction (churn prediction is a pre req), you can figure out how much a user is actually worth bringing back. This also allows you to figure out an optimal budget to profit ratio on a cost of a user.
That being said, use it only if you have enough data to be accurate. And also only use it for the right reasons. Done routinely, I believe it can be a great investment.
I wrote a simple code search tool a number of years ago that had the ability to run arbitrary regex queries on a codebase, much like a recursive grep. I was working on a medium sized codebase and our primary platform was Windows. The problem with grep was that opening and scanning lots of small files apparently wasn't something that Windows was optimized for. If the file cache was cold, a simple search could turn into minutes of waiting.
My approach was to, essentially, concatenate all files into a single blob. The search tool would then let the regex loose on the contents of that single file. I had a simple index on the side to map byte positions to filenames. To get a line number from the position I would simply count the number of newlines from the start of the file to the regex match position.
Add some smarts to avoid duplicating work, compression and multithreading and I had a tool that could find all lines that contained "int(8|16|32|64)_t" in the Linux kernel source in a third of a second. This was about 20 times faster than grep (with a hot file cache).
It was a simple solution that scaled well enough to handle the codebases I was working on at the time without problems. I later read Russ Cox's article[1] on how Google code search worked. I recommend that article to anyone who's interested in running regular expressions on a large corpus.
Edit: The source[2] is available on github. Tested on Linux and Windows.
Windows is (or at least was) a dog when it comes to this even if the files are in the cache. Like barrkel said, the number I quoted was for the case when the OS doesn't hit the drive.
You have to code for Windows just so. Dear POSIX people in general, there is more to cross platform coding than being able to build it with cygwin.
Every grep I've used on Windows handles wildcards, so it must be doing the expansion itself. The right way to do this sort of thing on Windows is to call FindFirstFile/FindNextFile, which is a cross between readdir and stat in that each call returns not only the name of one directory entry that matches the pattern (as per readdir+fnmatch) but also its metadata (as per stat). So, if you were going to call readdir (to get the names) and then stat each name (to get the metadata), you should really be doing all of it in just this one loop so that each bit of data is retrieved just the once.
But this is exactly what POSIX programmers would seemingly never, ever do. Probably because doing it that way would involve calling a Win32 function. Well, more fool them, because getting this data from FindFirstFile/FindNextFile seems pretty quick, and getting it any other way seems comparatively slow.
I cobbled together a native Win32 port of the_silver_searcher a while ago and it was about twice as fast as the stock EXE thanks to retrieving all the file data in one pass. In this case the readdir wrapper just needed to fill in the file type in the dirent struct; luckily it seems that some POSIX-style systems do this, so there was already code to handle this and it just needed the right #define adding. (I have absolutely no idea why MingW32 doesn't do this already.)
Prior to switching to the_silver_searcher, I used to use grep; the grep I usually used is the GNU-Win32 one (http://gnuwin32.sourceforge.net/packages/grep.htm), and it looks to call readdir to get the names, and then stat to get each name's details. I checked that just now, and after all those years I can finally imagine why it's so slow.
The source code is there too, but I wouldn't look too closely. "Cobbled together", like I said. Though I've been using it all the time, in conjunction with ag-mode - https://github.com/Wilfred/ag.el - and haven't noticed any obvious problems.
HDD don't care about opening files. If the FS supported the same operations, I could send grep directly to the block level and have it scan contiguous lists of blocks that cover files I care about. You will get false positives for the overscan (potentially), but you will also achieve the raw rate of the disk (150MB/s for spinning and 500MB/s for solid sate).
So in effect, these big data serialization systems are attempting to lift the file system into userland so they can make just that optimization.
It's not unlike how we do free text search at my work. We collect all aggregate information for an item, concatenate it and store it in one table. Searching is amazingly fast. So fast in fact that we've had several clients open a ticket claiming that the search is broken. "There is no way you can search for a random text string in our million documents database this fast".
A friend of mine does a lot of work that often boils down to neighborhood searches in high-dimensional spaces (20-200 dimensions). The "Curse of Dimensionality" means that trees and tree-based spacial data structures are ineffective at speeding up these searches because there are too many different paths to arrive at nearly the same place in 150 dimensional space.
Usually the solutions end up to use techniques like principle component analysis to bit-pack each item in the dataset as small as possible. Then to buy a bunch of Tesla GPUs with as much RAM as possible. The algorithm then becomes: load the entire dataset in the GPUs memory once, brute force linear search the entire dataset for every query. The GPUs have enormous memory bandwidth and parallelism. With a bunch of them running at once, brute forcing through 30 gigabytes of compressed data can be done in milliseconds.
I do wish Intel/AMD would make the REP SCASB/CMPSB instructions faster, since they could enable even faster and efficient string searches limited only by memory bandwidth. (Or they could add a new instruction for memmem-like functionality, although I'd prefer the former.)
I’d certainly seen this at Google, where they’re
pretty good at that kind of thing. But at Scalyr,
we settled on a much more brute-force approach: a
linear scan through the log
Art of Computer Programming mentions this methodology, and the importance of learning about it. There are entire sections of "Tape Algorithms" that maximize the "brute force linear scan" of tape drives, complete with diagrams on how a hypothetical tape drives work in the situation.
Few "fancy algorithms" books actually get into the low level stuff. But Knuth knows: modern computers are extremely fast at linear data, and accessing data linearly (as well as learning algorithms that access data linearly) is a good idea.
I'll bet you that a B-tree approach would be fastest actually. B-Trees are usually good at balancing "massive linear performance" with "theoretical asymptotic complexity". IIRC, there is some research into this area: (look up cache sensitive search trees)
We use some special tricks for searches that are executed
frequently, e.g. as part of a dashboard. (We’ll describe
this in a future article.)
And...
(You might wonder why we store log messages in this
4K-paged, metadata-and-text format, rather than
working with raw log files directly. There are many
reasons, which boil down to the fact that internally,
the Scalyr log engine looks more like a distributed
database than a file system. Text searches are often
combined with database-style filters on parsed log
fields; we may be searching many thousands of logs at
once; and simple text files are not a good fit for our
transactional, replicated, distributed data
management.)
It sounds like they're doing more than just "appending to the end of the log". If you're going to make an index of any kind, the index will likely be fastest with some sort of B-Tree.
They're not using any fancy algorithms...except when they implement a customized version of Boyer-Moore for better string searching in a critical section of their algorithm. Not to mention all the fancy algorithms optimizing their brute-force code underneath their immediately visible program.
A better title would be, "How we determined when to use brute force."
Exactly; "fancy" is relative. It's a bit tricky, but it's nothing like the complexity of maintaining and using a keyword index. The reference implementation given in the article we linked to is thirty-odd lines of code. What we're using in practice is somewhat larger, in part because Java is more verbose for this kind of thing, but still reasonable. (If there's interest, we'd be happy to post the code.)
I think the real point is that they did a bona-fide tradeoff analysis and found that for their use case one algorithm was better than another. It's not about how fancy the algorithm is, that's just how great engineering works. It's only surprising if you don't consider "brute force" to be just as valid a tool as any other.
Moreover, however fancy Boyer-Moore is, the data structure it is being applied to is incredible simple. There are no inverted indexes, B-trees, etc - just a string.
I don't know that this is simple. There is a lot of fancy work going on getting data in and out fast.
I think it's more that memory is getting slower relative to CPU, as we all know, so complicated data structures that involve lots of random access to memory aren't so efficient anymore vs linear scans through memory that can benefit from cache. It's still engineering, just the tradeoffs have shifted.
Interesting product. Have you considered writing an output plugin for Heka (https://github.com/mozilla-services/heka), so that people could use the parsers and client side aggregators etc written for Heka with your service?
We'd certainly be open to that if asked. We practice Complaint Driven Development [0] when it comes to API integrations and the like -- we prioritize what our customers ask for.
For our core experience, we went with a custom agent because that allowed us to viciously simplify the setup process. But we're very much open to working with other tools as well.
It's taught sometimes that the simple method is never the "good" approach, and that the fancier and more elaborate you are, the better the solution will be. I'm not just talking about code, either.
When did "simple" become such a dirty word? Simple isn't synonymous with lazy.
Reading 300GB from an SSD is going to be way too slow. It will take minutes per search. In a simple test on linode, the SSD seems to read at about 1GB/s.
300GB of ram might cost $3000/month on AWS. There are a few dedicated server providers that offer 384GB or even 512GB configs for less than 1000$/month.
This randomly reminded me of an ancient kuro5hin article that I found interesting back in the day. Where the proposal was to use brute force to implement comment search on kuro5hin.
In BI, a similar tradeoff between lots of fancy preaggregation, versus optimizing search across the raw, un- (or lightly-) processed base data comes up quite frequently.
Commonly, the choice of approach is dictated by the number of end users querying the same data. If it's relatively specific queries ran frequently by many users, the aggregation approach can be made snappier and save lots of processing power.
But for power users, it becomes a nuisance real fast to be limited not only by stored base data, but also by the somewhat arbitrarily chosen aggregate structures. I'm assuming people doing log analyses of server logs fall within "power users" in most shops. At least I'd hope so :)
As an aside, the Go language (that I'm currently flapping between loving and not so much), versus Python at al., seems to be born of somewhat similar thinking.
The article’s title is: “Searching 20 GB/sec: Systems Engineering Before Algorithms”.
Current title on Hacker News is misleading as the “simple code” is not compared against any “fancy algorithms”. This is about not spending time devising fancy algorithms when the simple O(n) approach is good enough.
I've implemented a brute-force search with an exponential algorithm, in a context where the user would want instant results. Basically, I implemented a bipartite graph producing algorithm that took the "obvious, low hanging fruit" first, and only worked for a set amount of time. This produced a "sloppy matching" tool that did most of the user's busy work for matching natural gas coming from "upstream" and going to points "downstream." Then the user could eyeball a few optimizations and put those in by hand.
I've also implemented a "web search" that was just a complete scan through a text file. But this was back in 1997, and the amount of data we had didn't justify anything more complicated.
Reminds of the way LMAX achieve high throughput and low latency on their trading exchange. One of the key points is that they use parallelism but not concurrency. Keep your threads busy doing one thing only and avoid synchronising with other threads. No blocking, no queues and they achieve >100K messages/s.
Where they do need pass messages between threads they use the famous LMAX Disruptor which keeps data copying and blocking to a minimum.
I did a similar thing when holding a forum in Redis - generating a keyword index for searching took up 3x the amount of space of the raw text and, whilst faster for single whole word searches, fetching each article and grep'ing it in Ruby was plenty fast enough for my needs. Plus no overheads of index maintenance (new posts, expiring posts) and search limits ("only posts by X", "last 7 days") etc.
Why not building on top of existing full text index/search engines? Why would I choose this over something like Kibana (http://rashidkpc.github.io/Kibana/) that provides me with all this and more, is built on top of existing capable, scalable open source products, and is itself free and open source?
Simple is an algorithm more or less. Simple and in-memory is something I've used in a number of cases. The other benefit of simple is reliability. 20-30 years ago this type of solution wasn't really possible but with today's 64 bit CPU's, tons of ram, (and if you have to ) SSD's a lot of what used to require cleverness can now be done with simple.
Yes, you can go down that road [0], and in some applications it's a good approach. However, it adds quite a bit of complexity, and the cost for creating and storing the index is substantial. For us, it doesn't pencil out to a win.
I wonder if bitap [0] would be a good fit for the 4K search algorithm. It would let you do linear-time regexp matching for relatively short patterns (32 characters on a 32-bit machine, 64 characters on 64-bit, etc).
Herb Sutter gave a talk somewhat recently going over the basics of working intelligently with the cache. He had some kinda surprising results where naive algorithms on vectors/arrays outperformed the intuitively better list/tree approaches. Circumstances where you'd think all the copying elements and re-sizing std::vector would be expensive turn out not to matter.
How fancy and efficient the underneath runtime library, memory management, process scheduling, network stack, block I/O, device drivers have been designed and implemented so that someone can just write naive code to achieve such a high performance and think himself a genius.
@Author: First of all simple code and fancy algorithms are not opposites. Actually a very simple code (elegantly and concisely written) can be have a really fast performance on the order of O(1). What you're suggesting is a very naive way of looking at the problem.
Secondly, you need to understand Algorithm analysis in little more detail. If you look at run time analysis of your brute force algorithm, you can determine, of it grows in linear time, logarithmic, exponential etc and how to improve upon the core algorithm. Anyone can add more machines, more memory, faster processor which will obviously help in improving performance but in comparison to order of growth of 'n' all of that pales in the background.
e.g. We could spend 3 weeks using feature vectors and backtesting against prior data to figure out what signals accounts which are likely to churn send (for the purpose of proactively identifying them and attempting to derisk them, like by having the customer success team help them out directly)... or we could write a two-line if statement based on peoples' intuitive understandings of what unsuccessful accounts look like. (No login in 30 days and doesn't have the $STICKY_FEATURE enabled? Churn risk!)
The chief benefit of the second approach is that it actually ships, 100% of the time, whereas priorities change and an emergency develops and suddenly 75% complete analysis gets thrown in a git repo and forgotten about. Actually shipping is a very useful feature to have.