the parser is all hand-written c. the only library i use outside of libc, is xxhash.
documents are read and parsed byte by byte, where "sections" (e.g. side nav) and "contexts" (e.g. comment author) are maintained in a stack of states. part of the speed comes from writing a custom hashing tool that creates compact, very fast hash tables with few or no collisions. so identifying html tags, etc, is very fast and keeps everything as close to l1 as possible.
data is stored exclusively in sphinx. i don't know if it's a secret or not, but i've seen no one talk about it: elasticsearch crumbled horribly under this kind of load. sphinx performs beautifully.
depending on the path i take from here, creating my own index and engine is becoming a serious consideration.
nope. it's all in sphinx. sphinx has its own sort of data store called 'attributes', which are analyzed after a full-text match is performed.
this doesn't work well for relational searches (e.g. sql), where you might look for results that match a specific location or metric (e.g. star-rating) first. but if the text-based search query is the most important part of the query, attributes can then be used to refine the results even further. fortunately for ontolo, this is the case. and it lets me eliminate managing another software package and data store. though that might change in the future, depending on how the complexity and amount of data evolves over time.
properly designing the index here for size and speed was one of the biggest challenges i faced; deciding what to keep in ram, what to keep on disk (attributes), and how to organize so much data that it didn't kill the disk, but still allowed everything you could want to be retrieved and searched against. this might have been the most time-consuming part of the entire project, in terms of thinking-time.
as for xxhash, it's amazing in a ton of ways. i've taken a sort of special interest in hash functions, designed my own suite of testing tools, and written many of my own. for every metric i've put xxhash up against, it has performed beautifully. the only time i'm ever able to write anything faster is when the quality of the hash is severely compromised in order to make a very customized hash for a very specific and narrow set of data. and compared to other 64-bit hashes out there, it consistently outperforms them in terms of speed and distribution, across many hardware architectures. yann collet really made something amazing there (in addition to several of his other projects).
[edited to add]: i forgot to mention that part of storing data in sphinx as attributes is that you can store plain text data (or json, etc) that is returned in the query result. this is how we return urls, page titles, etc to the user in the browser and exports, thus eliminating the need for a raw data store.
documents are read and parsed byte by byte, where "sections" (e.g. side nav) and "contexts" (e.g. comment author) are maintained in a stack of states. part of the speed comes from writing a custom hashing tool that creates compact, very fast hash tables with few or no collisions. so identifying html tags, etc, is very fast and keeps everything as close to l1 as possible.
data is stored exclusively in sphinx. i don't know if it's a secret or not, but i've seen no one talk about it: elasticsearch crumbled horribly under this kind of load. sphinx performs beautifully.
depending on the path i take from here, creating my own index and engine is becoming a serious consideration.