Hacker News new | past | comments | ask | show | jobs | submit login
FastMail's Email Search Architecture (fastmail.com)
157 points by alfiedotwtf on Dec 1, 2014 | hide | past | favorite | 31 comments



If you want Xapian search on a local maildir, I highly recommend notmuch[1]. Adding new mail and updating the index can take noticable time, but searching is super fast, it allows easy custom tagging, and search results are better than gmail in my experience.

I use it from the emacs notmuch mode.

[1] http://notmuchmail.org/


I use notmuch, but mu should also be mentioned in the same breath :):

http://www.djcbsoftware.nl/code/mu/ http://www.djcbsoftware.nl/code/mu/mu4e.html


I never used notmuch, but I was much inspired by it when I saw a talk about it at linux.conf.au a few years back. Later I ended up stealing several ideas from notmuch about how to interface with Xapian, like the one-letter prefixes.


I'm always a bit jealous when I see this setup (or mu).

But I don't use emacs, and that seems to lead to subpar support and crazy hacks to get something up and working, unfortunately.


But I don't use emacs,

Me neither, but mutt-kz had built-in support for notmuch. Not the hacky kind that calls 'notmuch', but it actually links against libnotmuch.

http://kzak.redcrew.org/doku.php?id=mutt:start


That looks really neat. You ruined my day (in terms of productivity), but might've given me a nice early Christmas present. Thank you!


I can second this recommendation. I've used mutt-kz for about two years now as my primary email interface and it's been rock solid.

I came from regular mutt so there was no learning curve, but that shouldn't be too bad either once you get used to those keybindings.


There's also Alot, which is a Python frontend for notmuch. I love it, and use it daily. https://github.com/pazz/alot



Can notmuch do regular expression searches?


Xapian supports a query syntax which uses a trailing '*', but it's not a true wildcard/regexp, it's a range search on an ordered index. Notmuch allows you to use that syntax, see http://notmuchmail.org/searching/


It can do simple wildcarding with the * character, but I'm not aware of any general regexp capability, same as with any indexed database. Since you still have the underlying maildirs you can always just use egrep I guess.


An obvious question that I didn't hit on in the blog is "what about host crashes"? The nice thing is, every index knows exactly which messages it covers - and we can quite quickly (within an hour or so for an entire server) scan all mailboxes and index the missing messages - it's more efficient than doing it in a real time, because you are often indexing multiple messages in the same mailbox.

Once the indexes are up to date, we can switch back to being masters again. We index on all the replicas independently so that they are always ready.


(during a clean shutdown, we copy all the indexes over to the SSD, and they get compacted to data in the next day's compact run)


Although in the past I've implemented Xapian [1] over Sphinx [2], Sphinx today seems to be much better, but both Xapian and Sphinx are under-appreciated compared to Solr [3] and Elasticsearch [4].

[1] http://xapian.org/

[2] http://sphinxsearch.com/

[3] http://lucene.apache.org/solr/

[4] http://www.elasticsearch.org/


All of those last three are awesome if you either put all the user's search indexes into a single engine, or have a shit-ton of memory.

With sphinx, we found we had to start and stop daemons all over the place to manage memory, and it was just unworkable. It was either that or run one big index per machine, but there are operational reasons I'd rather not be doing that. We try to keep everything user-sized.

That said, there's still stub Sphinx code in there. Both engines are have GPL licensing on them, which means compiling against Cyrus (BSD licensed) causes a non-BSD licensed end result. Not an issue for us, since we publish all our Cyrus code anyway.

There is talk of building an Elasticsearch backed into Cyrus as well - feel free, it's all open source. We'd definitely take the patch if it's good code (he says with his Cyrus Project Board Member hat on rather than his FastMail Director hat on)


Sphinx also had a bunch of really bad bugs around server startup and shutdown, and some ugly code. I ended re-writing some of their pthreads code and submitting patches, I have no idea if they ever used them or not because their development tree is internal and not visible outside their company.

Solr (well, Lucene) has awesome natural language stemming abilities, many more languages supported than Xapian. In particular it's much smarter than Xapian about Chinese. But a) the memory requirement makes running multiple shards on the same machine hard, and b) nobody in the company wanted to learn how to handle Java operationally.

EDIT: since 2012 Sphinx appear to have made a public mirror of their internal tree at https://code.google.com/p/sphinxsearch/source/checkout


Apache Solr is best suited for such applications. AOL Mail uses Solr to power search for all users [0].

[0] - http://lucidworks.com/blog/podcast-solr-at-scale-at-aol/


That advice is a bit dated honestly, Solr is fine but the current "best practice" if you will is to use ElasticSearch.

Not saying you can't do it with Solr or that Solr doesn't scale, it does. You'll just have an easier and more fun time doing it with ES.

Couple of related/examples:

http://highscalability.com/blog/2014/1/6/how-hipchat-stores-...

http://exploringelasticsearch.com/github_interview.html


Seems like they independently invented a sort of log structured merge tree.


LSM-type storage is frequently used in IR; it's not a new technique. I accidentally "invented" it a few years ago before I realized it had a name. Even Lucene 1.x used a variation of this for its "segment" files (which it still does, at least up to 3.x, afaik).

The reason is that you want to keep the inverted indexes sorted on disk, but you don't want to sort the entire index every time you update. So you create one mini-index per update and merge them lazily when you get too many of them.


During the compaction phase, when a new temp db is installed while the old is being compacted, isn't there a window where messages in the old temp db are not searchable?


Does anyone know why they chose Xapian over elasticsearch?


Xapian is to Cyrus what Lucene is to Elasticsearch - the search engine embedded into the larger server.


For me it would be the use of Java. I've just had too much bad luck with it. Admittedly that's not really very objective reasoning.


From memory (sorry, it was a while back) we actually started with Elasticsearch, but it was way to heavy and looked for an alternative solution. Even with a single user it was consuming way too much memory.


Sorry Alfie, your memory fails. We started with Sphinx.

We use Elasticsearch elsewhere (ELK stack), but not for mail search.


Fastmail is a great service, however iCloud completely fails with their web client. I have >50k emails, probably closer to 100k and searching simply times out. Never has worked for me. It's sad, really...


Sorry, I don't understand what you mean. iCloud is a service, and FastMail's web client doesn't talk to iCloud, it only talks to FastMail's servers - at least for now.


I was contrasting the iCloud interface to FastMail, an implication they were connected wasn't intended.


Ahh, ok. I misread your post then, sorry.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: