FastMail's Email Search Architecture

ams6110 · on Dec 1, 2014

If you want Xapian search on a local maildir, I highly recommend notmuch[1]. Adding new mail and updating the index can take noticable time, but searching is super fast, it allows easy custom tagging, and search results are better than gmail in my experience.

I use it from the emacs notmuch mode.

[1] http://notmuchmail.org/

danieldk · on Dec 1, 2014

I use notmuch, but mu should also be mentioned in the same breath :):

http://www.djcbsoftware.nl/code/mu/ http://www.djcbsoftware.nl/code/mu/mu4e.html

gregnbanks · on Dec 1, 2014

I never used notmuch, but I was much inspired by it when I saw a talk about it at linux.conf.au a few years back. Later I ended up stealing several ideas from notmuch about how to interface with Xapian, like the one-letter prefixes.

darklajid · on Dec 1, 2014

I'm always a bit jealous when I see this setup (or mu).

But I don't use emacs, and that seems to lead to subpar support and crazy hacks to get something up and working, unfortunately.

danieldk · on Dec 1, 2014

But I don't use emacs,

Me neither, but mutt-kz had built-in support for notmuch. Not the hacky kind that calls 'notmuch', but it actually links against libnotmuch.

http://kzak.redcrew.org/doku.php?id=mutt:start

darklajid · on Dec 1, 2014

That looks really neat. You ruined my day (in terms of productivity), but might've given me a nice early Christmas present. Thank you!

xorcist · on Dec 1, 2014

I can second this recommendation. I've used mutt-kz for about two years now as my primary email interface and it's been rock solid.

I came from regular mutt so there was no learning curve, but that shouldn't be too bad either once you get used to those keybindings.

gh02t · on Dec 1, 2014

There's also Alot, which is a Python frontend for notmuch. I love it, and use it daily. https://github.com/pazz/alot

jamespo · on Dec 1, 2014

Similarly I use mairix

http://www.rpcurnow.force9.co.uk/mairix/

pmoriarty · on Dec 1, 2014

Can notmuch do regular expression searches?

gregnbanks · on Dec 1, 2014

Xapian supports a query syntax which uses a trailing '*', but it's not a true wildcard/regexp, it's a range search on an ordered index. Notmuch allows you to use that syntax, see http://notmuchmail.org/searching/

ams6110 · on Dec 1, 2014

It can do simple wildcarding with the * character, but I'm not aware of any general regexp capability, same as with any indexed database. Since you still have the underlying maildirs you can always just use egrep I guess.

brongondwana · on Dec 1, 2014

An obvious question that I didn't hit on in the blog is "what about host crashes"? The nice thing is, every index knows exactly which messages it covers - and we can quite quickly (within an hour or so for an entire server) scan all mailboxes and index the missing messages - it's more efficient than doing it in a real time, because you are often indexing multiple messages in the same mailbox.

Once the indexes are up to date, we can switch back to being masters again. We index on all the replicas independently so that they are always ready.

brongondwana · on Dec 1, 2014

(during a clean shutdown, we copy all the indexes over to the SSD, and they get compacted to data in the next day's compact run)

kolev · on Dec 1, 2014

Although in the past I've implemented Xapian [1] over Sphinx [2], Sphinx today seems to be much better, but both Xapian and Sphinx are under-appreciated compared to Solr [3] and Elasticsearch [4].

[1] http://xapian.org/

[2] http://sphinxsearch.com/

[3] http://lucene.apache.org/solr/

[4] http://www.elasticsearch.org/

brongondwana · on Dec 1, 2014

All of those last three are awesome if you either put all the user's search indexes into a single engine, or have a shit-ton of memory.

With sphinx, we found we had to start and stop daemons all over the place to manage memory, and it was just unworkable. It was either that or run one big index per machine, but there are operational reasons I'd rather not be doing that. We try to keep everything user-sized.

That said, there's still stub Sphinx code in there. Both engines are have GPL licensing on them, which means compiling against Cyrus (BSD licensed) causes a non-BSD licensed end result. Not an issue for us, since we publish all our Cyrus code anyway.

There is talk of building an Elasticsearch backed into Cyrus as well - feel free, it's all open source. We'd definitely take the patch if it's good code (he says with his Cyrus Project Board Member hat on rather than his FastMail Director hat on)

gregnbanks · on Dec 1, 2014

Sphinx also had a bunch of really bad bugs around server startup and shutdown, and some ugly code. I ended re-writing some of their pthreads code and submitting patches, I have no idea if they ever used them or not because their development tree is internal and not visible outside their company.

Solr (well, Lucene) has awesome natural language stemming abilities, many more languages supported than Xapian. In particular it's much smarter than Xapian about Chinese. But a) the memory requirement makes running multiple shards on the same machine hard, and b) nobody in the company wanted to learn how to handle Java operationally.

EDIT: since 2012 Sphinx appear to have made a public mirror of their internal tree at https://code.google.com/p/sphinxsearch/source/checkout

chatman · on Dec 1, 2014

Apache Solr is best suited for such applications. AOL Mail uses Solr to power search for all users [0].

[0] - http://lucidworks.com/blog/podcast-solr-at-scale-at-aol/

frankwiles · on Dec 1, 2014

That advice is a bit dated honestly, Solr is fine but the current "best practice" if you will is to use ElasticSearch.

Not saying you can't do it with Solr or that Solr doesn't scale, it does. You'll just have an easier and more fun time doing it with ES.

Couple of related/examples:

http://highscalability.com/blog/2014/1/6/how-hipchat-stores-...

http://exploringelasticsearch.com/github_interview.html

hendzen · on Dec 1, 2014

Seems like they independently invented a sort of log structured merge tree.

atombender · on Dec 2, 2014

LSM-type storage is frequently used in IR; it's not a new technique. I accidentally "invented" it a few years ago before I realized it had a name. Even Lucene 1.x used a variation of this for its "segment" files (which it still does, at least up to 3.x, afaik).

The reason is that you want to keep the inverted indexes sorted on disk, but you don't want to sort the entire index every time you update. So you create one mini-index per update and merge them lazily when you get too many of them.

mikebo · on Dec 1, 2014

During the compaction phase, when a new temp db is installed while the old is being compacted, isn't there a window where messages in the old temp db are not searchable?

cobookman · on Dec 1, 2014

Does anyone know why they chose Xapian over elasticsearch?

robn_fastmail · on Dec 1, 2014

Xapian is to Cyrus what Lucene is to Elasticsearch - the search engine embedded into the larger server.

ams6110 · on Dec 1, 2014

For me it would be the use of Java. I've just had too much bad luck with it. Admittedly that's not really very objective reasoning.

alfiedotwtf · on Dec 1, 2014

From memory (sorry, it was a while back) we actually started with Elasticsearch, but it was way to heavy and looked for an alternative solution. Even with a single user it was consuming way too much memory.

robn_fastmail · on Dec 1, 2014

Sorry Alfie, your memory fails. We started with Sphinx.

We use Elasticsearch elsewhere (ELK stack), but not for mail search.

iancarroll · on Dec 1, 2014

Fastmail is a great service, however iCloud completely fails with their web client. I have >50k emails, probably closer to 100k and searching simply times out. Never has worked for me. It's sad, really...

brongondwana · on Dec 1, 2014

Sorry, I don't understand what you mean. iCloud is a service, and FastMail's web client doesn't talk to iCloud, it only talks to FastMail's servers - at least for now.

iancarroll · on Dec 1, 2014

I was contrasting the iCloud interface to FastMail, an implication they were connected wasn't intended.

brongondwana · on Dec 2, 2014

Ahh, ok. I misread your post then, sorry.