If you want Xapian search on a local maildir, I highly recommend notmuch[1]. Adding new mail and updating the index can take noticable time, but searching is super fast, it allows easy custom tagging, and search results are better than gmail in my experience.
I never used notmuch, but I was much inspired by it when I saw a talk about it at linux.conf.au a few years back. Later I ended up stealing several ideas from notmuch about how to interface with Xapian, like the one-letter prefixes.
Xapian supports a query syntax which uses a trailing '*', but it's not a true wildcard/regexp, it's a range search on an ordered index. Notmuch allows you to use that syntax, see http://notmuchmail.org/searching/
It can do simple wildcarding with the * character, but I'm not aware of any general regexp capability, same as with any indexed database. Since you still have the underlying maildirs you can always just use egrep I guess.
An obvious question that I didn't hit on in the blog is "what about host crashes"? The nice thing is, every index knows exactly which messages it covers - and we can quite quickly (within an hour or so for an entire server) scan all mailboxes and index the missing messages - it's more efficient than doing it in a real time, because you are often indexing multiple messages in the same mailbox.
Once the indexes are up to date, we can switch back to being masters again. We index on all the replicas independently so that they are always ready.
Although in the past I've implemented Xapian [1] over Sphinx [2], Sphinx today seems to be much better, but both Xapian and Sphinx are under-appreciated compared to Solr [3] and Elasticsearch [4].
All of those last three are awesome if you either put all the user's search indexes into a single engine, or have a shit-ton of memory.
With sphinx, we found we had to start and stop daemons all over the place to manage memory, and it was just unworkable. It was either that or run one big index per machine, but there are operational reasons I'd rather not be doing that. We try to keep everything user-sized.
That said, there's still stub Sphinx code in there. Both engines are have GPL licensing on them, which means compiling against Cyrus (BSD licensed) causes a non-BSD licensed end result. Not an issue for us, since we publish all our Cyrus code anyway.
There is talk of building an Elasticsearch backed into Cyrus as well - feel free, it's all open source. We'd definitely take the patch if it's good code (he says with his Cyrus Project Board Member hat on rather than his FastMail Director hat on)
Sphinx also had a bunch of really bad bugs around server startup and shutdown, and some ugly code. I ended re-writing some of their pthreads code and submitting patches, I have no idea if they ever used them or not because their development tree is internal and not visible outside their company.
Solr (well, Lucene) has awesome natural language stemming abilities, many more languages supported than Xapian. In particular it's much smarter than Xapian about Chinese. But a) the memory requirement makes running multiple shards on the same machine hard, and b) nobody in the company wanted to learn how to handle Java operationally.
LSM-type storage is frequently used in IR; it's not a new technique. I accidentally "invented" it a few years ago before I realized it had a name. Even Lucene 1.x used a variation of this for its "segment" files (which it still does, at least up to 3.x, afaik).
The reason is that you want to keep the inverted indexes sorted on disk, but you don't want to sort the entire index every time you update. So you create one mini-index per update and merge them lazily when you get too many of them.
During the compaction phase, when a new temp db is installed while the old is being compacted, isn't there a window where messages in the old temp db are not searchable?
From memory (sorry, it was a while back) we actually started with Elasticsearch, but it was way to heavy and looked for an alternative solution. Even with a single user it was consuming way too much memory.
Fastmail is a great service, however iCloud completely fails with their web client. I have >50k emails, probably closer to 100k and searching simply times out. Never has worked for me. It's sad, really...
Sorry, I don't understand what you mean. iCloud is a service, and FastMail's web client doesn't talk to iCloud, it only talks to FastMail's servers - at least for now.
I use it from the emacs notmuch mode.
[1] http://notmuchmail.org/