Optimal Shard Placement in a Petabyte Scale Elasticsearch Cluster

jillesvangurp · on Nov 9, 2018

Interesting overview.

I understand from elsewhere on their site that they are not using a recent version of ES (2.x?). This has a big impact on how things behave as they improved performance and garbage collect behavior by e.g. doing more things off heap with memory mapped files, fixing tons of issues with clustering, refactoring the on disk storage in Lucene 5 and 6, and giving users more control over sharding and indicex creation. So, it's likely that this setup is working around issues that probably have been addressed in later versions.

For example, 6.x has a shard allocation API that allows you control which nodes host which shards. That sounds like that would be useful in this setup.

They also seem to be mixing query and indexing traffic on the same nodes. I would suggest separating these in a cluster this big and using specialized nodes for query and index nodes (as well as dedicated master nodes). That way write node performance would be a lot more predictable and query traffic is no longer able to disrupt writes. Also you can use instances with more CPU/memory for querying or use elastic scaling groups to add querying capacity when needed. You might even get away with using cheaper T instances for this with cpu credits.

karlney · on Nov 9, 2018

Hi. We are actually using an even older version, 1.7.6 for our cluster. This is off not an ideal set-up and we will upgrade as soon as we can. That is not an easy task in a cluster of this size and given our uptime requirements. We know that general search and indexing performance, shard allocation and general cluster state management has all improved a lot in later versions of elasticsearch but we still anticipate that we will need shardonnay running even after we upgrade.

The shard allocation api would also be a very helpful addition to this solution. Currently we can sometimes get into situations where elasticsearch actively works against shardonnay and does very 'stupid' moves out of our control that later needs to be 'fixed' by shardonnay in later iterations. We have tried to disable elasticsearchs own allocation algorithm as much as we can but we still want it active as well as a safety net in case something unexpected happens with the hardware for example.

We do have dedicated master nodes. We also will have query and indexing client nodes that coordinates and merges search responses and handles routing etc. But behind that we utilize all data nodes for both query and indexing load. That has worked well so far due to the fact that indexing load is fairly constant. And we scale query load on recent indexes by having a large number of replicas for those.

Our largest problems have always been the vast difference in query complexity that we get. But we are slowly getting on top of that problem now as well. Some details on how that is done can be found in the following blog post

https://underthehood.meltwater.com/blog/2018/09/28/using-mac...

jillesvangurp · on Nov 9, 2018

I fully sympathize, I also still have a 1.7.x cluster that I have wanted to get rid off for years now. Several rounds of compatibility breaking changes have made this hard for us. Impressive that you can make this work at this scale.

The thing is, at the scale you are running that probably means that you are running at a much higher cost than technically needed with a recent 6.x setup. You should be seeing magnitudes improvement in memory usage probably.

Recent versions of ES are much better at preventing complex queries from taking down nodes. They've done a lot of good work with circuit breakers preventing queries from getting out of hand. They are also a lot smarter at e.g. doing caching for filtered queries.

My guess is you should not need Shardonnay on a recent setup. Most of what it does should be supported natively these days. ES has been working with many companies on setups comparable to yours and they've learned a lot in the last few years how to do this properly.

Another feature that could be of interest to you is cross cluster search introduced in 6.x (replacing the tribe functionality they had in v2 and v5). This would allow you to isolate old data to a separate cluster optimized for reads. Probably whenever you hit those old indices, your query complexity goes through the roof because it needs to open more shards.

mikljohansson · on Nov 9, 2018

We at Meltwater have some custom plugins for Elasticsearch that make a number of modifications to how queries are executed, and completely replace some low level Elasticsearch query implementations. We're also running a custom ES 1.7 version with some features backported from version 2+. The end result is something like 5-10x lower GC pressure and radically increased performance and query throughput for our particular workload. Without these changes we'd not be able to sustain our workload without a massive amount of more hardware and cost, just like you say.

Our flavor of Elasticsearch 1.7 is faster than vanilla 2.* for our workload, though still slower than ES 2.* with our customizations applied.

Recent Elasticsearch versions still use the same basic shard allocation algorithms as far as we know. Our workload is very imbalanced towards recent data, but it's not a binary hot-cold matter, rather a more exponential decay in workload for older indexes. We fully expect to need Shardonnay to balance the workload even with ES 7+.

We're also in early conversations with Elastic about shard placement optimization. They seem to be interested in applying linear optimization in a similar way, with a goal of solving the fundamental problems with shard allocation based on observed workload.

rpedela · on Nov 9, 2018

Honestly, I bet 6.x doesn't work well at PB scale. There may be different problems with different solutions, but I bet it would still need custom shard balancing, etc. I work at TB scale and 6.x largely works well with the exception of reindexing. Reindexing without downtime is still tricky.

The reason I think there are likely still problems at PB scale because of the attitude of ES core developers. They collectively act as if reindexing is no big deal and their proposed solution to many things I consider bugs is "just reindex". Reindex is the last thing I want to do given it is so hard to do it at TB scale with zero downtime. I don't think the core developers have experience with large clusters themselves so I find it unlikely that just upgrading to 6.x would solve all the problems at PB scale.

jillesvangurp · on Nov 9, 2018

I know people running ES at PB scale. You definitely need to know what you are doing of course but it is entirely possible. There are definitely companies doing this. Any operations at this scale need planning. So, I don't think you are being entirely fair here.

I'm not saying upgrading will solve all your problem but I sure know of a lot of problems I've had with 1.7 that are much less of a problem or not a problem at all with 6.x.

Reindexing is indeed time consuming. However, if you engineer your cluster properly, you should be able to do it without downtime. For example, I've used index aliases in the past to manage this. Reindex in the background, swap over to the new index atomically when ready. Maybe don't reindex all your indices at once. Also they have a new reindex API in es 6 to support this. At PB scale of course this is going to take time. I've also considered doing re-indexing on a separate cluster and using e.g. the snapshot API to restore the end result.

mikljohansson · on Nov 9, 2018

In our case at Meltwater we also have new documents as well as updates to old documents coming in. Meaning older indexes are actively being modified with low latency requirements on visibility and consistency of those modifications. This makes it more tricky to both reindex as well as doing a seamless/no-downtime-whatsoever upgrade to a new major version. It's not infeasible at all though, and we're working on it. Reindexing a PB of data should be possible on a matter of weeks based on our estimates, but we shall see!

krallja · on Nov 9, 2018

We’re still running 1.6 and 0.90 for a couple production workloads. Thankfully the 0.90 cluster requires no maintenance, because pre-1.0 things are pretty different.

KirinDave · on Nov 9, 2018

This article represents an ongoing trend that we've seen over the last 3 years as a resurgence in optimization techniques for directly modeling tasks which historically were considered "not amenable to optimization techniques." We saw this last year as well with the landmark, "The Case for Learned Index Structures" [0] and in prior years with Coda Hale's "usl4j And You" [1], which can easily be woven in realtime into modern container control planes.

We've also seen research in using search and strategy based optimization techniques [2] (such as Microsoft's Z3 SMT prover[3]) to validate security models. Some folks are even using Clojure's core.logic at request time to validate requests against JWTs for query construction!

[0]: https://arxiv.org/abs/1712.01208

[1]: https://codahale.com/usl4j-and-you/

[2]: http://www0.cs.ucl.ac.uk/staff/b.cook/FMCAD18.pdf (and by the way, this made it to prod last year, so an SMT solver is validating all your S3 security policies for public access!)

[3]: https://github.com/Z3Prover/z3

tnolet · on Nov 9, 2018

Props to the person coming up with the Shardonnay name. Tiny bits of fun in a pretty serious operation are awesome.

mikljohansson · on Nov 9, 2018

Thanks! We have a fair bit of a punny humor in our Gothenburg-Sweden engineering office where the information retrieval and big data magic happens. One of our plugins for Elasticsearch that significantly improves GC pressure and performance is called the "Greased Pig" plugin, because it makes Elasticsearch all slippery so it doesn't get stuck in GC hell ;)

andrewvc · on Nov 9, 2018

This is pretty cool tech.

I'd like to see this contrasted with the hot warm cold architecture which is the common approach to this problem.

Usually people deal with this problem by allocating faster/more/less storage dense hardware for recent data and slower/less/more storage dense hardware for older data. You can read more about this here https://www.elastic.co/blog/hot-warm-architecture-in-elastic...

Maybe something about it didn't work for them, but it's not clear to me from the article.

Disclaimer: I work for Elastic.

mikljohansson · on Nov 9, 2018

A hot/cold tier architecture might work for setups that have a predictable and sharp cutoff from hot to cold. E.g. a logging setup, where the current days index gets all the indexing and queries. And there's a low concurrency occasional query for cold data

Our workload is rather different from this. Indexing and document updates occur in an somewhat exponential decay pattern back in time, same with queries. So there's less sharp cut offs

If we ran a hot-cold architecture we'd get a few issues

Within each tier we'd get imbalanced workload over the nodes. Since within the tier the workload varies greatly with age of the indexes

We use AWS i3 NVMe SSD instances. d2's with HDDs or using EBS have too long IO latency/throughput/iops even for our "cold" data workload. So a cold tier would scale based on storage needs, but in this tier we'd be wasting lots of compute capacity. And a hot tier would scale based on compute needs, and waste tons of storage capacity.

By running both hot and cold workloads on the same set of nodes we get much more cost effective utilization. Since the hot workload uses most of the aggregate compute capacity, and the cold data uses most of the storage.

But this then necessitates using Shardonnay to ensure we spread workload optimally across the clusters. And the more evenly we can spread it, the higher total utilization we can put on the clusters without having single nodes overload.

A hot/cold architecture would much more costly for our workload. Since we'd have to unused storage on the hot tier, and unused compute capacity on the cold tier. A single tier just makes much more sense for our particular use case

karlney · on Nov 9, 2018

We tried the hot/cold architecture as well, and used it before in our data centre architecture.

But currently we have concluded that it is not worth the added complexity. But that might change again if/when we learn more or get new requirements.

We do change the number of replicas for recent data vs old data. We actually have a 4 tiered approach to how many replicas we use.

evntdrvn · on Nov 9, 2018

Thanks for the article! We have a tiny hosted ELK cluster for our app logging, but we don't have much expertise in good cluster design. We recently implemented monthly rolling indices, and currently have a single 'archive' index for the old data (from the past two years) sharded into 25GB chunks. Would it make sense from a query performance perspective to add extra replicas to the 'archive' shards, to help spread the load between the nodes? Thanks for any thoughts!

outworlder · on Nov 9, 2018

Hot/Warm architecture is nice, specially for logging setups. But it requires quite a bit of supporting logic (usually scripts) which is not available on ES itself.

boulos · on Nov 9, 2018

Since I see meltwater folks in this discussion, can you say a bit about the distribution of your queries? In the other blog post about routing [1], you mention that most are a few ms but others are minutes and more.

What’s the rough break down? Are most users performing queries “scoped” to a fairly small subset?

[1] https://underthehood.meltwater.com/blog/2018/09/28/using-mac...

mikljohansson · on Nov 9, 2018

Our query complexity and query response times follow a somewhat exponential pattern, with a lot of quick queries and a long tail of more monstrous queries.

We do have a large number of very complex queries coming in. What we deem to be towards the "simple" end could easily be hundreds of terms, wildcard, near and phrase operators. "Difficult" queries are things with hundreds of thousands of terms or many wildcards within nears/phrases that expand to millions of term combinations.

For Meltwaters customers it's usually important to get both very high recall and precision in the dataset described by a query. Since it's very little about getting a ranked result list and finding a hit (e.g. like what Google does). It's much more about running analytics/dashboards/reports/trends over the dataset delineated by the query, and exactness in the analytics matter a lot to Meltwater customers.

This all makes for complicated queries, to get both high precision and recall for whatever a customer is interested in analyzing. Our sales and support organizations help customers write good queries, and we also use AI systems to generate queries

outworlder · on Nov 9, 2018

Interesting take on the shard optimization.

I run a setup which is nowhere near as big as this one, and is mostly logging. I have seen the hotspots as described. Our solution is also to run beefy instances (i3 helps).

One thing that is sorely needed is a better solution than Curator. Curator is cool if you are only doing, say, daily shard rotation.

But it is dumb and can perform no decisions. If you want to do something more complicated than that, even a simple hot/warm architecture, then things get messy quickly.

Want to add a 'shrink' process and using hot/warm? Ok, now Curator will easily do dumb things like assign warm shards to a shrink node that also happens to be tagged as 'hot'. And nothing works.

Want to change the number of shards new indexes will have based on historical data(so they will fall under 50GB each)? Curator can't do that. Or maybe even change the number of replicas based on the number of nodes (because you have some auto-scaling going on) – I know you can do auto_expand_replicas 0-N, but N changes. Curator also doesn't do that.

Have a big cluster with hundreds of indices that need deletion every day? Have fun with the bulk deletion – if it fails, it won't retry. And now your cluster falls over because of high water marks popping everywhere.

You also need a single place from where to run it. Since in many clusters it is critical, you may want to replicate it. But then the curators will step on one another, and even if not, you need to make sure they are all in sync.

So custom code it is. Frankly, the basic functions performed by Curator should be available in ES itself – and actions pulled from an index stored in ES.

So now we have some ad-hoc scripts. If I get some time, I'll package them into some utility.

Xylakant · on Nov 9, 2018

At least for your ‘run it once in the entire cluster’ I can offer a simple solution: curator has a ‘master_only’ flag that makes it execute only on the elected master. You can run it on all nodes, but on all other nodes, the run will end up a noop: https://www.elastic.co/guide/en/elasticsearch/client/curator...

carrja99 · on Nov 9, 2018

Really good overview. We have a similar problem with our 2.1 cluster as well, we originally setup routing by customer but some customers can run some REALLY hot tasks that have us wind up with shards that are much larger than other shards.

One thing I'm curious about from the article, what did you use to visualize the heatmaps?

karlney · on Nov 9, 2018

We use datadog for monitoring our cluster. Shardonnay also produces lots of custom metrics that you don't get from the vanilla elasticsearch datadog integration.

carrja99 · on Nov 9, 2018

Nice, we do too, interested in getting better monitoring around shard allocation.

ahi · on Nov 9, 2018

OT: Their pricing page appears to be broken in Firefox: https://www.meltwater.com/request-pricing/

kokey · on Nov 9, 2018

As much as I like elasticsearch, it's been less fun to work with after Kibana v3 (which, I think, only works with ES v1?). I can only imagine people needing really large elasticsearch clusters nowadays don't really use Kibana for the majority of queries against it.

foolfoolz · on Nov 9, 2018

craziest thing to me about this whole article is this one company has 40 billion social media posts!

spullara · on Nov 9, 2018

I personally have 25B just for fun / side projects / analysis. The surprising thing for me is that for them it is petabyte scale (25k+ each!) and for me they take up 2TB (80 bytes each, compressed). Must have a tremendous amount of metadata or elastic search is super inefficient.

mikljohansson · on Nov 9, 2018

Our whole dataset exported as compressed JSON is likely not much more than 100TB. The petabytes come from all the index datastructures Elasticsearch/Lucene builds, as well as the high replication factor needed to keep up with the query throughput.

We index a lot of NLP and other enrichments on our documents. This also adds a lot of storage on top of the base text. And like Karl mentioned, we have lots of small social media documents available for analytics (currently 34B, sometime next year upward of 230B). But also 7+ billion documents of news, blogs and other long-text articles indexed, basically 10 years of all news media from the entire world online and available for analytics.

We're building the https://fairhair.ai/ data science platform to allow other companies to access, and run online search and analytics on top of this massive dataset and compute clusters. For example to embed analytics over this dataset into their own SaaS products

user5994461 · on Nov 9, 2018

In my experience, Elasticsearch is triple the size of the data.

    First is the actual json data in quasi plain text.
    Second is the _source field that duplicates the original input object (necessary for reindexing/rebuilding)
    Third is the _all field that duplicates the json data as text (only used for some text search, better disable it).

Finally, the index is duplicated to replicas, at least one if you want any redundancy.

Index compression with lz4 takes 20 or 30% off, new feature of elasticsearch v5.0, on by default.

karlney · on Nov 9, 2018

The social data are fairly small documents, but we index lots of other things as well that is larger in size. Another factor is our large replica count, especially for the hot data.

We also have quite a lot of metadata.