The Dynamo incident highlights an important lesson when using consistently hashe...

rbranson · on March 16, 2017

Thanks for the warm feedback!

We currently have an internal project underway to detect hot keys in our pipeline and collapse back-to-back writes before they're written out to DynamoDB.

It's difficult to apply throttling on these conditions synchronously within the ingestion API (i.e. return a 429 based on too many writes to one key) because of the flexibility of the product: that workload is perfectly acceptable for some downstream partners. It also gives me pause from a reliability perspective. We try to keep our ingestion endpoints as simple as possible to avoid outages, which for our product means data loss.

kornish · on March 16, 2017

Ah, gotcha. Yeah, it makes sense to avoid synchronously turning away data as that does defeat the point of the product. And the cost for rejecting false negatives is high because the moment when a client is receiving lots of data is when it's most important for them to store it.

If you don't mind answering: for your warehouse offering, do you pull data from some services (e.g. Zendesk, SFDC), have them push it to you (which is what I interpreted your "downstream partners" comment to mean – though perhaps those are "upstream partners"), or a mix of both?

rbranson · on March 16, 2017

For downstream partners I mean data flows from user -> Segment -> partner. Event data is ingested through our API & SDKs, and this is fed to partner's APIs. For Cloud Sources, generally data is pulled from partners using their public APIs at an interval and pushed to customer warehouses in batches. In a few special cases partners push data to our APIs.

user5994461 · on March 16, 2017

> the per-user ID write bottleneck

The basics is to have less machines but more powerful. It helps to handle the targeted bursts.

The advanced is to have a layer of queuing before the ingestion, where you can do magic with distribution rules, rate limiting and dropping peak traffic.

kornish · on March 16, 2017

Yep, you're absolutely right. In multitenant SaaS apps with extremely uneven distribution of traffic, it's pretty common for large customers to get their own dedicated DB servers.

> The advanced is to have a layer of queuing before the ingestion, where you can do magic with distribution rules, rate limiting and dropping peak traffic.

And batch loading – don't forget batching!

user5994461 · on March 16, 2017

I didn't mean to have special databases, that is another level of optimization.

I mean to have bigger servers for everything. For instance, a farm of 4x 10cores servers is likely to process data more consistently than 10x 4cores servers.