The Dynamo incident highlights an important lesson when using consistently hashed distributed data stores: make sure the actual distribution of hash keys mirrors the expected distribution. (though to their credit, someone writing an automated test using a hard-coded key was beyond their control).
Incidents like this are generally why rate limits exist, which they don't currently have [0], but perhaps they'll consider a burst limiter in place to dissuade automated tests but not organic human load spikes.
Unfortunately there doesn't seem to be an easy way to fix the per-user ID write bottleneck, short of adding a rate limit to the API – which would push backpressure from Dynamo to the Segment API consumer. Round-robin partitioning of values would fix the write bottleneck, but has heavy read costs because you have to query all partitions. They undoubtedly performed such analysis and found that it didn't fit their desired tradeoffs :)
Great post, very informative. Thanks for sharing! Also, love the slight irony of loading AWS log data into an AWS product (Redshift) to find cost centers.
We currently have an internal project underway to detect hot keys in our pipeline and collapse back-to-back writes before they're written out to DynamoDB.
It's difficult to apply throttling on these conditions synchronously within the ingestion API (i.e. return a 429 based on too many writes to one key) because of the flexibility of the product: that workload is perfectly acceptable for some downstream partners. It also gives me pause from a reliability perspective. We try to keep our ingestion endpoints as simple as possible to avoid outages, which for our product means data loss.
Ah, gotcha. Yeah, it makes sense to avoid synchronously turning away data as that does defeat the point of the product. And the cost for rejecting false negatives is high because the moment when a client is receiving lots of data is when it's most important for them to store it.
If you don't mind answering: for your warehouse offering, do you pull data from some services (e.g. Zendesk, SFDC), have them push it to you (which is what I interpreted your "downstream partners" comment to mean – though perhaps those are "upstream partners"), or a mix of both?
For downstream partners I mean data flows from user -> Segment -> partner. Event data is ingested through our API & SDKs, and this is fed to partner's APIs. For Cloud Sources, generally data is pulled from partners using their public APIs at an interval and pushed to customer warehouses in batches. In a few special cases partners push data to our APIs.
The basics is to have less machines but more powerful. It helps to handle the targeted bursts.
The advanced is to have a layer of queuing before the ingestion, where you can do magic with distribution rules, rate limiting and dropping peak traffic.
Yep, you're absolutely right. In multitenant SaaS apps with extremely uneven distribution of traffic, it's pretty common for large customers to get their own dedicated DB servers.
> The advanced is to have a layer of queuing before the ingestion, where you can do magic with distribution rules, rate limiting and dropping peak traffic.
I didn't mean to have special databases, that is another level of optimization.
I mean to have bigger servers for everything. For instance, a farm of 4x 10cores servers is likely to process data more consistently than 10x 4cores servers.
Incidents like this are generally why rate limits exist, which they don't currently have [0], but perhaps they'll consider a burst limiter in place to dissuade automated tests but not organic human load spikes.
Unfortunately there doesn't seem to be an easy way to fix the per-user ID write bottleneck, short of adding a rate limit to the API – which would push backpressure from Dynamo to the Segment API consumer. Round-robin partitioning of values would fix the write bottleneck, but has heavy read costs because you have to query all partitions. They undoubtedly performed such analysis and found that it didn't fit their desired tradeoffs :)
Great post, very informative. Thanks for sharing! Also, love the slight irony of loading AWS log data into an AWS product (Redshift) to find cost centers.
[0]: https://segment.com/docs/sources/server/http/#rate-limits