How Databases handle 10M devices in high-cardinality benchmarks

trollied · on July 4, 2021

This article is nonsense. The whole premise is based on a badly designed schema.

user5994461 · on July 4, 2021

Agreed. The firmware has nothing to do in the schema.

And to make things worse the most used timeseries database, Prometheus, is missing.

10M is not particularly high. If your company run prometheus/nodeexporter monitoring 1000 servers, you should be in that order of magnitude (expect 1000 to 10000 metrics per machine).

cebert · on July 4, 2021

This comes off more as a marketing blog post to me than an objective evaluation. I don’t understand why dynamoDB was not included as a technology for comparison as it is well suited for these types of scenarios.

reilly3000 · on July 4, 2021

I believe they were specifically looking at TSDBs vs document DBs.

reilly3000 · on July 4, 2021

I’ll also add that DDB would be unfathomably expensive for that type of workload. With 1,000,000 writes per second it would cost $3.25M /month with on demand pricing, or $93K/month with provisioned capacity and an annual agreement. That doesn’t include storage or reads. They were running on a 4 thread EC2 VM, so like $60/mo with a nice NIC. TSDBs are really good at logging, storing, and aggregating lots of floats efficiently. DDB shines for storing documents like product pages and returning them in ms or microseconds with a DAX cache, even with thousands of requests per second.

user5994461 · on July 4, 2021

I believe it's a use case for RedShift if anything, not DynamoDB.

RedShift is column oriented, if there is some types for numbers/timeseries, that should be able to do the job. It might still be cost prohibitive though.

I personally find the 1M writes to be misleading. It's typically 1000 sensors generating 1000 metrics (a large key=value file). It should be ingested in batch. I guess it's not impressive enough if you call it 1000 requests per second.

jlokier · on July 4, 2021

> one possibility was to use something radically different from what we already had, such as including LSM trees or B-trees, commonly used in time series databases. Adding trees would bring the benefit of being able to order data on the fly without inventing a replacement storage model from scratch.

> What bothered us most with this approach is that every subsequent read operation would face a performance penalty versus having data stored in arrays.

Unless I missed something non-obvious from the article, I'm not convinced B-tree storage inevitably faces a read performance penalty compared with storing data in arrays on storage - because B-trees can be made of arrays.

A B-tree can be thought of as a flat set of storage blocks, each of which contains the contents for a range of keys in a representation of your choice, and an index mapping key ranges to storage block locations. Usually the index is the same data structure but smaller, and through this recursive definition you get a tree. This is an unorthodox way of thinking about B-trees but it works and generalises well.

That storage block representation can be a sorted array of data if that's what you want.

Reading from a B=tree involves locating the storage block using the index, then reading the data you want from the block, in whatever representation is used. Presumably "data stored in arrays" also involves non-contiguous segments in storage at some size of segment, so small lookups involve finding the right segment, perhaps binary search to locate a first key, and then scanning it, and large range scans involve iterating over segments.

When the database revolves around fast SIMD array scanning operations, it can do that just as much inside a B-tree block that uses array representations for its contents as it can inside a segment of another array storage structure.

Maintaining this structure will slow writes with out of order values, but you have that problem with either data structure.

winrid · on July 4, 2021

It makes me wonder if you would have more or less cache misses at the start of a search in an array based BTree (vs potential miss for following some pointers).

However, this is all based on "searching". If you just want to read the latest N values in a timeseries store, then yeah an array or some kind of BRIN index would be better for reads.

jeffbee · on July 4, 2021

It seems to me like an anti-pattern to have a time series where the identity is formed by (device,location,firmware version,sensor). It seems to me like (device,location,sensor) is fine identity for an original of, presumably, real-valued time series data, while (device,location,firmware) is a good identity for, I guess, a string-valued time series, and if you really needed to get sensor data aggregated by firmware version you'd do a join in this rare and in my opinion highly contrived case.

Something1234 · on July 4, 2021

Maybe you want to be able to detect data errors when updating firmware versions? But there should be a way to compute firmware version of a device with some kind of join, maybe device updates?

chucky_z · on July 4, 2021

what you do instead is have an extra metric entirely per device. it's easier to have 1000 extra metrics than an extra label worth of cardinality on 1,000,000 metrics for every single update.

within prom:

  cpu{hostname,metadata,cpu_socket} 5.0
  cpu{hostname,metadata,cpu_socket} 6.0
  cpu{hostname,metadata,cpu_socket} 5.0
  cpu{hostname,metadata,cpu_socket} 4.0
  version{hostname,version_number} 1.0

instead of baking version_number in everywhere. just as an example, then you can know "oh, hostname has version number" on a chart, or even do some inversions and say "count of version number" to find stragglers.

ramraj07 · on July 4, 2021

So he has 1000 devices at 20 locations, but the cardinality is now 1000*20??? Why, is each device reporting from all 20 locations?? What type of math is that?

reilly3000 · on July 4, 2021

Time-series databases are optimized for performance based on having an index for each label/type of value. That inversion allows for a tremendous amount of values to be captured and reported against with low latency, but with the trade off of having less permutations available for scale. The same applies to Prometheus, Graphite, etc.

user5994461 · on July 4, 2021

He meant to say he has 20 000 devices total, 1000 devices in each of 20 locations.

It's not out of the ordinary if you have different industrial facilities to instrument.