Timeseries data storage in MongoDB

moe · on July 25, 2011

Aw, this was physically painful to skim.

What you really want for time-series data is a column db such as cassandra (or vertica etc.), perhaps HBase, perhaps a RDBMS, or perhaps a plain old log-file.

What you most definitely don't want is Microsoft Access or MongoDB. Thinking about it, MS Access might still work to a degree.

rbranson · on July 25, 2011

As long as it all the data for all metrics fits in RAM, it will work great in MongoDB. Of course, as soon as it doesn't, you're totally hosed.

To expand on the parent, a column store like HBase or Cassandra is perfect as each row can represent a timeslice and then each column can represent a single event or record within that timeslice. As the row gets evicted from it's initial storage in memory, it will be written to disk contiguously, in sorted order, and batch reads of this data is sequential.

It is possible to use MongoDB to store a timeslice as a document, but it is not designed to scale out to store very large numbers of columns within a single document.

amalag · on July 25, 2011

MongoDB needs indexes to fit in memory, not the entire dataset. This is an important distinction.

seigenblues · on July 25, 2011

happily, this is not a synchronous application with tens of thousands of concurrent users, so "totally hosed" for us may have a very different definition.

What do you mean by "very large numbers of columns" ? I've seen some mongo users with very rich, i.e. large, document models.

rbranson · on July 25, 2011

1,000s, 10,000s, 100,000s, millions. MongoDB columns are designed for serialization of rich documents, not store unbounded ranges of data values.

Even without tens of thousands of concurrent users, you'll eventually run into a deeply critical performance wall when MongoDB starts reading from disk. It's really best to think of it as an in-memory database.

seigenblues · on July 25, 2011

glad you didn't like it.

Vertica was not free, HBase was really heavyweight, & I knew i didn't want RDBMS. The sensor data itself is plain csvs ;)

Cassandra still does look interesting, although it looked like it would take me much longer to get it going; perhaps it will get rewritten. (The mongo solution only took a week to get something usable.)

rbranson · on July 25, 2011

Curious to hear why you thought it would take longer to get going with Cassandra. A single-node Cassandra cluster can be up with a 1-line command. Was it the availability of clients or interface abstractions? I understand that tutorial & documentation is not as readily available, so that's understandable.

seigenblues · on July 25, 2011

It wasn't the ease of install or simplest-case-deployment, that's for sure. I did install it and kick the tires.

This was around the end of 2009, and i don't think there were many clients available (i see pycassa dates to 2010-04).

again, ease of development is much more important to us than performance; it'll probably be a long, long time before our db engine is the choke point, at which point we'll have resources to use a heavier tool.

rbranson · on July 25, 2011

This makes sense, the landscape was very different. Cassandra wasn't something I would have used back then either.

hogu · on July 25, 2011

Is it just about the number of dimensions? Seems like you'd get decent performance with mongodb if each timeseries is a collection, and each entry was another document?

cstuder · on July 25, 2011

Being in a similar industry, I share the sentiments of the last couple of slides: Dataloggers are expensive and horrible pieces of hardware. Proprietary solutions with no regard for real-life scenarios (Limited connectivity, power failures, connection failures, weird and inflexible data formats...)

I would love to have a look at their Arduino based solution.

ericHosick · on July 25, 2011

For a temporal or time based key value store (I think this is kinda what the presentation shows) I used a collection that was something like:

Temporal Collection { _id: "X1", data_temporal : [ { time_start: SomeDate, time_stop: SomeDate, _id: "ID2" }, { time_start: SomeDate2, time_stop: SomeDate2, _id: "ID2" }]

Data Collection { _id: "ID1", parent: X1, data: { field1: "some info", field2: 34 }, _id: "ID2", parent: X1, data: { field1: "Some info new", field2: 34 } }

What is cool about this is that if you have access to the data like ID1, you can easily find out when it was added and how it changed.

If you have access to the temporal ID, X1, then at any time you can see what the data looked like.

If you need to relate data, the "foreign key" used is the data_temporal ID. In this way, it is possible to ask what your key value store data looked like at any time.

But, this could be off from the article.

This also works quite well in a relational database.

iskander · on July 25, 2011

Are there advantages over storing data in HDF? I've been working with a few hundred gigabytes of financial data this summer and I'm finding that python's data-oriented libraries (h5py, numpy, scipy, matplotlib, scikits.learn) cover my needs.

seigenblues · on July 25, 2011

depends on your usage. The rest of the toolchain is all python, numpy, scipy, matplotlib especially.

This might be poorly titled: it's not just about the storage as it is about the aggregation of disparate sensor data into coherent, continuous data streams.

iskander · on July 25, 2011

I'm totally ignorant of mongodb: what does it do for you (in the way of data aggregation) that's not easy in numpy?

hogu · on July 27, 2011

if your data can fit into arrays, then there's no advantage in terms of the types of aggregations.

however mongo allows you to store complex structures, think nested dictionaries/lists, and query on those nested structures, even allowing you to reach inside of nested structures to do the querying.

I guess you could do nested structured arrays in numpy, I've never done that before.

iskander · on July 27, 2011

I use h5py's datasets (which are organized hierarchically and stored in compressed chunks) to do basic filtering and then load a fraction of my data into memory as numpy arrays.

luigi · on July 25, 2011

I saw this presentation live at MongoDC and it was awesome.

amalag · on July 25, 2011

What about Hadoop? You can also use Hadoop as a backend for hadoop style filesystems. What about using Hive? Does it also require a fixed schema?

snowwindwaves · on July 25, 2011

he needed to find this board for his datalogger http://www.amazon.com/Webcontrol-Universal-Temperature-Humid...

yannis · on July 25, 2011

Interesting presentation, would do better with some more details in a blog or pdf.

seigenblues · on July 25, 2011

thanks, i'll try and write them up. any particular areas you'd like to see expanded?

yannis · on July 25, 2011

I am particularly interested in the software implementation. I am a Mechanical Engineer, involved with high rise construction. Looking to disrupt Building Management Systems.

theatrus2 · on July 25, 2011

Overtaking a BMS has huge problems (I know, I am in that space). On one end you're talking about certified hardware for a variety of needs (hardware, especially certified to various ASHRAE/ANSI/etc specs is expensive for a small company, no two ways about it). Second is the "no one got fired for buying IBM" mentality - if you hook up with Siemens, JCI and the like, you have paid that huge maintenance contract to make the vendor fix their problem and you won't end up with an insolvent small vendor's VAV controllers which don't work.

yannis · on July 25, 2011

Think positive, that is why you and I are on HN and our colleagues live in 30 year old specs space.