What you really want for time-series data is a column db such as cassandra (or vertica etc.), perhaps HBase, perhaps a RDBMS, or perhaps a plain old log-file.
What you most definitely don't want is Microsoft Access or MongoDB.
Thinking about it, MS Access might still work to a degree.
As long as it all the data for all metrics fits in RAM, it will work great in MongoDB. Of course, as soon as it doesn't, you're totally hosed.
To expand on the parent, a column store like HBase or Cassandra is perfect as each row can represent a timeslice and then each column can represent a single event or record within that timeslice. As the row gets evicted from it's initial storage in memory, it will be written to disk contiguously, in sorted order, and batch reads of this data is sequential.
It is possible to use MongoDB to store a timeslice as a document, but it is not designed to scale out to store very large numbers of columns within a single document.
happily, this is not a synchronous application with tens of thousands of concurrent users, so "totally hosed" for us may have a very different definition.
What do you mean by "very large numbers of columns" ? I've seen some mongo users with very rich, i.e. large, document models.
1,000s, 10,000s, 100,000s, millions. MongoDB columns are designed for serialization of rich documents, not store unbounded ranges of data values.
Even without tens of thousands of concurrent users, you'll eventually run into a deeply critical performance wall when MongoDB starts reading from disk. It's really best to think of it as an in-memory database.
Vertica was not free, HBase was really heavyweight, & I knew i didn't want RDBMS. The sensor data itself is plain csvs ;)
Cassandra still does look interesting, although it looked like it would take me much longer to get it going; perhaps it will get rewritten. (The mongo solution only took a week to get something usable.)
Curious to hear why you thought it would take longer to get going with Cassandra. A single-node Cassandra cluster can be up with a 1-line command. Was it the availability of clients or interface abstractions? I understand that tutorial & documentation is not as readily available, so that's understandable.
It wasn't the ease of install or simplest-case-deployment, that's for sure. I did install it and kick the tires.
This was around the end of 2009, and i don't think there were many clients available (i see pycassa dates to 2010-04).
again, ease of development is much more important to us than performance; it'll probably be a long, long time before our db engine is the choke point, at which point we'll have resources to use a heavier tool.
Is it just about the number of dimensions? Seems like you'd get decent performance with mongodb if each timeseries is a collection, and each entry was another document?
Being in a similar industry, I share the sentiments of the last couple of slides: Dataloggers are expensive and horrible pieces of hardware. Proprietary solutions with no regard for real-life scenarios (Limited connectivity, power failures, connection failures, weird and inflexible data formats...)
I would love to have a look at their Arduino based solution.
What is cool about this is that if you have access to the data like ID1, you can easily find out when it was added and how it changed.
If you have access to the temporal ID, X1, then at any time you can see what the data looked like.
If you need to relate data, the "foreign key" used is the data_temporal ID. In this way, it is possible to ask what your key value store data looked like at any time.
But, this could be off from the article.
This also works quite well in a relational database.
Are there advantages over storing data in HDF? I've been working with a few hundred gigabytes of financial data this summer and I'm finding that python's data-oriented libraries (h5py, numpy, scipy, matplotlib, scikits.learn) cover my needs.
depends on your usage. The rest of the toolchain is all python, numpy, scipy, matplotlib especially.
This might be poorly titled: it's not just about the storage as it is about the aggregation of disparate sensor data into coherent, continuous data streams.
if your data can fit into arrays, then there's no advantage in terms of the types of aggregations.
however mongo allows you to store complex structures, think nested dictionaries/lists, and query on those nested structures, even allowing you to reach inside of nested structures to do the querying.
I guess you could do nested structured arrays in numpy, I've never done that before.
I use h5py's datasets (which are organized hierarchically and stored in compressed chunks) to do basic filtering and then load a fraction of my data into memory as numpy arrays.
I am particularly interested in the software implementation. I am a Mechanical Engineer, involved with high rise construction. Looking to disrupt Building Management Systems.
Overtaking a BMS has huge problems (I know, I am in that space). On one end you're talking about certified hardware for a variety of needs (hardware, especially certified to various ASHRAE/ANSI/etc specs is expensive for a small company, no two ways about it). Second is the "no one got fired for buying IBM" mentality - if you hook up with Siemens, JCI and the like, you have paid that huge maintenance contract to make the vendor fix their problem and you won't end up with an insolvent small vendor's VAV controllers which don't work.
What you really want for time-series data is a column db such as cassandra (or vertica etc.), perhaps HBase, perhaps a RDBMS, or perhaps a plain old log-file.
What you most definitely don't want is Microsoft Access or MongoDB. Thinking about it, MS Access might still work to a degree.