Big data analytics is the biggest new trend in enterprises right now.
And one of the key parts of it is building machine learning models which
can do things like look at a customer's previous behaviour in order to predict
what their future behaviour might be. So this is one scenario where a time series database is infinitely better than say a RDBMS or Columnar database.
Likewise performing analytics on IoT devices e.g. sensors in trucks or oil/gas
equipment requires events to be captured, stored and later mined rapidly. And many
time series database being schemaless allow you to manage data from disparate sources
in the one table.
> So this is one scenario where a time series database is infinitely better than say a RDBMS or Columnar database
I still don't get it.
Most of my background is in "small data" and database programs tend either to store time series in "objects" (other kinds of objects are models and graphs), like Eviews, or as ultimately-isomorphic-to-spreadsheets tables with some added syntactic sugar for time structure that's only understood by some functions.
I'm beginning to do some "not so small data" (text analysis from news websites now; but the essential thing is the time sequencing of information cascades) with sqlite and pandas (pandas is just horrible, but it's already there) and basically the only problem I have is that raw data is unevenly sampled and I have to make some choices when downsampling to a fixed grid so statistical analysis proper can be performed.
That said: I understand, as the grandparent poster, that in some cases (high-end physics experiments) time-structured data is incrementally produced in enormous quantities, and just storing a thousand variables at ten thousand samples per second is this whole challenge.
But server logs? Sensor readings? At one point I had an Arduino hobby where I had a robot moving "of his own will" based on thermistors and did the necessary downsampling and filtering in the Arduino sketch itself for later analysis in Matlab. I mean, who's getting raw data from IoT devices into servers? Even my news scraping thing does some pre-selection before committing to the db.
People underestimate how quickly the signal-to-noise ratio decays at high frequencies; and it seems to me that Cargo Cult IT is massively overestimating its need for CERN-level compute.
Because it was nearly impossible to store all that data before, we had to figure out what we wanted to look at before we got the data. Now that we can store that data easily, it gives us the flexibility to find things we didn't know we wanted. For instance, if you are constantly storing raw sensor and proprioception data for your robot, if you are having problems with, say, its movement routines, you can look at the raw data and look for trends. Imagine how much easier debugging would be if you could just store the state of your software constantly rather than either looking for patterns ahead of time, or just toggling the data dump for small periods of time.
Now instead of a robotic hobbyist, imagine if you were a Tier 1 ISP or large banking institution.
My company deploys wireless sensor networks. Each sensor periodically reports its battery status, which we store in a database so we can see which sensors need their battery replaced.
Recently we found out that some batteries run empty much quicker than expected. The hardware vendor asked us for 'all battery updates for the past year or so'. We didn't save those; each update was just replacing the current value in the database. So it becomes very hard to diagnose this problem because the historic data is lost.
I understand that I have never dealt with really large datasets, whether as a hobby or as a professional (doing econometrics and related analytical work).
What it's not clear to me is what TSDBs (such as those listed by Wikipedia; I looked a bit into InfluxDB particularly) do better than RDBMSes.
This isn't "negative" skepticism, it's an earnest lack of knowledge.
And one of the key parts of it is building machine learning models which can do things like look at a customer's previous behaviour in order to predict what their future behaviour might be. So this is one scenario where a time series database is infinitely better than say a RDBMS or Columnar database.
Likewise performing analytics on IoT devices e.g. sensors in trucks or oil/gas equipment requires events to be captured, stored and later mined rapidly. And many time series database being schemaless allow you to manage data from disparate sources in the one table.