InfluxDB v0.10 GA with hundreds of thousands writes/sec, 98% better compression

SkyRocknRoll · on Feb 4, 2016

We are using 0.10 GA in production. We have seen lot of disk space reduction from 22GB to 700MB and performance improvement order of magnitude. Influxdb is the modern timeseries database.

Keep Rocking !!

Shish2k · on Feb 5, 2016

I wonder what sort of data you're storing with that reduction? I'm using it as a drop-in replacement for Graphite (with Grafana as a front-end, loving both BTW) and just had disk space reduced from 140GB to 250MB :D

kev009 · on Feb 4, 2016

Be _very_ skeptic of this, I wasted weeks trying to get git master versions of this not fall over including tsm1 storage engine. They are very good at marketing and made nice query and API entry, but data storage has been a total shit show.

Looking at less than a month of issues:

* https://github.com/influxdata/influxdb/issues/5440

* https://github.com/influxdata/influxdb/issues/5482

* https://github.com/influxdata/influxdb/issues/5534

* https://github.com/influxdata/influxdb/issues/5540

If you need something that won't fall over, it's not glamorous and a bear to setup, but OpenTSDB will sail with massive load once you've gotten it running.

pauldix · on Feb 4, 2016

Not sure what issues you're seeing, but we have people running this in production at significant scale.

One of the issues you linked to was for 0.9.6. Others were there because they had super high tag cardinality and not enough memory to actually run.

When people post comments like "I have 50 million series and it crashes on my box with 2GB of memory!!!", they're not relevant. You shouldn't expect miracles...

kev009 · on Feb 5, 2016

No, and you're countering rhetoric with rhetoric so here are the specs: bare metal 96G RAM, 16 cores, 36 disks (also tried 36 SSDs). Nothing exotic, and something similar you should be using to test before releasing another ill-fated storage engine based on "It works on my Macbook Pro with an artificial load test, ship it!". On the same box carbon is usable, and TSDB can scale linearly across a cluster of these machines. Influx burns down as soon as more than a few GB of metrics are persisted even after backing off collection.

pauldix · on Feb 5, 2016

I'm not quite sure what you're doing. If you have that level of hardware I don't know why you're running out of memory. Our tests with 100B points split across 1M series ran on much more modest hardware than that and didn't have a problem.

There's something different with what you're doing than with what we're testing. If you can give more detail about what your actual data looks like, I may be able to help.

Are you doing this on v0.10 (beta1 or greater) and you're seeing this problem?

jsmthrowaway · on Feb 5, 2016

> If you have that level of hardware I don't know why you're running out of memory.

Did you folks fix the "oops, I accidentally selected too much and OOMed the machine" crash that was supposed to be resolved by the new query planner in .10? That was my immediate showstopper for InfluxDB for a very high-volume infrastructure, because I didn't feel like proxying InfluxDB just to enforce chunking, and the mere existence of the bug gave me pause. I hit that in 24 hours and shelved the system in 48.

Even pre-1.0, I'm not encouraged when I have to "work around" pretty obvious oversights in reliability, so I put you back on the back burner to give you time to mature and transitioned back to a Kappa-style architecture for my needs. I might still use InfluxDB for low-volume aggregations out of my stream processing, but it's for sure out of the hot path for the foreseeable future.

pauldix · on Feb 5, 2016

The current version still has the problem that if you do a massive SELECT then it will fill up memory until the process gets killed. In the future we'll give controls to limit how much memory a query can take.

I think it's common in databases that if you throw a massive query at it that the server doesn't have the resources to handle, things will go wrong. It'll thrash, or crash, or generally have poor performance. We'll be working on improving the failure conditions, but if you put a query to a database that is too big for the server to handle, some sort of failure scenario will be encountered. Just a question of how it's handled.

Also, there's no new query planner in 0.10. We have a bunch of work getting merged in for the query engine for 0.11. But if you throw a huge query at the DB that the server can't handle, you'll still have great sadness.

Which DB ended up solving your big query problems? Maybe we're just a poor fit? Or what kinds of queries should we be working on optimizing?

thatsad · on Feb 5, 2016

The current version still has the problem that if you do a massive SELECT then it will fill up memory until the process gets killed.

That means any ad-hoc query can trivially crash your server. That's pretty serious. It's even worse if the ingest is using statsd-influxdb-backend with udp since during the crash you'll lose data.

Do you really think it's ok for a novice user to be able to crash the server from your web gui if they make a minor mistake (e.g something like

  SELECT value FROM /series.*/

not realizing how many series their regex actually applies to until it's too late and the has gui stopped responding)?

I think it's common in databases that if you throw a massive query at it that the server doesn't have the resources to handle, things will go wrong.

This thinking is worse then wrong. Not only is it wrong - a simple unqualified select won't crash any "common" database - it makes anyone with any "common" database experience doubt your other claims.

u320 · on Feb 5, 2016

> I think it's common in databases that if you throw a massive query at it that the server doesn't have the resources to handle, things will go wrong.

This has never ever happened to me in the last 20 years I've used SQL databases. Sure it might take a long time but that's it.

fweespeech · on Feb 5, 2016

> Did you folks fix the "oops, I accidentally selected too much and OOMed the machine" crash that was supposed to be resolved by the new query planner in .10?

> The current version still has the problem that if you do a massive SELECT then it will fill up memory until the process gets killed. In the future we'll give controls to limit how much memory a query can take.

...

It locks on the query until its killed manually and/or kills it automatically.

It doesn't crash.

SEJeff · on Feb 5, 2016

That is more or less how my employer worked around this glaring bug. That being said, we've worked together with Influx to fix some issues or add features to Telegraf and even have had their developers fix several of our issues found under excessive load (/me waves to Alex and Sebastian), Influx has a long way to go, but I've got faith that they're going in the right direction and things are getting a lot better. It is true this announcement is a bit too full of blatant marketing rhetoric, but the tech is good and if is getting better. If you've not tried it very recently, give it another shot.

kev009 · on Feb 5, 2016

I don't have a lot of time to spare on this and we've found a viable option so I won't be using influx any time soon. I can help you with access to my production system if you are interested and as you are watching arrange for between 1k and 10k hosts (for 2 - 20 values per second)

pauldix · on Feb 5, 2016

If you found something else that works for you, that's great. I'm sorry that InfluxDB didn't work for you when you tried it, but we'll continue to improve and build. My feeling is that this release is a significant milestone in our evolution. We'll see how people react over the coming weeks after actually using it.

monkpit · on Feb 5, 2016

I'm sad that pauldix disappears when the comments get direct like this - I've used influxdb at smaller scales and I'm genuinely curious to see responses to a few different threads going on in this discussion. But they are left without a reply as of now.

pauldix · on Feb 5, 2016

I disappeared because I can't spend all evening responding to every person on the internet that may be having trouble with our software. An offer to SSH into a system that isn't running Influx and trying to troubleshoot some ambiguous problem isn't direct.

We have people that have given us useful troubleshooting information that we've helped (and have helped us improve InfluxDB). Constructive criticism with an offer to help us improve is the best.

While I'd love to trace down every problem for everyone on the internet, my time on this planet is limited.

As for your concerns about if it scales, we'll try to put out more benchmarks and test code over the coming weeks. For reference, we built a tool for stress testing that you can see here:

https://github.com/influxdata/influxdb/tree/master/cmd/influ...

You'll have to make your own decisions.

kev009 · on Feb 5, 2016

We would install Influx, as a go app that is fairly trivial and it speaks the same collection protocol...

As a random problem on the internet, this is the workload you should expect from any midsize business and probably something your company would like to target for revenue. Good luck.

kev009 · on Feb 5, 2016

This is why I said they are great at marketing. There's really little value in working with me, but at least responding while the article is front page makes it look like they are responsive and looking into issues.

Maybe they can sink a big user/customer that's willing to gut through all the pain of helping to make an operation data store.. someone in this thread mentioned mongo and it sounds very familiar.

I'm wondering if I'm being too negative but the flip side of this is that it caused me and others a lot of stress, lost face, and time. Most developers and operators are overworked already so I am trying to save some pain since the blog post makes it seem like everything is fine and awesome.

JulianWasTaken · on Feb 5, 2016

This is a very odd thread.

I'm in no way involved in influx, we're just evaluating it, but this entire thread reads like complete FUD at the moment, so it's odd that you're calling out a maintainer or whoever pauldix is for "disappearing".

kev009 · on Feb 5, 2016

If you're just now evaluating it, you might not realize the history behind the project. It's more backlash than FUD.. infrastructure people don't like to be jerked around by marketing tactics any more than we already are. At various times each of the storage engines was promoted as the bee's knees LevelDB -> BoltDB -> bz1 -> tsm1. That kind of fumbling speaks for itself. It's strange for a high profile investment (including YC). Founded/funded around the same time contrast historically to FoundationDB.

preetamjinka · on Feb 5, 2016

> Founded/funded around the same time contrast historically to FoundationDB.

FoundationDB was in stealth mode for at least three years before they launched their closed Alpha release, so perhaps this isn't a fair comparison.

(I was an intern at FoundationDB during summer 2012.)

monkpit · on Feb 5, 2016

At the time I wrote the comment, there were direct questions left unanswered for 6 hours, while replies were given to other less-interesting comments in the meantime - that to me looks like someone ignoring something. He came back, that's great!

kylemathews · on Feb 5, 2016

He replied an hour ago...

monkpit · on Feb 5, 2016

An hour after I said that, yeah. But not in the like 6 hours prior while he was answering other stuff.

cantbackup · on Feb 5, 2016

How about issues like these? https://github.com/influxdata/influxdb/issues?utf8=&q=is%3Ao...

In my case I had a 0.9.4 influxdb setup fed by statsd for about a week. The server hard crashed the first time I tried to back it up.

pauldix · on Feb 5, 2016

We just have to close those out. Those are all fixed in the release we have today.

zenlikethat · on Feb 4, 2016

I have been forwarding Riemann metrics to it and visualizing with Grafana, pretty fun so far. Love that it's written in Go and fairly lightweight on resources compared to JVM-based technologies.

CSDude · on Feb 4, 2016

I just love the ability to GROUP BY queries, they work so fast and saves me from writing a stupid map-reduce style job for that basic thing. I am building a time series nice analysis product with it and this change would work very fine.

chaotic-good · on Feb 5, 2016

It depends greatly on the shape of your data, but with regularly spaced timestamps at second level precision and float64s, we’ve seen compression reduce each point down to around 2.2 bytes per point.

AFAIK you're using float64 compression scheme from "Gorilla" paper. 2.2 bytes per data point is possible with it but only on data that doesn't utilize full double precision (example: small integers converted to float64). You should compare compression algorithm used by TSM storage engine with zlib or any other general purpose compression algorithm, otherwise this number will be meaningless.

Dylan16807 · on Feb 5, 2016

Honestly if you have a series of increasing unix timestamps you should be able to do much better than 2.2 bytes each.

chaotic-good · on Feb 5, 2016

Even simple delta+RLE can squeeze increasing unix timestamps up to few bytes. I'm talking about floating point data compression. Even the specialized state of the art algorithms can't guarantee you 2.2 bytes per float in general.

kyloon · on Feb 5, 2016

I had another go at this new release after failing to get any of the 0.9.x releases to suit my use case of massively high volume writes. Now that I managed to get some of my data in it, aggregation queries seem to be lagging in speed, I would like to know more about what is actually being done currently to improve the query speed rather than waiting out for the v0.11 release as I am deciding whether InfluxDB is the way forward for my use case.

pauldix · on Feb 5, 2016

The new query engine work should get merged into master early next week. You'll be able to test against a nightly build then to see what kind of performance improvement will be part of 0.11.

What kinds of queries are you seeing poor performance with? That would help us troubleshoot and improve, thanks.

kyloon · on Feb 5, 2016

Thanks for the update. Most of the queries I am testing now are 'mean' and 'group by' aggregation for downsampling and visualization in Grafana. Looks like the problem is CPU bound as I see all 8 cores got maxed out when processing the queries. Currently there are about 50 series and 5B data points in total.

bfrog · on Feb 4, 2016

Been using 0.9 for a few months now and been pretty happy with it. Though I'm not really looking forward to figuring out how to move from telegraf 0.2 to 0.10, since that had some breaking changes.

I do love the simplicity of influx+telegraf, kapacitor also looks cool. Chronograf seems like a bad version of grafana still... maybe in the future if its FOSS and somehow manages to be better than grafana I'd use it.

AYBABTME · on Feb 6, 2016

I don't get all the hate. Reading the comments here yesterday and then today again, made me really question what's going on in this community.

They (InfluxDB) made huge progress, they work hard on an open source DB (and ecosystem) and people insult them and question Paul Dix's >30 minute response time to your support questions in a HN thread.

stock_toaster · on Feb 5, 2016

I would love to see some good (and up to date) documentation on replacing graphite with influxdb (retention, rollups, etc).

Last time I tried it out I vaguely recall that configuring rollups was kind of painful -- lots of nearly duplicate CS queries, even for a relatively small number of series.

XorNot · on Feb 5, 2016

Is the text ingest engine usable with TCP yet? UDP is a bit weird for this when I may care greatly about whether my server is successfully feeding a stream of lines too influxdb.

pauldix · on Feb 5, 2016

We don't have the TCP listener wired up yet, but it's on the roadmap. Thanks for bringing it up, I'll try to get it prioritized for sometime in the next few months.

ddorian43 · on Feb 4, 2016

They did the mongodb approach. Build a shitty db and then build a normal one and claim 100X faster.

And then add synchronized disk commits (like postgresql has always done) and performance goes puff (meaning lower than pg) (example: elasticsearch, mongodb).

pauldix · on Feb 4, 2016

fsyncs happen on every write. They're durable. This release may be 100x faster than our previous release, but it's also a competitor with other solutions.

We haven't run any comparisons vs. other solutions, but we will do that soon and I expect that we'll be competitive and better in some cases.

CyrusL · on Feb 4, 2016

Very tasteful way to handle the parent comment :)

Does Todd Persen talk about durability testing in the video or is more about performance benchmarking? I haven't watched yet.

https://vimeo.com/153802549

pauldix · on Feb 4, 2016

Thanks :). He talks about doing kill nines on the process while doing a bunch of writes, trying to corrupt the DB, and perf testing. In addition to a little bit of talk about what came before and some lessons learned.

SEJeff · on Feb 5, 2016

Have you had a chance to throw Influx through the paces of Kyle Kingsbury's Jepsen distributed data store tester?

pauldix · on Feb 5, 2016

Not yet, but it's on our TODO list. However, we have to finish the clustering implementation before we bother going through that testing. At this point we're telling people it's experimental and not meant for production use.

fapjacks · on Feb 4, 2016

This is not the first time I have seen an anti-InfluxDB post that also subtly pushes PostgreSQL in its place. Now I totally agree and I am not happy with InfluxDB's performance, and maybe I'm just being paranoid, but it's interesting I've seen this a handful of times.

ddorian43 · on Feb 4, 2016

Nothing against influxdb. But if you have 100X, you probably were doing something shitty before.

I use postgresql as comparison because antirez did in http://oldblog.antirez.com/post/redis-persistence-demystifie... (and ~none else has done something similar).

pauldix · on Feb 4, 2016

Have a look at this release. The write performance is massively better than it was. We did tests writing up to 100B points against a single server.

The next release will focus on improving query performance, but this one still works for many cases.

officialchicken · on Feb 4, 2016

"Massively better" doesn't mean deployable to production either. Especially not if you've only tested with a single point of failure with a single sized payload. But I'm willing to give it a third (and final) trial:

1) Does "DELETE * FROM foo" still cause the system to lockup, freeze, and require a restart to free memory? Or are there other conditions/queries that cause the system to become unstable?

2) There's no README/CHANGELOG/dependencies info on the download page. Which is the preferred version of Go to install on my servers for Influxdb - 1.4 or 1.5?

TwilightPoetry · on Feb 5, 2016

I'd guess go version doesn't matter, you should install the binary which is statically linked, no?

cbsmith · on Feb 4, 2016

Massive performance improvements like that tend to come down to overall changes in the data structure and/or algorithm. This is like, "oh yeah, maybe we should do LSM?" or "maybe compressed bitmaps?" kind of stuff. What what the magic?

CyrusL · on Feb 4, 2016

New storage engine.

Introduction from last October: https://influxdata.com/blog/new-storage-engine-time-structur...

cbsmith · on Feb 5, 2016

Oh right, I remember when that blog post came out. It was amusing because I'd built an in-house metric store much like InfluxDB (using a LevelDB engine), and reached much the same inflection point and decision. I so should have open sourced that so you poor saps didn't have to experience my pain.

dbettin · on Feb 4, 2016

What is the query performance like today? And, what are your goals for query performance after the next release?

flowless · on Feb 4, 2016

I would like to see actual comparison of both. I'm going pg way myself but due to millisecond precision of timestamps required for handling scientific data.

pauldix · on Feb 4, 2016

InfluxDB supports timestamps down to the nanosecond. We recommend that you only use the precision you need, since you'll get better compression that way.

flowless · on Feb 4, 2016

Good to know, thanks!

thecourier · on Feb 4, 2016

can you efficiently store logs in InfluxDB? would love to read if somebody have tried that before... (just wondering)

piran · on Feb 5, 2016

We store some logs at a small scale (5k/s) and queries are fine 0.9.5. You just have to be smart how you do tags still. Hopefully 0.10 is the same.

atombender · on Feb 4, 2016

InfluxDB is optimized for storing samples. I don't know for sure about 0.9+, but I can't imagine the architecture has improved to the point where it would be happy about storing large amounts of text data.

pstuart · on Feb 4, 2016

Perhaps if the logs used logfmt with a parser as an intermediary.

eklavya · on Feb 5, 2016

A lot of comments have mentioned Time Series use. How does it compare to Cassandra?

illumen · on Feb 4, 2016

But does it work? It wins the most buggy database servers I've ever used in 20 years award.

Is writing less than 20MB/sec of data something to brag about?

sytse · on Feb 4, 2016

We're using it with GitLab.com and we like it. 0.9 didn't work at all due to the volume but with 0.10 everything is functioning OK

fweespeech · on Feb 4, 2016

A full cluster or a single node?

YorickPeterse · on Feb 4, 2016

We currently run InfluxDB 0.10.0-nightly-614a37c (I have yet to upgrade it to the stable release) on a single DigitalOcean instance with 8GB of RAM with 30-something GB of storage. The previous stable release (0.9 something) didn't fare very well, even after we significantly reduced the amount of data we were sending (we were sending a lot of data we didn't really need).

Switching to 0.10.0-nightly-614a37c in combination with switching to the TSM engine resulted in a very stable InfluxDB instance. So far my only gripe has been that some queries can get pretty slow (e.g. counting a value in a large measurement can take ages) but work is being done on improving the query engine (https://github.com/influxdb/influxdb/pull/5196).

To give you an idea of the data:

* Our default retention policy is currently 30 days

* 24 measurements, 11975 series. Our largest measurement (which tracks the number of Rails/Rack requests) has a total of 28 539 279 points

* Roughly 2.3 out of the 8 GB of RAM is being used

* Roughly 4 GB of data is stored on disk

This whole setup is used to monitor GitLab.com as well as aid in making things faster (see https://gitlab.com/gitlab-com/operations/issues/42 for more info on the ongoing work).

fweespeech · on Feb 5, 2016

Thanks for the information. :)

Unfortunately, I need 2+ instances with Active/Active or failover to seriously consider anything for production which is why I've not touched InfluxDB beyond some light testing.

pauldix · on Feb 4, 2016

Am I correct in assuming that you got to the 20MB/sec number by taking our 3 bytes per point number times the number of data points?

The input is actually much higher than that. Data points over the network look like this:

cpu_idle,host=serverA,region=uswest value=23.0 1454617920

That's actually a toy example. Most real data would probably have more tags and longer measurement names. Obviously that's much more than 3 bytes.

We persist that to disk in a write ahead log (WAL) and then later we can do compression and compactions on the data to squeeze it down to 3 bytes per point. However, that takes more than a single write against the disk to get to.

Run a load test against it. See how much network bandwidth you can use. See what your HD utilization looks like. My guess is you'll be surprised by what you see.

switch007 · on Feb 4, 2016

And about the bugs...? My experience with 0.8 and 0.9 was somewhat sub-par. I personally am awaiting 1.x and it wouldn't surprise me if others were too.

la6470 · on Feb 5, 2016

I don't like the fact that old data will continue to use the old engine after the upgrade.

jmcgough · on Feb 5, 2016

There's a migration script that can be used for moving old data to the tsm engine.

effenponderousa · on Feb 5, 2016

I have nothing to do with influx and I probably won't in the future.

There are way too many haters in HN. You venomous minority who shit on every bit of good news that isn't yours -- keep your negativity to yourself.

You fucking monkeys infected with rage.

donatj · on Feb 5, 2016

On the one hand I agree with your sentiments, on the other hand I disagree with your approach.

la6470 · on Feb 5, 2016

Looks like all the hordes of oldies from slashdot is now coming to HN

dang · on Feb 5, 2016

It's true that there's too much negativity on HN, but it's also true that people contribute to it without meaning to. Your comment kind of demonstrates that.

On the other hand, your use of the word 'minority' there was both astute and thoughtful.