InfluxDB 2.0 Alpha and the Road Ahead

petetnt · on Jan 24, 2019

Been using InfluxDB for a rather long time now and while I've had my share of the problems (including https://news.ycombinator.com/item?id=17768860) generally I do like the ecosystem a lot. When it works it just works and it's awesome.

That said while TickScripts are not exactly a pleasure write, I wouldn't say a new language is pretty high on my wantlist. For example, as the blog post states:

> We also need to add features for backup & restore, bulk data import and export, and data deletes.

I'd take any of those over a new query language any day now. Especially the restore part is at the moment nearly completely non-existing after backup format change in InfluxDB, so good luck if you happen to run into data loss or just want to move around data in general. Same goes for data deletion outside of whole measurements.

I am hopeful that those features arrive before a 2.0 launch and hopefully in some sort of backwards compatible way (or at least ported to Influx 1.x).

e-dard · on Jan 24, 2019

Hi, I'm one of the engineers working on the storage side of InfluxDB. Improving the performance of adhoc deletes, as well as import/export (backup/restore) are features that my team will be actively working on in the coming weeks and months.

petetnt · on Jan 24, 2019

Hi @e-dard! That's great to hear, looking forward to next releases! Keep up the good work!

valyala · on Jan 25, 2019

> We also need to add features for backup & restore, bulk data import and export, and data deletes

Interesting to see that we at VictoriaMetrics also pay attention to these features. Backup and restore already works - see below. Bulk data import and export plus data deletes for arbitrary metrics selectors will be available soon.

It would be great if InfluxDB 2.0 would have instant snapshots as VictoriaMetrics already has - https://medium.com/@valyala/how-victoriametrics-makes-instan... .

Also glad to see that the future InfluxDB versions will support Prometheus exposition format and PromQL :)

pauldix · on Jan 24, 2019

Those are indeed blockers for a full 2.0 release. There is backup and restore in the 1.7 release, but it's not nearly as powerful as it needs to be. We aim to correct that before any full GA release of 2.0.

pauldix · on Jan 24, 2019

InfluxDB creator here, happy to answer questions and add more commentary here!

exabrial · on Jan 24, 2019

TickScript is indeed the killer feature, the problem is debugging is difficult because of the language itself is esoteric. Simply adding the ability to put printf statements to the console would be a game changer. The ability to write tests outside of chronograf would be a game changer. The ability to mock inputs would be a game changer. The problem is you've re-implemented the wheel and now you have to build the debugging ecosystem that is a well beaten path for other languages. Love your guy's product. It's also hard to give you guys money, because there's no offering you make that quite fits one of our needs.

pauldix · on Jan 24, 2019

Thanks for the TICKscript feedback. That’s all stuff that we want to have addressed in Flux. This alpha release doesn’t have that yet but printf, a test runner built into the influx CLI, and test inputs and outputs are all on the near term roadmap

alexk · on Jan 24, 2019

That's my feedback on TICK as well, it's pretty hard to debug and implement something like this:

https://github.com/gravitational/monitoring-app/blob/master/...

I looked at flux, and it seems pretty compatible with TICK script, although I don't have a clear understanding on how to make it easy to write alerts/queries and debug them yet.

I would be curious to try out REPL:

> We plan to provide a flux command line program that exposes a REPL and talks to various data sources.

Especially interested to see how easy it would be to edit and troubleshoot a multi-line transformation query like this one in it:

   cpu = data 
       // only get the last 5m of data
       |> range(start: -5m)
       // only get the "usage_user" data from the _measurement "cpu"
       |> filter(fn: (r) => r._measurement == "cpu" and r._field == "usage_user")

Why create a new language vs using javascript or lua?

e.g. your flux example above could have been just:

   let cpu = data.range({start: "-5m"}).filter(func(r){return r.measurement == "cpu" && r.field == "usage_user"})

Is there any specific feature of flux that requires a new language? I read your blog post here:

https://www.influxdata.com/blog/why-were-building-flux-a-new...

and it brings some valid points on Flux vs SQL, however I would be interested how would it compare as Flux vs Javascript or Flux vs Lua :)

pauldix · on Jan 24, 2019

There isn't a specific feature of Flux that requires a new language, although part of the language is a query planner and optimizer that is tied to certain functions in it. Technically any Turing complete language is equivalent, but in practice that is seldom true. Performance is one vector, but so is expressiveness, aesthetics, ease of use, etc.

We chose not to use Lua because it's not a widely known language. Instead, we chose to create a language that looks and feels much like Javascript, which I think is the most widely used language today. However, we wanted to limit the scope of it and to add new syntax over time to make frequent tasks easier to express.

Any time you create a new API or library, you create new surface area for a developer to know. A language is no different than this. Some languages are also much easier to learn than others. For example, I found Go a language that I could pick up the basics in very quickly, while Rust is something that I'm still learning over months of effort. I like both languages, but their learning curves are very different.

I think Flux is quite easy to pick up for many programmers, although I'll have to see questions and have countless interactions with people trying to learn it to prove that out and make adaptations over time.

Finally, one goal with including the UI in InfluxDB 2.0 is that as we improve it, most users won't have to learn Flux at all. They'll be able to accomplish what hey want by just clicking around the interface. We're not there yet, but it's what we aspire to. And we want to have control over the language to build it in conjunction with UI tooling to automatically manipulate it.

alexk · on Jan 24, 2019

>We chose not to use Lua because it's not a widely known language. Instead, we chose to create a language that looks and feels much like Javascript

This makes sense, in this case why not take strict subset of Javascript?

> Finally, one goal with including the UI in InfluxDB 2.0 is that as we improve it, most users won't have to learn Flux at all. They'll be able to accomplish what hey want by just clicking around the interface.

This will definitely will be helpful, and auto generation makes sense, however in our use-cases we treat Kapacitor alerts as a code that is reviewed.

pauldix · on Jan 24, 2019

Flux is almost a subset of Javascript. There are only two exceptions. Our pipe forward syntax, which I've seen proposed for Javascript, but who knows if that's going to happen. And the other is our named parameters with optional defaults. You can get close to this in JS by passing an object literal combined with destructuring.

However, there are other things we'll probably be adding to the language over time. Also, if you're only allowing a subset of JS, is it actually JS anymore? I'd imagine that some JS programmers could get frustrated when things they think should work (because they're valid JS) don't because they're not in the subset.

But say we went with that. Then we'd need to adopt an existing JS engine (most likely written in C++) and then modify it based on our needs. And then integrate that into our broader Go codebase. All of that seems a bit less than ideal.

Ultimately, I really do feel that if you're someone that knows Javascript, you can learn the elements of the Flux language in less than an hour. The bigger learning curve is the library of functions and the API, which would exist regardless of what language we chose.

I think the proof will come over time based on what kinds of things we enable in the language and the platform. In the near term I expect a large number of totally reasonable people to question the choice of a new language. After all, that's probably the rational response. But as we improve it, add to it, refine it, and improve the developer and user experience, I expect to win more converts. Developers pick up new tools because they enable them to get their jobs done faster. Ease of use, speed of development, and productivity are our guiding lights.

alexk · on Jan 24, 2019

> But say we went with that. Then we'd need to adopt an existing JS engine (most likely written in C++) and then modify it based on our needs.

This is definitely not an easy choice to make and writing a new language interpreter in Go seems a way easier approach initially.

One thing that Go helped me to understand though, is that the language does not matter so much as the tooling around it - debugging, compiling and support around the language are way harder to achieve than writing a language parser.

Thanks for sharing your thoughts on this design choice, I think this discussion is very relevant for a couple of reasons:

There are other Go projects that are taking the same approach to achieve simplicity and not picking the existing language, like OPA [1]. Others, like Helm are picking Lua [2], so there is clearly a problem the Go and infrastructure community is facing and split in the way people are approaching it.

We've faced the similar dilemma with Teleport, as we are designing our extensions system. My original plan was to use Lua, however after discussions with the team we settled on GRPC with Go [3], trading expressiveness/simplicity and freedom of lua/js languages in favor of industrial features Go runtime and GRPC provides out of the box.

For smaller extension plugins we decided not to create a new language, and ended up with interpreted subset of Go [4].

I wonder if there is a place for some subset of javascript or typescript that is fully interpreted by Go, with native extensions for debugging to be used by the community.

[1] OPA policy agent https://www.openpolicyagent.org/ [2] Helm 3.0 Lua plugins https://github.com/helm/community/blob/master/helm-v3/005-pl... [3] Teleport Plugin Design Document https://docs.google.com/document/d/1sPXXxx03P8VXWy-YD5w190g7... [4] Subset of Golang https://github.com/vulcand/predicate

exabrial · on Jan 24, 2019

They're probably looking at Flux as a sunk cost now, but I agree, Lua would have probably fit the bill.

Either way, I'm hoping this is "The Release" where things go consistent. They've pivoted several times, and it's all been good, but we do need a couple of years of consistency before we can make long term investments into Flux.

pauldix · on Jan 24, 2019

Lua was certainly an option and it's one I was even considering as an embedded scripting language back in the fall of 2014. I gave a talk in London where I asked for a show of hands about Lua vs. Javascript as the choice and the majority raised for JS.

One of the reasons we didn't go with that is because we didn't want a separate query vs. scripting experience like you have with SQL engines that embed programming languages. We wanted something that felt and looked seamless.

exabrial · on Jan 24, 2019

Awesome, great explanation, thank you. I disagree on your point about Lua, but given what you guys have built so far I have no doubt you'll be successful.

ezrast · on Jan 24, 2019

I tried out InfluxDB a while ago in my spare time and was intrigued by the feature set, but ultimately couldn't get past the abstruse query language, especially coming from the simpler and more flexible PromQL (not being able to do ad-hoc math across time series was a big deal for my use case). I'm eagerly looking forward to giving it another shot with Flux and have super-high hopes.

What does the data model for time series look like in 2.0? Mostly the same as 1.x, or has that gotten more flexible as well?

pauldix · on Jan 24, 2019

For now we take writes in the 1.x line protocol so it’s still measurement, tags, and field. However, Flux doesn’t really make that a requirement. So in the future we plan on having a way to write series in without requiring a field or even a measurement.

Once the planner gets the data to the Flux processing engine it views everything thing as a table of data with columns and records. So it’s much more flexible in how we can represent data.

ezrast · on Jan 24, 2019

Sounds good; thanks for the reply!

valyala · on Jan 25, 2019

FYI, the article mentions that InfluxDB will eventually support PromQL:

> With InfluxDB 2.0 we support both push and pull models out of the box. Eventually we’ll also support querying via PromQL out of the box

hunta2097 · on Jan 24, 2019

I love InfluxDB, apart from the need to have separate retention policies when reducing granularity over time.

Are there any plans for a more unified method of performing continuous queries - so that we can query high granularity and older, downsampled data at the same time?

That would be a killer feature for me.

pauldix · on Jan 24, 2019

Absolutely. We won't address that in the initial release of 2.0, but there will be ways to get it done. The eventual solution will probably revolve around using the tasks system to downsample into buckets of different retention. Then using a function in Flux at query time to look at the metadata of buckets in the query and the time range and selecting the precision based on that. We should be able to show examples of how to do this in Flux later this year.

hunta2097 · on Jan 24, 2019

Awesome, I'll look forward to it. Thanks Paul!

zphds · on Jan 24, 2019

I really dig the line protocol. Pretty simple.

Any HA features? Sharding to look out in 2.0? Or is the general idea to set streaming relays of influxdb tsm's and treat HA as an L7 proxy routing problem (shadow metric traffic using envoy for example). How do people handle this in their production setups. Curious to know.

It would be cool if the query engine could talk to multiple shards spanning multiple machines for dealing with high cardinality series.

pauldix · on Jan 24, 2019

Right now we’re prioritizing work on the single open source server and our cloud service, which has a very different design. Flux will be able to query multiple servers and combine their results (in OSS), but that would be a building block for some HA or clustering.

So you could certainly layer in your own HA solution. We’re still working out what if any clustered for federated features will exist in open source.

bithavoc · on Jan 24, 2019

the fact the clustering is not part of the core is a show stopper for me

wmf · on Jan 24, 2019

And not making any money would be a show stopper for them. This is probably never going to change so complaining about it is probably pointless.

(I'm not affiliated with Influx, but I strongly believe developers need to get paid with money.)

s17tnet · on Jan 24, 2019

I totally agree, I am a developer too but their offering is outrageously expansive.

willvarfar · on Jan 24, 2019

Are there any changes to the data storage level? Optimisations etc?

And can data points be incremented instead of the current field-replacement crap when you get new points with the same tag set?

e-dard · on Jan 24, 2019

In the last few months we have made quite a few improvements to data storage and indexing. Features that are available by default in 2.0, which are significant changes from releases earlier than, say, 1.6 include:

  - Significant TSM encoding and decoding performance improvements.
  - The TSI index will be on by default.
  - Queries that use the same tag key/value filters will be answered from the index more quickly using an LRU cache.
  - Field keys will now be indexed in 2.0, making filtering/grouping on field keys more efficient.
  - Improvements to how series are extracted from the index, and points data from the TSM engine, which helps with memory performance for queries.
  - Significant performance improvements to measurement deletion.

> And can data points be incremented instead of the current field-replacement crap when you get new points with the same tag set?

Can you elaborate on that?

willvarfar · on Jan 24, 2019

Excellent info on the engine improvements!

I want to use influx to store _statistics_ not _events_. Basically, my data points are tag-sets and counts.

There are several ways to achieve this; for example, you can send the events to influx and have continuous queries to gather the statistics. That doesn't work well when you have a lot of events, and where they arrive out of order and at high latency, etc.

So what you typically end up having to build is a stats thing that sits in front of influx and tracks the counts of events with particular tag sets in particular time buckets, and then keep uploading these to influx.

And there are two ways to do that:

1) you are not stateful and you keep uploading deltas and incrementing the nanoseconds to avoid data-point collision; you can then get the data out of influx with sum() on the fields and grouping by whatever the time bucket is. I tried this and influx grinds to a halt eventually.

2) you are stateful and track the totals outside influx, and keep uploading a newly-written data-point to overwrite the fields for that bucket in influx. This is much less data in influx and much easier to query, avoiding sum() etc. Its like I end up with something in front of influx doing what I want influx to do.

What would greatly simplify life is if the line format which looks like this:

measurement,tag1=x,tag2=y,tag3=z f1=total,f2=total timestamp

could look like this:

+measurement,tag1=x,tag2=y,tag3=z f1=delta,f2=delta timestamp

and in the second case, where the line is prefixed with a + sign, influx knows to add the fields if the data-point collides with another rather than overwrite them.

This would mean that people trying to store statistics in influx could add to those statistics statelessly. A massive simplification.

I've had other problems, like I have way more than 1M series. Its painful. My influx boxes hit iowait far too often, which is weird because the boxes have more RAM than the total dataset.

mattashii · on Jan 24, 2019

> Our vision for 2.0 is to collapse the TICK Stack into one cohesive and consistent whole ...

After InfluxDB 2.0, what applications would I have to run to be completely tick-compatible?

I ask this because I would rather not have to run a DB on every server because Telegraf was fully merged into InfluxDB.

e-dard · on Jan 24, 2019

Hi, just to clarify - Telegraf is still stand-alone. You will not need to run InfluxDB 2.0 on every host that you need Telegraf on.

dekhn · on Jan 24, 2019

Backup and easy restore (not having to run SQL statements to switch up tables) is the thing I care about the most.

I agree with the folks below suggesting you use javascript as the language instead of reinventing your own.

nerdbaggy · on Jan 24, 2019

I really like what Influx is doing. I find getting information in and out much easier in Influx than most others.

exabrial · on Jan 24, 2019

There's a canned dashboard for that actually. Setup the telegraf input for your influx instance and it can tell you all kinds of neat stuff about what it's doing internally.

geekybiz · on Jan 24, 2019

To other users of InfluxDB : How do you read data if you want to group, filter, sort by something other than time?

We used to use InfluxDB to store our perf data where every data point contained a timestamp (thus we thought influxdb was ideal for our use case). But, soon we wanted to group, filter, sort by various dimensions and it lead to performance issues.

Bringing it all in-mem and using pandas to do that was very slow for us. Also, creating indexes for so many columns didn't seem like a good idea.

We switched to postgres and the decision has served us well so far. I just want to understand if influx isn't suitable to our kind of use-case or we used it incorrectly.

pauldix · on Jan 24, 2019

InfluxDB 1.x was definitely not designed for that. With Flux in InfluxDB 2.0 you will be able to do things like store reference data in other places and join that with time series data in InfluxDB at query time. You can also sort, group and filter by any measurement, tag, field, or value. However, there are no user defined secondary indexes so the scope of this will be a bit more limited based on how things are stored. I'd have to know more about the specific kinds of queries to figure if it's something that would make sense within InfluxDB 2.

exabrial · on Jan 24, 2019

> Kapacitor is the killer app

Um. Yes. Wait, there are people that use influxdb _without_ kapacitor alerting??? Someone name any commercial or open source equivalent, b/c I'd like to know in case they kill it.

endymi0n · on Jan 24, 2019

I haven‘t had any other needs since switching to Prometheus / Prometheus AlertManager / Grafana and a Slack binding.

AM looks ugly as hell („designed by engineers“), but it‘s damn rock solid and versatile.

Tried Influx in the first days, but it was slow and buggy and they changed their paradigms every other version. Well, looks like they‘re doing it again...

h1d · on Jan 24, 2019

How do you store long term data?

ptman · on Jan 24, 2019

Prometheus can write data to another system for long term storage: https://prometheus.io/docs/operating/integrations/#remote-en...

h1d · on Jan 24, 2019

I know. As the OP didn't like Influx, I was wondering which other solution is used.

amyjess · on Jan 24, 2019

> Um. Yes. Wait, there are people that use influxdb _without_ kapacitor alerting???

People who use InfluxDB for reporting, not for alarms.

Our reporting team uses InfluxDB-sourced data to send customers reports on things such as bandwidth utilization and performance statistics, and we also have some reports and dashboards for internal purposes. And honestly, we use Zabbix for a lot of internal metrics too. For simple things, it's much easier for me to just set up an item and a graph in Zabbix than it is to stand up a "cron" job to farm data into InfluxDB ("cron" in quotes because it's not literal cron but a Jenkins-based system).

For actually monitoring devices for problems and generating alarms, we use Zabbix for internal systems and Netcool for customer-facing systems.

(for the record, I'm an NMS engineer at a mid-sized telecom)

a012 · on Jan 24, 2019

I use Grafana with Influxdb data source for alerting, Kapacitors alerts creation and sorting is less intuitive than Grafana.

sgt · on Jan 24, 2019

Kapacitor is not always intuitive, i.e. using Tickscript or Flux. We ended up using Python in addition to cater for use cases that were hard to do or to maintain in Tickscript.

pauldix · on Jan 24, 2019

Nope, our goal is to take it forward with Flux and having it being integrated into the main API. Flux functionality will be a superset of what’s possible in TICKscript

exabrial · on Jan 24, 2019

well that's a relief :)

viraptor · on Jan 24, 2019

I haven't seen flux and a few other things before, but it sounds really exciting. It looks like influx people release a lot of this info (and related tech like grafana) on their YouTube channel: https://www.youtube.com/channel/UCnrgOD6G0y0_rcubQuICpTQ

Good content, it's a shame it's not more popular.

s17tnet · on Jan 24, 2019

They quote something like 8 grand-a-node. It is a bit off the the 'popular' zone.

viraptor · on Jan 24, 2019

You can still do a lot with the open version. Especially if you need it for a small/side project, it's great.

If you really need the HA, you can start spending on the enterprise edition. Or move to another storage.

pauldix · on Jan 24, 2019

Our current cloud offering has much smaller initial pricing. Our Cloud 2.0 offering that we'll launch later this year will have usage based pricing as well as a free tier.

jrockway · on Jan 24, 2019

What's the story regarding high availability for InfluxDB 2.0? I know it is a commercial offering in the current release, and while I'm not super thrilled about that, I do think it's reasonable. It would be nice if you could buy the self-hosted version with a "contact sales" step which I honestly will never do. (I would just put a proxy in front of the multiple instances that tries to write to as many of them as possible and does reads from one at random. What could go wrong!)

The other thing I never figured out in the current version is how to write the following query. I store samples like (device, network, direction) -> packet count. I then want to know how many packets were sent across the network in total.

With monitoring systems I've used in the past (internal to my former employer), this was easy. You would do the delta calculation at the lowest level to convert device packet counters to number of packets sent in the last time interval, which varies randomly because samples are not necessarily arriving at discrete intervals. Then you would do an align, to bring the randomly-added sampling times into alignment across all the "streams" (which is a unique (device, network, direction)). Once the data is aligned, you can then do a group by to get rid of a certain tag, like device (and just have (network, direction) -> total packets sent in the time interval).

With InfluxDB, I can't figure out how to do this. It has the group by time concept, but not alignment, so I can't write a query that will work. I ended up just computing the deltas before inserting the item, at which point everything works fine.

I have not tried the same query with Prometheus, but I suspect it would work like I expect, as it seems very heavily inspired by a certain internal monitoring system I am most familiar with.

pauldix · on Jan 24, 2019

This release of 2.0 is not a commercial offering. It's completely free and open source under an MIT license without restrictions of any kind. For example, you could use it as the basis of your own commercial offering or use code from it and that is all fair game.

For the query you mention, you can do that in 1.x using a combination of group by time, fill and subqueries. In 2.0 you can do this in Flux using similar operators. More complex possibilities for interpolating missing data is also in the works.

Solving more complex query processing like what you mention is a specific reason why we started developing Flux. We couldn't figure how to address some of the more advanced query feature requests in the old language so we started with something new.

In addition to giving Flux more functions, flow control and more power, we also will be adding more out of the box functions and syntactic sugar to make some of these more advanced queries possible without being overly verbose and complex. Our order of operations for developing the language:

1. Make it powerful 2. Make it easy 3. Make it fast

So it'll take us time to get to all of our goals, but I think we've laid down a pretty good foundation.