Hacker News new | past | comments | ask | show | jobs | submit login
Influxdb made the switch from Go to Rust (reddit.com)
340 points by tim_sw on Oct 1, 2023 | hide | past | favorite | 157 comments



I love influx but damn do they like moving (too?) fast and quickly changing stuff. In a way, it's pretty cool since it means that they don't get stuck with bad decisions for backwards compatibility reasons, but it's a bit of a roller coaster for users.

Not sure what's the best solution though. Having a "stable" but fundamentally limited product (I guess influxdb v1) or breaking stuff in hopes of ending up with a way better technical foundation.


We're migrating off of InfluxDB due to that rollercoaster, honestly. It's hard enough to find time to maintain the monitoring stack at work. Casually dropping "Oh, and now you get to rebuild the entire grafana to change the query language" on that doesn't help. And apparently, version 3 does the same thing, except backwards.

Sorry, but at that point, we've decided to rebuild the entire metric visualization once on TimescaleDB, since we're running postgres a lot anyhow.


Fair warning, I had serious scaling issues with Timescale.

Solutions like Grafana Mimir, Victoria Metrics, Clickhouse, or yes, the new Influx implementation, are much more scalable and will give you much fewer headaches.

ClickhouseDB is realy brilliant, btw, it's a powerhouse. Especially with the fairly recent additions that enable hybrid local + S3 option, pushing older metrics to S3 for cheap long-term storage.


Agreed. We initially used Timescale for our GraphQL Metrics product[0] but very quickly ran into scaling & performance issues. We switched to Clickhouse and have scaled 10.000x+ since with almost no issues.

[0]: https://stellate.co/graphql-metrics


Also, Timescale similarly introduced S3 for bottomless data tiering:

https://www.timescale.com/blog/expanding-the-boundaries-of-p...


Only for their managed solution at the moment.


If you are open, would love to hear more about some of the challenges you had with Timescale, esp. with your workload.

mike (at) timescale or DM on twitter?


The cloud storage option for CH looks like a game changer for time-based data. Any concerns there about accidentally causing "trashing" when cold data is needed? I believe the MergeTree system works by splitting the table into parts during insertion that then later get merged together, so you have to careful that during merging, only the "hot" data is touched, otherwise you'll start pulling in cloud storage data that's supposed to be effectively read-only.


The merging happens primarily on hot data, I haven't run into any issues there.

But there are lots of approaches, depending on your needs.

You can (should) define a "cache disk" for S3, which will cache up to X Gb locally to avoid trashing.

Another option is is to move data into separate (purely S3 backed) tables after a certain time to avoid accidentally fetching large amounts of data from S3. You can still easily join the data together if needed.


Can you elaborate on what scaling issues you had with Timescale?


I ran into issues with TS too. Main issue I recall now was maintaining a grand total count of events that were already rolled up into daily counts was not fast since it always looked back at all the data. There was no way in the ts patterns to efficiently express it without handrolling something. The issue was that a grand total count can’t be expressed in terms of a hypertable because there’s no time column.

It’s fantastic for workloads that neatly fit in the hypertable pattern though.


You tried to use a timeseries database without a time column?


> Don't be snarky.

https://news.ycombinator.com/newsguidelines.html

Are you expecting a real answer of how I was hoping timescale's internal watermark system would help me roll up a total count or are you just implying I'm an idiot?


What's the difference between the watermark and a time column? Not the person you were replying to but I'm curious since I also thought that TimeseriesDB had a similar "timestamp" to influx.


TS requires you to specify a time column in each hypertable or view (continuous aggregate) where you want it to work its magic. It then stores an internal watermark that it compares to the time column in the table to figure out from where to read when refreshing.

My issue was that for a grand total I didn’t have a time column, so I couldn’t define my query as a continuous aggregate and the query had to start counting from the start of my underlying series each time.


Perhaps: add a time column with an artificially huge time range? This defines the grand total as an interval sum that happens to include all possible intervals.


Same here. I joined current company 3 years ago when Influx v2 was coming out. I was supposed to build some analytics on top of it. It was very painful. Flux compiler was often giving internal errors, docs were unclear and it was hard to write any a bit more complicated code. The dash is subpar to graphana but graphana had just raw support. There was no query builder for flux so I tried building dashboards in influxv2 but the whole experience was excrutiating. I still have an issue open where they have an internal function incorrectly written in their own flux code and I provided the fix and what was the issue but it was never addressed. Often times I had a feeling that I found bugs in situations that were so basic that it felt like I was the only person on the planet writing Flux code


I'm running InfluxDB 1 and 2 in parallel for a personal project, waiting for v2 to get mature and stable enough to replace v1. It's never happening I guess. v1 still works great for me.


We are influxdb enterprise customers and looking to do the same thing. They've kept their enterprise offering on 1.x, which has kept us mostly happy, but seeing what's going on in their OSS stuff is horrifying and we're looking to avoid the crash and burn at the end of the tunnel.


We announced the availability of the v3 successor to Enterprise v1. It supports the v1 API. We're still building data migration tooling, but if you're interested in testing it out just email support or your sales rep.


How does one upgrade from v2 beta to the latest v2? The docs for doing that seem to no longer exist https://github.com/influxdata/influxdb/issues/24393


To be honest I'm not sure. Upgrading individual releases on the way should take you there, but the v2 beta was quite a while ago.


Are you running the "OLAP" TimescaleDB on the same instance as your regular OLTP Postgres? This is the only reason I would entertain TimescaleDB, if I had a strict "1 server" requirement. I briefly deployed and looked into it and there were a lot of footguns like with compression.

If not, I would suggest looking at a proper OLAP DB. VictoriaMetrics has been great and was easy to set up.


We're much rather looking at reducing the number of technologies we have and exchanging one specialized one-use database for another one doesn't sound great. And sure, TimescaleDB is a hefty extension and will require some work to understand it, but things like HA, backups and overall management of Postgres are pretty much solved for us.

And beyond that, TimescaleDB works with a few things we have already. We could migrate Zabbix to use TimescaleDB for a large performance boost. Also 1-2 teams are building reporting solutions for the product platform, and they are generating some significant timeseries data in a Postgres database as well.


That's a fair point, but it's worth appreciating the fundamental differences between OLTP RDBMS and OLAP timeseries. I'm not saying deploy N different DBs, I'm saying pick a good OLTP solution (Postgres) and a good OLAP solution.

My bad experience with TimescaleDB 3 years ago was that enabling compression required disabling the "dynamic labels" feature, which was a total nonstarter for us. A proper timeseries DB is designed to achieve great compression while also allowing flexibility of series. Hopefully Timescale will/has fixed that without adding another drastic perf tradeoff, but given how Postgres is architected for OLTP I would be surprised.


Can you say more about "dynamic labels"? Do you just mean that as you evolve, you want to add a new type of "key-value" pair?

The most common approach here is just to store the step of "dynamic" labels in JSON, which can be evolved arbitrarily.

And we've found that this type of data actually compresses quite well in practice.

Also regarding compression, Timescale supports transparent mutability on compressed data, so you can directly INSERT/UPDATE/UPSERT/DELETE into compressed data. Under the covers, it's doing smart optimizations to manage how it asynchronously maps individual mutations into segment level operations to decompress/recompress.

(Timescale cofounder)


>My bad experience with TimescaleDB 3 years ago was that enabling compression required disabling the "dynamic labels" feature, which was a total nonstarter for us.

What is the "dynamic labels" feature? Is it a part of Postgres or Timescale?


It's a Timescale feature that allows you to "just insert" metrics without first doing a schema migration to make sure a possible new label schema is supported. I.e. a feature that is taken for granted in native timeseries databases, which don't have to work around RDBMS schemas.

I assume it's doing an automatic ALTER TABLE when necessary, which modifies each row and somehow breaks compression across the sharded tables. Or at least an automatic re-compression would cause massive latency on insert that they wanted to avoid.


You threw out a database because it didn’t offer compression in your specific use case? That’s it?

Just solve compression on the block level, why are you so specific about it happening in the database? It’s probably one of the least interesting feature comparisons when betting on which database to trust.


Not all compression is created equally. Timeseries data has well-defined characteristics that generic block level compression doesn't understand. It's a great example where application-level compression is worthwhile. The proof is in the pudding.


A reason I would still bias towards postgres is the maturity of managed solutions (including decoupled compute/storage solutions like Aurora and AlloyDB).

Are managed "proper OLAP DB" solutions competitive with managed RDBMS from a price and ease of use standpoint?


Using TimescaleDB from a managed provider is limited, unless of course that provider is Timescale. Other managed providers are only permitted to use the TimescaleDB Apache 2 Edition.

This link has a comparison of features[1].

[1] https://docs.timescale.com/about/latest/timescaledb-editions...


Another alternative to timescale could be hydra. Haven’t tried it myself but the promise of columnar tables seems wildly useful.


What are you migrating to?


It’s funny how for the longest time, I was upset with how slowly the web moved. At times I wished they wouldn’t care as much about backwards compatibility.

But now with these VC-funded tech products that have spawned over the last 5-7 years, who have a move-fast-and-break-things attitude, I’m seeing the benefits of the old approach.

I suppose it’s all a matter of trade offs, as with all things, and there’s no silver bullet.


We just left it. Too many changes, new query language is incomprehensible to drive-by-graphing, and rest of the industry seems to be building around PromQL/Prometheus.

Victoriametrics so far works very well.


I'm honestly surprised the CTO is still employed.


serious question on behalf of the uniformed, why? I feel that society encourages people to double down and be consistent even if they are armed with better information. we could be better if we didn't have to stick to the one true path, no?


He is a founder.


Been using both 1.x and 2.x for telemetry (oss & paid both). I am pretty excited with 3.x's interoperability. Archiving to standard data formats makes the data science team's job's easier, and with a more standard ANSI SQL query engine with jdbc support, and high cardinality tags, it will greatly speed up front end development and analysis use cases.

As well, I am one of those folks that happens to find the Flux query language powerful, but it's not easy enough for folks to just make that jump from SQL. Flux is much closer to Splunk's search language. It is good at what it does. FluxQL doesn't even have date parsing (which is really odd for a time series query language), but FlightSQL in 3.x seems to be more complete.


Yes I think v3 is pretty solid, and it's nice that they are still supporting v1 and v2. But I think the "migration" from v1 to v2 was the "painful" part. Not because it was too hard to migrate (I guess you don't even have to, since it's still supported), but because it introduced a very different approach, that was supposed to be the future of influx, that was just basically dropped in the next release. I think some commitment towards v3 might help in that regard. As you said, flux is powerful and took some time to get used to but it's now basically useless if you took the time to get into it.

I like that they are converging towards SQL, but at the same time it's a bit like going back to square one. They seem more convinced about going full SQL this time though, but yeah

Just searching for this, I stumbled on this documentation page that illustrates the point very well:

https://docs.influxdata.com/influxdb/v1/query_language/

In the same page (about the original influxql in v1), there is a depecration notice for v1 stating that v2 is the stable version, implying that InfluxQL is not recommended. And a pop up notice stating that v2 (flux) is basically deprecated and just in maintenance mode, and that you should use InfluxQL. But as I said in my earlier comment, I guess in some ways that's better than being too rigid and sticking with bad or less ideal technical decisions.


For v3 we're supporting InfluxQL natively in addition to SQL. The InfluxQL implementation is actually just a front end on top of DataFusion, the SQL engine we use.

We really wanted to bring Flux along too, but found that it was too difficult in the near term to have it work well with v3. We spent a bunch of time building a gRPC API that Flux uses to talk to v3 (the same thing we have in our Cloud v2 product), but that API was designed with the previous storage engine in mind. It ended up being brittle and performed very poorly.

So at this point the long term supported languages for InfluxQL and SQL, but we're continuing to support Flux for our customers.


I've always had a soft spot for influxdb after using it for a self hosted datadog/newrelic etc solution many (6+) years ago with great success. Still use it in conjunction with telegraf and grafana for personal project monitoring, but I've not brought myself to upgrade from the 1.x series.

Hopefully it's improved, but last time I tried upgrading I found the UX in grafana to be subpar on the newer versions, as I recall you lost the autocomplete/UI to build your queries. Obviously grafana is it's own project but feels like they (influx) should invest more resource in areas like this to encourage people to upgrade - if you're going to do major upgrades make sure they have feature parity


Like you, I've stuck with Influx v1, Telegraf and Grafana. My policy is to upgrade only when there are significant reasons to. When I evaluated InfluxDB 2, there were no major reasons for me to switch. Of course, the data ingested in my case is relatively small. YMMV.

I looked at TimescaleDB but at the time there was no easy way to get data from Telegraf to TimescaleDB. Telegraf finally merged code that allows writes to Postgres databases, but it took like 3 years to do that.

Ultimately, I still stuck with InfluxDB v1 because sending data to it via the InfluxDB line protocol is so simple. I have a couple of bash scripts that use awk to transform command output to Influx line protocol and send it to InfluxDB. It's just so simple. I love it.

I love learning about new things, but the InfluxDB v1 keeps working fine so I may not switch from it until something forces me to do it.


Was very similar for me. No good reason to go V2, the new query language, on top of sucking for anyone that only uses it once every few weeks, also wasn't really supported well in Grafana.

I ended up trying VictoriaMetrics by near accident as infuxdb didn't like something on my raspberry pi, and honestly it has been pretty painless. It is Prometheus-like stack which means you can use any PromQL-compatible things with it. There is "all in one binary", and version split by functions.

VM have tools to migrate from InfluxDB v1. I ended up just sticking old influxdb data in one database, as I wanted to change the format of what I write to it along with the migration.

> Ultimately, I still stuck with InfluxDB v1 because sending data to it via the InfluxDB line protocol is so simple. I have a couple of bash scripts that use awk to transform command output to Influx line protocol and send it to InfluxDB. It's just so simple. I love it.

It also have agent that's job is to convert from various protocols, and do the scraping, that includes influxdb, and few other popular protocols.


I've heard good things about VictoriaMetrics. I'll have to carve out some time to check it out.

Thanks for sharing your experience.


have you given QuestDB a try? it includes its implementation of the InfluxDB Line Protocol, adds SQL for queries and can sustain a higher ingestion rate, without high cardinality limitations


I delayed upgrading to flux and finally bit the bullet this summer, and a month later read the announcement deprecating it.

Next time around I'm going to give TimescaleDB a look.


>they like moving fast

They are always… in flux * sun glasses on*


We moved from Influxdb to Prometheus for this reason. Influxdb is far more powerful, but ain't nobody got time to fix all the graphs in Grafana or learn the very mathematical like QL.

If we had dedicated personell to manage our monitoring we might have stuck with it.


Sounds like tech I wouldn't wanna depend on.


We're trying to make the transition from v1 to v3 easier by brining the write and query APIs from that version forward. We wanted to do the same for Flux, but found it was too difficult in the near term. We might be able to do something in the future, but for now we're focused on making core improvements to the v3 engine.

We'll have data migration tools for v1 and v2 into v3 later this year/early next.


Thanks for your reply! Dumb question (I couldn't find a definitive answer) but will v3 InfluxQL be compatible with v1? Is there an article about the changes between v1 and v3?


Yes, the goal was that for anyone with Grafana dashboards or queries elsewhere, they wouldn't have to rewrite them. Just point at v3 and pretend that it's a v1 database (use the v1 API).

But there are a few things that aren't there. Continuous queries, SELECT INTO, and anything that modifies data isn't there.


Why would you find that cool? It's anything but unless it is a personal project. If people depend on your work, that is irresponsible.


I guess I'd rather have that than ossifying on a completely flawed architecture. Apparently flux was kind of a dead end, and while it's super risky and illustrates issues in decision making, it's still better than just doubling down on something that their own team consider to be futureless or too flawed.


The backstory here is they were doing a rewrite anyways, for reasons that had not much to do with languages; they expected to write some C++ for the new version. Rust was the right call for them.


This was discussed on HN at the time (2020):

https://news.ycombinator.com/item?id=25049253

At some point HN is going to have to decide if it's the Rust subreddit or a news site.


Rust is one of the few languages that have a chance to climb out of the hobbyist/academic/ultra-niche range, so it's interesting for me to hear about developments towards the direction of reaching mainstream status. I'd say the same thing about Zig but with less strength.

After that nobody wants to hear about Java.


I think they just completed the transition which is why it came up again.


I started using HN because I'm a rust fanboy and there was a lot of rust content (same with lobsters). I'm glad to say there is a lot of other HN content that interests me, I might never have known.

Funny enough, in contrast to when I joined, the pendulum seems to have swung, and comments disparaging rust seem to be en vogue.


The issue is that InfluxDB is an infrastructure product. Changing the core impacts the way users interact with the product. If Figma decided to change their backend, it could be transparent to users.

Opinions could be different if first they implemented a complete compatibility layer, Flux included, prior to making the migration.


Would be great to see an in- depth blog post by Andrew and team about Rust, the bad and the good. They didn't just build a system but one that was optimized for performance. What were the major challenges during the rewrite? Have you optimized CI build times?


Is there a comparison of the new InfluxDB against https://clickhouse.com?

I ask because ClickHouse is quite hot at the moment from my experience in consulting and that seems to be reflected in Google Trends [1].

And there are some startups relying on ClickHouse for their log/monitoring products like https://signoz.io and https://hyperdx.io.

[1] https://trends.google.com/trends/explore?date=all&q=ClickHou...


https://github.com/influxdata/influxdb

The non-reddit link target


Yeah but the Reddit link is very relevant for the cto’s comments on the change.


It's weird because I got here from a link on lemmy.


This is intriguing. Interesting, how does this new Influx engine compete in terms of performance with VictoriaMetrics (which is written in Go and really fast)?

They moved their entire stack from Go to Rust, rewrote the system from the ground, and spent a lot of time on it, I guess this is a big cost.

Is it worth it?


If I'm reading this [0] right, there will be no a standalone OS influxdb 3.0 version. So there's no point in comparing. I also wonder if it would be allowed to publish benchmarks of ENT version by 3rd-parties.

[0] https://www.influxdata.com/blog/the-plan-for-influxdb-3-0-op...


How big are y'all that you need InfluxDB?

For one database that receives 100M 600 byte JSON records/day, a single node AWS PostgreSQL RDS instance is handling it effortlessly, and the DBA work is very part-time. We keep year+ of summaries and 48 hours of detail, unloading the rest to S3 as parquet files, queryable by Athena if we need. AFAICT, we're spending <$4K/mon all in, including backups.

p.s. a buddy at a top-3 TV streaming service is also doing this for logging all viewing activity, but with Aurora.


we're storing 120M records a day in influx 1.8, offloading cold data to S3, all on a single m5.xlarge instance that runs other backend services on the side. Less than 400 bucks per month overall aws cost, with lots of other VMs in there. We could use RDS too, but why change it to something more expensive if it already works...


thanks! killer example.

how are you dealing with failover?

how are you dealing with interactive reporting?

how large are your records?

is the s3 data in cold storage? how much is stored in S3 right now? is it warm-enough to query via (e.g.) athena?


update: it's actually more like $1500/mon and that includes an RDS proxy, failover instance and online backups with one-click restore.


I love InfluxDB 1.x and the TICK stack. They abandoned a beautiful piece of software to chase shiny things with 2.x... sad to see them do it again.

Someone needs to pickup the original ideas of 1.x since they can't seem to stay focused, as their marketshare is ripe for grabbing.


With 3.0 we've worked hard to pull in the v1 API. Both the v1 write and query endpoints are supported in v3. There may some gaps here and there, but our goal is to make it so that all the things that worked with v1 will work with v3. The only two exceptions would be the subscription API that Kapacitor used and Continuous Queries. We warned people off from using both of those features in v1 as they didn't work well under load.

With v3, I prefer to think of it as us doubling down on core database performance and functionality. With v2 we tried to create this whole development platform. V3 brings our focus back to the core database, which I think will yield better results for everyone.


Can someone explain what's the InfluxData's market? Or how they make/plan to make money?

If we speak about metrics, Prometheus just win.


So it sounds like you're asking about our use cases. We have customers across almost every vertical. But what's most common are application and server monitoring, sensor data (industrial, rockets, satellites, etc.), and network monitoring.

Metrics is certainly one use case that people pay us for. With v3, we expect that real-time analytics and some more data warehousing types of use cases will become interesting. We always envisioned InfluxDB as a store for observational data of all kinds, not just metrics.

On how we make money, we sell our products. We have at this time:

- InfluxDB v1 Enterprise (a self-managed, clustered implementation of InfluxDB)

- InfluxDB v1 Cloud (Enterprise, but as as single-tenant managed service. We still run this for hundreds of customers)

- InfluxDB v2 Cloud (multi-tenant, usage based, we're running this for thousands of customers)

- InfluxDB v3 Serverless (multi-tenant, usage based v3)

- InfluxDB v3 Cloud Dedicated (single-tenant, resource based pricing)

- InfluxDB v3 Clustered (self-managed v3, clustered database)

We'll have single server versions in the future, but we're a bit off from that. Right now our focus is on continuing support for our v1 and v2 customers, and further developing our v3 products for new customers and customers that want to migrate over.


Does InfluxDB support event-based data models? For example, imagine something simple an SQL database's query log. Each query is an event, with data such as latency, rows, block I/O and so on, and metadata such as the database, the full SQL query, and so on. This is the kind of use case where traditional "measurement"-based time series databases like Prometheus aren't a good fit, because you have huge column cardinality for the labels (one value per query). Meanwhile, more general-purpose columns databases like ClickHouse and BigQuery have no issues with this type of data.


InfluxDB v3 is built to handle this kind of data. It's a columnar database, using object storage and Parquet files for persistence.


There in lies the problem. With shifting priorities and shifting strategies I don’t see them actually knowing what their market is.


I see Influx pop up a lot in communities like Home Assistant for doing time series data collection. I'm using Postgres for that myself (not a great experience but I knew that when I was too lazy to find alternatives during setup) but people seem very pleased with its performance.

I imagine similar dashboard services that don't necessarily work well in Prometheus are a good market for these types of databases. Prometheus is nice, but I don't think it's suitable for all Influx use cases (and vice versa).


VictoriaMetrics all-in-one binary is honestly everything that I need in my home IoT things.

Prometheus-compatible interface for query, a bunch of ingest protocols, smaller memory usage than InfluxDB (v1, haven't tested v2 coz new language have less grafana support). Options to scale too


Do you know if there is any particular advantage of Influx over Prometheus for IOT stuff? I also have noticed that Influx is way more popular in that space, but I don’t know whether the reasons are technical or just social (more tutorials, more shared experience, etc).


Influx is a full time series database. It's less opinionated than Prometheus.


As is tradition, hosted service: https://www.influxdata.com/influxdb-pricing/

Basically either you manage it yourself, or you pay them to do Serverless/Dedicated/Clustered hosted setup for you.



Yeah, I know, but there are a lot of competition in this field. Every cloud provider, Grafana and other players have their own managed Prometheus compatible solution.


This I guess is the mystery to me. Projects like this are great. But how are people funding them? (AKA how can I get my own paws on some investment $$ to work on cool database^Wother-systems-level tech? I'll even promise to try to make $$)


I am about the biggest Prometheus stan that you can find, but I will not mock or denigrate influx db. They are a runner-up, but there is room in this market for runner-ups, and their feature set does match some that Prometheus is not good at.


I think they missed one thing: Nobody wants to get locked in with the runner-up.

If they on top of that provided some compatibility layer for Prometheus/PromQL, like few other competitors did, then the prospective enterprise client have warm fuzzies that if they don't like it they don't need to rewrite entirety of their stack to work with something else. People could also use the existing ecosystem and "just plug it in", even replacing Prometheus instances they might have.

In the end people want to ingest the metrics and display it in Grafana. They don't need another visualisation solution that has less support and documentation. They don't want to learn new weird query language that is simulatenously more verbose and less readable than PromQL or even influxQL from v1.


Don't get me wrong, I don't denigrate Influx DB. They have some interesting features that prometheus doesn't have, I just don't think that this features can lead to a mass adoption(and money).

Let's put this way: is there any killer feature that can ditch most of the Prometheus installations in their favor?


And it *has* to be feature, not just "it's faster", because they are (much) faster time-series databases that support PromQL and are near-drop-in to your infrastructure.


I think you'd find that Prometheus is way behind a number of solutions in the corporate world, including Graphite, DogStatsD, etc. Popularity on HN does not translate to real world popularity.


Most of the companies in the corporate world have more than a monitoring system, at least 3 in my experience: sysadmins have at least one, network guys another one and one for legacy platforms (mainframes, as400 and so on).

Plus, in the last years with the rise of Kubernetes, most corporates have at least one cluster with Prometheus monitoring it.

Always, base on my experience, Prometheus is very popular, but the adoption is not so wide due to its `oss nature`: CTOs want someone to blame when things go wrong.


That's only a matter of time when they hire younger engineers who are familiar with modern monitoring systems and eager to apply their knowledge in practice.


What is it that you think prometheus offers over other solutions? It is more likely that the younger engineer is going to learn that companies don't care about what is popular on HN.


> What is it that you think prometheus offers over other solutions?

I like Prometheus and think this is a great piece of software. But even if we won't go into actual details, Prometheus is baked in into Kubernetes monitoring [0]. That's the first monitoring system young engineers will meet with when learning k8s. Although, k8s and Prometheus are both CNCF projects which means both of them will be promoted in synergy with each other.

> It is more likely that the younger engineer is going to learn that companies don't care about what is popular on HN.

This is not what I think younger engineers do :)

[0] https://kubernetes.io/docs/tasks/debug/debug-cluster/resourc...


In the year of our lord 2023, Graphite, Statsd, etc. are anachronisms. Prometheus is wildly superior to those tools.


Prometheus only handle aggregated data, though. While with influx you can store the events themselves with labels etc. While Prometheus is often good enough for standard metrics, it is just things it can't handle.


> Prometheus only handle aggregated data, though.

That's not true. You're referring to pull-based approach for metrics collection. It has its tradeoffs (like fixed interval scraping), but has a lot of benefits too (like higher reliability). Check the following link [0] from VictoriaMetrics docs, which supports both push and pull approaches. Prometheus also gained push support this year, though.

However, the main difference between Prometheus-like systems (Thanos, Mimir, VictoriaMetrics) and more traditional DBs for time series like InfluxDB or TimescaleDB is that first are designed to reflect system's state, and last are designed to reflect system's events. That's the main difference in paradigm, data model, and query languages. There is a reason why PromQL is so easy in 99% of cases, and so complex and annoying when users want to express what they get used to in traditional databases.

I'm saying this because I went through creating a Grafana datasource for ClickHouse [1] and I felt how complicated it is to express a most straightforward PromQL query in SQL, and vice versa.

If you'd like to learn more about differences between common queries for plotting time series in PromQL and SQL see my talk here [2].

[0] https://docs.victoriametrics.com/keyConcepts.html#write-data

[1] https://grafana.com/grafana/plugins/vertamedia-clickhouse-da...

[2] https://youtu.be/_zORxrgLtec?t=835


AFAIK only difference is that in Influx a given row can have more than one value so say "interface traffic" would be

    interface_if_octets host=router,instance=eth0,rx=123584,tx=213956
while in prometheus it would be

    interface_if_octets host=router,instance=eth0,type=rx 123584
    interface_if_octets host=router,instance=eth0,type=tx 213956
which in theory yes it is more compact but it gave me more annoyances than advantages during querying

>While Prometheus is often good enough for standard metrics, it is just things it can't handle.

My experience is that just anything made to ingest and analyze logs ends up mediocre for metrics and vice versa. I don't think I've seen single product that did both well or efficiently. So I'd rather have good metrics and just use ELK/Graylog/whatever else for logs.


Not really the biggest difference, I feel. In Prometheus, you only have the data you scrape. So lets say you scrape every minute. Then all you have is a counter per pod increasing from some value to something else.

With influx you can save every event. So you know exactly when it happened, the unique labels for that event etc. It's a completely different paradigm.

So in Prometheus you have

  myCounter,pod=1,time=20:23,value=1000
  myCounter,pod=2,time=20:23,value=500
  myCounter,pod=1,time=20:24,value=1100
  myCounter,pod=2,time=20:24,value=700
so all you know is that some event happened 100 times on pod1 and 200 times on pod2 the last minute. But with influx you could have a row for every single event. Of course that explodes the query time in comparison, but allows you to do much more with the data if needed.


Influx is a wholesome solution where Prometheus is just the time series database


Maybe i miss something, but the version 3 is just the database, they abandoned the TICK stack some time ago.


And I thank them for it. I was using the TICK stack a few years ago and the switch to TIG (Telegraf, InfluDB, and Grafana) has been a breath of fresh air.

Telegraf and InfluxDB are solid, but Chronograf was behind Grafana in usability and features.

Kapacitor was pretty rough. The language it used was hard to write, the docs were barebones, somewhat confusing, and sometimes inaccurate. When I switched off of Kapacitor, the CPU usage on the server dropped significantly too. So I'm guessing Kapacitor wasn't too CPU friendly either.


Can you expand what you are missing?

Prometheus is not just the db itself, it’s the ecosystem around it. You’ve got service-discovery, alertmanager and basically every application in existence having a /metrics endpoint and some pre-made Grafana dashboard.


People are not talking that InfluxData is NO more just a time series database, this isn’t just a language change but feature additions and with the level of massive dependence on C++ libraries it’s pretty foolish to continue using Go


Down-voters, could you please explain your disagreement to the statement


Haven't downvote, but I can't make sense of your comment in the context of this article.

Maybe you have some context or background info I don't have?


As a side note, the Flux language that they introduced in v2 never seems to have taken off as there are a few (2) public Grafana dashboards made with it, whereas the older influx language has around 1345 currently.

Unfortunately I was stupid enough to build my dashboard on flux, which I’m really sorry to say I dislike quite a bit, while still wanting to be respectful for the people who build the stuff.

All that said, I think that influx is a great tool although I’m mostly using it for personal projects and haven’t run anything at scale.


I tried it and it was more complex to do anything compared to PromQL.

Like, it looks super powerful for complex queries I will never need to make...


Yes, and in another HN post the Influx founder stated (me paraphrasing) it’s and it’s all about SQL compatibility.


Focus on making money and less on replatforming for a second time.


Replatforming to chase the hype cycle and sell to non-technical CTOs can be an effective money-making strategy.


Rust is not hype.


My problem with Rust is that compilation is too slow, as is downloading the gazillion crates needed to go anywhere. When I heard it was going to replace C I was expecting similar build times, but in reality Rust builds much slower than C++.

With C/C++ (and CMake + Ninja) it seemed we were finally getting to a point where incremental builds would complete before hitting the 400ms attention span "Doherty Threshold", and now it seems we are going back to the days of having to spend our time sword-fighting (xkcd) during slow builds.


It's way better compared to the early days and we have incremental compilation in dev. I don't consider it a problem anymore.

People got the memo that slapping Serialize on everything has a cost an unless you're doing huge native dependencies which require long compilation time, it's pretty snappy.


Seems like they have their priorities right. /s

https://news.ycombinator.com/item?id=36657829


This is really surprising for a commercial company. Not about the choice of language itself, but the fact that they prioritize such rewrites over features, similar to the concerns from other commenters. They mention "performance" and "garbage collector" and "error handling" which are almost technical details.

A company can make good money for quite a while as long as the performance is "good enough", and usually focuses on adding features instead of worrying about any rewriting/re-architect until there is a bottleneck or issues start to significantly slow down development. How "successful" such a rewrite still needs to be seen, but this is risky for most companies/products.


In this case, the features we kept getting asked for by our customers necessitated a change in underlying database architecture. I talk about that quite a bit in the reddit thread.

I totally agree that a rewrite is risky. It's not something I'd choose to do again, but at the time we didn't really see any way around rewriting the bulk of the database (even if we kept it implemented in Go).

Using Rust and the Arrow ecosystem of projects (Parquet, DataFusion, Flight) meant that there were a ton of things we didn't have to do from scratch. One of our staff engineers, Andrew Lamb, has called it a toolkit for building databases. Thanks in part to his contributions, I think he's right.


What is meant by separating compute from storage? This keeps being mentioned as if it were some new paradigm shift so I assume there's a non-obvious situation.


A traditional monolithic database assumes that you have locally attached storage. All of your ingest, indexing and query processing happens on the machine with that storage (i.e. your compute and storage live together). The cloud kind of complicated things with EBS and high IOPS network storage, but generally, those work kind of the same way. The volume is mounted on a single machine that uses it.

When people talk about separating compute from storage, they mean pulling compute heavy tasks like query, ingest, indexing, and compaction apart and using a shared storage tier that many systems can talk to. Usually this is object storage paired with some sort of catalog (kept in either object storage or some other store or API).

Snowflake popularized this approach in the data warehousing and OLAP space with great success. Their papers submitted to VLDB are great reads on the topoic.


The first comment is from the Cofounder and he lists the _features_ that motivated a rewrite:

>Then there's the question of why we did a rewrite at all. We wanted to get at some important requirements: >Unlimited cardinality >Analytics queries against time series at the performance of a columnar DB >Use object store as the durability layer for historical data (i.e. separate compute from storage) >SQL and broader ecosystem compatibility >All of that stuff taken together meant that we'd be rewriting most of the core of the database. ...


Sure, the write definitely could have helped releasing these new features, but the language transition itself takes non-trivial effort and could have been used for adding additional features instead. I haven't seen any evidence that such a language transition is desperately needed. Also I said deprioritize adding features, not stop adding features, mind you. The core question here is what kind of benefit we are seeing by changing the language.

If you are a fan of Rust and want to rewrite all your just because of that and you are the CTO, whatever, what can I say. But the title here is a bit clickbait-y and I really don't see much reflection on the languages themselves and how to balance the business needs for such a transition.


> So this isn't the approach I'd recommend for this kind of project, but we started fresh from scratch.

wow, the from scratch rewrite. I can't even imagine that for a major piece of software


The author was looking for an excuse to use Rust for something since 2018:

https://www.influxdata.com/blog/rust-can-be-difficult-to-lea...

The rewrite started in 2020... they rationalize now, but it's pretty clear they just really wanted Rust and found the reasons they list later as a post-decision justification. Nothing wrong with that, if you don't mind risking the future of your business on a risky rewrite... though if I had a job working on the Go code base and my employer suddenly announced we should drop everything and start a rewrite, so go learn Rust, which has a much smaller job pool in my area as far as I can see, I guess I would've been really pissed off and would leave as fast as possible.


To be fair, we didn't drop everything and do a rewrite. Over the last 3.5 years (the length of time for this project), our total engineering team has ranged from 50-90 people. For the first year it was me and two other people. Then for the 2 years following that it was 9 people total.

It wasn't until late last year that we made the decision to go all in on the rewrite and made that the focus of everyone in engineering. And we did that because we had 4 years of experience trying to get v2 and Flux to be successful, with modest results.

Most of the time we were developing this version, we were spending massively more engineering effort on developing v2 or maintaining v1 for our customers.


I think that was a risky move, but sometimes you need to take risks to get high rewards, so I agree it's sometimes the best move given your circumstances... I imagine you made the calculation that this was a risk worth taking... do you already know if that is the case? Are your developers "happier"? Customers praising improved performance? More people applying for jobs due to Rust? Fewer bugs? Easier to implement features?


So what’s gonna happen for v1 and v2 ? cause the number of features you’re offering in v3 is gonna be difficult to maintain compared to v1, v2


We're going to continue to support our customers on v1 and v2. We're building migration tooling over to v3 for those that want it. We have multiple years of transition ahead of us, we expect to have customers on all 3 versions for quite a while.


Thanks for the info


>excuse to use rust

Using rust for any other reason than its the best option technically, is just fad chasing.


Smaller job pool?

How does learning Rust reduce your chances of getting a new job? Learning new stuff increases your chances, not decreasea them.


It's definitely not something I'd do again. I'd warn off anyone from taking this approach. But the early results are looking good and I'm very excited about what this enables for the next few years ahead of us.


I'd love a blog or a write up on the rewriting process, lessons learnt etc. It would be awesome!


This is my dream job.


We're in the process of doing the opposite. Wish we had chosen the right tool for the job from day one.


tl;dr

- No garbage collector

- Fearless concurrency (thanks Rust compiler)

- Performance

- Error handling

- Crates

- they thought they were gonna use C++ and wanted interop (ended up not using C++?)

- ecosystem: Apache Arrow DataFusion

- "I thought that if we're going to rewrite most of the database anyway, we might as well do it in the best language choice in 2020"

But the real reason might be: "Rust good, Go bad" /s


Rust gets you on the frontpage!


Go does too. Those two feel like the most hyped languages currently on HN.


Go isn’t really hyped these days, Zig maybe. But Go is old reliable now (which is good).


No love for C#


C# is lovely, and with eventual first class AOT support I believe it will become more widespread


I love C# :)


No F# ?


Yeah, I'm sure they spent 3 years working on it to get on the front page for only a few hours. I'd get this argument maybe if you're talking about a pet project...


I still love Go, but I think Rust is a better fit for very performance sensitive systems software like a database.


You can write a database in Go, sure, but if you want to compete at the very top end, you need all the control. Go is nice in the way that Go is nice (if you even agree with that statement) precisely because it doesn't have that control. Unless it is a terminal goal for a database to be in Go simply to be in Go (BoltDB seems to fit into this category) or you don't care to compete at the very top end, I think it's a mistake to even start in Go.

To be honest, most of the "we started in Go and switched to Rust" stories read to me as "you should have always known that you should have started in Rust" (or C++ or something, though I'd choose Rust out of the viable set of languages here too). It's IMHO always been obvious that Go was not really viable, I mean, sure, you can get farther than you could with starting with Python, but, it's not something Go was even trying to solve.


Not sure I follow since there are very competitive tools written in Go such as https://victoriametrics.com for an example in this space.


Most "we started in Go" projects are a lot older than Rust itself. Rust 1.0 was only released in 2015, and the language only really became usable with the 2018 edition. C++ is not really a comparable language, it's not even memory safe.


What problem, according you, is Go "trying" to solve?


If you're scaling horizontally it's doubtful that language choice of rust Vs Go is going to make a difference.

Not enough, anyway, to make a complete rewrite more profitable over adding features.

After all, it's not as easy if influx was plagued by concurrency bugs, was it?

A move to a safer language than go just doesn't seem worth it, and there's little to no performance gain when you can just throw more hardware for that tiny performance difference.


While one can bring a lot of nuance to the conversation, most imperative languages do not have the &/&mut separation that Rust does and are therefore bad. &/&mut is basically essential to writing correct code in an imperative language.


Is it possible to do HA with Influxdb OSS yet?


That shift is what lost me.

Admittedly I was at a shop that wasn't likely to shift to an enterprise version so not a direct economic loss on their part but definitely stopped thinking of it as a solution to consider.


HA is a hosted/enterprise option only (paid).


How can they know what is defined behavior and what is undefined? How can it be proven? Rust has no spec and is therefore all undefined behavior.

</cope>




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: