Hacker Newsnew | past | comments | ask | show | jobs | submit | iampims's commentslogin

You should write something about this…


Great idea, poor naming. If you’re aiming for a standard of sorts, tying it to a specific software by reusing its name feels counter productive.

“Ducklake DuckDB extension” really rolls off the tongue /s.


True. The format looks really open so it would be better to have a more independent name. DuckLake for the DuckDB extension name is great in my opinion but for the table format something like SQLake or AcidLake might be more apt. The latter doesn't sound very appealing though, probably especially for ducks.


Quite a bummer, particularly because the main selling point is that it can be utilized with any SQL database (iiuc).


If I understand the Manifesto correctly, the metadata db can be any SQL database but the client needs to be DuckDB + DuckLake extension no ?


*for now. The principle in the client side (especially read only) should be the same with Iceberg. Ideally there's an Iceberg adapter for clients.


Good point. I think - any ducklake implementation for any SQL compliant database will work.

Of course, the performance will depend on the database.


but that'd be real money, not the Monopoly money they used to buy Ive/Windsurf...


Not sure I would qualify sharding a DB that get 1M qps as straight forward. I agree with you that it seems that an org would be a natural sharding key, but we know that at this scale, nothing really is ever straight forward, especially when it's your first rodeo.


> Not sure I would qualify sharding a DB that get 1M qps as straight forward.

Sharding at the application layer (basically figure out the shard from org/user in your application code prior to interacting with the DB), will scale to any QPS rate. This is what I was referring to.


That's true, but that's also why you really should shard long before hitting that point...

If your company is growing at this insane rate, it should be obvious that eventually you must shard. And the longer you delay this, the more painful it will be to accomplish.


I just wish Meta would open source Scuba.


Or sampling :)


Sampling is lossy though


lossy and simpler.

IME, I've found sampling simpler to reason about, and with the sampling rate part of the message, deriving metrics from logs works pretty well.

The example in the article is a little contrived. Healthchecks often originate from multiple hosts and/or logs contain the remote address+port, leading to each log message being effectively unique. So sure, one could parse the remote address into remote_address=192.168.12.23 remote_port=64780 and then decide to drop the port in the aggregation, but is it worth the squeeze?


If a service emits a log event, then that log event should be visible in your logging system. Basic stuff. Sampling fails this table-stakes requirement.


Typically, you store your most recent logs in full, and you can move to sampling for older logs (if you don't want to delete them outright).


It's reasonable to drop logs beyond some window of time -- a year, say -- but I'm not sure why you'd ever sample log events. Metric samples, maybe! Log data, no point.

But, in general, I think we agree -- all good!


> It's reasonable to drop logs beyond some window of time -- a year, say [...]

That's reasonable in a reasonable environment. Alas, I worked in large legacy enterprises (like banks etc) where storage space is at much more of a premium for reasons.

You are right that sampling naively works better for metrics.

For logs you can still sample, but in a saner way: so instead of dropping each log line with an independent probability, you'll want correlation. Eg for each log file for each hour only flip one weighted coin to decide whether you want to keep the whole thing.


Metrics can be sampled, because metrics by definition represent some statistically summarize-able set of observations of related events, whose properties are preserved even when those observations are aggregated over time. That aggregation doesn't result in loss of information. At worst, it results in loss of granularity.

That same key property is not true for logs. Logs cannot be aggregated without loss of information. By definition. This isn't up for debate. You can collect logs into groups, based on similar properties, and you can decide that some of those groups are more or less important, based on some set of heuristics or decisions you can make in your system or platform or whatever. And you can decide that some of those groups can be dropped (sampled) according to some set of rules defined somewhere. But any and all of those decisions result in loss of information. You can say that that lost information isn't important or relevant, according to your own guidelines, and that's fine, but you're still losing information.


If you have finite storage, you need to do something. A very simple rule is to just drop everything older than a threshold. (Or another simple rule: you keep x GiB of information, and drop the oldest logs until you fit into that size limit.)

A slightly more complicated rules is: drop your logs with increasing probability as they age.

With any rule, you will lose information. Yes. Obviously. Sampling metrics also loses information.

The question is what's the best (or a good enough) trade-off for your use case.


> If you have finite storage, you need to do something.

Sure.

> A very simple rule is to just drop everything older than a threshold.

Yep, makes sense.

> (Or another simple rule: you keep x GiB of information, and drop the oldest logs until you fit into that size limit.)

Same thing, it seems like.

> A slightly more complicated rules is: drop your logs with increasing probability as they age.

This doesn't really make sense to me. Logs aren't probabilistic. If something happened on 2024-11-01T12:01:01Z, and I have logs for that timestamp, then I should be able to see that thing which happened at that time.

> Sampling metrics also loses information.

I mean, literally not, right? Metrics are explicitly the pillar of observability which can aggregate over time without loss of information. You never need to sample metric data. You just aggregate at whatever layers exist in your system. You can roll-up metric data from whatever input granularity D1 to whatever output granularity e.g. 10·D1, and that "loses" information in some sense, I guess, but the information isn't really lost, it's just made more coarse, or less specific. It's not in any way the same as sampling of e.g. log data, which literally deletes information. Right?


Some serious engineering here. Kudos!


At a certain scale, exact computations (p50 for instance) become impractical. I’ve had great luck switching to approximate calculations with guaranteed error bounds.

An approachable paper on the topic is "Effective Computation of Biased Quantiles over Data Streams" http://dimacs.rutgers.edu/%7Egraham/pubs/papers/bquant-icde....


The quantile() method in clickhouse is also approximate although it uses a more simplistic reservoir sampling model than GK, but quantileGK() is also available. quantileExact() exists but indeed becomes impractical as you point out.


It'll work. Clickhouse has even experimental support for storing prometheus metrics natively. A big missing piece is alerting.


ClickHouse is great for logs and traces, however, for metrics, it is still in the early phase. ClickHouse is also a general purpose, real time analytics database. See clickhouse.com. Whereas Oodle is specifically built for end-to-end metrics observability.


Curious to know what you mean by "early phase" here.


ClickHouse recently got the support of TimeSeries table Engine [1]. It is marked as experimental, so yes - early stage. This engine is quite interesting, the data can be ingested via Prometheus remote write protocol. And read back via Prometheus remote read protocol. But reading back is the weakest part here, because Prometheus remote read requires sending blocks of data back to Prometheus, where Prometheus will unpack those blocks and do the filtering&transformations on its own. As you see, this doesn't allow leveraging the true power of ClickHouse - query performance.

Yes, you can use SQL to read metrics directly from ClickHouse tables. However, many people prefer the simplicity of PromQL compared to the flexibility of SQL. So until ClickHouse gets native PromQL support, it is in the early stages.

[1] https://clickhouse.com/docs/en/engines/table-engines/special...


Not a fan of bad mouthing other offerings. As @iampims is saying, alerts is a big missing piece. Clickhouse is also a general purpose database for many use cases including analytics, financial servers, ML& Gen AI, fraud, and observability.


Heroku is still around…

…otherwise you can try Render, Fly.io, Google Cloud Run, Railway, etc.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: