More

iampims · 2025-06-08T05:54:15 1749362055

You should write something about this…

iampims · 2025-05-27T14:03:20 1748354600

Great idea, poor naming. If you’re aiming for a standard of sorts, tying it to a specific software by reusing its name feels counter productive.

“Ducklake DuckDB extension” really rolls off the tongue /s.

snthpy · 2025-05-27T20:28:31 1748377711

True. The format looks really open so it would be better to have a more independent name. DuckLake for the DuckDB extension name is great in my opinion but for the table format something like SQLake or AcidLake might be more apt. The latter doesn't sound very appealing though, probably especially for ducks.

rtyu1120 · 2025-05-27T14:25:47 1748355947

Quite a bummer, particularly because the main selling point is that it can be utilized with any SQL database (iiuc).

formalreconfirm · 2025-05-27T14:40:01 1748356801

If I understand the Manifesto correctly, the metadata db can be any SQL database but the client needs to be DuckDB + DuckLake extension no ?

raihansaputra · 2025-05-28T02:29:30 1748399370

*for now. The principle in the client side (especially read only) should be the same with Iceberg. Ideally there's an Iceberg adapter for clients.

crudbug · 2025-05-27T14:43:04 1748356984

Good point. I think - any ducklake implementation for any SQL compliant database will work.

Of course, the performance will depend on the database.

iampims · 2025-05-23T13:38:09 1748007489

but that'd be real money, not the Monopoly money they used to buy Ive/Windsurf...

iampims · 2025-05-23T13:34:50 1748007290

Not sure I would qualify sharding a DB that get 1M qps as straight forward. I agree with you that it seems that an org would be a natural sharding key, but we know that at this scale, nothing really is ever straight forward, especially when it's your first rodeo.

bhouston · 2025-05-23T17:18:18 1748020698

> Not sure I would qualify sharding a DB that get 1M qps as straight forward.

Sharding at the application layer (basically figure out the shard from org/user in your application code prior to interacting with the DB), will scale to any QPS rate. This is what I was referring to.

evanelias · 2025-05-23T14:30:54 1748010654

That's true, but that's also why you really should shard long before hitting that point...

If your company is growing at this insane rate, it should be obvious that eventually you must shard. And the longer you delay this, the more painful it will be to accomplish.

iampims · 2025-03-08T04:56:51 1741409811

I just wish Meta would open source Scuba.

iampims · 2024-12-06T03:04:17 1733454257

Or sampling :)

craigching · 2024-12-06T03:08:28 1733454508

Sampling is lossy though

iampims · 2024-12-06T03:37:49 1733456269

lossy and simpler.

IME, I've found sampling simpler to reason about, and with the sampling rate part of the message, deriving metrics from logs works pretty well.

The example in the article is a little contrived. Healthchecks often originate from multiple hosts and/or logs contain the remote address+port, leading to each log message being effectively unique. So sure, one could parse the remote address into remote_address=192.168.12.23 remote_port=64780 and then decide to drop the port in the aggregation, but is it worth the squeeze?

kiitos · 2024-12-06T06:20:01 1733466001

If a service emits a log event, then that log event should be visible in your logging system. Basic stuff. Sampling fails this table-stakes requirement.

eru · 2024-12-06T09:23:37 1733477017

Typically, you store your most recent logs in full, and you can move to sampling for older logs (if you don't want to delete them outright).

kiitos · 2024-12-06T17:13:31 1733505211

It's reasonable to drop logs beyond some window of time -- a year, say -- but I'm not sure why you'd ever sample log events. Metric samples, maybe! Log data, no point.

But, in general, I think we agree -- all good!

eru · 2024-12-07T02:26:05 1733538365

> It's reasonable to drop logs beyond some window of time -- a year, say [...]

That's reasonable in a reasonable environment. Alas, I worked in large legacy enterprises (like banks etc) where storage space is at much more of a premium for reasons.

You are right that sampling naively works better for metrics.

For logs you can still sample, but in a saner way: so instead of dropping each log line with an independent probability, you'll want correlation. Eg for each log file for each hour only flip one weighted coin to decide whether you want to keep the whole thing.

kiitos · 2024-12-09T23:48:20 1733788100

Metrics can be sampled, because metrics by definition represent some statistically summarize-able set of observations of related events, whose properties are preserved even when those observations are aggregated over time. That aggregation doesn't result in loss of information. At worst, it results in loss of granularity.

That same key property is not true for logs. Logs cannot be aggregated without loss of information. By definition. This isn't up for debate. You can collect logs into groups, based on similar properties, and you can decide that some of those groups are more or less important, based on some set of heuristics or decisions you can make in your system or platform or whatever. And you can decide that some of those groups can be dropped (sampled) according to some set of rules defined somewhere. But any and all of those decisions result in loss of information. You can say that that lost information isn't important or relevant, according to your own guidelines, and that's fine, but you're still losing information.

eru · 2024-12-10T01:48:32 1733795312

If you have finite storage, you need to do something. A very simple rule is to just drop everything older than a threshold. (Or another simple rule: you keep x GiB of information, and drop the oldest logs until you fit into that size limit.)

A slightly more complicated rules is: drop your logs with increasing probability as they age.

With any rule, you will lose information. Yes. Obviously. Sampling metrics also loses information.

The question is what's the best (or a good enough) trade-off for your use case.

kiitos · 2024-12-10T05:57:52 1733810272

> If you have finite storage, you need to do something.

Sure.

> A very simple rule is to just drop everything older than a threshold.

Yep, makes sense.

> (Or another simple rule: you keep x GiB of information, and drop the oldest logs until you fit into that size limit.)

Same thing, it seems like.

> A slightly more complicated rules is: drop your logs with increasing probability as they age.

This doesn't really make sense to me. Logs aren't probabilistic. If something happened on 2024-11-01T12:01:01Z, and I have logs for that timestamp, then I should be able to see that thing which happened at that time.

> Sampling metrics also loses information.

I mean, literally not, right? Metrics are explicitly the pillar of observability which can aggregate over time without loss of information. You never need to sample metric data. You just aggregate at whatever layers exist in your system. You can roll-up metric data from whatever input granularity D1 to whatever output granularity e.g. 10·D1, and that "loses" information in some sense, I guess, but the information isn't really lost, it's just made more coarse, or less specific. It's not in any way the same as sampling of e.g. log data, which literally deletes information. Right?

iampims · 2024-10-25T19:57:16 1729886236

Some serious engineering here. Kudos!

iampims · 2024-10-03T03:28:24 1727926104

At a certain scale, exact computations (p50 for instance) become impractical. I’ve had great luck switching to approximate calculations with guaranteed error bounds.

An approachable paper on the topic is "Effective Computation of Biased Quantiles over Data Streams" http://dimacs.rutgers.edu/%7Egraham/pubs/papers/bquant-icde....

hipadev23 · 2024-10-03T03:35:59 1727926559

The quantile() method in clickhouse is also approximate although it uses a more simplistic reservoir sampling model than GK, but quantileGK() is also available. quantileExact() exists but indeed becomes impractical as you point out.

iampims · 2024-09-24T14:35:58 1727188558

It'll work. Clickhouse has even experimental support for storing prometheus metrics natively. A big missing piece is alerting.

kirankgollu · 2024-09-24T16:10:42 1727194242

ClickHouse is great for logs and traces, however, for metrics, it is still in the early phase. ClickHouse is also a general purpose, real time analytics database. See clickhouse.com. Whereas Oodle is specifically built for end-to-end metrics observability.

rnjn · 2024-09-25T04:20:20 1727238020

Curious to know what you mean by "early phase" here.

hagen1778 · 2024-10-01T08:08:24 1727770104

ClickHouse recently got the support of TimeSeries table Engine [1]. It is marked as experimental, so yes - early stage. This engine is quite interesting, the data can be ingested via Prometheus remote write protocol. And read back via Prometheus remote read protocol. But reading back is the weakest part here, because Prometheus remote read requires sending blocks of data back to Prometheus, where Prometheus will unpack those blocks and do the filtering&transformations on its own. As you see, this doesn't allow leveraging the true power of ClickHouse - query performance.

Yes, you can use SQL to read metrics directly from ClickHouse tables. However, many people prefer the simplicity of PromQL compared to the flexibility of SQL. So until ClickHouse gets native PromQL support, it is in the early stages.

[1] https://clickhouse.com/docs/en/engines/table-engines/special...

kirankgollu · 2024-09-27T19:16:42 1727464602

Not a fan of bad mouthing other offerings. As @iampims is saying, alerts is a big missing piece. Clickhouse is also a general purpose database for many use cases including analytics, financial servers, ML& Gen AI, fraud, and observability.

iampims · 2024-09-04T13:38:51 1725457131

Heroku is still around…

…otherwise you can try Render, Fly.io, Google Cloud Run, Railway, etc.