We should probably be having a thread about that actual release, so I've re-upped https://news.ycombinator.com/item?id=43200793, will move most of the comments thither, and will post links to this blog post and the other one that people have been referencing.
That makes sense and I didn't mean to imply that there was anything wrong with either your post or submitting it to HN! Both are good. It's just that it makes more sense for the community to first discuss the main thing itself.
> Using smallpond and 3FS depends largely on your data size and infrastructure:
> Under 10TB: smallpond is likely unnecessary unless you have very specific distributed computing needs. A single-node DuckDB instance or simpler storage solutions will be simpler and possibly more performant.
> 10TB to 1PB: smallpond begins to shine. You'd set up a cluster with several nodes, leveraging 3FS or another fast storage backend to achieve rapid parallel processing.
> Over 1PB (Petabyte-Scale): smallpond and 3FS were explicitly designed to handle massive datasets. At this scale, you'd need to deploy a larger cluster with substantial infrastructure investments.
Makes it very easy to determine if this would be useful for me and how much work I would expect to do to use it.
IMO means “in my opinion.” I used that phrase to express how the following statement is my opinion and not a universal truth. My “audience” in this case is myself.
I do that because otherwise there’s always a comment saying how things like “obvious” can be subjective.
I also used the word “pretty” to, again, soften the word “obvious” so that readers don’t think that it’s a universal truth.
with some "no s, sherlock" on the ">1PB will require additional infra."
go on...
like people talking about 1gbit iSCSI, and no one thought to say that 120MB/s, which is technically slower than ATA/133 which came out twenty years ago, might be the bottleneck. Obviously 10gbit will be "as fast as a local drive"!
Yes, exactly right! This means you need to buy additional hardware, like network cards[0], and possibly gbic and fiber optics.
I had a Chinese co-worker and something like this was actually his style of writing, no use of AI, because I was sitting next to him few times when he was writing documents.
some was AI generated, but I made sure everything was accurate. I'd normally rewrite everything, but I wrote this quickly before I had to leave the house. Didn't think it'd be on the front page!
Yeah I reeeaaally want to see benchmarks! Single node duckdb is absolutely insane (as in fast) performance wise, especially compared to something like Spark. There's been a lot of speed focussed work in the project and I don't know of any faster data processing (I'm not counting traditional SQL since a lot of the speed benefits there come from indexing etc and essentially doing additional work ahead of time).
I guess it comes down to how well written the distributed workflows are, there's a lot to get wrong, but in theory it should be able to achieve very impressive numbers.
My reasoning behind this is Dask, which uses Pandas under the hood being capable of better benchmarks than Spark, I think this is partly some good optimisations, but also simply that pandas is faster than spark's row based model. Duckdb is on some benchmarks more than 10x faster than pandas, you can see where this is going. . .
Ok, fixed now. (Submitted title was "DeepSeek Drops Distributed DuckDB")
Edit: I've since changed the title above to the article title, in keeping with the site guidelines (https://news.ycombinator.com/newsguidelines.html). It has been taking me a while to figure out what we're looking at here!
Drop in the context of Databases isn't even close to anything being released or launched. Drop = Delete. Release is a much better word for this context.
In denotation, "dropped" can be used equivalently to "released", yes; but in connotation, using "dropped" instead of "released" implies either that:
1. the particular release was sudden, unexpected, and not highly pre-advertised or post-advertised — as in an album being "dropped" by a band (where the band more often "releases" albums.) Usage of "dropped" here evokes the feeling that the releaser is casually "dropping" the thing in the public square and walking away, leaving it there to be studied. A band would release an album by going on tour selling it; or they might just drop an album on Spotify one day.
2. the particular release was a single limited production run / limited-time event — where people were anticipating something would be released at a certain specific time, but there was no advance statement from the releaser of exactly what people would be getting. Strong analogy with the NYE "ball drop" — the release is an event that people count down to or line up for. (Think: dropping a new limited-edition colorway of a product people ravenously collect — sneakers, Stanley cups, etc.)
3. the particular release was a bounded-in-size batch or "tranch" of production, all put out to be purchased at once where "once they sell out, they sell out" for now — but with the expectation that the releaser is producing more, but where this will take time, during which the item will remain sold out. (Often, the item has actually been produced in quantity, and this limited dribbling-out and repeated fast selling-out is purely a marketing technique to induce hype and demand.) This usage isn't a figurative extension of the literal verb "drop" — but rather a shortening of the word "airdrop", as in military resupply and/or NFTs. You would be more likely to see this phrased as "[X] dropped another [Y]" or "[X] dropped more [Y]"; or perhaps "there was a drop of [Y] today."
I think to be clearer it would have been written "DeepSeek Drops Distributed version of DuckDB". Otherwise it looks like they used DuckDB (the distributed one?) and they have something new or better they're using now.
After posting, I started thinking about how you could push Iceberg (or delta) partitions into smallpond. Spinning up 3FS will be a lot of work, but distributing compute on an existing Iceberg catalog would be worth trying.