Understanding Smallpond and 3FS

ok123456 · 2025-03-02T18:12:45 1740939165

https://github.com/deepseek-ai/smallpond

dang · 2025-03-02T20:18:52 1740946732

We should probably be having a thread about that actual release, so I've re-upped https://news.ycombinator.com/item?id=43200793, will move most of the comments thither, and will post links to this blog post and the other one that people have been referencing.

mritchie712 · 2025-03-03T11:09:15 1741000155

The repo had already been posted. The reason I wrote the post is it's a bit hard to understand how you'd actually use smallpond for analytics.

When people see "duckdb", they're going to think they can slap it into their local analytics workflows, but it turns out that's not a good idea.

dang · 2025-03-04T03:02:39 1741057359

That makes sense and I didn't mean to imply that there was anything wrong with either your post or submitting it to HN! Both are good. It's just that it makes more sense for the community to first discuss the main thing itself.

westurner · 2025-03-02T18:24:36 1740939876

smallpond: https://github.com/deepseek-ai/smallpond :

> A lightweight data processing framework built on DuckDB and 3FS.

mritchie712 · 2025-03-02T18:37:00 1740940620

updated.

jauntywundrkind · 2025-03-02T18:12:40 1740939160

Smallpond. Runs on their RDMA powered 3fs ("fire-flyer file system") filesystem.

https://github.com/deepseek-ai/smallpond

https://news.ycombinator.com/item?id=43200793

I didn't find anything of value in this article.

Did enjoy https://mehdio.substack.com/p/duckdb-goes-distributed-deepse... some, which eventually talks about smallpond being built on Ray, and… Smallpond actually running multiple partitioned duckdb instances?! Wow.

memco · 2025-03-02T18:21:11 1740939671

Love this straightforward analysis of use cases:

> Using smallpond and 3FS depends largely on your data size and infrastructure:

> Under 10TB: smallpond is likely unnecessary unless you have very specific distributed computing needs. A single-node DuckDB instance or simpler storage solutions will be simpler and possibly more performant.

> 10TB to 1PB: smallpond begins to shine. You'd set up a cluster with several nodes, leveraging 3FS or another fast storage backend to achieve rapid parallel processing.

> Over 1PB (Petabyte-Scale): smallpond and 3FS were explicitly designed to handle massive datasets. At this scale, you'd need to deploy a larger cluster with substantial infrastructure investments.

Makes it very easy to determine if this would be useful for me and how much work I would expect to do to use it.

dartos · 2025-03-02T19:02:14 1740942134

I very much felt like that entire portion of the article was ai generated, actually.

IMO pretty obvious, surface level, information and some prose on each bullet.

xixixao · 2025-03-02T19:28:54 1740943734

Saying something is “obvious” without specifying an audience is meaningless.

(because obviousness is subjective and depends on the knowledge, experience, and context of the audience)

dartos · 2025-03-02T21:40:07 1740951607

Notice the “IMO pretty” before the word “obvious”

IMO means “in my opinion.” I used that phrase to express how the following statement is my opinion and not a universal truth. My “audience” in this case is myself.

I do that because otherwise there’s always a comment saying how things like “obvious” can be subjective.

I also used the word “pretty” to, again, soften the word “obvious” so that readers don’t think that it’s a universal truth.

genewitch · 2025-03-02T19:18:16 1740943096

with some "no s, sherlock" on the ">1PB will require additional infra."

go on...

like people talking about 1gbit iSCSI, and no one thought to say that 120MB/s, which is technically slower than ATA/133 which came out twenty years ago, might be the bottleneck. Obviously 10gbit will be "as fast as a local drive"!

Yes, exactly right! This means you need to buy additional hardware, like network cards[0], and possibly gbic and fiber optics.

mritchie712 · 2025-03-02T19:42:11 1740944531

I updated the post. In this case, I meant "exotic" infra... e.g. 3FS isn't like adding more EC2 instances.

Adding ec2 instances is trivial, setting up 3FS is hard.

7thpower · 2025-03-02T19:36:22 1740944182

You’ve been wanting to get this off your chest for a while haven’t you.

fs111 · 2025-03-02T20:01:55 1740945715

The authors are Chinese so they may simply use AI to make it sound right in English

varispeed · 2025-03-02T20:07:58 1740946078

I had a Chinese co-worker and something like this was actually his style of writing, no use of AI, because I was sitting next to him few times when he was writing documents.

mritchie712 · 2025-03-02T19:44:29 1740944669

some was AI generated, but I made sure everything was accurate. I'd normally rewrite everything, but I wrote this quickly before I had to leave the house. Didn't think it'd be on the front page!

dartos · 2025-03-02T21:53:00 1740952380

Not judging you for using AI for a post like this!

Don’t feel bad. I just didn’t think AI generated bullet points were as impressive as the comment I was replying to did.

jimmyl02 · 2025-03-02T18:42:33 1740940953

I wonder at which scale spark fits into this picture and what the tradeoffs / benefits would be

mritchie712 · 2025-03-02T18:46:12 1740941172

spark is certainly the incumbent for this sort of thing.

one benefit for me personally: you should be able to move from local dev to cloud more easily.

benrutter · 2025-03-02T19:45:46 1740944746

Yeah I reeeaaally want to see benchmarks! Single node duckdb is absolutely insane (as in fast) performance wise, especially compared to something like Spark. There's been a lot of speed focussed work in the project and I don't know of any faster data processing (I'm not counting traditional SQL since a lot of the speed benefits there come from indexing etc and essentially doing additional work ahead of time).

I guess it comes down to how well written the distributed workflows are, there's a lot to get wrong, but in theory it should be able to achieve very impressive numbers.

My reasoning behind this is Dask, which uses Pandas under the hood being capable of better benchmarks than Spark, I think this is partly some good optimisations, but also simply that pandas is faster than spark's row based model. Duckdb is on some benchmarks more than 10x faster than pandas, you can see where this is going. . .

DannyPage · 2025-03-02T18:11:57 1740939117

“Releases” is used in the article - instead of “drops” - and would be a clearer title.

dang · 2025-03-02T18:56:40 1740941800

Ok, fixed now. (Submitted title was "DeepSeek Drops Distributed DuckDB")

Edit: I've since changed the title above to the article title, in keeping with the site guidelines (https://news.ycombinator.com/newsguidelines.html). It has been taking me a while to figure out what we're looking at here!

conqrr · 2025-03-02T18:32:25 1740940345

Drop in the context of Databases isn't even close to anything being released or launched. Drop = Delete. Release is a much better word for this context.

joshuat · 2025-03-02T19:02:32 1740942152

Even in the context of an application stack - my initial read had me believing they were moving away from DuckDB

mritchie712 · 2025-03-02T18:34:50 1740940490

yeah, I thought drop was amusing in this case paired with the tautogram

freehorse · 2025-03-02T19:05:27 1740942327

It was, but people here prioritise lexixal inambiguity rather than fun.

dboreham · 2025-03-02T18:14:07 1740939247

Not only clearer, but 180 degrees different in meaning.

4ndrewl · 2025-03-02T18:21:28 1740939688

I thought "dropped" these days meant released? Not helpful I know...

kaashif · 2025-03-02T18:25:10 1740939910

I was surprised because I thought the title meant they dropped support or something. Weird because I'd never heard of distributed DuckDB.

derefr · 2025-03-02T18:48:50 1740941330

In denotation, "dropped" can be used equivalently to "released", yes; but in connotation, using "dropped" instead of "released" implies either that:

1. the particular release was sudden, unexpected, and not highly pre-advertised or post-advertised — as in an album being "dropped" by a band (where the band more often "releases" albums.) Usage of "dropped" here evokes the feeling that the releaser is casually "dropping" the thing in the public square and walking away, leaving it there to be studied. A band would release an album by going on tour selling it; or they might just drop an album on Spotify one day.

2. the particular release was a single limited production run / limited-time event — where people were anticipating something would be released at a certain specific time, but there was no advance statement from the releaser of exactly what people would be getting. Strong analogy with the NYE "ball drop" — the release is an event that people count down to or line up for. (Think: dropping a new limited-edition colorway of a product people ravenously collect — sneakers, Stanley cups, etc.)

3. the particular release was a bounded-in-size batch or "tranch" of production, all put out to be purchased at once where "once they sell out, they sell out" for now — but with the expectation that the releaser is producing more, but where this will take time, during which the item will remain sold out. (Often, the item has actually been produced in quantity, and this limited dribbling-out and repeated fast selling-out is purely a marketing technique to induce hype and demand.) This usage isn't a figurative extension of the literal verb "drop" — but rather a shortening of the word "airdrop", as in military resupply and/or NFTs. You would be more likely to see this phrased as "[X] dropped another [Y]" or "[X] dropped more [Y]"; or perhaps "there was a drop of [Y] today."

SteveDR · 2025-03-02T18:30:30 1740940230

Yes, most young people would say an artist “dropped” new music instead of saying that they released new music. Still a bad title though

rvnx · 2025-03-02T18:40:15 1740940815

Dropped could mean abandoned

0xCMP · 2025-03-02T18:29:27 1740940167

I think to be clearer it would have been written "DeepSeek Drops Distributed version of DuckDB". Otherwise it looks like they used DuckDB (the distributed one?) and they have something new or better they're using now.

KaoruAoiShiho · 2025-03-02T18:29:53 1740940193

Dropped could also mean they used to use it but stopped, that's also pretty common parlance in software...

stavros · 2025-03-02T18:29:48 1740940188

Yes but then you lose the alliteration.

mritchie712 · 2025-03-02T18:32:35 1740940355

yes, sorry, I simply couldn't resist

djeastm · 2025-03-02T18:28:36 1740940116

This is one of my "Kids these days..." moments. I've been caught several times mistaking the meaning of this new slang.

BHSPitMonkey · 2025-03-02T19:02:40 1740942160

Not _so_ new:

- https://boards.straightdope.com/t/where-did-the-term-album-d... (2009) - https://www.talkbass.com/threads/when-did-release-become-dro... (2013)

But it _has_ spread much faster outside of the music scene these last few years, e.g. describing software and products.

wigster · 2025-03-02T18:34:01 1740940441

drop should be un-dropped.

mritchie712 · 2025-03-02T18:34:02 1740940442

Sorry, I couldn't resist the tautogram.

farts_mckensy · 2025-03-02T18:34:33 1740940473

It's pretty clear what is meant by anyone under the age of 50.

ivandenysov · 2025-03-02T18:35:33 1740940533

I’m anyone and it wasn’t clear to me

farts_mckensy · 2025-03-02T19:35:48 1740944148

[flagged]

throitallaway · 2025-03-02T21:08:35 1740949715

Not everyone is immersed in pop culture, not everyone is a native English speaker, etc. It doesn't cost anything to be kind.

mritchie712 · 2025-03-02T18:58:49 1740941929

After posting, I started thinking about how you could push Iceberg (or delta) partitions into smallpond. Spinning up 3FS will be a lot of work, but distributing compute on an existing Iceberg catalog would be worth trying.

maknee · 2025-03-04T20:32:33 1741120353

What are the results from running smallpond and 3fs?

Are claims valid for <10TB, 10TB -> 1PB and over 1PB?

xnx · 2025-03-02T18:32:22 1740940342

"drops" seems to be a fairly recent contronym meaning both "released" and "discontinued".