Btreefs generates executable code at runtime to unpack btree nodes

kzrdude · on June 25, 2023

That particular feature has been discussed in review quite a lot it seems. The latest messages on the list suggest a diifferent approach that avoids generated code, because the x86 maintainer hated it.

Let's see if they will switch to that.

https://lore.kernel.org/linux-bcachefs/ZIuCFtmnFturKwex@mori...

2h · on June 25, 2023

What the fuck, that page is gigantic

viraptor · on June 25, 2023

Has anyone actually posted some benchmarks of the jit and other methods? I haven't seen them before and it seems weird how much this has been discussed without published numbers.

cwillu · on June 25, 2023

See https://news.ycombinator.com/item?id=36466306

BenjiWiebe · on June 25, 2023

*bcachefs

archi42 · on June 25, 2023

The general optimization is actually not that novel: A DBMS might do that for parts of a query. At least in a high performance database lecture this was taught as a possible optimization. Edit: I'd intuitively expect the improvements to be more than 5%< though.

E.g. PostgreSQL can do that these days (not sure if it did back then): https://www.postgresql.org/docs/current/jit-reason.html

BitPirate · on June 25, 2023

I have had a lot of "fun" with this feature in the past. It has earned a permanent place in my setup script:

ALTER SYSTEM SET jit=off;

Basically a performance blackhole for more complex queries. Most OSM related projects disable it straight away as it only creates headaches.

https://github.com/osm-search/Nominatim/pull/2559 https://github.com/gravitystorm/openstreetmap-carto/blob/mas...

jandrewrogers · on June 25, 2023

I have been JIT-ing query predicates in database-like systems for over a decade, and some commercial databases have supported it a lot longer than that. There are a couple relevant aspects that impact the performance benefit.

The performance gains are much higher if the database engine was designed to have JIT-ed execution from day one. Grafting it onto a database engine after the fact, like Postgres, is going to gain substantially less benefit than is theoretically possible. Additionally, it mostly benefits databases where query predicates have a large amount of data to process, it doesn’t do much for OLTP. But in the right system and context, large integer factor performance improvements are routinely achievable.

cmrdporcupine · on June 25, 2023

Yes, this is a thing. But the key difference there is... that's being done ... in user space.

We live in a world where just last week Google turned off io_uring access on a pile of machines, and that's only "executing" restricted sets of operations. Executable code in kernel = big giant target painted on back.

archi42 · on June 26, 2023

Ah, yes, that's an excellent point. Because someone else talked about 5% performance I looked at it with my performance hat on, not the security hat. OTOH we have BPFilter with its VM.

shrubble · on June 25, 2023

FWIW, the IBM mainframes have channel control programs, which as I understand it, can dynamically generate small programs and send them to the channel controller to execute; can include branching / conditionals also. https://en.wikipedia.org/wiki/Execute_Channel_Program

zX41ZdbW · on June 25, 2023

Manually constructing machine code (as in the example) is not the best idea - it is error-prone, difficult to debug, and prevents testing with sanitizers. I'd not do it.

Using LLVM for JIT is also not the best idea because LLVM is a complex codebase with bugs. Example: https://github.com/ClickHouse/ClickHouse/issues/50323#issuec... Although there are some marginal benefits: https://clickhouse.com/blog/clickhouse-just-in-time-compiler...

cwillu · on June 25, 2023

https://lore.kernel.org/linux-bcachefs/5ef2246b-9fe5-4206-ac...

    [...]
    So, without intending any particular hostility:

    <puts on maintainer hat>

    bcachefs's x86 JIT is:
    Nacked-by: Andy Lutomirski <luto@kernel.org> # for x86

    <takes off maintainer hat>

    This makes me sad, because I like bcachefs.  But you can get it merged 
    without worrying about my NAK by removing the x86 part.
    [...]

https://lore.kernel.org/linux-bcachefs/dcf8648b-c367-47a5-a2...

    > No, I'm saying your concerns are baseless and too vague to address.

    If you don't address them, the NAK will stand forever, or at least until a 
    different group of people take over x86 maintainership.  That's fine with me.

    I'm generally pretty happy about working with people to get their Linux 
    code right.  But no one is obligated to listen to me.

    >
    >> text_poke() by itself is *not* the proper API, as discussed.  It
    >> doesn't serialize adequately, even on x86.  We have text_poke_sync()
    >> for that.
    >
    > Andy, I replied explaining the difference between text_poke() and
    > text_poke_sync(). It's clear you have no idea what you're talking about,
    > so I'm not going to be wasting my time on further communications with
    > you.

    No problem.  Then your x86 code will not be merged upstream.

    Best of luck with the actual filesystem parts!

    --Andy

https://lore.kernel.org/linux-bcachefs/20230620201851.qrcabl...

    [...]
    > >> Andy, I replied explaining the difference between text_poke() and
    > >> text_poke_sync(). It's clear you have no idea what you're talking about,
    > >> so I'm not going to be wasting my time on further communications with
    > >> you.
    > 
    > One more specific concern: This comment made me very uncomfortable and
    > it read to me very much like a personal attack, something which is
    > contrary to our code of conduct.

    It's not; I prefer to be direct than passive 
    aggressive, and if I have to bow out of a discussion
    that isn't going anywhere I feel I owe an explanation 
    of _why_. Too much conflict avoidance means things
    don't get resolved.

    And Andy and I are talking on IRC now, so things are 
    proceeding in a better direction.

bigyikes · on June 25, 2023

Yikes, I guess I’m way too sensitive to ever be a kernel dev.

Kent seems way out of line. “Direct” is one thing, telling a maintainer they have no idea what they’re talking about is another. It’s totally uncalled for.

Are they trying to imitate Linus, or something? Sorry, you don’t get a license to be an asshole until you literally invent Linux.

cwillu · on June 25, 2023

Dunno, having read through the rest of the thread, and knowing some of the history, I can see why some frustration leaked out. Worth noting that they were able to pretty much immediately work together after this to come up with what appears to be a good solution to satisfy everyone.

Vogtinator · on June 25, 2023

> Sorry, you don’t get a license to be an asshole until you literally invent Linux.

That doesn't grant you a license either.

coldtea · on June 25, 2023

It's a mindset from a previous era, when there were blunt thick-skinned hackers and not snowflake "I'll report it to the HR" types

bastawhiz · on June 25, 2023

Where I'm from, having a temper tantrum on a mailing list because you're not getting your way is what makes you the snowflake. If you don't want to follow the code of conduct, you can submit your patch to a kernel without one.

coldtea · on June 25, 2023

Yes, that's the new mindset, about "codes of conduct" and such. It's what happens after a project has been bookstrapped, succesful, and established, and the later process-focused/bureucracy/ass-saving/touchy-feely minded people come in.

For Linux that was 15+ years after it started, did fine, and conquered the world, without needing one.

cwillu · on June 25, 2023

Worth noting that it also worked fine in this instance, the two parties that were in conflict worked things out and found a better solution, and the snowflake who brought up the code-of-conduct was treated as network damage and routed around.

It's vital to pay attention to the sheer volume of comments being made that all needed to be substantively responded to by one developer. The sense I get is very much that the appropriate attitude is: don't join the dog-pile if you don't want to be snapped at.

cmrdporcupine · on June 25, 2023

I've worked through both eras, and I'll take the latter and enjoy the higher productivity and job satisfaction from everyone involved, thank you.

_a_a_a_ · on June 25, 2023

I can't see any context in the form of a discussion, it's just code, so I can't see the anticipated trade-offs etc. but I'd expect the cost of access to the Btree of any decent size (that is, overflowing cache) and large fanout to be almost completely about RAM memory latency. Therefore I'd expect compiling just the tree-accessing code to be of little value.

Happy to be put right though.

userbinator · on June 25, 2023

I wonder how much this improves performance over not "JIT'ing" the calculation. I have done similar things in image/video codec code (where it resulted in substantial increases in speed) but this is the first time I've seen it for a filesystem.

cwillu · on June 25, 2023

Looks like 5% on the benchmark in the (long) thread.

https://lore.kernel.org/linux-bcachefs/ZGB1eevk%2Fu2ssIBT@mo...

    [...]
    testing random btree updates:

    dynamically generated unpack:
    rand_insert: 20.0 MiB with 1 threads in    33 sec,  1609 nsec per iter, 607 KiB per sec

    old C unpack:
    rand_insert: 20.0 MiB with 1 threads in    35 sec,  1672 nsec per iter, 584 KiB per sec

    the Eric Biggers special:
    rand_insert: 20.0 MiB with 1 threads in    35 sec,  1676 nsec per iter, 583 KiB per sec
    [...]

kzrdude · on June 25, 2023

5% is way too little to risk mainlining over

viraptor · on June 25, 2023

Alternative: 5% gain in a specific issue is totally fine to defer until another patch set. It feels like fighting about it in the first submission is a waste of time for everyone involved.

kzrdude · on June 25, 2023

You said it the way I'd want to.

grumpyprole · on June 25, 2023

Agreed, extraordinary solutions require extraordinary justification.

gavinray · on June 25, 2023

5% can be tens of millions of dollars a year when you're FAANG scale

They have engineers dedicated to 0.5-2.0% improvements in the kernel

kzrdude · on June 25, 2023

Yes and those improvements can happen in due time, after the filesystem first has arrived in linux.

jonathrg · on June 25, 2023

What does mainlining mean in this context?

espadrine · on June 25, 2023

Merging into the main Linux branch (Linus’). One maintainer has said they won’t merge until the JIT is removed. If it is never merged to Linux, it will forever be niche, requiring the same tricks and having the same limitations as zfs.ko. So it seems better to be 5% slower now and widely used, than to always be niche.

All in all, while a good Phoronix benchmark is what can make it supplant all other Linux filesystems, I appreciate the security concerns raised by maintainers, and I agree that a better approach would have been to use the default code at first, and seek advice on how to improve its performance. Thankfully, it looks like that is where it is going now.