Sail – Unify stream processing, batch processing and compute-intensive workloads

ignoreusernames · 2024-09-10T11:18:23 1725967103

From the announcement “As of now, we have mined 1,580 PySpark tests from the Spark codebase, among which 838 (53.0%) are successful on Sail. We have also mined 2,230 Spark SQL statements or expressions, among which 1,396 (62.6%) can be parsed by Sail”

Kinda early to call this a drop in replacement with those numbers no?

But, with enough parity this project could be a dream for anybody dealing with spark’s dreadful performance. Kudos to the team

Kydlaw · 2024-09-10T12:00:15 1725969615

The next paragraph explains that: "When looking at the test coverage numbers alone, Sail’s capability may seem limited. But we have found that there is a long tail of failed tests due to formatting discrepancies, edge cases, and less-used SQL functions, which we will continue tackling in future releases."

I am with you that it is still very very early. I'll personally keep an eye on the project.

SpicyLemonZest · 2024-09-10T12:35:46 1725971746

I'll keep an eye on it too, but for a query engine formatting compliance and edge cases tend to be almost all of the work. It's easy to implement SELECT x FROM y WHERE z.

bburnett44 · 2024-09-10T12:40:27 1725972027

Yeah but the website literally says “zero code changes”. It’s the long tail that’s dangerous since most people don’t understand it as well as a the core functions

anonzzzies · 2024-09-10T12:23:35 1725971015

Bit off topic; we are looking for something like this but with a facility for untrusted users to run sandboxed code instead of trusted code. All that I found (but I am relatively new to this field) are hacky and, worse, slow solutions.

bobbiechen · 2024-09-10T14:35:47 1725978947

What kind of sandboxing are you looking for, and what's the threat model? (My contact info is in my profile as well if you prefer)

anonzzzies · 2024-09-10T16:22:59 1725985379

Our users (who are perhaps untrusted) write the compute for this and so, they cannot just do whatever they want. But it still has to be performant so kicking off containers so far is not great, however even they we couldn't find done.

log4shell · 2024-09-10T14:33:28 1725978808

What is your use case?

log4shell · 2024-09-10T14:38:19 1725979099

It is refreshing to see multiple projects with arrow/datafusion trying to bank on existing and user friendly spark's API instead of reinventing the API all over again.

There is likes of comet and blaze that replace execution backend of spark with datafusion and then you have single process alternatives like sail trying to settle in "not so big data" category.

I am watching evolution of projects powered by datafusion and compatible with spark with keen eye. Early days but quite exciting.

cybergorilla · 2024-09-10T13:56:56 1725976616

This looks interesting, but the docs are really lacking, to the point where it is barely understandable.

I see some potential wins on it, such as it being a Rust-based, Spark-compatible and better suited for single processor environments, but they are just not explained or developed enough.

johanneskanybal · 2024-09-10T12:28:43 1725971323

Been on hacker news long enough to take bold claims like this with a few cups of salt but bookmarked so some kind of interest sparked (hah!).

mikymoothrowa · 2024-09-10T12:22:24 1725970944

Is this for distributed data processing like spark is?

The documentation has nothing to indicate that it is.

rajman187 · 2024-09-10T13:58:04 1725976684

From the documentation [1]

> The mission of Sail is to unify stream processing, batch processing, and compute-intensive (AI) workloads. Currently, Sail features a drop-in replacement for Spark SQL and the Spark DataFrame API in single-process settings.

[1] https://docs.lakesail.com/sail/latest/

binary132 · 2024-09-10T17:14:38 1725988478

"In single-process settings"

nobody uses Spark for this outside of unit testing

TiredOfLife · 2024-09-10T12:04:09 1725969849

Note. This is not Laravel Sail - the docker environment.