> I do not see why it would be much slower than direct access to the storage.
Implementations of protocols like ODBC/JDBC generally implement their custom on-wire binary protocols that must be marshalled to/from the lib - and the performance would vary a lot from one implementation to another. We are seeing a lot of improvements in this space though, especially with the adoption of Arrow.
There is also the question of computing for ML. Data scientists today use several tools/frameworks ranging from scikit-learn/XGBoost to PyTorch/Keras/TensorFlow - to name a few. Enabling data scientists to use these frameworks against near-realtime data without worrying about provisioning infrastructure or managing dependencies or adding an additional export-to-cloud-storage hop is a game changer IMO.
> Have you tried to implement row- and column-based security on direct access to cloud storage? It flat out does not work.
It is a solved problem. Essentially you need a central place ( with decentralized ownership for the datamesh fans ) to specify the ACLS ( row-based, column-based, attribute-based etc.) - and an enforcement layer that understands these ACLs. There are many solutions, including the ones from Databricks. Data discovery, lineage, data quality etc., go hand in glove.
Security is front and centre for almost all organizations now.
It goes into the details of how this performance is achieved(and not just at 100TB). Part of this could be attributed to innovations in the storage layer(delta lake), and part of it is just the new query engine design itself.
:) That is a good question. Why spend eng cycles to submit results to the TPC council - why not just focus on customers?
I believe the co-founders have addressed this in the blog.
> Our goal was to dispel the myth that Data Lakehouse cannot have best-in-class price and performance. Rather than making our own benchmarks, we sought the truth and participated in the official TPC benchmark.
I'm sure anybody seriously looking at evaluating data platforms would want to look at things holistically. There are different dimensions like open ecosystem, support for machine learning, performance etc. And different teams evaluating these platforms would stack rank them in different orders.
These blogs, I believe, show that Databricks is a viable choice for customers when performance is a top priority (along with other dimensions). That IMO is customer obsession.
Informative article elithrar - thanks!. Wondering why appContext was used instead of goji's Context/environment object ?
Does not a context struct that holds all the handler dependencies makes the code harder to ponder and test(because the dependencies of a component would be unclear).
Wondering whether testing/debugging would be a little less complex if we create separate handler-groups/components? That would make the main() very verbose with all the wiring - (one of the things the inject lib tries to solve).
From what i understand there seems to be two line of thoughts.
#1) Being verbose is good - Components whose constructors take all of their dependencies are simple to reason about and straightforward to test
#2) When there are several binaries that share libraries - allocating memory, and wiring up the object graph becomes mundane and repetitive. Solving this by using a DI library like inject that doesn't add run-time overhead would be good. This doesn't have to happen at the cost of being difficult to test/reason-out.
> Informative article elithrar - thanks!. Wondering why appContext was used instead of goji's Context/environment object ? Does not a context struct that holds all the handler dependencies makes the code harder to ponder and test (because the dependencies of a component would be unclear).
Goji's context is a request context, and only exists for the lifetime of the request. Re-populating it at the beginning of every request would be a significant amount of overhead, and it's ultimately not designed for "life-of-application" variables. Request contexts like Goji's (or gorilla/context) are best used for passing short-lived data between middleware/handlers.
You could ultimately create a series of smaller structs, but you would need a ton of types (and repetitive code) that satisfy http.Handler to achieve that.
Memory allocation with this approach is minimal: you're only passing a struct pointer around once per request, which is about as good as it gets (better than a closure).
Testing with this approach is also straightforward: you can populate the appContext instance with your test sessions|database|etc when running tests, and your handlers operate as if they've been passed the real thing.
I've considered splitting out my handlers into a `handlers` package and the appContext struct/config structs into a `conf` package that handlers imports (and main initialises), and that's probably something I'll do in the near future since it's an easy change.
It's certainly not the one way/single best way to do things, but I've found that aligning to interfaces (like http.Handler) and leaning on structs as much as possible helps keeps things easier to reason and mock out later.
Found it interesting that inject wires up the object graph and runs only once on application startup.
Curious to know your thoughts around why this lib could be considered non-idiomatic
Implementations of protocols like ODBC/JDBC generally implement their custom on-wire binary protocols that must be marshalled to/from the lib - and the performance would vary a lot from one implementation to another. We are seeing a lot of improvements in this space though, especially with the adoption of Arrow.
There is also the question of computing for ML. Data scientists today use several tools/frameworks ranging from scikit-learn/XGBoost to PyTorch/Keras/TensorFlow - to name a few. Enabling data scientists to use these frameworks against near-realtime data without worrying about provisioning infrastructure or managing dependencies or adding an additional export-to-cloud-storage hop is a game changer IMO.