More

saj1th · on Nov 16, 2021

> I do not see why it would be much slower than direct access to the storage.

Implementations of protocols like ODBC/JDBC generally implement their custom on-wire binary protocols that must be marshalled to/from the lib - and the performance would vary a lot from one implementation to another. We are seeing a lot of improvements in this space though, especially with the adoption of Arrow.

There is also the question of computing for ML. Data scientists today use several tools/frameworks ranging from scikit-learn/XGBoost to PyTorch/Keras/TensorFlow - to name a few. Enabling data scientists to use these frameworks against near-realtime data without worrying about provisioning infrastructure or managing dependencies or adding an additional export-to-cloud-storage hop is a game changer IMO.

saj1th · on Nov 18, 2021

> There is also the question of computing for ML.

Few reasons why Databricks platform shines here.

1) Not limited by just udfs - Extensions to improve performance, including GPU acceleration in XGBoost, distributed deep learning using HorovodRunner.

2.) End to end MLOps solution - including Feature store, Model registry & Model Serving

3.) Open approach with https://www.mlflow.org/

4.) Glass box (not blackbox) model for AutoML

saj1th · on Nov 16, 2021

Check out https://databricks.com/product/unity-catalog when you get a chance. There are other solutions in this space as well.

saj1th · on Nov 16, 2021

> Have you tried to implement row- and column-based security on direct access to cloud storage? It flat out does not work.

It is a solved problem. Essentially you need a central place ( with decentralized ownership for the datamesh fans ) to specify the ACLS ( row-based, column-based, attribute-based etc.) - and an enforcement layer that understands these ACLs. There are many solutions, including the ones from Databricks. Data discovery, lineage, data quality etc., go hand in glove.

Security is front and centre for almost all organizations now.

saj1th · on Nov 16, 2021

> Why would Spark trounce Snowflake. What makes it inherently so much faster at 100TB jobs?

These are the slides from a talk one of the co-founders (@rxin) gave at Stanford. https://web.stanford.edu/class/cs245/slides/LakehouseGuestTa...

It goes into the details of how this performance is achieved(and not just at 100TB). Part of this could be attributed to innovations in the storage layer(delta lake), and part of it is just the new query engine design itself.

saj1th · on Nov 15, 2021

:) That is a good question. Why spend eng cycles to submit results to the TPC council - why not just focus on customers?

I believe the co-founders have addressed this in the blog.

> Our goal was to dispel the myth that Data Lakehouse cannot have best-in-class price and performance. Rather than making our own benchmarks, we sought the truth and participated in the official TPC benchmark.

I'm sure anybody seriously looking at evaluating data platforms would want to look at things holistically. There are different dimensions like open ecosystem, support for machine learning, performance etc. And different teams evaluating these platforms would stack rank them in different orders.

These blogs, I believe, show that Databricks is a viable choice for customers when performance is a top priority (along with other dimensions). That IMO is customer obsession.

saj1th · on Nov 13, 2021

>the delta also still keeps partitioning information in the hive metastore, while iceberg keeps it in storage, making it a far superior design.

Check out https://github.com/delta-io/delta/blob/3ffb30d86c6acda9b59b9... when you get a chance. You don't need hive metastore to query delta tables since all metadata for a Delta table is stored alongside the data

>they did not include features like optimizing small files

For optimizing small files, you could run https://docs.delta.io/latest/best-practices.html#compact-fil...

saj1th · on Dec 15, 2015

The ipad app paper could be used to create similar looking diagrams

saj1th · on July 30, 2014

Does this mean new packages/bridges that supports context.Context getting released for back-end datastores like elasticsearch, aws etc?

saj1th · on July 28, 2014

Informative article elithrar - thanks!. Wondering why appContext was used instead of goji's Context/environment object ? Does not a context struct that holds all the handler dependencies makes the code harder to ponder and test(because the dependencies of a component would be unclear).

Wondering whether testing/debugging would be a little less complex if we create separate handler-groups/components? That would make the main() very verbose with all the wiring - (one of the things the inject lib tries to solve).

From what i understand there seems to be two line of thoughts.

#1) Being verbose is good - Components whose constructors take all of their dependencies are simple to reason about and straightforward to test

#2) When there are several binaries that share libraries - allocating memory, and wiring up the object graph becomes mundane and repetitive. Solving this by using a DI library like inject that doesn't add run-time overhead would be good. This doesn't have to happen at the cost of being difficult to test/reason-out.

Guess each might have it's own place.

elithrar · on July 28, 2014

> Informative article elithrar - thanks!. Wondering why appContext was used instead of goji's Context/environment object ? Does not a context struct that holds all the handler dependencies makes the code harder to ponder and test (because the dependencies of a component would be unclear).

Goji's context is a request context, and only exists for the lifetime of the request. Re-populating it at the beginning of every request would be a significant amount of overhead, and it's ultimately not designed for "life-of-application" variables. Request contexts like Goji's (or gorilla/context) are best used for passing short-lived data between middleware/handlers.

You could ultimately create a series of smaller structs, but you would need a ton of types (and repetitive code) that satisfy http.Handler to achieve that.

Memory allocation with this approach is minimal: you're only passing a struct pointer around once per request, which is about as good as it gets (better than a closure).

Testing with this approach is also straightforward: you can populate the appContext instance with your test sessions|database|etc when running tests, and your handlers operate as if they've been passed the real thing.

I've considered splitting out my handlers into a `handlers` package and the appContext struct/config structs into a `conf` package that handlers imports (and main initialises), and that's probably something I'll do in the near future since it's an easy change.

It's certainly not the one way/single best way to do things, but I've found that aligning to interfaces (like http.Handler) and leaning on structs as much as possible helps keeps things easier to reason and mock out later.

saj1th · on July 28, 2014

Cool! thanks for clarifying things up.

saj1th · on July 28, 2014

Found it interesting that inject wires up the object graph and runs only once on application startup. Curious to know your thoughts around why this lib could be considered non-idiomatic