Hacker News new | past | comments | ask | show | jobs | submit login

> I do not see why it would be much slower than direct access to the storage.

Implementations of protocols like ODBC/JDBC generally implement their custom on-wire binary protocols that must be marshalled to/from the lib - and the performance would vary a lot from one implementation to another. We are seeing a lot of improvements in this space though, especially with the adoption of Arrow.

There is also the question of computing for ML. Data scientists today use several tools/frameworks ranging from scikit-learn/XGBoost to PyTorch/Keras/TensorFlow - to name a few. Enabling data scientists to use these frameworks against near-realtime data without worrying about provisioning infrastructure or managing dependencies or adding an additional export-to-cloud-storage hop is a game changer IMO.




> There is also the question of computing for ML.

Few reasons why Databricks platform shines here.

1) Not limited by just udfs - Extensions to improve performance, including GPU acceleration in XGBoost, distributed deep learning using HorovodRunner.

2.) End to end MLOps solution - including Feature store, Model registry & Model Serving

3.) Open approach with https://www.mlflow.org/

4.) Glass box (not blackbox) model for AutoML




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: