This is every data scientists’ dream.

prepend · on May 6, 2021

It’s ok, but surprisingly feature poor since they only index datasets with structured metadata. I kind of wish they would compile all their metadata into a structured mega-catalog and allowed searching by api. Or just dumped it out as a dataset itself.

chatmasta · on May 6, 2021

Then you'll love what we're doing at Splitgraph: https://www.splitgraph.com/connect

As far as your SQL client is concerned, data.splitgraph.com:5432 is a giant Postgres database with ~40,000 tables in it. You can query and join across them using your existing tools. Behind the curtain, we'll forward your query to the upstream data source, translating it from SQL to whatever language it expects. (We can also ingest delta-compressed versioned snapshots).

zX41ZdbW · on May 7, 2021

What are the largest datasets in Splitgraph? Can I list the datasets sorted by size?

We have the need for large public datasets for testing ClickHouse: https://clickhouse.tech/docs/en/getting-started/example-data...

chatmasta · on May 7, 2021

On the public DDN (data.splitgraph.com:5432), we enforce a (currently arbitrary) 10k row limit on responses. You can construct multiple queries using LIMIT and OFFSET, or you can run a local Splitgraph engine without a limit. We also have a private beta program if you want a managed or self-hosted cloud deployment with the full catalog and DDN features. And we are planning to ship some "export to..." type workflows for exporting to CSV and potentially other formats.

For live/external data, we proxy the query to the data source, so there is no theoretical data size limit except for any defined by the upstream.

For snapshotted data, we store the data as fragments in object storage. Any size limit depends on the machine where Splitgraph's Postgres engine is running, and how you choose to materialize the data when downloading it from object storage. You can "check out" an entire image to materialize it locally, at which point it will be like any other Postgres schema. Or you can use "layered querying" which will return a result set while only materializing the fragments necessary to answer the query.

Regarding ClickHouse, you could watch this presentation [0] my co-founder Artjoms gave at a recent ClickHouse meet-up on the topic of your question. We also have specific documentation for using the ClickHouse ODBC client with the DDN [1], as well as an example reference implementation. [2]

[0] https://www.youtube.com/watch?v=44CDs7hJTho

[1] https://www.splitgraph.com/connect

[2] https://github.com/splitgraph/splitgraph/tree/master/example...

prepend · on May 7, 2021

That’s pretty cool, I’ll test it out when I get to a machine with a Postgres client.

Do you translate to api calls and query in real-time? Or are you ingesting and archiving?

chatmasta · on May 7, 2021

We support both! A Splitgraph repository can "mount" an external data source, and we'll proxy queries to it using a system based on Postgres Foreign Data Wrappers (FDW). But a repository can also contain any number of "data images," which are versioned snapshots of data roughly inspired by Docker images. You can define them with a declarative, Dockerfile-like syntax called a Splitfile, and you can rebuild them against upstream sources with caching semantics similar to "docker build."

Our core philosophy has always been that it makes sense to start with data federation (live data), and then selectively warehouse/ingest only what you need (versioned data). We're shipping some upcoming features to support this workflow. You start by providing us (or your private deployment) a set of read-only credentials to any supported data source, which we then "mount" as a repository, making it discoverable in the catalog, and instantly queryable with all the other data on Splitgraph. If or when you decide that you want to warehouse this data, we'll make it easy for you to schedule a loading job to ingest it as a Splitgraph image. This way, you can query the live or versioned data in any repository, by simply changing the tag you use to address it.

You can do all this stuff locally, btw – a decentralized workflow is fully supported, and you can push data between peers. The public Splitgraph.com happens to be a "super peer" with a data catalog, scalability features, etc. But if you just want to experiment on your own, you can try it in five minutes!

I don't want to spam this thread too much, so I'll limit it to one link – maybe take a look at the Splitfile docs: https://www.splitgraph.com/docs/working-with-data/using-spli...

villasv · on May 6, 2021

This is not new, though. So it may be a dream in the sense of people have been asleep?

puzzledobserver · on May 6, 2021

Why the condescension? Do you mean that Google has been offering this service for a while? Or do you mean that similar services have previously been offered by other organizations? In which case, perhaps you could link to them?

villasv · on May 7, 2021

No condescension intended. Google has been offering this service for a while, at least since 2018 iirc.

sneilan1 · on May 6, 2021

How long until Google shuts this service down?