Google Dataset Search

abraxaz · on May 6, 2021

Information on how to annotate datasets: https://developers.google.com/search/docs/data-types/dataset

> We can understand structured data in Web pages about datasets, using either schema.org Dataset markup, or equivalent structures represented in W3C's Data Catalog Vocabulary (DCAT) format. We also are exploring experimental support for structured data based on W3C CSVW, and expect to evolve and adapt our approach as best practices for dataset description emerge. For more information about our approach to dataset discovery, see Making it easier to discover datasets.

For more info on those:

- W3C's Data Catalog Vocabulary: https://www.w3.org/TR/vocab-dcat-3/

- Schema.org dataset: https://schema.org/Dataset

- CSVW Namespace Vocabulary Terms: https://www.w3.org/ns/csvw

- Generating RDF from Tabular Data on the Web (examples on how to use CSVW): https://www.w3.org/TR/csv2rdf/

prepend · on May 6, 2021

It’s funny because Google does not use these standards to validate.

I keep getting errors from Google that some of my dataset’s descriptions are over 5,000 characters even though dcat:description does not have a size limit.

Of course it’s impossible for me to report a bug in how they index.

inductive_magic · on May 7, 2021

You could submit a dataset containing the bug report :-)

westurner · on May 6, 2021

Use cases for such [LD: Linked Data] metadata:

1. #StructuredPremises:

> (How do I indicate that this is a https://schema.org/ScholarlyArticle predicated upon premises including this Dataset and these logical propositions?)

2. #LinkedMetaAnalyses; #LinkedResearch "#StudyGraph"

3. [CSVW (Tabular Data Model),] schema.org/Dataset(s) with per column (per-feature) physical quantity and unit URIs with e.g. QUDT and/or https://schema.org/StructuredValue metadata for maximum data reusability.

4. JupyterLab notebooks:

4a. JupyterLab Metadata Service extension: https://github.com/jupyterlab/jupyterlab-metadata-service :

> - displays linked data about the resources you are interacting with in JuyterLab.

> - enables other extensions to register as linked data providers to expose JSON LD about an entity given the entity's URL.

> - exposes linked data to the user as a Linked Data viewer in the Data Browser pane.

4b. JupyterLab Data Explorer: https://github.com/jupyterlab/jupyterlab-data-explorer :

> - Data changing on you? Use RxJS observables to represent data over time.

> - Have a new way to look at your data? Create React or lumino components to view a certain type.

> - Built-in data explorer UI to find and use available datasets.

chatmasta · on May 6, 2021

This is a great resource. At Splitgraph, we index ~40k open data sets, and we make sure to include structured metadata for each one, so we show up in these results. (example [0])

One cool aspect of this metadata is that it allows a dataset to have multiple sources. So if two sites index the same dataset, there is no duplicate content penalty like there might be with textual content. If you search for a dataset, it will include links to all its sources (whether canonical or otherwise).

For most of the data we index at Splitgraph, the canonical source is an open government data portal powered by Socrata (e.g. data.cdc.gov). We noticed that Socrata powered a lot of portals, so we wrote a Socrata plugin for Splitgraph, along with a scraper to index the metadata. The plugin basically implements a Postgres FDW so that Splitgraph can translate from SQL to the upstream query language. In this case, the plugin translates to Socrata's bespoke API language. But for private deployments we also have plugins for Snowflake, Postgres, some SaaS services, etc.

If you find some data on Google Dataset Search with Splitgraph listed as a source, please take a look! Our "Data Delivery Network" (DDN) is implemented on top of the Postgres wire protocol, so you can connect with any Postgres client (or use our web editor). All the Postgres query syntax is available to you; you can even JOIN across any of the other 40k+ datasets indexed at Splitgraph. That includes "live data" like Socrata portals, but also versioned snapshots of data called "data images." Here's an example of a point-in-time query across two snapshots (basically a diff) [1], and another query that joins across tables at data.cityofchicago.org and data.cambridgema.gov [2].

[0] https://www.splitgraph.com/cdc-gov/distribution-of-covid19-d... – "View Source" to see the Schema.org metadata

[1] https://bit.ly/3epvxcj

[2] https://bit.ly/3f1ll8K

(Sorry for the bit.ly links. The URL for our query editor includes the full SQL string, and I don't want to mess up HN formatting.)

remram · on May 8, 2021

Why did you make your website's scrollbar half the normal size? It looks the same as native, except too small to grab? Why?!

davcancas · on May 6, 2021

This dataset search engine has been around for years! We created DataMarket (https://datamarket.es) inspired by this site (and Auren Hoffman's SafeGraph).

vkhuc · on May 6, 2021

Absolutely. This dataset search was first introduced in September of 2018. It was out of beta in January last year: https://www.kdnuggets.com/2020/01/google-dataset-search.html.

Der_Einzige · on May 6, 2021

Stop, you're making the barrier to entry too low! /s

This is really really cool. Between this and Hugginfaces Dataset and models hubs, AI/ML is really getting easier to use.

paulz_ · on May 6, 2021

I've actually been on the lookout for model hubs lately. Any that you've seen or reccomend?

I've found https://modelzoo.co/ but it seems more like a currated list of models (some incomplete) rather than a community where users share trained models.

halflings · on May 6, 2021

If you use tensorflow, tfhub [1] is the go-to model repository.

[1] https://tfhub.dev/

john-tells-all · on May 6, 2021

Dataset with 9,000 annotated cat images! => https://datasetsearch.research.google.com/search?query=cat&d...

uptime · on May 6, 2021

I have a lot to read before I get excited but if the team is here: Can we get DCAT for sets that are otherwise only discoverable with OAI-PMH? Seems like a divide between govt and academic repos that hinders harvesting.

damirkotoric · on May 8, 2021

Shameless plug. I wrote a piece about The State of Open Data Portals https://uxdesign.cc/designing-open-data-portals-for-governme... where I predict that it'll take a Google to really provide a single searchable dataset portal for the whole world.

Doesn't take a genius to predict, but there ya go! Governments are assembling datasets in a very fragmented way. It'll take a private company to provide one single website to explore and find all datasets from around the world, making it easier to look at holistic patterns that are happening around the world, or compare patterns between countries.

Though, I would expect a much better UX from Google nowadays. This site has more in common with Google Scholar than Google Search.

And ultimately I'd like to see them build something where people don't need to download datasets in order to make use of the data.

I compare the state of open data to the state of mapping software before Google Maps. You needed to download map files and open them on special software that you open on your computer to make sense of the data. And then Google Maps came along and flipped that whole model. Open data needs the same leap forward in order for more people to make greater use of open data.

igravious · on May 7, 2021

Discussion from Jan 2020: https://news.ycombinator.com/item?id=22130874 | 32 comments

Discussion from Sept 2018: https://news.ycombinator.com/item?id=17919297 | 76 comments

plaidfuji · on May 7, 2021

I’ve come across this a few different times over the years... always seems enticing and potentially useful, but I’ve never found a real use for it. I suppose it provides a library of well-prepped datasets to test ML models on? Anyone ever used this for any practical purpose beyond a sandbox-type use case?

disgruntledphd2 · on May 7, 2021

Well, you could use it to answer questions of interest to you.

I certainly tried my current projects on it, and found some useful stuff (most of which I've seen before).

smhx · on May 7, 2021

another good resource that's more specific to machine learning is https://paperswithcode.com/datasets

ravila4 · on May 7, 2021

The lab I work in has a project that helps annotate datasets with metadata and register their schemas: https://discovery.biothings.io/

A common barrier to making FAIR datasets is that not all data lends itself to be schema.org compliant. The idea is that instead of enforcing one schema to rule them all, we allow people to make their own schemas by extending existing ones, and register them in an API to be easily discoverable.

jeresuikkila · on May 7, 2021

They should add a "I'm Feeling lucky" button

a_square_peg · on May 6, 2021

Also in case the team is here... the updated date for ERA5 back extension to 1950-1978 (Preliminary version - https://datasetsearch.research.google.com/search?query=ERA5%...) is incorrect as this was only released last year (2020) but is stated as 2011.

antpls · on May 8, 2021

No mention of https://frictionlessdata.io/ dataset metadata format which is also used by Kaggle

fabcomm · on May 6, 2021

This is every data scientists’ dream.

prepend · on May 6, 2021

It’s ok, but surprisingly feature poor since they only index datasets with structured metadata. I kind of wish they would compile all their metadata into a structured mega-catalog and allowed searching by api. Or just dumped it out as a dataset itself.

chatmasta · on May 6, 2021

Then you'll love what we're doing at Splitgraph: https://www.splitgraph.com/connect

As far as your SQL client is concerned, data.splitgraph.com:5432 is a giant Postgres database with ~40,000 tables in it. You can query and join across them using your existing tools. Behind the curtain, we'll forward your query to the upstream data source, translating it from SQL to whatever language it expects. (We can also ingest delta-compressed versioned snapshots).

zX41ZdbW · on May 7, 2021

What are the largest datasets in Splitgraph? Can I list the datasets sorted by size?

We have the need for large public datasets for testing ClickHouse: https://clickhouse.tech/docs/en/getting-started/example-data...

chatmasta · on May 7, 2021

On the public DDN (data.splitgraph.com:5432), we enforce a (currently arbitrary) 10k row limit on responses. You can construct multiple queries using LIMIT and OFFSET, or you can run a local Splitgraph engine without a limit. We also have a private beta program if you want a managed or self-hosted cloud deployment with the full catalog and DDN features. And we are planning to ship some "export to..." type workflows for exporting to CSV and potentially other formats.

For live/external data, we proxy the query to the data source, so there is no theoretical data size limit except for any defined by the upstream.

For snapshotted data, we store the data as fragments in object storage. Any size limit depends on the machine where Splitgraph's Postgres engine is running, and how you choose to materialize the data when downloading it from object storage. You can "check out" an entire image to materialize it locally, at which point it will be like any other Postgres schema. Or you can use "layered querying" which will return a result set while only materializing the fragments necessary to answer the query.

Regarding ClickHouse, you could watch this presentation [0] my co-founder Artjoms gave at a recent ClickHouse meet-up on the topic of your question. We also have specific documentation for using the ClickHouse ODBC client with the DDN [1], as well as an example reference implementation. [2]

[0] https://www.youtube.com/watch?v=44CDs7hJTho

[1] https://www.splitgraph.com/connect

[2] https://github.com/splitgraph/splitgraph/tree/master/example...

prepend · on May 7, 2021

That’s pretty cool, I’ll test it out when I get to a machine with a Postgres client.

Do you translate to api calls and query in real-time? Or are you ingesting and archiving?

chatmasta · on May 7, 2021

We support both! A Splitgraph repository can "mount" an external data source, and we'll proxy queries to it using a system based on Postgres Foreign Data Wrappers (FDW). But a repository can also contain any number of "data images," which are versioned snapshots of data roughly inspired by Docker images. You can define them with a declarative, Dockerfile-like syntax called a Splitfile, and you can rebuild them against upstream sources with caching semantics similar to "docker build."

Our core philosophy has always been that it makes sense to start with data federation (live data), and then selectively warehouse/ingest only what you need (versioned data). We're shipping some upcoming features to support this workflow. You start by providing us (or your private deployment) a set of read-only credentials to any supported data source, which we then "mount" as a repository, making it discoverable in the catalog, and instantly queryable with all the other data on Splitgraph. If or when you decide that you want to warehouse this data, we'll make it easy for you to schedule a loading job to ingest it as a Splitgraph image. This way, you can query the live or versioned data in any repository, by simply changing the tag you use to address it.

You can do all this stuff locally, btw – a decentralized workflow is fully supported, and you can push data between peers. The public Splitgraph.com happens to be a "super peer" with a data catalog, scalability features, etc. But if you just want to experiment on your own, you can try it in five minutes!

I don't want to spam this thread too much, so I'll limit it to one link – maybe take a look at the Splitfile docs: https://www.splitgraph.com/docs/working-with-data/using-spli...

villasv · on May 6, 2021

This is not new, though. So it may be a dream in the sense of people have been asleep?

puzzledobserver · on May 6, 2021

Why the condescension? Do you mean that Google has been offering this service for a while? Or do you mean that similar services have previously been offered by other organizations? In which case, perhaps you could link to them?

villasv · on May 7, 2021

No condescension intended. Google has been offering this service for a while, at least since 2018 iirc.

sneilan1 · on May 6, 2021

How long until Google shuts this service down?

paulcole · on May 7, 2021

I was looking up dentistry data sets (my industry) and came across this:

https://www.arcgis.com/home/item.html?id=9850793c688e4eebaab...

Can anybody explain why this showed up in a dataset search and what exactly the data is?

ladberg · on May 7, 2021

Someone trying to do SEO by getting links to their site from a popular one?

runj__ · on May 7, 2021

The person who added the data set is the same person that owns the business, so this seems pretty likely.

sigmonsays · on May 6, 2021

Do they have a deprecation notice up already?

apprenticer · on May 7, 2021

The privacy problem should be considered

mrkramer · on May 7, 2021

Privacy is the liability of those who collect data and share it in a form of dataset not of those who search for datasets or who crawl them.

drusepth · on May 7, 2021

Can you explain what you mean by "the privacy problem"?

lunatuna · on May 7, 2021

I'm not a scientist, so not the most scholarly first lookup, but tried searching for penis data[0]. The first link sent me to a site that requires signup to use [1]. No fun. Won't use again.

[0] - https://datasetsearch.research.google.com/search?query=penis... [1] - https://data.world/jemus42/world-penis-data

alphachloride · on May 7, 2021

I'll keep using it because the inconvenience of the occasional sign-up is trumped by the convenience of searchable datasets.