Faster PostgresSQL to BigQuery Transfers

vidarh · on Jan 10, 2023

Back when I was working with shapefiles, it was the type of things that tended to be far more convenient to process in-process using something like GDAL [1] (which can operate directly on an in memory copy, gzip files, sqlite databases and far more) and query it with GDAL's SQL support, especially when build with Spatialite [2] rather than loading it into a separate database. It'd have been interesting if the author had talked about what's stopping him from that approach given he's clearly aware of GDAL and given that 130M records and a few tens of GB isn't a particularly big GIS dataset.

[1] https://gdal.org

[2] https://www.gaia-gis.it/fossil/libspatialite/index

marklit · on Jan 10, 2023

I'll take a look into it.

Last year I found ClickHouse did H3 enrichment much faster than PG or BQ so it has become a go-between for larger workloads with simple enrichment requirements.

The dataset in the post is an example anyone can download. I handle 30B+ records at work.

qoega · on Jan 10, 2023

>It would also be good to see GEOS, GDAL and PROJ integrated into ClickHouse.

You can just add feature requests like this to ClickHouse issue tracker. https://github.com/ClickHouse/ClickHouse/issues/45129

marklit · on Jan 10, 2023

Done :)

https://github.com/ClickHouse/ClickHouse/issues/45130

ctc24 · on Jan 10, 2023

Very cool to see a walkthrough with actual benchmarks. Not entirely surprised that Parquet shines here. Another big advantage of Parquet over CSV is that you don't have to worry about data integrity. Perhaps less relevant for GIS data, but not having to think about things like string escaping is rather nice.

"It would be great to see data vendors deliver data straight into the Cloud Databases of their customers. It would save a lot of client time that's spent converting and uploading files."

Hear hear! Shameless plug: this is exactly what we enable at prequel.co. If there are any data vendors reading this, or anyone who wants easier access to data from their vendor, we're here to help.

edit: quote fmt

legedemon · on Jan 10, 2023

For someone whose interaction with spatial data is very limited, I found the article to be a treasure trove of information.

Also, thanks for sharing S2! It'll be nice to look at.

pantsforbirds · on Jan 10, 2023

I really love working with parquet and the general arrow ecosystem. The performance and cost ratios you can get out of it are really insane. AWS S3 + parquet + athena is one of the best and cheapest databases I've ever used.

contol-m · on Jan 10, 2023

[I work on BigQuery]

BigQuery also supports in-place querying of datasets on GCS (or S3/Azure using Omni) via external/BigLake tables. https://cloud.google.com/bigquery/docs/query-cloud-storage-u...

nlittlepoole · on Jan 13, 2023

Can we get any of Iceberg/Delta/Hudi that isn't terribly complex to setup? Like configurable completely from Standard SQL via the CREATE EXTERNAL TABLE syntax.

fhk · on Jan 10, 2023

Great to share the combination of tech here and very interested to see how others are ingesting spatial data at scale

api · on Jan 10, 2023

Anything to go the other way? I’d like to use BQ to warehouse and be able to examine but PG to do heavy analytics due to the cost once you really start doing many repeated queries.

I guess I could just dump directly to CSVs and download but BQ is a nice convenient bottomless data bucket.

mritchie712 · on Jan 10, 2023

do you already have an ETL pipeline to BQ? You could use Airbyte[0] for pg to bq. It's pretty heavy if this is your only source, but should work well if you need to query the pg data in bq frequently.

0 - https://airbyte.com/connections/PostgreSQL-to-BigQuery

rockostrich · on Jan 10, 2023

They said "the other way" so BQ to Postgres.

muspimerol · on Jan 10, 2023

BigQuery and Postgres are both sources and destinations in Airbyte, the data can flow either way.

api · on Jan 10, 2023

Airbyte looks nice! thanks.

buremba · on Jan 10, 2023

You can utilize BI engine for this exact purpose: https://cloud.google.com/bigquery/docs/bi-engine-query

boundlessdreamz · on Jan 10, 2023

Does anyone know what tools can be used to stream the result of a mongodb query into bigquery?

jd_mongodb · on Jan 10, 2023

See https://cloud.google.com/blog/products/data-analytics/mongod...

jd_mongodb · on Jan 10, 2023

See also https://www.youtube.com/watch?v=9MFv8J7Smws

boundlessdreamz · on Jan 10, 2023

Thank you!

yolo3000 · on Jan 10, 2023

Nice to read this, I had a similar type of assignment 15 years ago, visualize the rollout of the fiber optics network across the city. But we had a lot less data to deal with.

michaericalribo · on Jan 10, 2023

Is there a reason not to used federated queries to hit postgres directly?

vgt · on Jan 10, 2023

Federation is slow

(i was the pm for bigquery federation with gcs and cloudsql)