More

craigkerstiens · 2024-08-18T00:57:42.000000Z

Very much agreed with this general idea, and believe a lot of this was inspired by the team we hired at Crunchy Data to build it as they were socializing it for a while. Looking forward to pg_duckdb advancing in time for now it still seems pretty early and has some maturing to do. As others have said, it needs to be a bit more stable and production grade. But the opportunity is very much there.

We recently submitted our (Crunchy Bridge for Analytics-at most broad level based on same idea) benchmark for clickbench by clickhouse (https://benchmark.clickhouse.com/) which puts us at #6 overall amongst managed service providers and gives a real viable option for Postgres as an analytics database (at least per clickbench). Also of note there are a number of other Postgres variations such as ParadeDB that are definitely not 1000x slower than Clickhouse or DuckDB.

coatue · 2024-08-18T05:32:28.000000Z

Hey Craig, for the public record- pg_duckdb was not inspired by the team at Crunchy Data. Our early mvp version, "pg_quack" was made public (apache 2.0) on February 2nd. About 2 months later, Crunchy's analytics product shipped on April 30th. If you were working on it around a similar time it was a coincidence. Let's call it great minds think alike.

scirob · 2024-08-18T16:09:06.000000Z

Craig fan here, agree it's zeitgeist and I'm loving the PG ecosystem

leetrout · 2024-08-18T02:13:19.000000Z

I just did a project for a YC startup and we reverted to postgres from duckdb+sqlite for concerns enterprises might not see the local file combo as mature / professional.

Really excited about the idea of being able to have everything under the postgres umbrella even with sacrifices.

From the engineering side I have nothing but good things to say about duckdb.

I opened up the database to the frontend (it's an internal reporting tool not unlike grafana and I filtered queries through an allowlist) and it was pure delight to have the metrics queries right next to the graph. Very rapid iterations.

philippemnoel · 2024-08-18T15:51:25.000000Z

As Craig said, Crunchy has a very enterprise mature offering for analytics in Postgres and are very much leading the charge here. ParadeDB is built in a similar way, also ranking high on ClickBench, and is available in the open-source as well.

I'm hopeful the pg_duckdb project will mature enough to be a stable foundation for ParadeDB and others, but that appears to be a matter of MotherDuck and how much they're willing to push this forward.

scirob · 2024-08-18T16:06:45.000000Z

Paradedb choose GPL. So i could see pg_duckdb accelerating past them. But then you never know each of them can change the license at any time

philippemnoel · 2024-08-18T23:31:05.000000Z

ParadeDB itself is AGPL, yes. Our core offering is pg_search, which offers Elasticsearch inside Postgres. What we build will be AGPL, and if pg_duckdb moves forward we will be happy to rebuild our analytics offering on top of it.

coatue · 2024-08-18T16:21:02.000000Z

Hey Phil, the blogpost says pg_duckdb is being taken forward by duckdb labs, hydra, motherduck, neon, and microsoft azure. We're fully invested in developing pg_duckdb and I'm happy to work collaboratively- do you have something valuable to add to pg_duckdb?

philippemnoel · 2024-08-18T19:39:33.000000Z

There is a lot missing from it, as you know. We'd be happy to be part of the project if we get commit access/even partnership :)

nikita · 2024-08-18T04:36:10.000000Z

Are you guys planning to opensource your work at crunchy?

craigkerstiens · 2024-07-23T02:48:24.000000Z

I love when friends do this. It's hard to keep up with people and what they're up to. Publishing and letting people subscribe to me is a great way to share things. A few examples of some friends who are doing this:

Justin Searls (fairly known in Ruby and Rails community) mostly quit a lot of various social channels though publishes on some of them one direction. He started a podcast that wasn't meant to be guests of some specific topic, it's just him updating you on things. What he's working on, what he's learning, random stories, etc. - https://justin.searls.co/casts/

Brandur who I've worked with at a couple of places (Heroku previously, and now Crunchy Data) who writes great technical pieces that often end up here also has more of a personal newsletter. While there are technical pieces in there at times he'll also talk about personal experiences my favorite one is some of the unique experiences hiking the Pacific Trail (https://brandur.org/nanoglyphs/039-trails).

achileas · 2024-07-24T18:39:59.000000Z

This gives me heart. I like writing about technical things, but I also like writing about personal things, concerts I went to, whatever. I'm a whole person, and I never liked the pressure (mostly from social media) to build your "brand" around one genre or style of writing. For me, my site is a personal one where I post about things I'm interested in. Ham radio, machine learning, my travels, pay phones, whatever. Maybe less useful for a reader or audience building but...I just like to write and share things.

throwaway290 · 2024-07-23T03:15:08.000000Z

For many it was supplanted social media. IG, TG, even TikTok (shudder) channels. It monetizes the same motivation

craigkerstiens · 2024-06-04T18:29:55.000000Z

This is very much why we built the Postgres playground, which has Postgres embedded in your browser with guided tutorials - https://www.crunchydata.com/developers/tutorials

craigkerstiens · 2024-05-29T19:50:24.000000Z

At very first glance this would be much closer to neon, with the separated storage and compute.

Crunchy Postgres for Kubernetes is great if you're running Postgres inside Kubernetes, but is more of standard Postgres than something serverless. Citus also not really serverless at all, Citus is more focused on performance scaling where things are very co-located and you're outgrowing the bounds of a single node.

craigkerstiens · 2024-04-30T23:52:45.000000Z

It's not datafusion, much more custom with a number of extensions that underly pieces. And we've got a number of other extensions that will be in the works, to the user it's still a seamless experience but we've seen that smaller extensions that know how to work together are easier to maintain. For example we're working on a map type one that knows how to understand the map types within Parquet files within Postgres. In time we may open source some of these pieces, but we don't have a time frame for that and it's a case by case on each of the extensions.

craigkerstiens · 2024-04-30T17:33:08.000000Z

It's a custom extension and actually a number of custom extensions, with quite a few more planned to further enhance the product experience. All extensions work together as a single unit to compose Crunchy Bridge for Analytics, but under the covers lots of building blocks that work together.

Marco and team worked were the architects behind the Citus extension for Postgres and have quite a bit of experience building advanced Postgres extensions. Marco gave a talk at PGConf EU on all the mistakes you can make when building extensions and best practices to follow–so in short quite a bit gone into the quality of this vs. a quick one off. Even in the standup with the team today it was remarked "we haven't even been able to make it segfault yet, which we could pull off quite quickly and commonly with Citus".

ccleve · 2024-04-30T18:34:03.000000Z

Do you have a link to the slides or the video of that presentation? I found this, but no links: https://www.postgresql.eu/events/pgconfeu2023/schedule/sessi...

craigkerstiens · 2024-04-30T23:59:30.000000Z

Hmm, let me see if we have a link to slides. There was no video recording unfortunately but we can definitely get slides posted.

koolba · 2024-04-30T18:09:30.000000Z

How do AWS credentials get loaded? Is it a static set populated in the CREATE SERVER or can it pull from the usual suspects of AWS credential sources like instance profiles?

Is the code for the extension itself available?

mslot · 2024-04-30T18:34:16.000000Z

The credentials are currently managed via the platform, so you enter them in the dashboard. We wanted to avoid specifying credentials via a SQL interface, because they can easily leak into logs and such. We'll add more authentication options over time.

koolba · 2024-04-30T18:39:45.000000Z

How does the extension get access to them? Is there some other “master” token for the Crunchy PG server itself that is used to fetch the real token?

The extension is not FOSS right?

craigkerstiens · 2024-04-30T19:36:40.000000Z

There is coordination from the Crunchy Bridge control plane to the data plane, that the extension is then aware of.

At this time it's not FOSS, we are going to consider opening some of the building blocks in time, but at the moment they have a pretty tight coupling on both the other extensions and on how Crunchy Bridge operates.

riku_iki · 2024-04-30T18:23:10.000000Z

> Crunchy Bridge for Analytics

what exactly is "Crunchy Bridge for Analytics"? It is some hosted cloud infra? I can't install it locally as extension?

mslot · 2024-04-30T18:36:41.000000Z

Crunchy Bridge is a managed PostgreSQL service by Crunchy Data available on AWS, Azure, and GCP.

Bridge for Analytics is a special instance/cluster type in Crunchy Bridge with additional extensions and infrastructure for querying data lakes. Currently AWS only.

craigkerstiens · 2024-04-08T14:19:30.000000Z

Fully agree with this sentiment it's very much our focus and goal at Crunchy Data with one big thing I'd add-great support.

I recall seeing them crop up in the early days of building and running Heroku Postgres, they were a very very early managed service provider. To my knowledge they never seemed to grow to massive scale but were a steady business (though I don't know any of the details for sure). That they were still around for over a decade is a testament from a lot of others.

craigkerstiens · 2024-02-28T02:12:57.000000Z

Love the callout to Dataclips. It was easily my favorite least used feature by Heroku customers. Blazer and PgHero both got a bunch of inspiration from some of the early things we built at Heroku and its amazing having Andrew crank out so many high quality projects to make some of the tooling more broadly available.

craigkerstiens · 2024-02-27T18:22:01.000000Z

Shameless plug, but we aim to get pretty close to this on Crunchy Bridge (our hobby-0 with 2 vcores starts at $10 a month) - https://www.crunchydata.com/pricing/calculator.

craigkerstiens · 2024-01-09T01:19:31.000000Z

Marco (author) is probably asleep at this point and could give a deeper perspective. He sort of hits on this when talking about disk latency... Depending on your setup and well just from some personal experience I know it's not crazy for Postgres queries to go at 1ms per query. From there you can start to do some math on how many cores, how many queries per second, etc.

Single node Postgres (with a beefy machine) can definitely manage in the 100k transactions per second. When you're pushing the high 100k into millions read replicas is a common approach.

When we're talking transactions, question of is it simply basic queries, bigger aggregations, and is it writes or reads. Writes if you can manage to do any form of multi-line insert or batching with copy you can push basic Postgres really far... From some benchmarks Citus as mentioned can hit millions of records per second safely with those approaches, and even without Citus can get pretty high write throughput.

franckpachot · 2024-01-09T06:12:13.000000Z

The "disappointing" benchmark mentioned in the article is a shame for GigaOm who published it and for Microsoft who paid for it. They compare Citus with no HA to CockroachDB and YugabyteDB with replication factor 3 Multi-AZ, resilient to data center failure. And they run Citus on 16 cores (=32 vCPU) and the others on 16 vCPU. But your point about "beefy machine" shows the real advantages of Distributed SQL. PostgreSQL and Citus needs downtime to save cost if you don't need that beefy machine all days all year. Scale up and down is downtime, as well as upgrades. Distributed SQL offers elasticity (no downtime to resize the cluster) and high availability (no downtime on failure or maintenance)

AdamProut · 2024-01-09T16:30:08.000000Z

RE: "Distributed SQL offers elasticity (no downtime resize"). I'm not sure this is as much of an advantage of distributed databases vs single host databases anymore. Some of the tech to move virtual machines between machines quickly (without dropping TCP connections) is pretty neat. Neon has a blog post about it here[1]. Aurora Serverless V2 does the same thing (but I can't find a detailed technical blog post talking about how it works). Your still limited by "one big host" but its no longer as big of a deal to scale your compute up/down within that limit.

[1] https://neon.tech/blog/scaling-serverless-postgres

mistrial9 · 2024-01-09T02:09:18.000000Z

second yes to that - postgresql warm with plenty of RAM can do some fancy things and return an answer sub-millisecond too

cache is King

franckpachot · 2024-01-09T13:25:18.000000Z

but large cache is expensive in the cloud and you cannot scale up/down without downtime

datadrivenangel · 2024-01-09T15:25:10.000000Z

4TB of ram is only $71 per hour on AWS RDS. If you're at planetary scale that's not bad.