Started down the path of using this when it was in beta, but had to abort when we saw there was no option to connect to it from Python App Engine Standard.
Now that it's GA...it looks like that hasn't changed. Is the classic, Python, App Engine standard becoming a second class citizen? Or was there some reason why this wasn't considered GA worthy for Postgres?
Trying to understand if going forward Google is trying to push everyone to the flexible environment or not - as I would have really expected connectivity between these two products.
And no, appengine standard is not a second class citizen. Hand-wave-ily, the connectivity path that flex uses works for postgres with minimal changes, but unfortunately some additional work is required to get appengine standard for other languages working for postgres. :(
Last I checked it wasn't possible to whitelist internal IPs (e.g. Kubernetes nodes or VM instances) to access Cloud SQL instances at all -- the options are either to use the non-standard cloud SQL proxy sidecar app, or allow connections from all endpoints (public or private).
This seems like a major omission, and AWS has had this for ages.
Ah, misremembered exactly what the issue was -- you're right, individual endpoints can be whitelisted. Internal networks cannot, which is what I (or anyone else using GKE) would need, since node IPs are ephemeral.
I believe the same issue would apply to VM instances that are not pets, (in auto-scaling groups for example), since I'm not aware of being able to auto-assign static IPs there either.
there is also a thirth option. A small pod listening for node changes on k8api, that whitelists ips on cloudsql. I have been using this since two years ago.
I wouldn't say second class citizen (just yet) but the docs and Googlers have been gently nudging people to the flex environment. It can do everything the standard one does and more, so there's really very little reason to stick around on standard.
Is point in time recovery available now in GA? I had checked it a couple of weeks ago for the beta service and it was not available. I think for a managed DB hosting point in time recovery is a critical feature.
[I'm the Cloud SQL TL]
No, it isn't. We agree with you that it's an important feature for managed databases, and we're working to get it right. We decoupled it from this launch to get PostgreSQL to GA faster.
Unrelated to pg but could you badger the spanner team to make a mini spanner product :)
Related to postgres. We have many many concurrent connections but a load satisfied by an n1-standard-4 atm do you recommend a connection pooler or something to help us get down to the 100 to 200 connections we need to be at to use cloudsql?
Does anyone have insight or experience using this in production? We're currently running PostgreSQL 10 w/ pg_partman on our own hardware but looking at a several options for cloud migration. Unfortunately, Citus Cloud on GCP doesn't appear to be an option (yet?)
We've been running a production workload on Postgres/Google Cloud SQL for about half a year now.
While things are good for the most part, a couple of serious problems related to connectivity have us completely boggled. We're connecting from Google Kubernetes Engine, which seems like it should be a standard combination, but run into constant problems that we've dumped many many hours into debugging.
We still haven't figured this problem out. I've found the docs to be very weak on Google's part. A lot of the troubleshooting tips are not very helpful (and can consist of unhelpful broad strokes like "be sure to use indexes!"). Because Google Cloud is not as popular than AWS, there is less community guidance from others. And what guidance does exist is often in forum threads that feel less than reputable. There's a big push to try to get you to talk to sales rep that are not technically knowledgeable and just try to upsell.
Very frustrating. Unclear if moving back to AWS, or hosting our own Postgres, would help.
[I'm the Cloud SQL TL] Note that we currently only support PostgreSQL 9.6. Obviously supporting major versions across both MySQL and PostgreSQL is a priority for us.
Hi, Craig from Citus here. If you're interested in Citus being available on other infrastructure provides aside from AWS as a fully managed service please feel free to reach out to me directly craig at citusdata.com.
Latest software (using v10.3), better performance (nvme SSDs), better backups (point in time, instant cloning), better features (more extensions, cross-region replication even across different clouds), better flexibility (migrate master across different clouds), better monitoring (logs and datadog metrics export), and more focused support with a smaller team.
Yes, I know. This thread is about managed database services, specifically about Aiven vs cloud-direct. Aiven still runs VMs on AWS or GCP but offers nvme disks while neither RDS nor CloudSQL have that available.
For our needs, yes. They run on cloud VMs so there will be a markup but their startup-4 and higher plans on GCP use local NVME SSDs so we get much better performance for the price.
We also make use of cross-regional replicas and are looking at doing it across clouds so if you want that then there isn't any other option other than doing it yourself. It's more of the complexity of this deployment rather than raw db size for us so if you have several TBs then maybe it's not the best fit.
I would appreciate if you could list some of those differences and tradeoffs, for some of us who are interested in Citus but haven't yet had time to look at it in more detail. Thanks!
We actually just migrated off yesterday... we went with an independent vendor like Aiven instead of the clouds because they move too slowly and don't have enough features.
Like aiven or aiven? :) I’m not aware of too much competition in this space. We’re happy with aiven so far, but they’re quite unknown / under the radar it seems...
They don't have a great story around extensions - the ones they do have are unsupported/buggy. For example, PostGIS is missing "ST_FromGeoJSON" because it was compiled with the wrong flag - and has been this way for over a year despite hundreds of user complaints.
As someone who was migrating from on-premises to gcp, and who needs extensive postgis support, this information is a deal breaker for me. Is there any place I can find more information about postgis and other extensions status? Are you aware of any other bug related to postgis?
They're all the same quality. Sometimes great, sometimes terrible, but will get the job done as long as you put in the effort. Also bigger customers will obviously get more.
at the least, they were waiting for timeouts to appear in plv8, which came in at 2.3.1 - I do not know the current status of it being brought into cloud sql though.
My understanding is people are able to run and manage Citus themselves on GCP (not Cloud SQL), but Citus Cloud (the managed solution) is only available in AWS.
I'm looking around for more details on these "regional" disks that replicate between two zones at the block level. Is that just a fancy term for os level mirrored disks using the cloud persistent disks?
Block device based replication for Postgres seems a bit unconventional given that Postgres has native synchronous replication support with WAL streaming.
Intuition tells me that you might get better performance if you let the DB itself do the replication but I can't really justify that without real review of what happens.
The postgres docs (https://www.postgresql.org/docs/10/static/different-replicat...) say that the WAL solution has no "Master server overhead" in contrast to the File System Replication solution, but it's not explained and I'm not sure what is meant by that.
I guess with a block device based solution, recovery takes longer, because failover entails you have to actually mount the block device (as no 2 machines can mount it rw at the same time), and then start the DB (or in a more basic implementation, just boot the entire second machine as part of failover), while with WAL streaming both postgres instances would already be running? Wo failover would be faster with WAL streaming?
I would be great if somebody from GCP could elaborate what the tradeoffs here are, how long failover takes, and whether we can expect similar performance and behaviour as with WAL shipping.
Amazon's Aurora Postgres database does a similar thing: your master in one zone replicates to a disk that is in all the other zones. Unlike normal Postgres RDS instance it also auto-scales storage to what you use.
Amazon claims better scaling then ordinary Postgres for this.
Just speculating but it’s possible the block level is faster because it’s replicated over a dedicated and optimized SAN rather than (potentially) contending with normal network traffic. I assume the database state would only be crash consistent though.
A regional disk is a logical disk that is synchronously replicated at the block level across exactly two zones within the same region [1]. Since the disks are always identical, with no replication lag, the HA control plane can seamless fail over the whole database to a new master that plugs into the same disk. It's all in the article.
Regional disks aren't publicly available yet, but they are in alpha [2]. Like normal persistent disks, everything is backed by Google's internal Colossus system [3].
Has anyone used the beta and got any feeling for how maintenance downtime impacts things? A bit nervous about how you can only set a "maintenance window" and not be able to plan ahead for disruption; as far as I can tell, they won't even tell you ahead of time. The HA seems really solid (zero-lag "regional disks"), but it's still a bit disconcerting.
The updates take the entire instance down for 2-5 minutes each month. While you can't avoid them, they can be scheduled for particularly low traffic times. If you're trying to avoid downtime, its a giant PIA. Even with HA enabled, you still lose master, slave and read replicas. Not entirely sure what they define HA as, but a mandatory monthly downtime doesn't usually fit into mine.
[Update]
That said, from what I understand, they have a road map to maintaining read replicas and queued writes. Not sure what the date on it is though.
[I'm the Cloud SQL TL] I can't comment on timelines, but we're aware that customers are interested in more features around maintenance window scheduling, deferral, and notification, as well as shorter downtime for updates and smarter scheduling within a group of replicas.
[I'm the Cloud SQL TL] Confirmed. We know it's a problem that we need to fix. HA reduces downtime in unexpected failure cases (live migration for your primary only helps in planned shutdown cases, not if the physical machine fails), but doesn't currently help with maintenance-related downtime.
Unfortunately last time I used CloudSQL for MySQL it was incredibly unstable. They would take down our master AND standby at the same time for maintenance. When we filed a ticket they just said it was a known bug with no plans to fix.
A major client of mine migrated to AWS because of this and other issues.
I've been thinking about moving us to Google's Cloud Platform. What I found in regards to maintenance here: https://cloud.google.com/compute/docs/regions-zones/#mainten... states that they do live migrations without any down time. Can anyone elaborate? Is this only for Compute Engine? In that case, if one can run postgres on a Compute Engine instance, why not do that instead? Surely, if one can setup a highly available postgres cluster, Google can do updates without affecting uptime???
To be fair, we wouldn't use GCP for anything but virtual servers and storage replication... I have no desire to tie us to Google's infrastructure any more than necessary.
Were your master and standby in the same availability zone? Can't you set diff maintenance windows? WTF?
"Live migration" refers to how Compute Engine transparently migrates a VM to another physical host [1]. Disk and memory is copied over, and they have some ridiculous technology that keeps network connections alive and re-attaches them to the new VM when it's been switched over, so that it causes, in principle, zero disruptions. This is much more magical than other providers, such as AWS and DigitalOcean, where such a migration results in a reboot.
You can run PostgreSQL on a VM just fine. You just have to manage itself. Cloud SQL comes with some upsides (zero management, spectacular HA failover capabilities) and some downsides (lack of extensions, lives on a separate network, no control over maintenance window); you have to decide what you're willing to live with.
You can set the upgrade window, but it can't be predicted. What you can control is the order — e.g. set your staging instance to "early" and production instance to "late", then hopefully staging should be upgraded first and you'll know ahead of the production upgrade if any issues arose.
GCP has the best compute, storage, and networking of all the clouds. They are cheaper, faster, more scalable and more reliable than the others. Their managed services leave a lot to be desired (beta status, non-standard interfaces, and other limits) but if you're just looking to run VMs then that is the perfect fit for their cloud.
We consolidated everything on GKE now which lets use use VMs but still have the kubernetes control plane looking after things for us which has been great so far.
Maintenance windows are set for the cluster, not single instances. We were distributed across 3 AZs and Google had no suggestions for mitigating the ~5 minutes of downtime we were seeing every week or two.
The whole experience was so amateur and unprofessional it really soured me on GCE. They do have some cool tech but it seems like their cloud division needs to mature a bit.
There is disruption yes. It's usually short however we always see retries in our logs for a few minutes. Our app doesn't need perfect uptime though and we haven't tried the HA setup.
I know that it's definitely not going to be 100% the same (especially since Spanner doesn't even support SQL DML right now), but I think a drop-in replacement into a managed autoscaling database is a really nice alternative to manual sharding.
Right now basically the options are Aurora, Citus, and running CockroachDB yourself.
Last time I was at Google for a workshop (If you have the chance to visit Google, do it. The food alone is worth it), they didn’t seem to push CloudSQL a lot, because they wanted to guide people more in the spanner direction.
Without a solid RDS counterpart however, I don’t think bigger companies will consider moving from AWS. Happy to see they changed their mind and continue to expand their SQL services. The competition by Google put a lot of pressure on AWS who seemed to be gotten a bit lazy. Google was ahead of the game with their global load balancers and network speed and quality. Now AWS countered with their 5th series C5/M5, which solve the bandwidth problem of the smaller C and M instances
I asked that yesterday in a meeting with my gcloud rep, they said all work has gone into getting to GA, and once that is done, look for them to start doing things like updating, more new features, etc.
Since there seems to be some Googler who work on Cloud SQL here, I wonder is there any chance Cloud SQL will be available in asia-southeast1 soon? It's the only region (I believe) where Cloud SQL isn't available at all, and one of the main reasons we can't fully migrated from AWS to GCP just yet.
[I am a Cloud SQL-er, and worked on region expansion... among other things]
Cloud SQL has only been available in regions with at least three zones (since we believe that is the minimum to make sure we can maintain HA in the event of a single zone failure). asia-southeast1 currently only has two zones, when a third zone is launched, Cloud SQL will become available in that region.
When you say you couldn't get the root certs to work... what do you mean?
Cloud SQL automatically generates server certificates, and we offer UI+API for creating additional client certificates. The two should not share a root CA.
Yes, you can use both standard and SSD persistent disks. If you create a larger instance with more vCPUs and a big enough disk, you can achieve greater than 240mb/s, see the docs:
Now that it's GA...it looks like that hasn't changed. Is the classic, Python, App Engine standard becoming a second class citizen? Or was there some reason why this wasn't considered GA worthy for Postgres?
Trying to understand if going forward Google is trying to push everyone to the flexible environment or not - as I would have really expected connectivity between these two products.