Keycloak with PostgreSQL on Kubernetes

photonios · on April 10, 2023

If there's anyone reading this that is planning on deploying Keycloak in a high availability environment, I would highly recommend that you persist all sessions in the database as offline sessions.

At work, I ran 9 Keycloak clusters in production, handling tens of millions of sessions where the cost of losing sessions was high. The amount of time we wasted on getting it to work reliably with its default configuration of storing the sessions in its distributed, in-memory cache (Infinispan) is insane. It just isn't designed to handle such a work load reliably. Unless you're willing to spent months tuning it for every possible scenario, you WILL lose sessions.

If you are in this situation, shoot me an email. I have been through this pain and it took a lot of painstaking work to get to a highly reliable set up at scale.

thegreatunknown · on April 11, 2023

Hi, You might want to take a look at the new storage in keycloak[1].

Newer keycloak versions (19 and up) have a configurable storage for the auth sessions (see storage-area-auth-session and storage-area-user-session). I haven't checked them but the documentation is promising.

For older session (last time I checked keycloak 15) you might want to use offline sessions but they don't allow SSO after the auth session was evicted from infinispan.

1 - https://www.keycloak.org/2022/07/storage-map.html

zakokor · on April 11, 2023

> I would highly recommend that you persist all sessions in the database as offline sessions.

Please! Post it, thanks

paulmd · on April 11, 2023

Was it something about Infinispan, or Keycloak?

We were wondering about Redis in a similar IAM use-case (PingFederate) but it wasn't officially supported, so we decided to just go with persistent Postgres. I wonder if we saved ourselves a bunch of heartache.

photonios · on April 11, 2023

We often experienced cascading failure, especially during rolling restarts. A node would start shutting down and Infinispan would start to try to rebalance. Due to the large volume of sessions, other nodes would start to become unresponsive and stop replying. Eventually, you'd end up in a situation where it would give give up trying to shut the node down cleanly and just kill itself. That wouldn't be a big deal if you weren't doing a rolling restart. When the first node doesn't shut down cleanly, the data should be "safe" since it is replicated to at least N owners. In practice, the other nodes also get restarted, also shut down uncleanly and sessions are lost. Secondly, as the cluster became unresponsive, requests to refresh sessions would start to time out, which would also cause those sessions to be "lost" since they would eventually hit the maximum idle time.

As long as we wouldn't do any restarts, it would sort of work. Problems would pop up when due to high load, one or more nodes would become unresponsive and liveness probes would restart nodes. That would often cause the kind of cascading failure described above.

Most of these problems are also the result of running it in Kubernetes. We very quickly learned to remove the liveness probes and to massively increase the grace period. This helped, but only so much. We still had rather frequent failures similar to the one I just described.

Maybe if we wouldn't have run it in Kubernetes and we would be more knowledgeable about Infinispan, we could've gotten a stable set up. For us, as a small team without that specialized knowledge, we struggled to get a stable set up.

ilyt · on April 11, 2023

Ah, the infinite fun of managing distrubuted systems, I've seen similar failure modes in pretty much anything distributed. While in one node systems the spike of traffic just causes it to sorta work slow, cascading failures caused by latency plague most of the distributed ones.

Whether it's process management or just say node having too little memory and spinning in GC too much.

Mixing app and DB (which is I guess happening here) also can be fun, as now app being overloaded can cause DB being overloaded. You'd probably be just fine if infinispan was used as a remote database instead of embedded one.

ArchOversight · on April 10, 2023

Do you have a blog post or something detailing what you did and how you did it?

rad_gruchalski · on April 10, 2023

I found this: https://www.janua.fr/offline-sessions-and-offline-tokens-wit.... janua.fr is a very solid Keycloak resource. The write up is for a pretty aged Keycloak version but there are probably some decent pointers in there.

photonios · on April 11, 2023

This article gets pretty close, but it misses a very critical piece. If you're running Keycloak 16 or older, you'll explicitly want to enable lazy offline session loading [0]. Otherwise, Keycloak will attempt to load ALL offline sessions in memory during startup.

Keycloak 17 made offline sessions lazily loaded by default.

[0] https://www.keycloak.org/docs/16.1/server_admin/#offline-ses...

vsviridov · on April 10, 2023

As a possible alternative, I've recently started using Zitadel (https://zitadel.com/) which is a very full-fledged open source IDP, in active development.

dariusj18 · on April 10, 2023

This looks very interesting, I have recently tried using auth0 and was so horribly disappointed with how you go from 0 to enterprise as soon as you need any modern security feature. Plus I had assumed that they had a mature product, but is chaotic and difficult to know what you are doing for even the simplest use cases.

hardwaresofton · on April 11, 2023

Wow Zitadel looks absolutely amazing. All the features you want out of login, permissive license, good pl choice, easy to deploy, postgres-first.

Since I somehow can’t resist complaining about absolute mana from heaven:

The only issue I see is this reliance on event sourcing — I get the reasoning but I much prefer regular state saving + audit log approaches.

Event sourcing seems like a complexity and performance liability — does anyone have any insight on the implementation/why I am wrong about my misgivings?

vsviridov · on April 11, 2023

The amount of events in such system isn't going to be too crazy, unless it's some massive enterprise with thousands of principals, I would imagine...

It also seems cockroachdb first, but I'm glad i can use postgres. One fewer database to deploy and manage, and for my use case (basically myself and occasional friends and family) that's perfectly fine.

hardwaresofton · on April 11, 2023

> The amount of events in such system isn't going to be too crazy, unless it's some massive enterprise with thousands of principals, I would imagine...

Right, but it's like... why take that liability in the first place when you have a rock solid and extensible DB like Postgres under the hood.

Why not take the CQRS (good idea), but not go as far as full-on Event Sourcing, and just make sure you keep an audit table log or even executed operation log?

IMO in practice almost on one actually goes back in time with Event Sourcing. Also there are so many things can bite you it just seems unnecessary.

I did some digging through the code, and I really wish they'd made a big DB interface and then made the event store an implementation of that. It looks like they did it the other way -- the default interface being the event store, and PG/CockroachDB being the underlying. It's a subtle difference but means a huge deal for actual swappability of backends.

https://github.com/zitadel/zitadel/blob/main/internal/events...

I have to say, the code is also REALLY confusingly laid out. I just want to find the grpc/http handler that does like "create a user". I've been searching and clicking around for 10s of minutes -- maybe I don't read enough go.

> It also seems cockroachdb first, but I'm glad i can use postgres. One fewer database to deploy and manage, and for my use case (basically myself and occasional friends and family) that's perfectly fine.

I think of cockroachdb as basically postgres-with-stuff-bolted-on (albeit very good stuff, cockroach seems awesome), so I still consider it postgres-first! :)

upcoming-sesame · on April 10, 2023

Ory Hydra / Kratos is another good one

https://www.ory.sh

vbezhenar · on April 11, 2023

Is there solid documentation or tutorial on deploying hydra + kratos? That's what stopped me last year. I've found some subtle pointers in some buried github issues, but nothing definitive. Which surprised me because one would thought that's the main use-case for many people and should be documented well.

whilenot-dev · on April 11, 2023

there's the repository with examples: https://github.com/ory/examples

vsviridov · on April 10, 2023

I guess with Zitadel they don't paywall any features, and with self-hosted option you get essentially the same thing as with hosted. I think you can probably even do multi-instance, however maybe without a management interface for that part of it... When I read Ory language, it says to me "you can try it out locally and itegrate, but we want you in our hosted solution right after" (I could be wrong, I was just glancing casually...)

rad_gruchalski · on April 10, 2023

The Ory stack doesn’t hide any features behind a paywall. Their documentation is subpar, for sure.

It’s a decent product but setting things up can be a bit dull.

One has to thread multiple Github repos and their incomplete scattered documentation together.

But once it works, it works.

jonas-w · on April 10, 2023

How does zitadel compare to authentik (https://goauthentik.io/) or authelia (https://www.authelia.com/)?

vsviridov · on April 10, 2023

I don't think authelia has a UI, and it also has a mode where it integrates a bit more deeply with the routing mesh, to protect apps that do not do auth themselves. Authentik I've not looked into. It also seems that they differentiate self-hosted options into free and paid with different features...

mekster · on April 11, 2023

It took 2 minutes to read Authentik's site to find this clause in pricing FAQ.

> Features from the enterprise version are periodically moved to the open source version.

rad_gruchalski · on April 10, 2023

Any service mesh will integrate with any OpenID provider. It’s a token verification and a redirect.

zinclozenge · on April 11, 2023

Zitadel could be amazing, but as far as I can tell they don’t allow using your own UI screens, and it’s not obvious to me how you’d build a multi tenant SSO feature. They have the concept of organizations, but it’s not obvious to me how you’d route a user to the right login.

mffap · on April 12, 2023

You can enable Domain Discovery to route users to the correct organization. Or you send a reserved scope with the auth request to select the organization. Building an own Login UI will be available in a couple of weeks (https://github.com/zitadel/zitadel/issues/5015)

vsviridov · on April 11, 2023

Organizations are tied to domains, but yeah, that part is more confusing than the rest...

Instance also has a domain on top of that, but there are plans for a "simple" mode, assuming single org.

zinclozenge · on April 11, 2023

I mean nothing wrong with using a domain for routing, but I think most SaaSes would rather have a routing based on the email address.

mffap · on April 13, 2023

That's basically what it does. You can activate Domain Discovery and verify a Domain on an organization, with that zitadel routes users to the organization based on the suffix (ie. email domain)

zinclozenge · on April 13, 2023

Thanks for clarifying, I must have missed it in the docs. If you see this comment, I'm wondering if this discovery functionality will also be customizable when the custom UI screens feature gets added?

ahachete · on April 10, 2023

This is good and interesting recipe to get Keycloak and Postgres on Kubernetes.

There is an important improvement, though: the Postgres deployed here is not production ready (high availability, backups, monitoring, etc).

We run Keycloak on StackGres [1] which gives us production-ready Postgres setup (disclaimer: it's dogfooding). Happy to share the YAML manifests used to deploy Keycloak with StackGres. Maybe we will write a blog post as a follow-up to this one, for completeness.

[1]: https://stackgres.io

rad_gruchalski · on April 10, 2023

> There is an important improvement, though: the Postgres deployed here is not production ready (high availability, backups, monitoring, etc).

Another omission is that one could use a Keycloak operator instead of rolling custom YAML.

bbu · on April 11, 2023

I gotta say the (new) keycloak operator is super basic and doesn't really support changes in image tag. it always assumes a keycloak upgrade and will automatically scale down your keycloak to 1 instance to do an "upgrade". of course this will overload that one node, it will crash, and all your sessions are gone. I'm not sure if anybody is actually using keycloak & kc-operator in production on kubernetes. the state of the documentation and guides make it look like it's an abandoned product.

ahachete · on April 11, 2023

Yes, absolutely.

slig · on April 11, 2023

How do you compare StackGres to CrunchyData's `postgres-operator`?

ahachete · on April 11, 2023

I don't want to go too offtopic on this one --feel free to join StackGres Slack Community [1] to discuss further.

As a one-liner, though, for completeness: StackGres is fully open source (unlike Crunchy that needs a license for production); comes with a Web Console; 150+ Postgres extensions (including Timescale, Citus and many others); and many Day 2 operations fully automated.

[1]: https://slack.stackgres.io

slig · on April 11, 2023

Thank you, I will join.

vbezhenar · on April 11, 2023

I miss the ability to use some kind of GitOps with Keycloak. There's Terraform plugin, but I hate it (because of state). I wish there was some kind of config file which Keycloak would read at startup and create/update/delete its resources according to it. I know that I can initialize realm with JSON (with unreadable structure), but I can't maintain realm with config file.

skhro87 · on April 11, 2023

you can try keycloak-config-cli https://github.com/adorsys/keycloak-config-cli we've been using it in production for 2 years and it works well! we are running it as part of our CICD, to sync settings to all Keycloak realms. As the tool supports variable substitution, it makes it quite flexible. The config file it uses is basically the same realm.json you can export from Keycloak, so it doesn't re-invent the wheel.

vbezhenar · on April 11, 2023

It looks promising, thanks!

pat2man · on April 11, 2023

Have you tried the operator? https://www.keycloak.org/operator/advanced-configuration

vbezhenar · on April 11, 2023

I looked at it but I don't see the features I'm talking about. I can configure pod myself, so I don't understand why would I need this operator.

hotpotamus · on April 10, 2023

I've been down this road a bit, though actually in Docker Swarm. One aspect I spend a lot of time digging into was running multiple keycloak containers with shared cache. On metal or a VM with multicast, they'll find each other no problem, and it works beautifully, but I'm not aware of any container orchestration that brings multicast out of the box (and I don't think AWS does either). Keycloak has a built in Kubernetes DNS discovery mechanism to find its peer containers and share cache which also worked quite well on Swarm, though I lost a day or two tweaking it.

vbezhenar · on April 11, 2023

Yes, Keycloak cluster works fine on Kubernetes. It takes some time to read all the docs and understand things, but nothing outrageous, that was my experience at least.

rad_gruchalski · on April 10, 2023

AWS supports multicast in VPCs.

hotpotamus · on April 10, 2023

Curious - I've seen several references that it doesn't support it, and that keycloak has a dedicated ec2 cache discovery option. But I don't use AWS anyway, so I'm far from knowledgable about it.

rad_gruchalski · on April 10, 2023

https://aws.amazon.com/blogs/networking-and-content-delivery...

rubentanlz · on April 11, 2023

Hopefully this doesn't come across as off-topic but the "smooth scrolling" or whatever that is that hijacks the normal scroll behavior is throwing off my scrollwheel, making the website nigh impossible to navigate. Only way I can scroll properly is by clicking on the scroll bar and dragging it up and down manually.

toomanydoubts · on April 11, 2023

Yes. As a rule of thumb, unless you're making some kind of experiment/poc/artistic work please don't hijack the native browser scroll ever. Hate that this still has to be said in 2023. We like and (maybe) configured the scroll on the OS to our own taste, it's pointless to try and impose an slower, inferior and bug-ridden non-native implementation of the scroll just to add some kind of "smoothness".

brakmic · on April 11, 2023

Hi, OP here.

Thanks for the hint.

I just disabled smooth scrolling. Sorry, not a sophisticated designer guy, using wordpress plus some UI themes.

Regards,

rubentanlz · on April 11, 2023

Thanks, the article content is very useful for me as we're using k8 and keycloak together right now.

Too · on April 11, 2023

I may be wrong but is it not preferable to use StatefulSet for databases, rather than PV, PVC and Deployments?

You will not be able to scale anything up anyway since it’s a single instance mounting the same data.

brakmic · on April 11, 2023

Hi,

I don't think you're wrong. StatefulSets are simply on a "higher level" than raw Deployments and manually provided PersistentVolumes/Claims. I am thinking about writing another article that shows how to use StatefulSets in similar scenarios.

Regards,

xupybd · on April 10, 2023

I've just started using Keycloak to provide OpenID for F# Safe stack applications.

Wow the learning curve was steep on that one. Not having ever touched OpenID or anything other than forms based authentication and not knowing ASP.Net very well.

But it's neat to get it all up and running. Still a few issues with getting Keycloak to redirect to HTTPS but we will get there.

rad_gruchalski · on April 10, 2023

Maybe this will be helpful? https://gruchalski.com/posts/2022-02-20-keycloak-1700-with-t.... I’m the author.

andix · on April 10, 2023

That’s exactly what needs to be done. It is also in the keycloak documentation, but not as easy to find as in your post.

rad_gruchalski · on April 10, 2023

Thanks. I have recently rolled out Keycloak on k8s with Istio and ACME cert-manager. I’m going to write an article about it and post here when I find some time.

xupybd · on April 10, 2023

Thank you!

That looks like the exact problem I'm facing. I'll try it out today!

Thanks again!

andix · on April 10, 2023

The most disappointing problem with asp.net for me was, that there is no backchannel logout. So you can’t easily force-logout users from oidc/keycloak.

Everything else was going pretty smooth, although the authentication documentation for asp.net really sucks.

xupybd · on April 10, 2023

Yeah I've hit that. So if you log them out in your application that will remove the cookies and they won't show as logged in but the next redirect to Keycloak shows an error.

The documentation sucks for ASP.net and it's far worse for the Safe stack.

You have to understand the stack so you have to read up on the following.

  ASP.net
  Giraffe   
  Saturn
  Fable remoting
  Keycloak
  OpenID

Once you have a good understanding of all of those you can start to understand the half a dozen blog posts that attempt something similar.

andix · on April 11, 2023

You can do a logout via redirect though. So you have to call SignOutAsync on the oidc scheme, and then the user will get redirected to a logout page.

Need to enable SaveTokens in session though, because you need the logout token for that.

If you have any issues, please ask and I will post some (short!) code snippets here :)

Ps: I also love f#, but I concluded that it’s not worth using it for asp.net. There are just too many f# specific things you need to figure out first. Just going with c# is the safer bet. But you can still use f# for your service layer and for tests!

pharmakom · on April 11, 2023

Been doing similar stuff and found that ASP.Net stuff wasn’t worth the hassle compared to a custom set of functions on top of Giraffe.

xupybd · on April 11, 2023

Yeah you're right.

I built my own auth, forms based with blowfish encryption is a few hours. Then I felt like I was doing it wrong. So I looked at OpenID. It's been two weeks and it just working. It'll take me. A few days to document it well enough to be happy I can keep it running.

I made a poor choice and introduced too much complexity.

boris-ning-usds · on April 11, 2023

I've been reading up on Keycloak recently, and had questions for hosting keycloak in prod.

How do people in the field handle configuration updates with code? For example, if I want to set it up as an identity broker to an idp, I would want that configuration backed by code, reviewed by my team. Is anybody using the keycloak terraform provider https://registry.terraform.io/providers/mrparkers/keycloak/l... in production?

Do people diff the realm json configuration as code and use that instead?

vbezhenar · on April 11, 2023

Same pain here. I've tried terraform and it works, but I just hate it because of state management. We're doing things manually, like someone changes stuff on test server, writes everything down and then at deploy hour repeats those changes on production server. This is not nice.

skhro87 · on April 11, 2023

you can try keycloak-config-cli https://github.com/adorsys/keycloak-config-cli we've been using it in production for 2 years and it works well! we are running it as part of our CICD, to sync settings to all Keycloak realms. As the tool supports variable substitution, it makes it quite flexible. The config file it uses is basically the same realm.json you can export from Keycloak, so it doesn't re-invent the wheel.

vxxzy · on April 10, 2023

Ah nice! I use Keycloak in conjunction with NetMaker. It seems to work well! I’d like to figure out a way to somehow get ssh authentication with keycloak. I’ve read of oauth + ssh certs, but all of it seems so cumbersome. It would be cool to have an open source alternative to StrongDM.

bebop · on April 10, 2023

Super roughly, but you might be able to implement an Authentication SPI[0] and wire that into an Authorization Code flow.

[0] - https://www.keycloak.org/docs/latest/server_development/#_au...

hsn915 · on April 10, 2023

Am I the only one for whom this kind of thing is painful to read?

I am kinda curious though about the kind of personality type that enjoys this kind of stuff.

Of course, I have never heard of "Keycloak" before, so I checked their homepage:

"No need to deal with storing users or authenticating users."

Wait a second, is dealing with storing users and authenticaing them _so much pain_ that you rather inflict yourself with the pain of setting up and managing a k8s cluster?

I seriously don't get it.

rad_gruchalski · on April 10, 2023

No. The article talks about setting up Keycloak in Kubernetes (and it does that poorly because it’s not a production setup) but you don’t need Kubernetes: https://gruchalski.com/posts/2022-02-20-keycloak-1700-with-t....

hsn915 · on April 10, 2023

I find that equally triggering

rad_gruchalski · on April 10, 2023

Can’t appeal to everyone. Tell me what triggers you. I’m the author and I’m looking forward to your feedback.

hsn915 · on April 11, 2023

You proposed it as an alternative to the complexity in the OP, but it's more or less the same thing: edit a bunch of config files and run some cryptic commands

rad_gruchalski · on April 11, 2023

It’s not an alternative. It’s a different technology stack altogether. I posted it as a response to show you that one does not need Kubernetes to run Keycloak.

What would you expect to see instead of a bunch of configuration files and cryptic commands?

hsn915 · on April 11, 2023

> What would you expect to see instead of a bunch of configuration files and cryptic commands?

The fact that you even need to ask ..

Like, it's so _obvious_ that any computer system can't possible be made to work properly without a bunch of cryptic configuration files and cryptic commands.

rad_gruchalski · on April 11, 2023

> it's so _obvious_ that any computer system can't possible be made to work without

Indeed. The sad world of configuration files and cryptic commands.

vbezhenar · on April 11, 2023

Why postgres is running as privileged?

brakmic · on April 11, 2023

Hi,

Many thanks for the hint. It must have been a leftover setting from one of the other variants I am using locally.

Yes, postgres doesn't need to run as privileged. I changed it to "false" and updated the github repo.

Regards,