Hacker News new | past | comments | ask | show | jobs | submit login
Keycloak with PostgreSQL on Kubernetes (brakmic.com)
158 points by brakmic on April 10, 2023 | hide | past | favorite | 77 comments



If there's anyone reading this that is planning on deploying Keycloak in a high availability environment, I would highly recommend that you persist all sessions in the database as offline sessions.

At work, I ran 9 Keycloak clusters in production, handling tens of millions of sessions where the cost of losing sessions was high. The amount of time we wasted on getting it to work reliably with its default configuration of storing the sessions in its distributed, in-memory cache (Infinispan) is insane. It just isn't designed to handle such a work load reliably. Unless you're willing to spent months tuning it for every possible scenario, you WILL lose sessions.

If you are in this situation, shoot me an email. I have been through this pain and it took a lot of painstaking work to get to a highly reliable set up at scale.


Hi, You might want to take a look at the new storage in keycloak[1].

Newer keycloak versions (19 and up) have a configurable storage for the auth sessions (see storage-area-auth-session and storage-area-user-session). I haven't checked them but the documentation is promising.

For older session (last time I checked keycloak 15) you might want to use offline sessions but they don't allow SSO after the auth session was evicted from infinispan.

1 - https://www.keycloak.org/2022/07/storage-map.html


> I would highly recommend that you persist all sessions in the database as offline sessions.

Please! Post it, thanks


Was it something about Infinispan, or Keycloak?

We were wondering about Redis in a similar IAM use-case (PingFederate) but it wasn't officially supported, so we decided to just go with persistent Postgres. I wonder if we saved ourselves a bunch of heartache.


We often experienced cascading failure, especially during rolling restarts. A node would start shutting down and Infinispan would start to try to rebalance. Due to the large volume of sessions, other nodes would start to become unresponsive and stop replying. Eventually, you'd end up in a situation where it would give give up trying to shut the node down cleanly and just kill itself. That wouldn't be a big deal if you weren't doing a rolling restart. When the first node doesn't shut down cleanly, the data should be "safe" since it is replicated to at least N owners. In practice, the other nodes also get restarted, also shut down uncleanly and sessions are lost. Secondly, as the cluster became unresponsive, requests to refresh sessions would start to time out, which would also cause those sessions to be "lost" since they would eventually hit the maximum idle time.

As long as we wouldn't do any restarts, it would sort of work. Problems would pop up when due to high load, one or more nodes would become unresponsive and liveness probes would restart nodes. That would often cause the kind of cascading failure described above.

Most of these problems are also the result of running it in Kubernetes. We very quickly learned to remove the liveness probes and to massively increase the grace period. This helped, but only so much. We still had rather frequent failures similar to the one I just described.

Maybe if we wouldn't have run it in Kubernetes and we would be more knowledgeable about Infinispan, we could've gotten a stable set up. For us, as a small team without that specialized knowledge, we struggled to get a stable set up.


Ah, the infinite fun of managing distrubuted systems, I've seen similar failure modes in pretty much anything distributed. While in one node systems the spike of traffic just causes it to sorta work slow, cascading failures caused by latency plague most of the distributed ones.

Whether it's process management or just say node having too little memory and spinning in GC too much.

Mixing app and DB (which is I guess happening here) also can be fun, as now app being overloaded can cause DB being overloaded. You'd probably be just fine if infinispan was used as a remote database instead of embedded one.


Do you have a blog post or something detailing what you did and how you did it?


I found this: https://www.janua.fr/offline-sessions-and-offline-tokens-wit.... janua.fr is a very solid Keycloak resource. The write up is for a pretty aged Keycloak version but there are probably some decent pointers in there.


This article gets pretty close, but it misses a very critical piece. If you're running Keycloak 16 or older, you'll explicitly want to enable lazy offline session loading [0]. Otherwise, Keycloak will attempt to load ALL offline sessions in memory during startup.

Keycloak 17 made offline sessions lazily loaded by default.

[0] https://www.keycloak.org/docs/16.1/server_admin/#offline-ses...


As a possible alternative, I've recently started using Zitadel (https://zitadel.com/) which is a very full-fledged open source IDP, in active development.


This looks very interesting, I have recently tried using auth0 and was so horribly disappointed with how you go from 0 to enterprise as soon as you need any modern security feature. Plus I had assumed that they had a mature product, but is chaotic and difficult to know what you are doing for even the simplest use cases.


Wow Zitadel looks absolutely amazing. All the features you want out of login, permissive license, good pl choice, easy to deploy, postgres-first.

Since I somehow can’t resist complaining about absolute mana from heaven:

The only issue I see is this reliance on event sourcing — I get the reasoning but I much prefer regular state saving + audit log approaches.

Event sourcing seems like a complexity and performance liability — does anyone have any insight on the implementation/why I am wrong about my misgivings?


The amount of events in such system isn't going to be too crazy, unless it's some massive enterprise with thousands of principals, I would imagine...

It also seems cockroachdb first, but I'm glad i can use postgres. One fewer database to deploy and manage, and for my use case (basically myself and occasional friends and family) that's perfectly fine.


> The amount of events in such system isn't going to be too crazy, unless it's some massive enterprise with thousands of principals, I would imagine...

Right, but it's like... why take that liability in the first place when you have a rock solid and extensible DB like Postgres under the hood.

Why not take the CQRS (good idea), but not go as far as full-on Event Sourcing, and just make sure you keep an audit table log or even executed operation log?

IMO in practice almost on one actually goes back in time with Event Sourcing. Also there are so many things can bite you it just seems unnecessary.

I did some digging through the code, and I really wish they'd made a big DB interface and then made the event store an implementation of that. It looks like they did it the other way -- the default interface being the event store, and PG/CockroachDB being the underlying. It's a subtle difference but means a huge deal for actual swappability of backends.

https://github.com/zitadel/zitadel/blob/main/internal/events...

I have to say, the code is also REALLY confusingly laid out. I just want to find the grpc/http handler that does like "create a user". I've been searching and clicking around for 10s of minutes -- maybe I don't read enough go.

> It also seems cockroachdb first, but I'm glad i can use postgres. One fewer database to deploy and manage, and for my use case (basically myself and occasional friends and family) that's perfectly fine.

I think of cockroachdb as basically postgres-with-stuff-bolted-on (albeit very good stuff, cockroach seems awesome), so I still consider it postgres-first! :)


Ory Hydra / Kratos is another good one

https://www.ory.sh


Is there solid documentation or tutorial on deploying hydra + kratos? That's what stopped me last year. I've found some subtle pointers in some buried github issues, but nothing definitive. Which surprised me because one would thought that's the main use-case for many people and should be documented well.


there's the repository with examples: https://github.com/ory/examples


I guess with Zitadel they don't paywall any features, and with self-hosted option you get essentially the same thing as with hosted. I think you can probably even do multi-instance, however maybe without a management interface for that part of it... When I read Ory language, it says to me "you can try it out locally and itegrate, but we want you in our hosted solution right after" (I could be wrong, I was just glancing casually...)


The Ory stack doesn’t hide any features behind a paywall. Their documentation is subpar, for sure.

It’s a decent product but setting things up can be a bit dull.

One has to thread multiple Github repos and their incomplete scattered documentation together.

But once it works, it works.


How does zitadel compare to authentik (https://goauthentik.io/) or authelia (https://www.authelia.com/)?


I don't think authelia has a UI, and it also has a mode where it integrates a bit more deeply with the routing mesh, to protect apps that do not do auth themselves. Authentik I've not looked into. It also seems that they differentiate self-hosted options into free and paid with different features...


It took 2 minutes to read Authentik's site to find this clause in pricing FAQ.

> Features from the enterprise version are periodically moved to the open source version.


Any service mesh will integrate with any OpenID provider. It’s a token verification and a redirect.


Zitadel could be amazing, but as far as I can tell they don’t allow using your own UI screens, and it’s not obvious to me how you’d build a multi tenant SSO feature. They have the concept of organizations, but it’s not obvious to me how you’d route a user to the right login.


You can enable Domain Discovery to route users to the correct organization. Or you send a reserved scope with the auth request to select the organization. Building an own Login UI will be available in a couple of weeks (https://github.com/zitadel/zitadel/issues/5015)


Organizations are tied to domains, but yeah, that part is more confusing than the rest...

Instance also has a domain on top of that, but there are plans for a "simple" mode, assuming single org.


I mean nothing wrong with using a domain for routing, but I think most SaaSes would rather have a routing based on the email address.


That's basically what it does. You can activate Domain Discovery and verify a Domain on an organization, with that zitadel routes users to the organization based on the suffix (ie. email domain)


Thanks for clarifying, I must have missed it in the docs. If you see this comment, I'm wondering if this discovery functionality will also be customizable when the custom UI screens feature gets added?


This is good and interesting recipe to get Keycloak and Postgres on Kubernetes.

There is an important improvement, though: the Postgres deployed here is not production ready (high availability, backups, monitoring, etc).

We run Keycloak on StackGres [1] which gives us production-ready Postgres setup (disclaimer: it's dogfooding). Happy to share the YAML manifests used to deploy Keycloak with StackGres. Maybe we will write a blog post as a follow-up to this one, for completeness.

[1]: https://stackgres.io


> There is an important improvement, though: the Postgres deployed here is not production ready (high availability, backups, monitoring, etc).

Another omission is that one could use a Keycloak operator instead of rolling custom YAML.


I gotta say the (new) keycloak operator is super basic and doesn't really support changes in image tag. it always assumes a keycloak upgrade and will automatically scale down your keycloak to 1 instance to do an "upgrade". of course this will overload that one node, it will crash, and all your sessions are gone. I'm not sure if anybody is actually using keycloak & kc-operator in production on kubernetes. the state of the documentation and guides make it look like it's an abandoned product.


Yes, absolutely.


How do you compare StackGres to CrunchyData's `postgres-operator`?


I don't want to go too offtopic on this one --feel free to join StackGres Slack Community [1] to discuss further.

As a one-liner, though, for completeness: StackGres is fully open source (unlike Crunchy that needs a license for production); comes with a Web Console; 150+ Postgres extensions (including Timescale, Citus and many others); and many Day 2 operations fully automated.

[1]: https://slack.stackgres.io


Thank you, I will join.


I miss the ability to use some kind of GitOps with Keycloak. There's Terraform plugin, but I hate it (because of state). I wish there was some kind of config file which Keycloak would read at startup and create/update/delete its resources according to it. I know that I can initialize realm with JSON (with unreadable structure), but I can't maintain realm with config file.


you can try keycloak-config-cli https://github.com/adorsys/keycloak-config-cli we've been using it in production for 2 years and it works well! we are running it as part of our CICD, to sync settings to all Keycloak realms. As the tool supports variable substitution, it makes it quite flexible. The config file it uses is basically the same realm.json you can export from Keycloak, so it doesn't re-invent the wheel.


It looks promising, thanks!



I looked at it but I don't see the features I'm talking about. I can configure pod myself, so I don't understand why would I need this operator.


I've been down this road a bit, though actually in Docker Swarm. One aspect I spend a lot of time digging into was running multiple keycloak containers with shared cache. On metal or a VM with multicast, they'll find each other no problem, and it works beautifully, but I'm not aware of any container orchestration that brings multicast out of the box (and I don't think AWS does either). Keycloak has a built in Kubernetes DNS discovery mechanism to find its peer containers and share cache which also worked quite well on Swarm, though I lost a day or two tweaking it.


Yes, Keycloak cluster works fine on Kubernetes. It takes some time to read all the docs and understand things, but nothing outrageous, that was my experience at least.


AWS supports multicast in VPCs.


Curious - I've seen several references that it doesn't support it, and that keycloak has a dedicated ec2 cache discovery option. But I don't use AWS anyway, so I'm far from knowledgable about it.



Hopefully this doesn't come across as off-topic but the "smooth scrolling" or whatever that is that hijacks the normal scroll behavior is throwing off my scrollwheel, making the website nigh impossible to navigate. Only way I can scroll properly is by clicking on the scroll bar and dragging it up and down manually.


Yes. As a rule of thumb, unless you're making some kind of experiment/poc/artistic work please don't hijack the native browser scroll ever. Hate that this still has to be said in 2023. We like and (maybe) configured the scroll on the OS to our own taste, it's pointless to try and impose an slower, inferior and bug-ridden non-native implementation of the scroll just to add some kind of "smoothness".


Hi, OP here.

Thanks for the hint.

I just disabled smooth scrolling. Sorry, not a sophisticated designer guy, using wordpress plus some UI themes.

Regards,


Thanks, the article content is very useful for me as we're using k8 and keycloak together right now.


I may be wrong but is it not preferable to use StatefulSet for databases, rather than PV, PVC and Deployments?

You will not be able to scale anything up anyway since it’s a single instance mounting the same data.


Hi,

I don't think you're wrong. StatefulSets are simply on a "higher level" than raw Deployments and manually provided PersistentVolumes/Claims. I am thinking about writing another article that shows how to use StatefulSets in similar scenarios.

Regards,


I've just started using Keycloak to provide OpenID for F# Safe stack applications.

Wow the learning curve was steep on that one. Not having ever touched OpenID or anything other than forms based authentication and not knowing ASP.Net very well.

But it's neat to get it all up and running. Still a few issues with getting Keycloak to redirect to HTTPS but we will get there.


Maybe this will be helpful? https://gruchalski.com/posts/2022-02-20-keycloak-1700-with-t.... I’m the author.


That’s exactly what needs to be done. It is also in the keycloak documentation, but not as easy to find as in your post.


Thanks. I have recently rolled out Keycloak on k8s with Istio and ACME cert-manager. I’m going to write an article about it and post here when I find some time.


Thank you!

That looks like the exact problem I'm facing. I'll try it out today!

Thanks again!


The most disappointing problem with asp.net for me was, that there is no backchannel logout. So you can’t easily force-logout users from oidc/keycloak.

Everything else was going pretty smooth, although the authentication documentation for asp.net really sucks.


Yeah I've hit that. So if you log them out in your application that will remove the cookies and they won't show as logged in but the next redirect to Keycloak shows an error.

The documentation sucks for ASP.net and it's far worse for the Safe stack.

You have to understand the stack so you have to read up on the following.

  ASP.net
  Giraffe   
  Saturn
  Fable remoting
  Keycloak
  OpenID
Once you have a good understanding of all of those you can start to understand the half a dozen blog posts that attempt something similar.


You can do a logout via redirect though. So you have to call SignOutAsync on the oidc scheme, and then the user will get redirected to a logout page.

Need to enable SaveTokens in session though, because you need the logout token for that.

If you have any issues, please ask and I will post some (short!) code snippets here :)

Ps: I also love f#, but I concluded that it’s not worth using it for asp.net. There are just too many f# specific things you need to figure out first. Just going with c# is the safer bet. But you can still use f# for your service layer and for tests!


Been doing similar stuff and found that ASP.Net stuff wasn’t worth the hassle compared to a custom set of functions on top of Giraffe.


Yeah you're right.

I built my own auth, forms based with blowfish encryption is a few hours. Then I felt like I was doing it wrong. So I looked at OpenID. It's been two weeks and it just working. It'll take me. A few days to document it well enough to be happy I can keep it running.

I made a poor choice and introduced too much complexity.


I've been reading up on Keycloak recently, and had questions for hosting keycloak in prod.

How do people in the field handle configuration updates with code? For example, if I want to set it up as an identity broker to an idp, I would want that configuration backed by code, reviewed by my team. Is anybody using the keycloak terraform provider https://registry.terraform.io/providers/mrparkers/keycloak/l... in production?

Do people diff the realm json configuration as code and use that instead?


Same pain here. I've tried terraform and it works, but I just hate it because of state management. We're doing things manually, like someone changes stuff on test server, writes everything down and then at deploy hour repeats those changes on production server. This is not nice.


you can try keycloak-config-cli https://github.com/adorsys/keycloak-config-cli we've been using it in production for 2 years and it works well! we are running it as part of our CICD, to sync settings to all Keycloak realms. As the tool supports variable substitution, it makes it quite flexible. The config file it uses is basically the same realm.json you can export from Keycloak, so it doesn't re-invent the wheel.


Ah nice! I use Keycloak in conjunction with NetMaker. It seems to work well! I’d like to figure out a way to somehow get ssh authentication with keycloak. I’ve read of oauth + ssh certs, but all of it seems so cumbersome. It would be cool to have an open source alternative to StrongDM.


Super roughly, but you might be able to implement an Authentication SPI[0] and wire that into an Authorization Code flow.

[0] - https://www.keycloak.org/docs/latest/server_development/#_au...


Am I the only one for whom this kind of thing is painful to read?

I am kinda curious though about the kind of personality type that enjoys this kind of stuff.

Of course, I have never heard of "Keycloak" before, so I checked their homepage:

"No need to deal with storing users or authenticating users."

Wait a second, is dealing with storing users and authenticaing them _so much pain_ that you rather inflict yourself with the pain of setting up and managing a k8s cluster?

I seriously don't get it.


No. The article talks about setting up Keycloak in Kubernetes (and it does that poorly because it’s not a production setup) but you don’t need Kubernetes: https://gruchalski.com/posts/2022-02-20-keycloak-1700-with-t....


I find that equally triggering


Can’t appeal to everyone. Tell me what triggers you. I’m the author and I’m looking forward to your feedback.


You proposed it as an alternative to the complexity in the OP, but it's more or less the same thing: edit a bunch of config files and run some cryptic commands


It’s not an alternative. It’s a different technology stack altogether. I posted it as a response to show you that one does not need Kubernetes to run Keycloak.

What would you expect to see instead of a bunch of configuration files and cryptic commands?


> What would you expect to see instead of a bunch of configuration files and cryptic commands?

The fact that you even need to ask ..

Like, it's so _obvious_ that any computer system can't possible be made to work properly without a bunch of cryptic configuration files and cryptic commands.


> it's so _obvious_ that any computer system can't possible be made to work without

Indeed. The sad world of configuration files and cryptic commands.


Why postgres is running as privileged?


Hi,

Many thanks for the hint. It must have been a leftover setting from one of the other variants I am using locally.

Yes, postgres doesn't need to run as privileged. I changed it to "false" and updated the github repo.

Regards,




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: