I had a GCP Cloud Run function that rendered videos. It was fine for one video per request but after that it slowed to a crawl and needed to shut down to clear out whatever was wrong. I assume a memory leak in MoviePy? Spent a couple of days looking at multiple options and trying different things, in the end I just duplicated the service so I had three of them and rotated which one we sent video renders to and do each render one at a time. It was by far the cheapest solution, means we processed them in parallel rather than serially so it was faster, all in all it worked a treat.
Hah welcome to cloudrun! I was evaluating it a few years ago to host some internal corporate app.
It worked great and was way easier to deploy than k8s setups. However after some testing we found out that the core logic of the app - a long running process, would just crawl to a halt after some time.
It turned out google wanted to push you to use their (paid) queue / pubsub solutions, but they didn’t want it to be _too_ obvious, so cloudrun would actually throttle its cpu sometime after the request it was spawned was returned.
Our logic was based on pushing stuff in a queue and having it be processed outside of a request, but google just f*ked with that solution.
And it would have been fine if that was upfront info, but it was buried in a doc page somewhere obscure, small print…
that was the whole point of cloud run, which was obvious when you looked at the pricing for it: If an instance is not processing requests, the CPU is not allocated and you're not charged.
This reminds me of a service I recently found that was routinely crashing out and being restarted automatically. I fixed the crash, but it turns out it had ALWAYS been crashing on a reliable schedule - and keeping the service alive longer created a plethora of other issues, memory leaks being just one of them.
That was a structural crash and I should not have addressed it.
At Fastmail the ops team we ran fail overs all the time just to get our failures so reliable they worked no matter what. Only once in my tenure did a fail over fail and in that case there was a --yolo flag
At reddit we would randomly select a process to kill every 10 minutes out of the 10 or so on each machine, just so they would all get a restart in case we didn't do a deployment for a few days.
At Amazon they schedule service bounces during code freeze for any service that is known to have memory leaks because it's easier than finding the leak, which isn't usually an issue since it gets deployed so often.
Oooh, you’ve just reminded me of the email server at my first dev job. It would crash every few days and no one could work out why. In the end someone just wrote a cron job type thing to restart it it once a day, problem solved!