hey! thanks for the reply. I'm sure it would be fine if we had it configured correctly - the issues we're seeing are a) unexpected behavior in the queue/worker library we are using, b) it was installed before most of us got there, and we don't understand the failure modes, and c) we don't have good visibility into the state of the system when it fails.
I have full faith in Redis as a tool, and I'm sure there are reliable queue/worker libraries that work with it but we don 't have one of them. even with a reliable datastore you need to make sure you're not dequeueing things twice or putting things back on the queue or suddenly having your workers stop processing things - we've seen all of these, and like everything, it's been tricky to balance "rewrite the entire thing" vs. "make it good enough & focus on delivering business value".
I have full faith in Redis as a tool, and I'm sure there are reliable queue/worker libraries that work with it but we don 't have one of them. even with a reliable datastore you need to make sure you're not dequeueing things twice or putting things back on the queue or suddenly having your workers stop processing things - we've seen all of these, and like everything, it's been tricky to balance "rewrite the entire thing" vs. "make it good enough & focus on delivering business value".