I agree with most of this. If you have non-trivial message handling logic in a p...

I agree with most of this. If you have non-trivial message handling logic in a production system, you probably shouldn't use SQS directly to drive your work. Your SQS handling logic should be simple and reliable. In most cases, if the handling logic is complex, long-running, or needs operational visibility (logging, monitoring, etc.), I'd write the message handler itself to just kick off a workflow via Step Functions or some other workflow system. You'll pay for that in initial development costs, because it is more complicated (you need to write lambda handlers, wire them up with CloudFormation, etc.), but the tradeoff is that it gives you a central place to look at each unit of work, instead of having your artifacts scattered around various logs (if at all).

The takeaway for me is: distributed systems are hard. If you have distributed workers, you have entered into a vastly more complex realm. SQS gives you some tools to work successfully in that environment, but it doesn't (and can't) get rid of that complexity. Most of the problems I've seen relate to engineers not understanding the fundamental complexity of coordinating distributed work. Your choice of tech stack for your queues isn't going to make a big difference if you don't understand what you're fundamentally dealing with.