Not really, I don't see how this would be more difficult than any other kind of queue most programs already have--email notification retries especially. Unlike email though, this is HTTP and has better status codes. Also, you'd technically still have to implement a queue to send to svix, no? Otherwise if they go down you lose critical messages.
Webhooks are easy-ish to send and retry. Building the UX to help users successfully use webhooks is not simple. You need debugging tools, retry handling, notifications when they break (but not the first time they break, when they break repeatedly), etc.
You're conflating low level plumbing with a ready-to-go, multi tenant feature.
It's potentially a bit different from normal queues in that while you scale up your own queue processing, you can't scale up the webhook receiver. And unike something like newsletter emailing, you probably care very much about latency.
This means that in a naive implementation, unless you run as many parallel workers as there are messages in the queue, someone will block someone else from delivering. Depending on your latency requirements, this might not be acceptable.
Making delivery truly parallel — that is, each distinct receiver should not block anyone else, no matter how slow or failure-prone they are — and low latency is a bit more tricky, essentially requiring one logical queue per webhook.
You can solve it in various ways, depending on what solution (Kafka, NATS/JetStream, Pulsar, Google Pub/Sub, etc.) you choose, but as far as I know, nobody provides this out of the box. In particular, one-queue-per-webhook requires worker coordination in a way that classical pub/sub doesn't — after all, you don't want to run one worker per webhook if they're not all full of pending messages — and some systems don't scale to many queues very well (e.g. Google Pub/Sub has a hard limit of 10,000 topics per project).
Retrying can also pose some challenges. What if the webhook has been down for days? Do you still keep messages in the queue, or do you throw them away? If the webhook comes up, do you prioritize new deliveries or do you mix in the old ones? How do keep track of this so that you can alert the webhook owner about the flakiness?
As the other poster says, the devil is in the details. It's all solveable, but nine times out of then, I personally prefer having something off-the-shelf that's been built once, rather than building it from scratch every time.
I might be missing something but it seems like all of your details are either things you would need to configure anyways in Svix (not all services should have the same retry/expiry) or things that are not solved by this service. This service takes HTTP as input and output, so you wouldn't need a worker per topic anyway, right? The workload is http-in, http-out, with a failure condition for retry.
If I already have a queue of http messages (which I need to have to protect from Svix downtime) configured with their policies for retry/expiry (which I need to configure since it's not the same for all) then what does this service do that is not basically a curl loop with an error check?
But a queue to protect against Svix downtime is fundamentally different from delivering webhooks.
I already outlined some challenges with implementing webhooks. I think you're missing my point about parallel delivery. If the workload is "HTTP-in, HTTP-out", you need to make sure that a single slow "out" does not cause head-of-line blocking that would prevent other, fast workloads from being executed. One way to accomplish that is to scale up to have N_workers >= N_pending, which is typically a terrible solution. So a mature webhook solution needs to be more clever about this.
Queues are great for situations where either the latency doesn't matter, or where you can scale up your resources to decrease latency; but in the case of webhooks, the latency of the webhook receiver is outside your control — you can't scale them up.
Here's another detail where devils are hiding: Delivering webhooks to arbitary URLs is a security concern. The mitigate this, the delivery agent run in an isolated environment so that it cannot possibly interfere with private hostnames/IPs in your cluster.
You don't need a queue to protect from Svix downtime. It can be as simple as logging failures to svix (when they happen), and replaying these events. Though as I said elsewhere, this scenario is something you'd need to deal with Twilio and SendGrid too.
As for what this service does that is not basically a curl loop with an error check: see the rest of the comments. People chimed him from their experience better than I could have said it myself. Or even look at https://svix.com and see what we offer, you'll see that there's much more nuance. :)
We know that people underestimate webhooks, it's a challenge we need to overcome, but there really is more to it than just a POST request.
> It can be as simple as logging failures to svix (when they happen), and replaying these events
That's a manually implemented queue, right?
I looked at the site and this thread and I still don't get it, I don't think I underestimate webhooks, but rather that I don't see why adding another webhook inbetween will help.
We commented about the "queue to send to svix" in the post above.
> Deliverability to user endpoints (servers) is very different to deliverability to Svix. User endpoints fail all the time and for various reasons, and each of them can fail independently. This means developers need a robust and scalable delivery system that can deal with failures on an ongoing basis. While with Svix, outages are rare, and are dealt with as incidents. The same way you would with SendGrid, Twilio and other API providers.
Anyhow, this was not a comment about our uptime against any particular service, but rather how our uptime against the collective (so how often any of those fail) - because that's what matters here.
Though there's definitely a big difference in uptime between a service that has SLAs and random user-endpoints that don't necessarily promise the same.