I did some work optimizing a similar problem, but simpler and on another OS[1]. The basic concept that worked was Receive Side Scaling (RSS), which was developed by Microsoft, for Windows Server. Did you come accross that? It needs support in the NIC and the driver, but intel gigE cards do it, so you don't need the really fancy cards. I don't know what the interface is like for Windows, but inbound RSS for FreeBSD is pretty easy, and skimming Windows docs, it seemed like you could do more advanced things there.
The harder part was aligning the outgoing connections; for max performance, you want all of the related connections pinned to the same CPU, so that there's no inter CPU messaging; for me that meant a frontend connection needs to hash to the same NIC queue as the backend connection; for you, that needs to be all of the demultiplexed connections on the same queue as the multiplexed connection. Windows may have an API to make connections that will hash properly, FreeBSD didn't (doesn't?), so my code had to manage the local source ip and port when connecting to remote servers so that the connection would hash as needed. Assuming a lot of connections, you end up needing to self-manage source ip and port anyway, and at least HAProxy has code for that already, but running the rss hash to qualify ports was new development, and a bit tricky because bulk calculating it gets costly.
Once I got everything setup well with respect to CPUs, things got a lot better; still had some kernel bottlenecks though, I wouldn't know how to resolve that for Windows, but there were some easy wins for FreeBSD.
Low core count is the right way to go though; I think the NICs I used could only do 16 way RSS hashing, so my dual 14 core xeon (2690v4) weren't a great fit; 12 cores were 100% idle all the time; something power of two would be best.
Email in profile if you want to continue the discussion off HN (or after it fizzles out here).
[1] Load balancing/proxying, but no TLS and no multiplexing, on FreeBSD.
Do you actually use RSS via options RSS / options PCBGROUP? I've tried it several times, and its just so hard to get right & have matching cores / rx rings, etc. I've made it work with a local patch to nginx, but it was so fragile that I abandoned it.
I had been thinking that RSS/PCBGROUP was totally abandoned and could potentially be removed.
I no longer work where I did this (and it's been shut down, as it was a transitional proxy), so I can't be 100% sure what the kernel configuration was; I was able to release patches on the HAProxy mailing list, although they weren't incorporated, but at least I can reference them [1].
But yes, I think I ended up using both RSS and PCBGROUP. This was on a server running only one application (plus like sshd and crond and whatever), so it was dead simple to line up listen socket RSS and cpu affinity; I had a config generator script that would look at the number of configured queues and tell HAProxy process 0 to bind to cpu 0 and rss queue 0, up until I ran out of RSS queues; we needed a config generator script anyway, because the backend configuration was subject to frequent changes. If it was only listen sockets, RSS would have been sufficient without needing PCBGROUP, but locking around opening new outgoing sockets was a bottleneck and PCBGROUP helped considerably, but it was still a bottleneck. This was on FreeBSD 12.
Edit: I also found some patches[2] I sent to freebsd-transport that I don't know if anyone saw; I don't remember if I updated the patches after this... I know I tried some more stuff that I wasn't able to get working. Don't apply these patches blindly, but these were some of the things I had to fiddle with anyway. I think I saw there was some stuff in 13 that likely made outgoing connections better.
On further reflection, I just want to emphasize how much of an improvement RSS/PCBGROUP and the couple of minor tweaks made for this use case; with unmodified FreeBSD 12 and HAProxy and the load we had, there was basically zero conncurrency available, you could run as many processes (or threads) as you wanted, and the capacity would be the same, and it was sad, I think we could only get about 100k clients on a box before it would run out of steam.
With everything tweaked, we got to 2M clients per server, and actually it was hard to find the limit, because I wasn't able to direct enough traffic to the machines under test.
The software and configuration changes weren't big, but it was a huge impact. On the other hand, if RSS and PCBGROUP weren't in the kernel, I don't think I would have been able to add something similar, and we would have had to something wild and crazy (or try Linux and see if it would do the job). Now, I really did want to write a raw packet tcp proxy in userspace, but I knew it would be a lot easier to manage and quicker to get working with something off the shelf.
Of course, maybe there's a better solution to the root bottleneck, which was always opening a new outgoing tcp connection; even with all the tweaks, that was still the bottleneck, but fixing that needs someone more skilled than me, and I guess it's a pretty niche use case to be opening so many outgoing sockets. Accepting tons of sockets is way more common and way more optimized.
The harder part was aligning the outgoing connections; for max performance, you want all of the related connections pinned to the same CPU, so that there's no inter CPU messaging; for me that meant a frontend connection needs to hash to the same NIC queue as the backend connection; for you, that needs to be all of the demultiplexed connections on the same queue as the multiplexed connection. Windows may have an API to make connections that will hash properly, FreeBSD didn't (doesn't?), so my code had to manage the local source ip and port when connecting to remote servers so that the connection would hash as needed. Assuming a lot of connections, you end up needing to self-manage source ip and port anyway, and at least HAProxy has code for that already, but running the rss hash to qualify ports was new development, and a bit tricky because bulk calculating it gets costly.
Once I got everything setup well with respect to CPUs, things got a lot better; still had some kernel bottlenecks though, I wouldn't know how to resolve that for Windows, but there were some easy wins for FreeBSD.
Low core count is the right way to go though; I think the NICs I used could only do 16 way RSS hashing, so my dual 14 core xeon (2690v4) weren't a great fit; 12 cores were 100% idle all the time; something power of two would be best.
Email in profile if you want to continue the discussion off HN (or after it fizzles out here).
[1] Load balancing/proxying, but no TLS and no multiplexing, on FreeBSD.