Google has probably sent data to almost every /24 in the last hour. Probably 99% of their egress data goes to destinations where they've sent enough data recently to make a good estimate of bottleneck link speed and queue size.
Having to pick a particular initcwnd to be used for every new TCP connection is an architectural limitation. If they could collect data about each destination and start each TCP connection with a congestion window based on the recent history of transfers from any of their servers to that destination, it could be much better.
It's not a trivial problem to collect bandwidth and buffer size estimates and provide them to every server without delaying the connection, but it would be fun to build such a system.
> It's not a trivial problem to collect bandwidth and buffer size estimates and provide them to every server without delaying the connection, but it would be fun to build such a system.
Tons of fun. Sadly, I don't have access to enough clients to do it anymore.
But napkin architecture. Collect per connection stats and report on connection close (you can do a lot with tcp_info or something something quic). That goes into some big map/reduce whatever data pipeline.
The pipeline ends up with some recommended initial segment limit and a mss suggestion [1], you can probably fit both of those into 8-bits. For ipv4, you could probably just put them into a 16 MB lookup table... shift off the last octet of the address and that's your index into the table. For ipv6 it's trickier, the address space is too big; there's techniques though.
At google scale, they can probably regenerate this data hourly, but weekly would probably be plenty fast.
[1] This is it's own rant (and hopefully it's outdated) but mss on a syn+ack should really start at the lower of what you can accept and what the client told you they can. Instead, consensus has been to always send what you can accept. But path mtu doesn't always work, so a lot of services just send a reduced mss. If you have the infrastructure, it's actually pretty easy to tell if clients can send you full mtu packets or not... with per network data, you could have 4 reasonable options, reflect sender, reflect sender minus 8 (pppoe), reflect sender minus 20 (ipip tunnel), reflect sender minus 28 (ipip tunnel and pppoe). If you have no data for a network, random select.
That's about local link loss; you at best get buffer bloat from confusing the wired desktop and the wireless laptop that share a 800~1200 Mbit/s DOCSIS downlink.
Or worse, different service tiers from a neighborhood getting bundled via CGNAT; though that's a clear argument for IPv6.
Spiders that send too much traffic tend to get blocked, so they are already having to contend with some sort of coordination. Whatever system they’re using for that coordination (server affinity being the simplest) can also propagate the congestion windows.
Having to pick a particular initcwnd to be used for every new TCP connection is an architectural limitation. If they could collect data about each destination and start each TCP connection with a congestion window based on the recent history of transfers from any of their servers to that destination, it could be much better.
It's not a trivial problem to collect bandwidth and buffer size estimates and provide them to every server without delaying the connection, but it would be fun to build such a system.