Forward error correction in QUIC

ycmbntrthrwaway · on Feb 27, 2016

Similar idea (coded TCP) is patented: http://www.codeontechnologies.com/technology/use-cases/coded... They are selling licenses to Cisco: http://www.codeontechnologies.com/partners/

The idea is simple: XOR random combinations of packets and use gaussian elimination to restore them. Google uses simplest version of it, where they XOR together groups of N packets and send them as N+1 packets, so they can recover one packet if it is lost.

star-trek-fleet · on Feb 27, 2016

That sounds like network coding

colanderman · on Feb 27, 2016

Ya, I don't understand how that's patented. It must be a very specific patent. A few years ago when I first learned of the woes of TCP over lossy links literally the first thought that popped into my head was "why not code N packets into M>N and recover using Gaussian elimination over GF-2"? Such an idea could have passed for non-obvious maybe 50 years ago but it's a pretty basic idea in the 21st century.

To be clear, you can trivially apply any technique here: https://en.wikipedia.org/wiki/Erasure_code across packets in a channel.

ycmbntrthrwaway · on Feb 27, 2016

The goal of network coding is increased throughput, and it is only possible in multicast case. Authors of CTCP also call it "network coding", but in their case they just decrease delay in the case of loss. Throughput of normal TCP is lower, but only because it reacts to loss by decreasing throughput, so the same throughput can be obtained by using more aggressive congestion control.

Nrsolis · on Feb 27, 2016

I've often wondered why not implement G.709 on the packet payload (as opposed to the whole frame) and see how that compares to TCP retransmission.

https://en.wikipedia.org/wiki/G.709

ycmbntrthrwaway · on Feb 27, 2016

I don't know about G.709, but in general coding, including FEC, is used on link layer, as there it is possible to finely adjust it to achieve throughput close to channel capacity and nearly zero packet loss ratio. It is the problem of channel estimation and rate control. In ideal world where users don't contend and always transmit at the same rate, it would be enough.

TCP solves another problem: it adjusts source rate. It assumes that link layer works correctly and packets are not lost because incorrect rate (aka MCS) is used. Instead, packet loss is used as the indicator of congestion: users send data faster then it is possible to transmit it through bottleneck. It is hard to predict and packets are not lost with the same probability on the TCP timescale (RTT, around 50ms), so TCP can't react in time and select the code that would correct losses without adding too much overhead. Thus, coding is not used on TCP layer.

That is why Google focuses on latency, not throughput. It is the area where you can benefit from FEC on transmission layer. Another problem QUIC can solve is bad rate control, such as Wi-Fi rate control on old Windows systems, but as wireless internet usage moves to mobile, and Google controls android, it becomes less of a problem as Google can just improve Wi-Fi rate control in Linux if it fails.

Nrsolis · on Feb 29, 2016

Yeah. I understand the problem that TCP is trying to solve.

My thinking was that there is a whole class of network applications that don't work great under TCP because the retransmission mechanism drives latency super high when congestion is experienced in the network. TCP works to back off transmission that might be the cause of packet loss, but those packets still have to be transmitted again and the bandwidth timeslot consumed by the dropped packet amounts to wasted resources in the network. In a datacenter microburst situation, retransmissions just tickle the problem you're trying to avoid.

FEC over a lossy network would allow for a slight increase in the baseline resource utilization with the side benefit of eliminating retransmissions which might be beneficial in a microburst environment. I don't really know, I'm just thinking aloud.

raattgift · on Feb 29, 2016

Consider three conditions:

- high link-layer integrity; packet loss is mainly due to congestion

- low packet loss; most flows unaffected

- bursty packet loss; individual flows, where they see any loss at all, have a non-negligible probability of multiple back-to-back packets lost (e.g. in FIFO queue tail drop, or a dynamic routing transient like a black hole or brief (~ ttl) forwarding loop).

Consider the limit in which the number of simultaneous flows becomes large. In that limit, there is a lot of additional FEC data that has been generated, successfully transported across a network, and ultimately discarded as redundant. There is also FEC data that fails to help in the reconstruction of flows which experience a multi-packet loss too great for the erasure code being used.

We can calculate wasted power in this limit, and in some situations that waste can be significant.

One situation is when the FEC-generator is sending a large number of flows that stay near steady-state equilibrium amongst themselves and with other long-lived flows, which is reasonable to expect when multi-minute adaptive streaming video is the dominant component of the traffic.

Conversely a transmission that is sufficiently short that it never enters into the moral equivalent of congestion avoidance, and where the application is also latency-sensitive, small amounts of FEC in the early parts of conversations might be a win over retransmissions, depending on how often the FEC forms part of loss recovery.

In the modern Internet, the cost-benefit analysis of the extra power the extra FEC requires is probably strongly time-dependent, and it may be counter-intuitive. (e.g., if FEC doubles initial traffic, it may exacerbate any burstiness, which may in turn lead to greater burst synchronization; that is, in a microburst environment the additional FEC data, unless it's somehow congestion-avoiding (hard to do for very short flows) it might be the opposite of beneficial).

It's probably also worth considering differences in FEC receivers, for instance whether they are drawing power from a small battery or not. Consider a toy question like, "What drains more battery energy: two hours of receiving FEC or two and a bit hours of receiving non-FEC?" "A bit" may matter. It might also be dynamically calculable: lots of retransmits? Transition to sending FEC proactively. No FEC use? Transition to reactive packet loss recovery.

Anyway, it's cool to get even a little look at what Google's learned from its experiments. Everyone should encourage the sharing of these sorts of details, and we should probably encourage other large scale entities with different types of traffic (e.g. non-YouTube Google; Facebook; Twitter) to do likewise.

muhic · on Feb 27, 2016

It is worth noting that this document reads more like a first draft, in fact the document title is 'QUIC FEC v1'. No details/illustration of the implementation and experimental results are given, instead you find statements like 'Median and mean join latency increased'.

jsnell · on Feb 27, 2016

It's not academic writing, just a short explanation on why something that was expected to be a major advantage of QUIC is being removed from the spec for now, due to having a negative impact on Youtube. (Of course that shows that they're able to revert bad decisions, which is definitely an advantage over TCP...)

Though there is an interesting rebuttal in the related mailing list thread [1] suggesting that the feature is sound as-is, but was being applied in the wrong way in these experiments. Specifically, that the largest expected gain is for the client-sent packets very early during the connection setup.

[1] https://groups.google.com/a/chromium.org/forum/#!topic/proto...

kierank · on Feb 27, 2016

Why not use something more advanced like LDPC instead of basic XOR (unless I'm missing a trick)?

amock · on Feb 27, 2016

I think the reason is that the computational power needed to use LDPC is much greater than that required for XOR.