We’ve shared a lot about how much we love QUIC (and why we’re building our own implementation called quicly). It promises latency reduction, improved throughput, resilience to client mobility, and increased privacy and security. Excitingly, the QUIC working group at the IETF is now on the cusp of getting the first version of QUIC wrapped up and ready for internet-wide deployment. While many of the people and teams building and planning to use it are eager to see wide deployment, one concern keeps coming up: what is the computational cost of QUIC’s additional features and protections? QUIC is being touted as a replacement to TCP; how can it do that if it requires significantly more computational power?
We ran the tests to try and find some answers, and here’s the high level answer: Yes, QUIC can be as computationally efficient as TCP!
Before the bottles of champagne come out, we’ll admit what will soon become obvious: we have a simple setting and benchmark, and that we need to do more testing with more realistic and representative hardware and traffic scenarios. Importantly, we did not have any hardware offload enabled for TCP or QUIC. Our goal was to use a simple scenario with synthetic traffic to eliminate some of the more obvious computational bottlenecks and gain insights into how we might reduce QUIC’s costs.
That said, we were surprised to find that QUIC did as well as TCP in even our simple scenario.
You could think of the exercise in this post as equivalent to making our car race well against a Ferrari on a race track. A race track is a highly artificial environment and the experience of driving our car on it is not representative of what you will experience daily (unless you are a race car driver). Working through the problem of racing well on that track however helps us to uncover bottlenecks. And the important and transferable things here are the measures we would take to remove the bottlenecks. That’s what this post is about.
TCP has been the workhorse of the web for a long time, and a lot of effort has gone into optimizing its implementations over the years to make it more computationally efficient. QUIC however, is still a nascent protocol; it has yet to be widely deployed and tuned for computational efficiency. How would such a new protocol compare against the venerable TCP? Importantly, can QUIC be as efficient as TCP in the near future?
We anticipated two major sources of computational cost for QUIC:
Acknowledgement processing: A large fraction of packets in a typical TCP connection carry only acknowledgements. TCP acknowledgements are processed within the kernel, both at the sender and the receiver. QUIC does these in user space, resulting in more data copies across the user-kernel boundary and more context switches. Additionally, TCP acknowledgements are in plaintext, while QUIC acknowledgements are encrypted, increasing the cost of sending and receiving acknowledgements in QUIC.
Per-packet sender overhead: The kernel knows about TCP connections, and can remember and reuse state that is expected to remain unchanged for all packets sent in a connection. For instance, the kernel needs to typically only look up the route for the destination address or apply firewall rules once at the start of the connection. Since the kernel has no connection state for QUIC connections, these kernel operations are performed on every outgoing QUIC packet.
Since QUIC runs in user space, these costs are higher with QUIC than with TCP. This is because every packet that is either sent or received by QUIC crosses the user-kernel boundary, which is known as a context switch.
The experiment and first impressions
To start answering the question we laid out earlier, we set out to do a simple benchmark. We used quicly as the QUIC server and as the QUIC client. QUIC packets are always encrypted using TLS 1.3, and quicly uses picotls, H2O’s TLS library, for this purpose. Our reference TCP setup would use picotls using native Linux TCP to minimize the differences between the reference TCP setup and the QUIC setup.
Computational efficiency can be measured in one of two ways: by measuring the amount of computational resources required to saturate a network, or by measuring the throughput sustainable with all available computational power. Saturating the network introduces variability due to packet losses and due to subsequent loss recovery and congestion controller actions. While it is important to include these while measuring performance, we wanted to avoid this variability, and therefore chose the latter method. We measured computational efficiency as the throughput that a sender could sustain on an otherwise idle high bandwidth network with all of its computational power.
The sender's computational efficiency is important for two other reasons. First, senders tend to bear the brunt of computational cost in transport protocols. This happens because the sender is responsible for most computationally expensive transport functions, such as running timers to detect packets dropped in the network and retransmitting them, monitoring the network’s round-trip time, and running bandwidth estimators so that it doesn’t congest the network. Second, servers are typically senders and improving the computational efficiency of protocol processing matters more at servers. (This doesn’t mean that computational efficiency at clients is not important, but protocol processing isn’t typically their primary bottleneck.)
A brief bit about our experimental setup before diving into results. Our sender used Ubuntu 19.10 (linux kernel version 5.3.0) on an Intel Core m3-6Y30, capped to single core and single thread. The sender was connected to the local network through a USB Gigabit-Ethernet adapter that uses the ASIX AX88179 controller. Checksum offloading was enabled for both TCP and UDP. Other hardware optimizations for TCP, such as TCP Segmentation Offload (TSO), were not used; we plan to use these in future comparisons along with similar hardware optimizations for UDP. The receiver was a quad-core MacBook Pro running at 2.5 GHz, connected to the sender via a commodity Gigabit-Ethernet switch. Importantly, to ensure that the sender would not saturate the network when using all of its computational power, we limited the CPU’s clock from 2.2GHz down to 400MHz. We confirmed that there were no losses in the network when running the experiments.
We are using fairly low-end hardware for our sender here. And that’s just fine – as noted earlier, in this first step, we care about the measures that we would take to remove the bottlenecks that appear and about their transferability to other environments. In our future steps, we will be looking at server-grade hardware.
As the first reference benchmark, we measured the maximum achievable raw unencrypted TCP throughput using iperf. This throughput was 708Mbps.
As the second reference benchmark, we measured sustained throughput achievable by TLS 1.3 over TCP, using picotls with AES128-GCM as the cipher. This throughput was 466Mbps, roughly 66% of what we saw with unencrypted TCP. This performance reduction is due to the cost of cryptography, the use of a non-blocking socket, and the cost of interrupting user-space execution to handle incoming acknowledgements. Importantly however, the overheads don't really impact the question we are investigating here, since these costs apply to both TLS over TCP and QUIC.
Finally, we measured sustained throughput with off-the-shelf quicly. This throughput was 196Mbps. QUIC was able to achieve approximately 40% of TLS 1.3 over TCP. QUIC suffered from the costs that we expected, and the initial number was sobering.
We wanted to not only measure the cost of running QUIC, we also wanted to see what we could do to lower this cost. And we did not want to limit ourselves to our implementation; we would consider changes or tweaks to the protocol as well. We did this in three steps, and we’ll walk through each one next.
Reducing acknowledgement frequency
Like TCP, the QUIC specification recommends that a receiver send an acknowledgement for every two packets that it receives. While this is a reasonable default, receiving and processing acknowledgements is a cause of computational cost for a data sender. A receiver could simply send fewer acknowledgements, but doing so can reduce the connection’s throughput, especially early in the connection.
To understand this, imagine that each acknowledgement received by the sender allows it to increase its rate. The sooner it receives an acknowledgement, the sooner it increases its sending rate. If that rate is low, there are fewer packets going to the receiver per round trip and therefore fewer acknowledgements coming back from it. Reducing the number of acknowledgements for such a connection can measurably reduce the sender’s sending rate and overall performance.
For a connection that has high throughput, reducing the number of acknowledgements could still mean that enough of them are coming back to not measurably impact the sender’s rate. (This is a bit of a simplification, and the other considerations are explained in a bit more detail here). In our test setup, we can reduce the acknowledgement frequency without any impact on the sender’s throughput.
We reduced the acknowledgement frequency from once every two packets to once every ten packets, and as a result, quicly was able to sustain a throughput of 240Mbps. Doing this benefits the network by reducing the number of acknowledgement packets on the network, and this experiment proves that it also benefits the sender by reducing computational overhead. This result convinced us to implement the QUIC Delayed ACK extension proposal – more on this later.
Coalescing packets with Generic Segmentation Offload (GSO)
Following the advice of Willem de Bruijn and Eric Dumazet in "Optimizing UDP for content delivery: GSO, pacing and zerocopy", we next turned to see if UDP Generic Segmentation Offload might help with reducing the overhead caused by single packet writes and context switches per packet. Generic Segmentation Offload (GSO) is a Linux feature that allows user-space applications to provide a series of packets as a single unit to the kernel. This unit is passed through the kernel as one large virtual packet. This means the kernel now makes fewer decisions on large virtual packets instead of doing them once for each small packet. Unlike TCP Segmentation Offload, GSO in Linux is not necessarily a hardware offload. When the hardware does not support segmentation offload of UDP packets, the large virtual packet is divided into multiple small UDP packets as it reaches the network card driver. That was the case in this experiment.
By coalescing at most ten UDP packets into one object and sending them using GSO, QUIC’s throughput increased from 240Mbps to 348Mbps – a 45% increase! To see if coalescing even more would improve performance, we tried coalescing up to 20 UDP packets with GSO. This led to an additional 45% increase in throughput, and QUIC was now zooming along at 431Mbps.
This was huge. The per-packet cost of QUIC was clearly a significant bottleneck, and addressing that with GSO helped enormously. We needed to choose a GSO size, which we will discuss later. We then turned our eyes to another parameter that does not get as much attention as it should: packet size.
Increasing packet size
The QUIC specification recommends a conservative default minimum QUIC packet size of 1200 bytes, and quicly used 1280 bytes. Implementations are allowed to increase packet size if they have reason to believe that the path might support larger packets. Given that the path was able to support 1472-byte QUIC packets, and that TCP was using 1460-byte packets on this path, it made sense for QUIC to use larger packets as well. Increasing this maximum packet size reduces computational cost, since it reduces the number of packets required to transfer a certain amount of data. That reduces computational inefficiency at both the sender and the receiver because there is a fixed per-packet processing cost at both sides.
So, we changed the QUIC packet size from 1280 bytes to 1460 bytes, for parity with the TCP payload size. With this change, quicly was able to sustain a throughput of 466Mbps – an 8% increase in throughput.
QUIC was now faster than TLS over TCP!
Using the results in production
This experiment showed a clear path forward for improving quicly’s efficiency: reducing acknowledgement frequency, coalescing packets with GSO, and using as large a packet size as possible. We now turn to generalizing and adopting these optimizations in a way that they would work well across various environments, minimizing the risk of side-effects.
Reducing acknowledgement frequency
Reducing the acknowledgement frequency to a fixed rate of once per ten packets has a couple of issues. First, as noted above, it can measurably hurt throughput when the connection’s throughput is low to begin with. Second, the client in these experiments is quicly, whereas when running in production, the client will be a browser, which we do not control.
The answer to both of these problems is the Delayed Ack extension for QUIC, which enables a sender to dynamically control the acknowledgement frequency of the receiver, based on its current sending rate.
Since running this experiment, we implemented the delayed ACK extension, with the sender controlling acknowledgement frequency to be once every eighth of its congestion window, which is approximately once every eighth of its round-trip time. In our experimental setup above, this reduces acknowledgements to once every sixty packets, which is a significantly larger reduction than in our experiment.
Coalescing packets with GSO
The GSO experiment showed that coalescing more packets made QUIC more efficient. There’s a cost to GSO however. Coalescing with GSO means that the sender bursts out all these packets into the network, leading to increased short-term pressure on the network’s buffers and an increased probability of packet loss.
So, with this tradeoff, how many packets should quicly coalesce? The specified acceptable burst size for a QUIC sender is ten, which seems like a pretty good recommendation to use here. quicly has now implemented an option to send GSO bursts of ten packets.
We note that currently a sender’s kernel cannot pace these packets out – that is, send the constituent packets of a GSO burst out at a rate that keeps the network from having to absorb the burst. We would be interested in using such a facility if it were to be implemented in the Linux kernel.
Choosing a packet size
The larger the packet size, the better the performance. However, larger packets suffer the risk of getting dropped on some network paths. We could implement mechanisms such as Path MTU Discovery to detect the largest packet size that could be used on a connection, but this is practically only useful for long-lived connections. For most connections, and during the beginning of all connections, a sender needs to determine a good packet size.
Instead of using a fixed packet size, the quicly server now determines its own packet size based on the size of the first packets it receives from the client. Since these packets successfully made it through the network to the server, it is reasonable to expect that the packets of the same size might have a good chance of making it back through the network to the client.
Conclusion and next steps
With these changes in place, quicly now achieves 464 Mbps (1% faster than TLS 1.3 over TCP) when the first QUIC packets sent by the client are 1460 bytes, and 425Mbps (only 8% slower than TLS 1.3 over TCP) when the first QUIC packets sent by the client are 1350 bytes – the default packet size used by Chrome.
The workload and environment in this experiment represent only one point in the vast space of workloads and environments. Our goal was to see if we could get QUIC to meet TCP’s throughput in this very specific microbenchmark. With further testing and experiments, we will explore and improve QUIC performance in other representative settings and scenarios.
Significantly however, this experiment shows us that using system optimizations and protocol mechanisms judiciously, QUIC has a fair chance of being as computationally efficient as TCP.
As we continue work on tuning our implementation’s performance with further testing and deployment, we will continue to report our findings. Stay tuned!