Fastly's Response to SegmentSmack

Jana Iyengar

VP, Product, Infrastructure Services, Fastly

Ryan Landry

Vice President, Technical Operations, Fastly

Marc Eisenbarth

Director of Application Security

August 14, 2018

Security Engineering

A remotely exploitable denial-of-service (DoS) attack against the Linux kernel, called SegmentSmack, was made public on August 6th, 2018 as CVE-2018-5390. Fastly was made aware of this vulnerability prior to that date through a responsible disclosure.

As part of our initial investigation, Fastly discovered a candidate patch proposed by Eric Dumazet from Google to address this vulnerability. We discussed the vulnerability and the patch with Eric, reproduced the attack, validated the patch as a fix, and estimated the impact of the vulnerability to our infrastructure. We immediately deployed temporary mitigations where we were most vulnerable, while simultaneously preparing and rolling out a patched kernel to our fleet.

As of this post, our entire fleet has been upgraded to the fixed kernel, and our customers and traffic are protected from SegmentSmack.

The Vulnerability

The SegmentSmack attack targets an expensive operation in Linux’s TCP segment assembly code, and was quickly identified as high risk. To understand the vulnerability, it is helpful to understand TCP receivers in general and the Linux TCP receiver in particular.

TCP presents an ordered and reliable bytestream to applications using it. Internet infrastructure can be quite chaotic – packets get lost, reordered, duplicated – and TCP is responsible for creating order out of this chaos. Packets carrying TCP segments may arrive out of order at a TCP receiver, which stores them until they can be delivered in order to the application.

The Linux TCP implementation uses an “out-of-order queue” to hold segments that are not received in order. When a new segment is received, a Linux TCP receiver walks through the queue trying to find the correct position for this segment and to coalesce existing segments. Coalescing these existing segments is an expensive operation. In this case, this operation was being executed even if the segments were not contiguous, and therefore could not be coalesced. An attacker could simply send small non-contiguous segments, and the receipt of every new segment would cause the receiver to spend enormous CPU time trying to coalesce segments in the queue. This would delay or entirely disrupt servicing of requests from legitimate users.

In simple terms, the patches that fix this vulnerability eliminate unnecessary attempts at coalescing segments in the out-of-order queue.

Our Response

We reproduced the attack, and found that even a single weakly-provisioned attacker could severely impact customer traffic.

The kernel patches had already been made public, but rolling out new kernels without adversely affecting our customers takes time and care. We wanted to roll out protections for our servers as quickly as possible, with minimal risk to our customers and those traversing the Fastly network.

We realized that we could simply reduce the receive buffer allocated by the kernel to incoming TCP connections with no discernible impact on web and video traffic. A TCP receiver uses a receive buffer to hold data that has not yet been delivered to the application, including data received out of order. A TCP receiver communicates this buffer size to the sender, and any data that does not fit in the buffer is discarded by the receiver. A SegmentSmack attack was only useful if there were a significant amount of segments in the receiver’s out-of-order queue. Reducing the receive buffer size at our servers would limit the size of the out-of-order queue, consequently reducing the potency of the attack.

Unfortunately, a smaller receive buffer also limits the sender’s throughput. At any given time, the sender can only send into the network as much data as can fit in the receiver’s buffer. In Fastly’s favor was the fact that most all of our traffic is download traffic, where our servers are data senders and not receivers. As a result, we expected that reducing the receive buffer at our servers would not affect the throughput of most of our traffic.

We anticipated that reducing the buffer size would reduce throughput where our servers were data receivers. This is true in two cases: when users upload data through our edge nodes, and when our edge nodes pull content from an origin server. Through experimentation, we were able to determine an ideal buffer size that mitigated the attack with minimal impact to performance of user uploads, which addressed the first case.

The second case was more critical: reducing the receive buffer size would cause user-visible impact on page load latency. We needed to increase the receive buffer for POP to POP connections, and between edge node and origin connections. Luckily, kernel probes (kprobes) provides an easy way to inject code into a running kernel. In particular, kprobes can be attached to the entry of a function and examine and modify its arguments. We used this to our advantage by intercepting the connect() system call (a call that is only executed on TCP connections initiated by our edge nodes) and setting the receive buffer to large values for known good connections.

With our buffer reduction workaround in place, we were protected from this vulnerability while kernel upgrades progressed at a measured and monitored pace across the fleet, guarding important cache-hit ratios for our customers. As the kernel upgrades were deployed through the fleet, both parts of the receive buffer mitigation were rolled back at the same time -- the receive buffer limit was raised and the kprobe was removed. However, in a few edge nodes, our deployment code left the receive buffer limit in place while removing the kprobe after the kernel upgrade. This caused a limited and temporary performance regression, which was discovered quickly and reverted. The bug in our deployment code has been fixed, to protect future kernel upgrades.

Summary

Fastly is an extension of our customers’ infrastructure. Security, availability, and performance are foundational attributes of everything we do. Our deep relationships in the operational security community, along with our talented engineering teams, enabled us to respond well in advance of the public disclosure, protecting our customers, their customers, and the internet as a whole.

If problems and challenges like those covered in this blog post sound exciting to you, we’re always adding to our engineering and security teams. Check out our careers page and join us in making the internet better!