Preventing outages with resilient architectures

Laura Thomson

SVP, Engineering, Fastly

Inés Sombra

VP of Engineering, Core Systems, Fastly

Hossein Lotfi

SVP of Engineering, Fastly

November 07, 2023

Industry insights Company news Platform Engineering

Fastly’s resilient architecture principles prevent outages, mitigate severity, and deliver on our availability promises without compromising performance. We systematically eliminate “single point of failure” risks and always, ALWAYS prioritize distributed and resilient solutions that are built to scale.

We did not expect the reliability and distributed architecture of our control plane to be the key differentiating feature that everyone wanted to hear about this week, but here we are. No cloud vendor is immune to outages, but Fastly continuously and proactively acts to address this risk by building extra resilience and redundancy into our network, control plane, and data plane. This work is designed to protect us against catastrophic failures by preventing them from occurring, and mitigating the severity of their impact if they do occur.

In the spirit of #hugops, we have a lot of empathy for our colleagues in the industry who have experienced outages recently. It’s all of our worst nightmares. We decided to write about our approach to resilience today because it’s a question we’ve been asked numerous times by customers in the last week or so.

Now, let’s get into the decisions we’ve made, and the work we started several years ago to make Fastly more resilient. We’re talking about resilience against everything, from black swan events (like the complete loss of a datacenter), to more common scenarios like internet outages or sophisticated DDoS attacks. This means building resilience into the control and data planes, but also into Fastly’s overall network, traffic handling, and more. By the end of this post you’ll understand why resilience is Fastly’s middle name. And why redundancy is our first name… and why redundancy is also our last name too.

Distributed solutions reduce single points of failure

Two of the most important principles we apply throughout the Fastly platform are to build systems that are distributed and to remove single points of failure.

About three years ago Fastly began a formal process to continuously assess and strengthen our architecture that was led by an internal cross-functional team called the Fastly Reliability Task Force (RTF). The RTF meets on a regular basis to triage, evaluate, discuss, and prioritize the mitigation of existential risks to the platform. This forum has been hugely instrumental in driving major improvements, while constantly planning what to tackle next. Part of the impetus for launching the task force was that we recognized that Fastly could have been (in the past) unprepared to handle a data center power outage. In order to address the risks we identified, we started a (truly) massive company-wide initiative that we lovingly codenamed “Cloudvolution” for two reasons. First, we wanted to apply an evolutionary, transformative leap to the way we run our entire platform and achieve a more resilient, multi-cloud, multi-region architecture. And second, because we did not think we would ever be saying that name publicly and sometimes we like silly names. Not everything needs to be an acronym.

Control and data plane resilience

“Cloudvolution” was intended to strengthen both our control plane and our data plane, and bring high availability to core platform services for additional platform resilience. A key goal was to iteratively evolve system design abstractions to improve resilience, have clearer service boundaries, and set ourselves up for easy scaling to the next stage of customer growth (and beyond). Basically, we wanted fewer dependencies, less risk, fewer points of failure, and more resilience.

We knew that this architecture upgrade would be key to maintaining reliable service in the event of a catastrophic failure. Today, our control and data planes are multi-cloud and multi-region. We worked hard to make it so that Fastly does not depend on a single datacenter, a single availability zone, or a single cloud provider. Our control plane is run on two independent and geographically dispersed regions with a warm failover to the secondary region if needed. Similarly, our data plane (powering our observability & analytics) also has two independent regions and a warm failover, but resides in a separate cloud provider from the control plane. This effectively limits the ‘blast radius’ of a catastrophic event, making it significantly less likely that something could take out both our ability to observe (data plane) and take action (control plane).

We also ensure that our data centers have failovers in place from grid power to uninterruptible power supply (UPS) batteries with generator backups, but it’s not enough just to have these in place on a checklist. As part of our Business Continuity Planning (BCP) we also execute failovers between active and standby regions to continuously test these capabilities in the event of a regional cloud provider outage.

Once we shifted our control and data planes to this better architecture we knew it was important to make it easy for our engineering teams to build on so that EVERYTHING Fastly built would be as resilient as possible. We made our control and data planes available as an integration platform so that engineers could spin up new products and features that are safe, scalable, and resilient out-of-the-box – even the smallest tests and alpha projects. This helps us further reduce risk, as smaller engineering teams are not forced to recreate their own version of core systems to launch something new. It’s easy for everyone to build in the safest and most resilient way, but it doesn’t stop there.

The RTF continues to identify areas to improve even as we have tackled our original priorities, and we are already working on the next iteration of our control and data planes. These improvements can take years to fully implement, so it’s important to start working before you need it. We are already making progress on plans to further decouple sticky system abstractions in the control plane over the coming year, which will not only make it more resilient, but also help us to accelerate product development.

Network resilience (beyond the control plane)

Data center failures that impact the control and data planes of a platform are not the only challenges to face when running a global edge network that promises low latency and high availability. Now that we’ve talked through the ways in which our control and data planes are built to be distributed and resilient, here’s a look at the ways in which the Fastly edge network is built to avoid single points of failure, and to mitigate problems automatically, immediately, and intelligently. Most of the time when we talk about resilience at Fastly, we’re talking about our edge network and content delivery services, so let’s dig in.

Resilient handling of traffic and network outages

Traffic anomalies, latency problems, and internet outages are daily realities for a network the size of Fastly. On any given day, internet transit providers collectively experience anywhere from just a few issues all the way up to hundreds of these temporary, short-lived, connectivity or performance degradations. In aggregate these are referred to as “internet weather.” Some internet outages are large enough to capture global attention. Most are (relatively) smaller events that pass quickly, but even the “smaller” weather events cause latency and performance degradation along the way that can have serious impacts.

Current industry best practices often employ techniques like Border Gateway Protocol (BGP) routing changes, but because BGP doesn’t have any application-level failure detection capabilities built into it, it can take a long time for a problem to get resolved – sometimes hours. A monitoring or observability system outside of BGP has to detect the issue, infer the problematic routes, plot a solution, and then use BGP to issue instructions to change the network topology and eliminate faulty routes. Once those instructions are issued, BGP is fast to fix the issue, but all the stuff that comes before it can take minutes or hours to get to that point. So BGP isn’t very effective for fine-tuning changes around smaller interruptions or outages in the network. Most of the time the issue has resolved itself by the time the BGP change would have an impact, and it does nothing to help the sites and applications that suffered real consequences for every second of the outage, and just had to wait for things to work again.

At Fastly, the fact that we don’t control the entire global network is no excuse for finding ways to provide better and more resilient service for our customers. Here are some of our innovations for providing our customers, and their end users, with the fastest, most reliable experience available. These advances in edge network operations are only possible due to our modern network architecture and the fact that we have a truly software-defined network that allows us to apply logic at any layer and programmatically scale and adjust networking flows as desired to circumvent internet problems and ensure uptime and reliability.

Keep an eye out for these common themes: 1) our systems are automated, self-healing and can respond immediately without waiting for human intervention, and 2) they are provisioned across our entire fleet of POPs.

Removing dependencies for resolving “internet weather”

We love problem solving, so the worst thing about internet weather is that it’s not within our control to fix the actual source of the problem! It’s something happening out there on a part of the global internet infrastructure that someone else owns or manages, and whatever event is occurring is out of our control. But certain things ARE in our control, and we’ve developed ways to improve our service and route around bad weather. The first is a technology we call “fast path failover.”

Fast path failover automatically detects and re-routes underperforming edge connections at the transport layer, allowing us to mitigate the impact of internet weather issues that are occurring outside of our own POPs. A lot of internet weather isn’t a full break – often there’s just a lot of latency or other issues, but the link in the network is still technically connected, just heavily degraded, and this causes problems. The standard approach to remediation uses the BGP to route traffic away from broken internet paths, and it does an ok job for complete breakages, but it’s a terrible solution for degraded connections.

When a link along the path becomes unavailable, BGP can withdraw the routes involving that link and signal alternative paths, if available. This triggers ISPs to reroute traffic and bypass the issue. But in situations where a path is heavily degraded, but not entirely failed, a BGP route withdrawal might not be triggered at all. Sometimes the service provider has to detect and manually reroute traffic to mitigate the issue, and this process can take anywhere from several minutes to a few hours depending on the provider.

Fast path failover doesn’t wait for BGP to fix things for Fastly customers – if something is failing, we want it to fail fast and reroute fast and start working again – fast! Fast path failover automatically detects and re-routes underperforming edge connections without waiting for a network-wide resolution issued via BGP. We don’t need to wait for peers or transit providers to tell us that a path is broken, we can see it for ourselves. And we don’t need to wait for their routing solution to try a different route from our end. Our edge cloud servers can determine if connections are making forward progress, infer the health of the internet path, and select an alternate path quickly whenever necessary.

In another win for distributed architecture we get even faster routing because we don’t rely on centralized hardware routers that bottleneck routing decisions for an entire POP. Our edge servers can act faster in a distributed fashion to make routing decisions on a per-connection basis. This enables precise failover decisions that only reroute degraded connections without impacting healthy ones. This mechanism is remarkably effective at identifying and mitigating internet weather conditions that are typically too small and too short to be accurately detected and mitigated using standard network monitoring techniques. Read more about fast path failover.

To go even further, in cases where the internet weather is sufficiently central in the network topology, there may be no alternate path that exists to move the traffic away from the failed route. Other providers can get stuck behind these issues with no viable alternatives, but we simply don’t take “no” for an answer. Fastly has massive transit and peering diversity that significantly reduces the risk of getting caught behind network bottlenecks when trying to reach our end users.

Smart, automated traffic routing

While fast path failover improves connectivity for content requests from Fastly’s edge cloud platform moving across parts of the internet we don’t control, we have also added Precision Path and Autopilot to improve performance across parts of the network that Fastly can control.

Precision Path is used to improve performance across internet paths between customer’s origin servers and our network, and Autopilot is our automated egress traffic engineering solution. They do amazing things when used in combination, and they let us react immediately without needing to wait for a human to analyze and determine a plan, and this is critical because reacting faster prevents issues from cascading and affecting more of the network.

Precision Path

Precision Path continuously monitors all origin connections in every Fastly POP worldwide. When it detects an underperforming origin connection (due to internet weather, for example), it automatically identifies all possible alternative paths to that impacted origin and re-routes the connection to the best alternative in real-time. We can often re-establish a healthy origin connection before 5xx errors get served to end users, effectively fixing network issues so fast that it’s like they never existed. Our real-time log streaming feature can also be used to monitor for origin connection rerouting events that may occur on Fastly services.

Precision Path also focuses on reliably delivering content to end users from our edge cloud platform. When delivering this content, we track the health of every TCP connection. If we observe connection-impacting degradation (e.g., congestion), we use fast path failover to automatically switch delivery to a new network path and route around the issue. This automatic mitigation is enabled by default on all of our POPs and applies to all Fastly traffic. No additional configuration is required.

Autopilot

Autopilot is what enabled us to deliver a record of 81.9 Tbps of traffic during the last Super Bowl with zero human intervention, requiring no manual interventions over the course of this high traffic, and high stakes event. Since February 2023 we’ve had many other days where traffic has exceeded Super Bowl levels to set new records, so any day has the potential to be as big as a Super Bowl day. This ability to scale is not just useful once per year. It’s in use every day, all year round, optimizing our traffic and maximizing Fastly’s efficiency.

Similar to fast path failover, Autopilot was built to address shortcomings in the BGP protocol. BGP has a “capacity awareness gap” – it can only be used to communicate whether an internet destination can be reached or not. It cannot tell whether there is enough capacity to deliver the desired amount of traffic or what the throughput or latency would be for that delivery. It’s like if a courier said they could deliver a package and took it from you, only to find out later that it didn’t fit into their car.

Autopilot addresses this issue by continuously estimating the residual capacity of our links and the performance of network paths. This information is collected every minute via network measurements and used to optimize traffic allocation so that we can prevent links from becoming congested. Precision Path is lightning fast, but it’s mostly about moving away from bad connections – it doesn’t “know” a lot about the new connection when it makes those decisions. Autopilot has a slightly slower reaction time than Precision Path, but it makes a more informed decision based on several minutes of high resolution network telemetry data. Rather than just moving traffic away from a failed path (like Precision Path), it moves larger amounts of traffic toward better parts of a network.

Working together, Precision Path and Autopilot make it possible to rapidly reroute struggling flows onto working paths and periodically adjust our overall routing configuration with enough data to make safe decisions. These systems operate 24/7, but looking at the most recent Super Bowl we can see one example of the efficiency they provide. They rerouted 300 Gbps and 9 Tbps of traffic (respectively), which would have otherwise been delivered over faulty, congested or underperforming paths, and clogged up more of Fastly’s network capacity. These self-managing capabilities enable us to react faster and with higher frequency to potential failures, congestion and performance degradation issues on our network.

Lastly, while Autopilot brings many benefits to Fastly, it is even better for our customers who can now be even more confident in our ability to manage events like network provider failures or DDoS attacks and unexpected traffic spikes – all while maintaining a seamless and unimpacted experience for their end users. Read more about Autopilot and Precision Path.

Automated protection against massive DDoS attacks

Not all network issues are unintentional. A critical part of network resilience is being able to withstand massive Distributed Denial of Service (DDoS) events like the recent Rapid Reset attack. This attack continues to create problems around the internet, but Fastly wasn’t vulnerable because we had an automated system in place that was able to begin identifying and mitigating it immediately using a technique we call “Attribute Unmasking.”

Attribute Unmasking

DDoS attacks have gotten more powerful over time, as well as increasingly fast to scale. They often scale from zero requests per second (RPS) to millions or hundreds of millions RPS after just a few seconds, and then they might end just as quickly – sometimes terminating in less than a minute. DDoS attacks are also becoming more sophisticated, like the recent Rapid Reset attack which relied on a characteristic of the HTTP/2 protocol that had not been previously exploited.

For most of the large platforms affected by Rapid Reset this was a novel attack that wreaked havoc, but Attribute Unmasking allowed us to rapidly, and automatically extract accurate fingerprints out of the network traffic when we were being hit with Rapid Reset, and it works the same way for other complicated attacks. Every request coming through a network has a huge number of characteristics that can be used to describe the traffic, including Layer 3 and Layer 4 headers, TLS info, Layer 7 details, and more. Our system ingests the metadata from inbound requests on our network and uses it to tell the malicious traffic apart from the good traffic. This allows us to block attack traffic while letting legitimate traffic through.

For faster response times, DDoS protection is handled at the edge of our network, with detection and defense capabilities built into our kernel and network application layer processing stack. This is another instance of a distributed solution (just like with fast path failover) that is only possible because our network is completely software defined, allowing us to run functions in a more distributed fashion across our servers in parallel. Our system is also modular, so we can rapidly enhance our detection and mitigation capabilities as new classes of attacks are discovered, without needing to develop an entirely new mechanism to respond. When an attack like Rapid Reset attack comes along, we simply add a few new functions to our detection and response modules, keeping our response times incredibly short, even for novel attacks. Read more about Attribute Unmasking and the Rapid Reset attack.

Resilience is a process

There’s a lot of detail in this post, but the main takeaway is that at Fastly:

We prioritize efforts to think about what could go wrong BEFORE it goes wrong
We allocate significant resources to improve our architecture
We continually identify and eliminate single point of failure risks
We find innovative ways to prepare for, and solve problems that occur outside of our control
We consider this work to be continuous. We are always working to be prepared for tomorrow, and we are always asking ourselves what else could be done.

If you want to learn more about how we work, and the performance and security benefits that Fastly customers receive as a result of the efficiencies that come with these same innovations, try us out with our free tier or get in touch today.