Fastly’s Edge Network: Building for Availability

Distinguished Engineer

July 16, 2025

Fastly's resilient architecture helps prevent outages and mitigate severity

On June 12, 2025, a widespread outage rippled across the internet, disrupting access to major platforms and services. Infrastructure dependencies on Google Cloud Services (GCS) led to service degradation for apps millions rely on daily, from Spotify and Discord to YouTube and GitLab. Additionally, the outage affected other cloud providers (e.g., CloudFlare). At a high level, Google deployed new code in May 2025 that introduced new quota policy checks, but that code was deactivated during the initial rollout. An event on 12 June triggered the dormant code, and the new code attempted to reference data via an uninitialized pointer. This instigated a crash loop where services fail and restart repeatedly. Google has posted a post-mortem for those interested in more details.

The impact of this incident highlights the modern normal approach for providing a global service; Namely, such services involve an extremely complex web of multi-cloud and multi-platform dependencies. These services gain resiliency from being able to do regional failovers in such configurations. That web of interconnectedness is what allowed this outage to affect a wide variety of services. Cloud-based backend services are viewed as a more reliable and scalable way of hosting global services. This incident is a reminder that not only will catastrophic cloud failures happen, but also that those failures will impact a variety of services. It also showed that while regional failover improves resiliency, it is not the answer to all failure modes. The architecture of a global service needs to consider the range of failure modes and incorporate capabilities that allow that service to fail open and continue operating in a safe state.

Designing for Resilience

An incident like this underscores the importance of a resilient architecture that can withstand unforeseen challenges. Resilience isn’t something that happens with minor tweaks to an infrastructure after an incident – it’s the result of an architectural foundation focused on ensuring availability of services and is manifested in how a platform is built and operated. Fastly’s architecture starts from a simple premise: the internet is inherently unpredictable. Failures happen. Failures can range from very localized incidents (micro-outages) to global events. What matters is how you handle them. We will discuss all the aspects of our resiliency strategy at a later date, but there are a few of those aspects that took center stage on 12 June:

Distributed Decision Making: Fastly’s edge network is decentralized and operates independently from centralized controllers. Each node is capable of serving requests autonomously, making the overall system more fault-tolerant by default. Real-time decision making based on self-sustained, operational state at each node allows for rapid adaptation to changing conditions without relying on centralized backends. This flexibility is a cornerstone of operational resilience.
Infrastructure Diversity: We build and manage our own critical systems, including TLS termination, DNS resolution, and configuration deployment, which means fewer external dependencies that could become single points of failure. Multiple instances of these systems are operational to provide fast failover in case they are impacted by an incident. Some of those systems rely on external services to ensure continuity of operation.
Graceful Degradation: While we strive to develop capabilities that are available 100% of the time, we recognize incidents happen. What we try to avoid is a failure in one component creating a cascading failure across multiple services. Rather, we focus on developing approaches that allow components to operate in a degraded state, which in turn allows other components to continue to function.

Being Transparent: What Was Affected

While our core content delivery functionality continued operating as expected, some Fastly systems experienced degraded performance during the June 12 incident, largely due to downstream effects of the GCS outage.

We believe in being transparent about what happened. Fastly was not immune to the GCS outage. Some of the more noticeable impacts of this outage included degraded performance in:

KV Store: Our KV store uses GCS for durable storage, which means we were unable to perform write actions during the outage. Fortunately, customers were generally still able to read from KV, allowing them to continue operating on cached content. Again, allowing systems to continue operating in a degraded state allowed customers to continue to function.
Management interfaces: In some instances, Fastly customers experienced intermittent degradation in their ability to manage their services. For example, Fastly customers encountered issues with our API for managing TLS private keys and certificates. This meant that customers were unable to upload new material during the event. However, because critical API resources are cached, many users were able to continue accessing management functionality with increased latency or limited features. While any degradation in service can be problematic, we prefer limited availability over a full outage.
Internal alerting: Some of our monitoring and alerting systems utilize Google’s BigQuery for storage. During the outage, Fastly experienced some operational challenges due to not being able to retrieve some metrics from BigQuery. However, Fastly employs multiple monitoring pipelines, so operational staff were able to derive system states from other metrics available through these secondary systems.
Support systems: Some of our support systems, such as online documentation and observability, were impacted by the outage. Those systems operate independently from our caches, but do have some dependencies on GCS for off-premise storage. Our resilience strategy drives modularity and independence in our systems to reduce the effects of incidents.

These effects were short-lived and did not interfere with customer traffic delivery. That’s a key validation of our design philosophy: in moments of uncertainty, the edge must continue serving.

Building for Availability

The June 12 event is a clear reminder that resilience is not theoretical. It’s tested in real-world moments – often unexpectedly. We built Fastly to help customers thrive not just in ideal conditions, but in the unpredictable, high-stakes scenarios that define the modern internet. Incidents can happen to anyone at any time. Fastly is obsessed with ensuring that our services are highly available, so that our customers’ services are highly available. We take every incident, whether it occurs in one of our systems or in someone else’s, as an opportunity to learn and improve our posture.

While we may have shown resiliency on 12 June, we were by no means immune, and we take that seriously. These moments challenge us to reflect, adapt, and improve. We’re actively applying what we’ve learned from this incident to identify areas of potential risk, strengthen our systems, and improve failover strategies.

Fastly remains committed to providing a robust, reliable, and available edge cloud platform, empowering businesses to deliver seamless digital experiences to their users, especially when the unexpected happens.