Summary of June 8 outage
We experienced a global outage due to an undiscovered software bug that surfaced on June 8 when it was triggered by a valid customer configuration change. We detected the disruption within one minute, then identified and isolated the cause, and disabled the configuration. Within 49 minutes, 95% of our network was operating as normal.
This outage was broad and severe, and we’re truly sorry for the impact to our customers and everyone who relies on them.
On May 12, we began a software deployment that introduced a bug that could be triggered by a specific customer configuration under specific circumstances.
Early June 8, a customer pushed a valid configuration change that included the specific circumstances that triggered the bug, which caused 85% of our network to return errors.
Here’s a timeline of the day’s activity (all times are in UTC):
09:47 Initial onset of global disruption
09:48 Global disruption identified by Fastly monitoring
09:58 Status post is published
10:27 Fastly Engineering identified the customer configuration
10:36 Impacted services began to recover
11:00 Majority of services recovered
12:35 Incident mitigated
12:44 Status post resolved
17:25 Bug fix deployment began
Once the immediate effects were mitigated, we turned our attention to fixing the bug and communicating with our customers. We created a permanent fix for the bug and began deploying it at 17:25.
Where do we go from here?
In the short term:
We’re deploying the bug fix across our network as quickly and safely as possible.
We are conducting a complete post mortem of the processes and practices we followed during this incident.
We’ll figure out why we didn’t detect the bug during our software quality assurance and testing processes.
We’ll evaluate ways to improve our remediation time.
We have been — and will continue to — innovate and invest in fundamental changes to the safety of our underlying platforms. Broadly, this means fully leveraging the isolation capabilities of WebAssembly and Compute@Edge to build greater resiliency from the ground up. We’ll continue to update our community as we make progress toward this goal.
Even though there were specific conditions that triggered this outage, we should have anticipated it. We provide mission critical services, and we treat any action that can cause service issues with the utmost sensitivity and priority. We apologize to our customers and those who rely on them for the outage and sincerely thank the community for its support. Customers should always feel free to email firstname.lastname@example.org for more information.