Revenir au blog

Follow and Subscribe

Disponible uniquement en anglais

Cette page n'est actuellement disponible qu'en anglais. Nous nous excusons pour la gêne occasionnée, merci de revenir sur cette page ultérieurement.

An Indispensable Pillar of Resilience - The Human

Brian Haberman

Distinguished Engineer

This article is part five of Fastly’s "Pillars of Resilience" series, exploring how we design, build, and operate our global network for maximum availability and performance. Read the full series:

In our introduction to this series on resilience, we explored our foundational layer of observability that enables automated responses, intelligent traffic routing, and proactive system-level defenses. We talked about designing for failure, but a system is never truly resilient until it accounts for the unexpected. Handling the unknown unknowns. This is where the most critical component of our resiliency comes into play: the human operator.

While our automation is robust and our self-healing capabilities are sophisticated, the ultimate safety net that prevents a local failure from becoming a global crisis is our team of skilled, thoughtful, and well-trained engineers. The human factor isn't a vulnerability to be eliminated or a point of failure to be optimized away; it's a cognitive advantage that must be leveraged.

Adaptive Problem Solving vs Novel Failure

Automation is inherently limited to handling the failures it was explicitly designed to mitigate. It operates on a predefined set of rules, triggers, and expected outcomes. Even in AI-based tools, there is a limitation driven by the data being used to develop the automation. But in a complex, distributed environment like Fastly's global network, failures don’t always arrive in an expected form.

When a cascade of events combines known failure modes in a novel, unexpected sequence, automation can be overwhelmed or, worse, make an incorrect decision based on incomplete information. Or a new failure mode emerges due to a novel threat vector, new functionality, or a unique traffic pattern, leading automated tools to make no decisions. A human, possessing the ability to recognize anomalous patterns and apply abstract reasoning, can quickly step outside the automated playbook and attack the problem.

Resilience is about making the right trade-offs under pressure. Do you prioritize maximum availability, even if it means temporary service degradation? Or do you sacrifice a small segment of traffic to protect the core service? This requires situational judgment, a uniquely human capacity to weigh competing risks and rewards that no algorithm can yet fully replicate.

The Holistic System View

Modern Content Delivery Network (CDN) architectures are defined by complexity and interdependence. While monitoring tools provide terabytes of data, they may often present the system as a collection of isolated metrics. An algorithm can see that CPU utilization is spiking in a single Point of Presence (POP), but it takes a human to understand the systemic context.

Our operators and engineers don't just see a series of dashboards. They see the system as a living organism. They can connect seemingly disparate events, such as a sudden drop in error rates on one service with an unexplained increase in latency on another, and form a complete hypothesis. They understand the organizational and architectural history that led to the current design, allowing them to pinpoint the root cause based on accumulated and shared knowledge.

Abstract reasoning allows the human to see beyond the dashboards

Effective resilience places the human at the center of the control plane, not as a replacement for automation, but as a smart override. Automation handles the vast majority of events efficiently, reserving the most critical, high-stakes decisions for the humans who can bring critical thinking to bear.

The Engine of Improvement: Learning and Feedback

If an incident doesn't lead to system improvement, it's a double failure. The ability to learn from adversity is arguably the most resilient trait of any organization, and that learning process is fundamentally human.

Fastly, like all SRE-driven organizations, relies on a blameless post-mortem culture. This process requires honest self-reflection, meticulous analysis, and a commitment to systemic change. Activities that cannot be automated. Engineers are the ones who distill operational chaos into actionable items, driving the creation of new tooling, better runbooks, and, critically, improved automation. Doing so in a blameless manner is critical to addressing the issue as effectively and efficiently as possible while also building stronger relationships across development teams.

Resilience is about detecting and recovering from a disruption; antifragility is about getting better because of it. The human operator is the mechanism by which Fastly's system achieves antifragility. Every incident handled by a human becomes a lesson embedded in the culture, software, and architecture, making the entire system stronger for the next inevitable challenge.

Empowering the Human

The engineers and operators are the system's greatest cognitive asset. So our goal is to equip them with the tools necessary to amplify their judgment, not replace it. Tooling in a resilient architecture serves to reduce cognitive load, provide clarity during chaos, and accelerate decision-making. To that end, we are constantly looking for ways to leverage emerging capabilities that will give our staff superpowers.

Dashboards are a key way to track the myriad of metrics collected at Fastly. And they are easy to create and share. Which means we have a lot of them. But, during an active incident, finding the right dashboard can be critical to stopping a minor disruption from becoming an outage. Rather than asking a human to find the dashboard, we have developed an AI assistant for operators. This allows them to simply ask for particular information, such as “show me the kernel versions that are deployed in the London POP” or “Retrieve the time series of the top 20 customers with the highest traffic volume globally, over the last 7 days”. The AI assistant finds or creates the right dashboard to satisfy the request.

Effective post-mortems require a shared understanding of the incident being reviewed. Given the variety of information generated and collected during incident response, it can be difficult for everyone to internalize all the raw data before the post-mortem. To facilitate more effective learning, we augmented our AI assistant to generate synopses of incidents to allow all interested parties to come up to speed on the issues quickly. The figure below shows an example of an incident summary that is far more accessible than gigabytes of documentation, audio/video recordings, and chat logs.

Leveraging an AI assistant to summarize the contributing factors of an incident

Empowering the Human Operator: The Future of System Resilience

Ultimately, the goal of robust automation in a system like Fastly's is not to eliminate humans, but to empower them. By taking over the predictable, repetitive, and simple failures, automation frees up our engineers to focus on the complex, novel, and high-impact problems that truly test the limits of resilience. A resilient system isn't one that never fails, but one that contains failures gracefully and learns from every single one. That process demands the ingenuity, judgment, and expertise of the human operator, the true key to engineering resilience at scale.