The Sleep Test: How Embracing Chaos Unlocks API Resilience

Lorraine Bellon

Senior Product Marketing Manager, Security

Anna Jensen

Gerente de producto técnico

11 de diciembre de 2025

Seguridad Producto

TL;DR:

Modern systems are so complex that no single human (or team) can realistically see how all the pieces interact.
That complexity creates “unknown unknowns,” brittle dependencies, and endless firefighting.
Engineers need visibility, and Fastly’s API Discovery delivers exactly that: automatic, continuous mapping of your API traffic and changes.
New capabilities like Tree View and Inventory give teams richer context, better auditability, and real-time understanding.
API Discovery helps teams shift from reactive panic to proactive planning, creating space for strategic work (and maybe even lunch!).

What’s it like to be a platform engineer? In a word: distracting.

Constant phone alarms. Endless Slack messages. Lots and lots of coffee and energy drinks. A few moments of relative peace and quiet, usually spent in planning sessions and post-mortems, but always with one ear towards the alarm bells.

Picture this: you have a nice, meeting-free day planned to catch up on necessary reading: new frameworks, languages, integrations, and internal procedures… until you get a Slack message from one of your devs asking why a production service is broken. You immediately reprioritize your day to focus on this.

This bug proves to be quite troublesome to fix. You find yourself getting hungry after you realize you forgot to eat breakfast. Eventually, you find a good stopping point and decide to take lunch instead. Some days that time allowance might be 15 minutes, and some days that might be an hour. Today, you allow yourself 30 minutes. You bring your work phone with you while you eat, just in case something urgent comes up. You’ve learned the hard way to always have it around, even when you’re not explicitly on-call.

Finally, after a bit of sustenance, you’re back to working through that bug. You wish you had thought to write more tests to cover more possible cases that your automated provisioning workflow might encounter, but you also know how hard it is to find time to do all those little things when every day involves so much context switching. You hope that in the process of excavating this bug, you don’t find a bigger one.

Just as you’re achieving flow in your troubleshooting, another Slack message interrupts you. A customer is bombarding your support team nonstop with messages because a different service isn’t working how they expect. Any stakeholders involved with the customer account are panicking. It’s a P0 – drop everything and fix it because otherwise the sky might fall. You jump into the Zoom room spun up for this purpose, so much for getting that reading done today.

Figure 1: A day in the life of a platform engineer (thank you, KC Green)

Several false starts and a few choice expletives later, your team resolves the issue. But now you have to document what went wrong and make recommendations for how it can be avoided in the future. You spend a few hours doing that, and now you can finally go back to your other work. You realize you haven’t seen the sun since yesterday because you arrived at work before it came up. So you decide to walk to the local coffee shop and order a latte. You come back from your coffee journey to discover four new random requests for you in Slack.

After briefly considering a new career, you start drafting your to-do list for tomorrow, knowing you’re likely to finish none of them. You look at your calendar for tomorrow and see you have a 90-minute planning meeting, where you’ve been invited to help “influence” the architecture of a new service. This usually means pointing out where the engineering team’s proposed design could very obviously fall over when running in production. You quickly jot down a few ideas to bring up at the meeting, to make sure they’re thinking about the scale and reliability side of things, like speed, latency, availability, and so on. This part of the work excites you and reminds you why you got into this stuff in the first place: you’re fulfilled by figuring out ways to make systems work better.

You reflect on how much you actually enjoy your work, and start looking forward to going on a run this evening. Then, you get pinged asking if you could “jump onto a Zoom real quick.”

…

What’s at the root of all of this chaos, and what could possibly make it better?

Modern software systems have reached an unprecedented level of complexity, which often makes a complete understanding by any single individual or even a small team nearly impossible. Codebases are massive and can span millions of lines, coupled with a web of dependencies between countless interconnected components, APIs, and external services. These components are, of course, not static; they constantly and rapidly evolve, leading to more and more technical debt over time. These challenges are exacerbated by the abstract and dynamic nature of software in general, where issues arise from emergent properties and "unknown unknowns," or the unexpected interactions inherent to a complex system that are difficult to predict or trace. Consequently, platform teams must often operate with only a partial mental model, focusing on narrow slices of the system, while accepting the inherent fragility and obscurity of the overall structure.

Figure 2: A typical production system

Imagine for a moment that you have magic powers to see and understand what’s happening everywhere. You’d have the full picture of what’s happening in production and what could potentially put your systems at risk. You’d be able to understand what exists and gain confidence that everything is working as expected. With those magic powers, you would be able to uncover and understand the mysteries plaguing your platforms, and break down the barriers that keep you glued to your desk and wreak havoc on sleep schedules and workout routines. Instead of constantly fighting fires, you would get to do the exciting part of your job: building truly resilient systems that scale and power amazing things.

The good news is that magic is real! 🪄

On September 30, we launched API Discovery, which gives you one-click access to discover, monitor, and secure your APIs easily. It continuously monitors your API traffic within Fastly’s extensive Edge network to build a continuous snapshot of your APIs, keeping you aware of any new, updated, and unexpected API requests coming to your origin. Since then, we’ve continuously added new capabilities to API Discovery, including Tree View for better real-time contextual viewing. Now, we’re adding Inventory, which helps you create audit logs and additional evergreen context for your teams. Looking ahead, we’re delivering more capabilities to help teams define API standards and monitor and enforce API behaviors for specific and targeted mitigation actions. And this is only the beginning.

What’s driving this development direction? In two words: embracing chaos. The way we’re designing and building our API security centers around fostering “API resilience” in the face of the undeniable reality: too many APIs to secure, not enough hours in the day, and not enough information to make meaningful decisions that take you out of firefighting mode and into strategic vision. This means establishing a set of tools and datasets that help you shift your mindset from reactive whac-a-mole to thoughtful, actionable decision making, with scalable planning and design that prevents problems and keeps the critical business systems up and running. It goes beyond simply fixing things when they break, and toward understanding WHY things break at scale and what happens when they do.

A truly resilient system is designed with the understanding that failure is inevitable, and in fact, should be welcomed as an opportunity to pressure test your system and help promote faster healing in the future. The golden business metrics might be minimizing downtime and data loss via higher availability and greater operational confidence. But the outcome that really matters is creating a system that doesn’t rely on the tears of its operators to stay up and running.

It might sound counterintuitive, but chaos is never going away. It’s time that we and our systems stop fighting it and learn to embrace it and learn from it. What’s the first step? Taking a look at what’s happening right here and right now.

Ready to give API Discovery a try? It’s easy to turn on with just one click. Get instant visibility, cut the noise, and keep your APIs secure – without the hassle. See it in action with a personalized demo or chat with our team of security experts to see what Fastly can do for you.

Sólo disponible en inglés

The Sleep Test: How Embracing Chaos Unlocks API Resilience

¿Listo para empezar?