Chaotic Good: Resilience Stress Tests at the Edge

Chaos experimentation is the buzzy software-flavored version of resilience stress testing: simulating adverse conditions to observe how a system behaves. Conducting chaos experiments means we can observe how components interact within the system when encountering stresses and surprises, and thereby learn how prepared we, and our software systems, are for failure. It’s a hot topic for a reason.

But the top question I get about chaos experimentation and resilience stress testing is, “how do we start?” Like, yes, all this sounds lovely and transformative and sunshine, rainbows, etc. but how do we get from today to tomorrowland?

In this post, we’ll walk through a basic starter example experiment – verifying basic security assumptions on a website (like requiring cookies and not allowing cross-origin requests) – that reflects a reasonable place for most organizations to start in their resilience stress test journey.

Because the answer is we start small. Very small. Just enough to get the flywheel going, to start building muscle memory around conducting experiments so eventually you can become a chaos sorcerer performing dozens of experiments 24/7.

But, much like writing letters or blog posts, starting small can sometimes feel more difficult than beginning with a big bang. How does a bite-sized experiment look? Will it even be valuable? Will I have to get a bunch of teams involved? How do I make sure I don’t crash production and wake up to an avalanche of alerts and messages despite my chaotic good intentions?

This blog post answers these questions and introduces a copy-paste way to start small with chaos magic. If you already are familiar with chaos experiments and resilience stress tests, feel free to skip to the Chaos@Edge section.

What even is chaos?

Chaos experimentation is the practice of simulating adverse conditions to generate evidence of how a system behaves in response to adversity. In basically all other complex systems domains, this practice is called “resilience stress testing” (my preference), so use that terminology instead if you think it’ll help get buy-in within your org – especially for leaders who may be skittish about the idea of introducing “chaos.”

I like to think of resilience stress testing as the scientific method in action. Let’s say we have a hypothesis like: in the event of a superstorm hitting New York City, we are confident that streets across all five boroughs will not flood. We can either wait for the next superstorm to hit and find out whether our hypothesis is affirmed or denied, or we can conduct an experiment to generate evidence more proactively.

To conduct the experiment, we have a few options. The least ethical one is hiring a bunch of jumbo planes to dump as much water on NYC as possible to see how its urban water infrastructure handles it. If our assumption doesn’t hold, people could drown, panic, lose their homes or property, and experience other horrible outcomes. So, probably we don’t want to do that.

Another option is to recreate a physical model of NYC and dump water into that replica instead. This is likely to be quite expensive at the level of scale required to achieve high fidelity and possibly bears its own ethical quandaries if we try to replicate real humans. Also likely a no-go.

Thus, the most popular option across complex systems domains is to use software to virtually model the system and virtually model the adverse condition to generate experimental evidence. Using software also means you can run the simulation repeatedly, since the nature of complex systems is that they’re non-deterministic (i.e. the same sequence of events won’t always result in the same outcome).

While computer-based simulation is the better choice, it’s still expensive and achieving fidelity is still fraught, so there’s a ton of research going into this problem area. I read this research for fun and the tl;dr is that these other domains are lowkey a bit desperate for easier, more effective ways to conduct resilience stress tests. They dream of being able to understand a system’s resilience to adverse scenarios, whether the system is urban water infrastructure, the human brain, shrubland habitats, nuclear reactors, or financial networks – because this evidence will allow them to better prepare for inevitable failure.

In software, we have it so, so easy in comparison. Our world is already virtual. We don’t have to simulate things like horchata latte cups and half-eaten bagels blocking storm drains. All things considered, software is one of the simplest and most cost-effective domains to conduct resilience stress tests… yet we rarely do them.

Why? Again, in my experience, it’s often because organizations don’t know where to start. They believe it’s inherently complex and messy and costly. They see examples from Silicon Valley giants and think, “Well, we struggle to even implement smoke tests or integration tests, how could we possibly get there?” It’s a valid question.

The good news is that we don’t need to begin our experimental journey with superstorms or solar flares or especially complex scenarios. We don’t even need to maintain a separate experimentation environment (although you can absolutely conduct experiments in a staging or test environment if you wish), nor fiddle with infrastructure configuration if we don’t want to.

The next section will explore how we can get started on our chaos engineering journey in a simpler way.

Chaos@Edge

The best place to start with chaos experimentation is not only with small experiments but also those that validate assumptions you are extremely confident are true. If you aren’t confident an assumption holds, then you should do the things to gain confidence first before running an experiment.

When I thought about what kind of chaos experiment would be a worthy beginner one that would apply to most organizations, I targeted the most basic security assumptions a company might have about their site – especially their login pages. These include things like:

My site requires cookies or authentication headers
My site does not allow cross-origin requests

Most organizations are super confident that their login pages (or other relevant pages) require cookies and don’t allow cross-origin requests. But as changes accumulate, those assumptions may erode or break – so we can run an experiment to verify them on an ongoing basis.

I built this straightforward experiment using Fastly’s serverless edge platform (i.e. Fastly Compute), which meant I didn’t have to play sysadmin. When we have hot girl shit to do, like writing chaos logic, it makes sense to abstract away infra concerns like figuring out the right instance size, memory allowances, security groups, and other tedium. My language of choice was Javascript, although you could certainly use Go or Rust for this, too.

Doing this on a high-performance serverless platform also means we can clone requests without disrupting user traffic and keep the experiment speedy (thanks to the fast cold boot times). A common concern I hear from organizations is that they don’t want to disrupt production or degrade their users’ experience while conducting chaos tests. Performing chaos at the edge reduces those hazards by design.

And reducing those hazards by design not only keeps the business and its end users happy, but also bypasses the barrier many cybersecurity teams face when implementing chaos experimentation: the lack of software delivery expertise. If your team is migrating from the “moat and castle” model and upskilling from the “configure the security box” status quo, using a safe-by-design compute platform means you don’t have to worry about making deployment mistakes; the platform handles that work for you (and even software-immersed security teams benefit from removing that toil, too).

Let’s walk through the code example to see how it works:

// <reference types="@fastly/js-compute" />
addEventListener("fetch", event => event.respondWith(app(event)));

// define your backend here
// outside of fiddle, it might look like:
// const backend = "https://http-me.glitch.me/"
const backend = "origin_0";

Above, we add an event listener with the “fetch” type so we can intercept network requests and customize our responses. Then, we define the backend we want to use; this will be the site on which you want to conduct the experiment, like https://myorganization.com.

With that foundation, we can define our logic flow. For a given incoming request, we will clone it twice to perform our various experiments. This means we will make three separate requests and compare the responses:

Normal (unmodified) request
Modified request to validate missing Cookie header
Modified request to validate explicit CORS Origin header

If unexpected differences arise, we’ll return an error. For instance, if we strip cookies from an incoming request and the site’s response is the same for each request, then it appears a cookie isn’t required. And if we force a cross-site origin header, but the response is the same, then the site allows cross-origin requests. Whether confirming or denying our assumptions, we’ll better understand the reality of how our system operates.

You can see how this logic unfolds in the rest of the code example:

async function app(event) {
  
  try {
    const req = event.request;
    const anonymousRequest = req.clone();
    const anonymousRequestDifferentOrigin = req.clone();

    // Fetch request from backend with any incoming cookies
    let cookieResponse = await fetch(req, {
      backend: backend,
    });

    // Usable even if there's no cookie
    let cookieBuffer = await cookieResponse.arrayBuffer();

    // Fetch again with the cookies removed
    if (anonymousRequest.headers.has("cookie")) {
      anonymousRequest.headers.delete("cookie");
      let backendResponse = await fetch(anonymousRequest, {
        backend: backend,
        cacheOverride: new CacheOverride("pass"),
      });

      // Compare responses and report which URLs require the cookie header 
      let responseBuffer = await backendResponse.arrayBuffer();
      if (cookieResponse.status != backendResponse.status || !buffersAreEqual(cookieBuffer, responseBuffer)) {
        console.log("URL requires cookie: " + req.url);
      } else {
        console.log("URL appears not to require a cookie: " + req.url);
      }
    } else {
      console.log("Incoming request did not include a cookie: " + req.url);
    }

    // Fetch again with an origin header differing from the backend
    if (!anonymousRequestDifferentOrigin.headers.has("Origin")) {
      anonymousRequestDifferentOrigin.headers.set("Origin", "fastly.com");
      let backendResponse = await fetch(anonymousRequestDifferentOrigin, {
        backend: backend,
        cacheOverride: new CacheOverride("pass"),
      });

      // Compare responses and report which URLs require the same origin
      let responseBuffer = await backendResponse.arrayBuffer();
      if (cookieResponse.status != backendResponse.status || !buffersAreEqual(cookieBuffer, responseBuffer)) {
        console.log("URL requires same origin: " + req.url);
      } else {
        console.log("URL appears not to require same origin: " + req.url);
      }
    } else {
      console.log("Incoming request is already cross-origin: " + req.url);
    }

    // Send the response back to the client by implementing a readable stream
    return new Response(cookieBuffer, {
      headers: cookieResponse.headers,
      status: cookieResponse.status,
    });
  }

There’s also some error-handling code towards the bottom as well as some buffer stuff, but above is the engine of our little Compute program. Again, there’s nothing particularly fancy here. This is not an Earth-shattering experiment that challenges our conceptions of space-time.

Its simplicity is a strength. It makes this experiment an excellent starting point for teams who want to experiment with experiments, to get familiar with how it works and feels so they can chart a course for wider adoption. It also makes it a good candidate for a continuous experiment; whenever engineers push changes to the site, you can confirm your basic assumptions about its behavior are holding over time.

Conclusion

Getting started with chaos experimentation isn’t about “breaking” production or blowing people’s minds. It’s about starting small with one of your “no duh” assumptions, building a lightweight program to run the experiment, analyzing the resulting evidence, and building the muscle memory to gain confidence in chaos.

I personally love Compute as a place to run experiments since it means I can be lazy and focus on chaos logic rather than all the infra stuff. And it imparts the extremely important business benefit of not disrupting production or real users. If you decide to try out this code example on Compute, I’d love to chat with you about it and brainstorm other chaos spells we could concoct.

Chaotic Good: Resilience Stress Tests at the Edge

What even is chaos?

Chaos@Edge

Conclusion

Want to continue the conversation?

Ready to get started?