The pursuit of innovation necessitates what The New York Times CTO Nick Rockwell calls “good risk taking,” which is calculated, distributed, and hedged. In this Altitude series, the technical leaders behind some of the leading online destinations explain how their teams succeed by viewing failure as an opportunity to learn (and find better solutions for future failures).
Last June, we shared Seth Vargo’s story of HashiCorp’s quick and relatively painless recovery (despite a nerve-wracking “purge all”) during an outage. In this post, we’ll share tips for successful failure and recovery from Kenton Jacobsen, Director of Engineering at Vogue.com and Glamour.com.
Like HashiCorp, Kenton’s team at Vogue understands the importance of “failing gracefully,” and how to draw on lessons learned to continuously improve going forward: “It’s not about failing less, but failing more intelligently.” They keep the following in mind for thoughtful risk taking:
- Information is imperfect. The “oral tradition” at many organizations creates dead ends – critical information needs to be translated into documentation.
- Everyone makes mistakes.
- Complex systems have complex failures. The more pieces and parts that interact in particular ways, the more complex failures can be.
- Sometimes there is a single point of failure. There shouldn’t be a single button/command/event that requires manual intervention (e.g., “Never turn that server off because it won’t turn on again”), but it happens. Those sorts of systems should be fixed as soon as they’re identified, not back burnered until a crisis hits.
- Fundamental surprises happen. Failures can occur in ways that can’t be predicted (e.g., the AWS outage).
- Some things can only be observed in production, AKA the real world.
- Release pipelines get blocked; it’s important to keep fire escapes clear. As noted by former Etsy CTO John Allspaw, you need to be able to deploy code and spin up infrastructure rapidly; if you’ve parked stuff there that’s untested or unfinished, you’re blocking your fire escape.
- Failure can cascade — limit your blast radius to keep the cost of failure low. A system or feature should only affect itself or its immediate neighbors, so you don’t take down your entire system.
The faster you go, the higher your risk:
That said, tech teams “[have] to take risks to innovate.” It’s like driving a car — while it’s certainly riskier than sitting at home all day, you have to increase your velocity in order to go places.
Optimizing a 404 page, or asking the wrong questions
It turned out that jQuery’s AJAX method has a global cache property which was set to false, resulting in a timestamp being appended to every request.
(Editor’s note: an interesting and little-known side-effect of using a timestamp as a cache-buster is that it still often hits cache at the edge — if it’s a to-the-second-resolution timestamp, i.e., changes once per second, and multiple clients make requests in the same second, Fastly will see requests at the edge for the same URL and collapse them into one.)
In an effort to fail quietly and quickly — creating 404 pages that were more entertaining — the team ended up in cascading failures. Not to mention the fact: “When you’re optimizing your 404 pages to that degree, you’re probably asking some of the wrong questions.”
(Editor’s note: Vogue’s instinct to do something interesting on their 404 page is not unusual — other Fastly customers have also produced interesting “Not found” pages, including the Financial Times, who offer a delightful and informative 404 page with definitions of various economic theories.)
Restarting an unfamiliar database
In an effort to deliver a new content search feature, Vogue’s team needed to restart a MySQL database. MySQL is a very complex system and has many thousands of possible error states, but since restarting the process will almost always work without error, it can be tempting to think you don’t need expert support: although the team hadn’t done it on this system before, they’d restarted
mysqld many times. In this case, an unexpected and unfamiliar error message —
Failed: intervention required (one they’d never seen and haven’t since) — led to elevated pulses, some frantic googling, and the hard-earned lesson that not having the skills and knowledge on hand to debug a problem could lead to a massive incident. Fortunately in this case, repeating the restart eventually solved the problem, and offered the overall takeaways that understanding the risks of production-affecting changes, knowing where to reach out for expertise, and how quickly a crisis can be addressed are critical to preventing catastrophe.
From these war stories resulted the following sage advice:
- Fail fast so you can ultimately fix faster. If it takes you a long time to roll out something that fails, it will also take you a long time to fix.
- Failure will happen. It’s not an if.
- Take smart risks, so you can live to fail another day.
Check out more war stories here, and stay tuned — we’ll continue to recap customer tales of success and failure going forward.
You may also like:
Spotify on diagnosing cascading errors
Our customers’ war stories have taught us that even the most routine changes (like restarting a database or switching backends) can sometimes lead to unexpected errors, but savvy teams already have the tools and processes…
Reddit on building & scaling r/place
Altitude SF 2017 brought together technical leaders from Reddit, the ACLU, TED, Slack, and more to explore the future of edge delivery, emerging web trends, and the challenges of cloud infrastructure and security. In this…
Technical trainings & the future of edge delivery at Altitude SF 2017
Altitude SF 2017 featured hands-on trainings and talks from industry leaders like Reddit, the ACLU, Slack, TED, and more. We explored the future of edge delivery, heard about emerging trends in cloud infrastructure and DevOps,…