Subscribe to our newsletter
Get the latest news and industry insights in your inbox.
Subscribe to our newsletter
Thanks for subscribing.
The pursuit of innovation necessitates what The New York Times CTO Nick Rockwell calls “good risk taking,” which is calculated, distributed, and hedged. In this Altitude series, the technical leaders behind some of the leading online destinations explain how their teams succeed by viewing failure as an opportunity to learn (and find better solutions for future failures).
Last June, we shared Seth Vargo’s story of HashiCorp’s quick and relatively painless recovery (despite a nerve-wracking “purge all”) during an outage. In this post, we’ll share tips for successful failure and recovery from Kenton Jacobsen, Director of Engineering at Vogue.com and Glamour.com.
Like HashiCorp, Kenton’s team at Vogue understands the importance of “failing gracefully,” and how to draw on lessons learned to continuously improve going forward: “It’s not about failing less, but failing more intelligently.” They keep the following in mind for thoughtful risk taking:
The faster you go, the higher your risk:
That said, tech teams “[have] to take risks to innovate.” It’s like driving a car — while it’s certainly riskier than sitting at home all day, you have to increase your velocity in order to go places.
It turned out that jQuery’s AJAX method has a global cache property which was set to false, resulting in a timestamp being appended to every request.
(Editor’s note: an interesting and little-known side-effect of using a timestamp as a cache-buster is that it still often hits cache at the edge — if it’s a to-the-second-resolution timestamp, i.e., changes once per second, and multiple clients make requests in the same second, Fastly will see requests at the edge for the same URL and collapse them into one.)
In an effort to fail quietly and quickly — creating 404 pages that were more entertaining — the team ended up in cascading failures. Not to mention the fact: “When you’re optimizing your 404 pages to that degree, you’re probably asking some of the wrong questions.”
(Editor’s note: Vogue’s instinct to do something interesting on their 404 page is not unusual — other Fastly customers have also produced interesting “Not found” pages, including the Financial Times, who offer a delightful and informative 404 page with definitions of various economic theories.)
In an effort to deliver a new content search feature, Vogue’s team needed to restart a MySQL database. MySQL is a very complex system and has many thousands of possible error states, but since restarting the process will almost always work without error, it can be tempting to think you don’t need expert support: although the team hadn’t done it on this system before, they’d restarted
mysqld many times. In this case, an unexpected and unfamiliar error message —
Failed: intervention required (one they’d never seen and haven’t since) — led to elevated pulses, some frantic googling, and the hard-earned lesson that not having the skills and knowledge on hand to debug a problem could lead to a massive incident. Fortunately in this case, repeating the restart eventually solved the problem, and offered the overall takeaways that understanding the risks of production-affecting changes, knowing where to reach out for expertise, and how quickly a crisis can be addressed are critical to preventing catastrophe.
From these war stories resulted the following sage advice:
Check out more war stories here, and stay tuned — we’ll continue to recap customer tales of success and failure going forward.