HashiCorp on recovering from failures

Altitude NYC brought together a brilliant group of Fastly customers – we heard war stories from industry leaders at Vogue, Spotify, The New York Times, Hashicorp, and more. In this post, we’ll recap Seth Vargo’s talk about how HashiCorp recovered from the infamous S3 outage, and prepared to mitigate future failures.

Recovering from failures with Fastly

Seth is the Director of Technical Advocacy at HashiCorp. The San Francisco-based open source software company is an active participant in Fastly’s open source program, and he shared some of HashiCorp’s impressive growth stats from 2016:

14,243,908 downloads
549.9 TB bandwidth
Serving an average of 62.6 GB/hour
80 requests/minute — “These are big binaries, not web requests,” with 24-124 megabyte binaries in a single instance
96% cache coverage

HashiCorp often sees major spikes in demand; during annual conferences and user events, they experience massive bursts of traffic, where “all of a sudden” they’ll have 300,000 people trying to download Vagrant at once. This wouldn’t scale well if they were running their own infrastructure, but “Fastly handles it just great.”

Fastly & the AWS outage

According to Seth, it’s the job of systems administrators and DevOps engineers to prepare for the reality that “the cloud is inherently unreliable.” Seth cited the AWS outage from February 2017, and while he asked for a show of hands to see how many people in the Altitude audience were affected, he assured us that this wasn’t a bashing talk: “A number of companies, HashiCorp included, built services that singly depended on Amazon S3.”

HashiCorp hosts all of their static sites on Fastly; the content is stored in an S3 bucket which is fronted by Fastly’s CDN. The content lives in that cache “forever,” (with a “pretty infinite TTL”) because part of their deploy process is to manually purge the cache. All the sites are stored in the same S3 bucket (an important implementation detail for later) and they use Custom VCL to rewrite the URL inside a folder within the S3 bucket to route to a folder within the same bucket.

As part of a deploy, they issue a purge on a key, e.g., purge site-terraform or site-consul to purge just the HTML for those sites, and they also spider the site (i.e., use bots to hit pages on their sites). They “cheat” to pre-warm the cache: they use edge caching and shielding, so they recursively spider the site to force the loading into the edge POP, and then all of the other edge POPs will gradually pull from the origin POP that they chose.

When the S3 outage happened, Fastly’s CDN could no longer talk to HashiCorp’s origin, but it still had the cached data. Seth knew this because they had issued the spider. Although all data might not have propagated to the edge, it should be in the origin shield node, and the edges should be able to pull from there. So why did all of their sites go down?

It turned out that as part of their deploy, they had a few new pages that were pushed to S3. The spider step, which happens immediately after the deploy, failed to find those new pages they’d deployed. Seth and his team assumed that the spider had failed and that the cache was outdated (“Maybe we hit an edge server that hadn’t propagated yet”) so they decided to purge all. “You see where this is going,” Seth said, amidst knowing murmurs from the audience.

HashiCorp saw “a number of systematic failures:” They didn’t know S3 was down. In fact, “the internet hadn’t even reported it yet,” and because they deploy their static sites so frequently (10-15 times a day), HashiCorp was one of the first customers to notice and report the issue to Amazon. Additionally, they had identified some caching issues but hadn’t yet invested the engineering resources in figuring out the reasons behind them — they were addressing them (i.e., avoiding) by purging the cache, further compounding the S3 issue. (They’ve since fixed that.)

The “cool part” was that Fastly still had the data. One of Fastly’s customer support engineers pinged Seth on Slack offering to help. In the process, HashiCorp learned a bit more about Fastly architecture: when you tell Fastly’s CDN to cache something, it uses something called a cache key — a composition of hostname + URL + ###generation###, which is what you see when you look at the custom-generated VCL. When that VCL is compiled onto the boxes, that’s replaced with an auto-incrementing ID.

When you issue a purge-all command, Fastly takes that generation key and increments, so your data isn’t actually gone — it’s just no longer being served from the cache. The data is still there, living in the background; all you have to do is restore that generation value. Seth pointed out that you have to work with the Fastly support team to do this: they’ll go into your services and revert the previously known cache key. You’ll start serving your traffic that lives in memory, effectively reverting a purge all. Seth noted that “what’s great about Fastly” is a purge all is effectively a “fake purge”: the data is still there, and if you “seriously mess up” you can undo it. You can always reach out to the support team for help: “Fastly support is pretty instamagic,” Seth said. “I think the response time is negative seconds.”

Follow-up steps: mitigating future failures

As part of the recovery from the S3 issue, Fastly’s support engineer also identified other areas where HashiCorp wasn’t optimizing caching and doing some things they “shouldn’t have been.” Seth reiterated, “If we were using Fastly properly, none of this would have happened.”

Before the S3 incident, users had been reporting that browsers were caching the content too long, so Seth’s team lowered the time to live (TTL) from 24-48 hours to 4. Although this was the right change to make, they were using the same cache-control header to control how long the content is stored in Fastly and how long it’s stored in the browser. Because their content is very static, they can actually tell Fastly to cache content infinitely because they purge the cache as part of their deploy. On the other hand, they want to refresh the browser every 2-4 hours, but requests will hit Fastly, not the origin.

They use the Fastly surrogate-control header to pull which cache-control header they’re listening to. They send the cache-control header to the browser and set it to 4 hours, and set the surrogate-control header to a week — Fastly will hold static site data in memory for about a week, and will gradually fall out. That way, if “someone has an outage that lasts longer than a week” they’ll have other opportunities to recover.

They also added two “super important” headers, stale-if-error and stale-while-revalidate. Stale-if-error tells Fastly to serve content if there’s an error retrieving content from the origins, even if the cache time has expired. Stale-while-revalidate is “a little more nuanced,” and is especially important if you’re using soft purge, which HashiCorp “uses for everything now.” Soft purge tells Fastly’s CDN to purge data from the cache but not actually get rid of it immediately — to mark content as outdated instead of permanently purging while new content is fetched from the origin.

Want more HashiCorp? Join us at our West Coast customer summit June 28-29 to hear from HashiCorp CTO Armon Dadgar, where he’ll discuss how Nomad, their application scheduler, empowers developers & increases resource utilization.

HashiCorp on recovering from failures

Recovering from failures with Fastly

Fastly & the AWS outage

Follow-up steps: mitigating future failures

Want to continue the conversation?

Ready to get started?