ButterCMS lets developers add a content management system to any website in minutes. Our business requires us to deliver near-100% uptime for our API, but after multiple outages that nearly crippled our business, we became obsessed with eliminating single points of failure. In this post, I’ll discuss how we use Fastly’s edge cloud platform and other digital strategies to make sure we keep our customers’ websites up and running.
When you hear “CMS” or “blogging,” you probably think of WordPress. ButterCMS is an alternative that allows development teams to add CMS and blogging functionality into their own native codebases using the ButterCMS API. Today, ButterCMS powers thousands of websites across the world, helping serve millions of requests a month.
Downtime is fatal
Our customers typically build websites that make an API request to Butter for page content during their request/response lifecycle. This means that if their API request to Butter fails, their page most likely won’t render. Or if Butter’s API goes down, our customers’ websites go down with us.
This is a lesson we learned the hard way in our early days. Unreliable server hosting led to frequent intermittent outages and performance degradations that frustrated customers. A botched DNS migration led to hours of API downtime that took down dozens of customers’ websites for nearly half a day and left a large number of customers questioning whether they could continue relying on us, while several customers left.
After this incident, we recognized that ensuring near-100% uptime was an existential issue. A significant outage in the future could lead to us losing hard-earned customers and put our business in crisis.
Delivering a global, fast, resilient API
Avoiding failure completely is not possible – you can only do your best to reduce your chances.
For example, “controlling your own fate” by running your own physical servers protects you against a hosting provider going down, but puts you in the position of having to handle security and scalability, both of which can easily take you down and be difficult to recover from.
For our team, keeping our API up at all times and making sure it delivered high performance across the globe was crucial. But as a smaller company, we knew we didn’t have the resources to deliver global, highly scalable performance with near-100% uptime. So we turned to a company that did: Fastly.
We use Fastly’s edge cloud platform in front of our API as a cache layer; all API requests are served via Fastly’s CDN.
When one of our customers updates their website content in ButterCMS, we invalidate the API keys for the specific bits of content that were edited. Non-cached requests hit our servers, but we have a 93% hit rate because content on our customers’ websites changes infrequently, relative to the number of visitors they get. This means that even if our database or servers experience intermittent outages, our API remains up. Although we wouldn’t want this, theoretically if our servers went down completely for several hours, our customers’ websites would stay up so long as Fastly was.
Eliminating single points of failure
During the early days of ButterCMS, we dealt with two separate DNS incidents that left us scarred. In the first incident, our DNS provider at the time accidentally “cancelled” our account from their system, leading to an outage that took nearly six hours for us to fully recover from. The second incident occurred when routine DNS editing led to a malfunction by our DNS provider, and took nearly half a day to resolve. DNS incidents are particularly damaging because even after an issue is identified and fixed, you have to wait for various DNS servers and ISPs to clear their caches before customers see the fix on their end (DNS servers also tend to have minimum or maximum TTL constraints that they impose regardless of the TTL you have set).
Our experiences made us extremely focused on eliminating any single point of failure across our architecture.
For DNS, we switched to using multiple nameservers from different DNS providers. DNS providers often allow and encourage you to use 4-6 redundant nameservers (eg. ns1.example.com, ns2.example.com). This is great: if one fails, requests will still be resolved by the others. But since all your nameservers are from a single company, you’re putting a lot of faith that they are going to have 100% uptime.
For our application servers, we use Heroku’s monitoring and auto-scaling tools to make sure our performance doesn’t degrade from spikes in traffic. In addition to caching our API with Fastly’s CDN, we also cache our API at the application level using Memcached, which provides an additional buffer against intermittent database or server failure.
To protect against the rare possibly of a total outage across Heroku or AWS (which Heroku runs on), we maintain a separate server and database instance running on Google Cloud that we can failover to quickly.
Failure is inevitable
No matter how reliable we make our API, we have to accept that networks are unreliable and failures are bound to occur. We’ve all experienced trouble connecting to Wi-Fi, or had a phone call drop on us abruptly. Outages, routing problems, and other intermittent failures may be statistically unusual on the whole, but are still bound to be happening all the time at some ambient background rate.
To overcome this sort of inherently unreliable environment, we help customers build applications that will be robust in the event of failure. Our SDKs offer features such as automatically retrying when API requests fail, or support for easily using a failover cache such as Redis on the client.
Without realizing it, many of us are building single points of failure into our stack. At ButterCMS, success depends on ensuring that our customers’ applications don’t ever go down because of us. We do this by eliminating as many single points of failure as possible from our backend infrastructure, and providing SDKs that make it easy for our customers to achieve resiliency and fault-tolerance within their applications.