Migrating MySQL Stats to Bigtable with no downtime | Fastly

Staff Software Engineer

July 06, 2017

Learning from the past is an essential step in decision making; at Fastly, we offer our Historical Stats API to help customers make quicker and better decisions based on past events. This API enables you to retrieve historical caching statistics by the minute, hour, and day, offering key insight into events and informing future decisions. In this blog post, I’ll walk through how we switched the Historical Stats API database from self-managed MySQL to Google Cloud Bigtable without any downtime or service disruption.

Priorities and technologies

One of many technical challenges of building a Historical Stats API (and its globally distributed pipeline) is the immense amount of small-datum that needs to be processed, stored, and organized for fast retrieval.

In our early architecture, we chose to build a MySQL-based data service to store and serve historical statistics. MySQL/InnoDB was an easy choice because of its strong ecosystem and our familiarity with the software. This is still true today and we run MySQL in many other services. So why move the historical stats data off MySQL? The original architecture worked well, but we also knew that we would soon outgrow this architecture: the ever-increasing data volume was making us run into physical limitations (such as storage capacity), which also led to the schema becoming more sensitive to modify. We were faced with a decision — whether to continue growing our MySQL-based stats infrastructure, or look for another solution, which would include reviewing cloud database providers. The latter was appealing because offloading the management of the underlying storage mechanism would allow us to put more focus into improving our analytical products, which in turn leads to delivering more value to our customers. We ultimately decided to go with Google Cloud Bigtable because of its capacity to handle high volumes of data (e.g. time series data), sparse table design, and scalability through the dynamic node mechanism. As a whole, Google Cloud Platform (GCP) made sense to us because of their rigorous compliance support and the global interconnects between Fastly and GCP.

Cloud Bigtable is a massively scalable, multi-dimensional wide-column database service that is available through the GCP. Since this blog post is not about Cloud Bigtable itself, I’ll refrain from going over the technical details. However, do point your browser at this lovely paper if you’re interested in the design and low-level details of Bigtable.

Anatomy of Fastly’s Historical Stats API

Historical stats include aggregated data such as the number of requests, bandwidth usage, and hit ratio, providing you with a comprehensive overview of your traffic. No personal information or remote log streaming data is included.

The public-facing Historical Stats API is served by a small read-only Ruby web application that consumes historical data from a separate service. Data population, on the other hand, is performed by a dedicated service that consumes and processes incoming messages from our globally distributed stats pipeline. (There’s a lot more going on in reality, but you get the idea.) In essence, we needed to create another pipe that would connect Bigtable to the global stats pipeline and write some abstraction to make it easier for our front end Ruby application to pull data out of Bigtable in various ways, such as per service and datacenter. Doesn’t sound too bad, right? Think again.

Question everything that can go wrong

Swapping out the old data service with a new one and hoping for the best would have been irresponsible. We made sure that the switch would not cause any disruptions to our customers by questioning many things, including:

Is the data in Bigtable consistent with MySQL?
Are responses from the Historical Stats API correct?
What happens if there’s a failure in the stats pipeline?
Are the performance characteristics of the new system acceptable?
Are there any edge cases that we might have missed?
Is our system monitoring sufficient to discover anomalies?

Testing for correctness at scale

To ensure that data is consistent between Bigtable and MySQL, we built multiple verification methods and ran them independently. One of them periodically runs at the database layer, comparing rows inside Bigtable, MySQL, and other databases. Another verification method runs at the web API layer, where every Historical Stats API request is passed on to an asynchronous verification worker. This verification worker is responsible for comparing responses computed using Bigtable and MySQL for the given request, without disrupting its lifecycle. If an inconsistency is found, the worker then logs the event to our data analytics platform for later analysis and real-time discovery of data inconsistencies. Because our API front end is a Ruby application, we were able to leverage GitHub’s scientist gem in our verification worker, which helped us safely experiment with new data retrieval code.

We ended up running this “shadow” verification process for months before cutting over production requests to the new data service. We highly recommend taking this type of approach as it can really uncover all kinds of interesting facts, including unexpected computation cost. For example, our app performance monitoring showed that the majority of the time spent on computing Bigtable-based responses was due to network latency, which led us to realize that we needed to migrate our Stats API servers closer to Bigtable as well. There’s just something special about the production environment that can teach us a lot.

Gradually cut over API routes

Thanks to months of testing, we were relatively confident that our Bigtable deployment (and relevant software) were ready for production traffic. Even then, we decided to be extra cautious by cutting over only one or two API routes at a time to ensure that everything was running as expected. While this partial cutover strategy sounds tricky, things went very smoothly thanks to the ease of configuring request routing with VCL. Here’s a very simple example to give you an idea:

Given two backends, api_node and api_experimental_node:

backend api_node {
  .host = "127.0.0.1";
  .port = "443";
  … snip ...  
}
 
backend api_experimental_node {
  .host = "127.0.0.2";
  .port = "443";
  … snip ...
}

One option is to point API requests against a specific resource to the experimental node by matching on the request path.

sub vcl_recv {
  if (req.url ~ "^/path/to/resource") {
    set req.backend = api_experimental_node;
  } else { 
    set req.backend = api_node;
  }
}

Similarly, you could force a request to hit the experimental node if a particular HTTP header is present in the request. This is especially useful for end-to-end debugging.

sub vcl_recv {
  if (req.http.Api-Debug ~ "experimental" || req.url ~ "^/path/to/resource") {
    set req.backend = api_experimental_node;
  } else { 
    set req.backend = api_node;
  }
}

A particularly VCL-savvy reader might now be wondering — what if I have multiple API nodes for performance and redundancy reasons (which is often the case)? Never fear, this is also straightforward to configure — check out our docs on fine-tuning your load balancing configuration.

Our takeaways: thinking boldly & strategically

Switching the persistent database system of an active service while avoiding downtime and service disruption is not only difficult, it requires patience and a lot of testing. It’s a classic “easier said than done” problem to get just right. The methodologies introduced in this blog post worked for us, and we hope they might help you if you’re thinking about tackling a similar problem.

The database switch from MySQL to Bigtable and its new data pipe has turned out well so far; data remains consistent and client services are continuing to behave as expected. However, we would be lying if we said we got everything perfect the first time: we've made various tweaks along the way, including rethinking and re-architecting the way we write to Bigtable to gain more throughput. Stay tuned: we hope to cover these lower-level learnings and improvements in a future blog post.

For more on our Historical Stats API, check out our documentation, and we also recommend exploring our Real-Time Analytics API.