Powering the world’s checkout counter: behind Stripe’s multi-CDN strategy | Altitude NYC 2019

Suhas Muralidhar and Shulin Jia, Software Engineers at Stripe, share how the company not only uses CDNs to serve static assets on Stripe’s sites, but also to serve its secure hosted payment page. They'll also explore how and why Stripe recently implemented a multi-CDN strategy to amplify resiliency.

(00:00):
Good afternoon everyone. In this presentation, we'll talk about multi-CDN integrations for resiliency and share some of the experiences we had at Stripe. In this presentation we'll go over our thought process behind why we thought multi-CDN setup was important, look into some sort of considerations we had before starting this project, then go into some implementation details on how we build tools to ease integration with multiple CDNs, followed by some observations and takeaways. We are from Stripe. Who here has used Stripe or know about Stripe? Great, I see a lot of hands. That's awesome. Stripe is a technology company that builds economic infrastructure for the internet. A lot of companies ranging from very small signups, to big enterprise companies use our software to accept payments and manage their businesses online. I'm Suhas.

(01:00):
And I'm Shulin.

(01:00):
We both are engineers from the edge team at Stripe. Edge team builds and manages infrastructure to ingress low-latent API connections for Stripe. They also provide edge computing framework, and also manage our CDN infrastructure. Shulin and I recently worked on spinning up Second City and Fort Stripe. We'll talk about that.

(01:22):
To give a bit of background, Stripe started using Fastly as a primary CDN, and the usage spans across three different Stripe products. Stripe dashboard is a feature — it's user interface — which our merchants use to login and manage this type of account. You can think of this as a place where merchants can see the recent payments, manage the disputes between their customers, refunds, and also monitor their entire integration. We serve static assets for Stripe dashboard through CDNs. Stripe JS is our foundational JavaScript library, along with Stripe elements which has prebuilt UI components. Stripe JS provides a unified way to tokenize customer information, accept credit card information, and process payments. Stripe JS also provides an easy way to accept different payment providers, like Apple pay and Google pay. Millions of businesses use Stripe JS and we serve it through the CDNs.

(02:24):
Stripe checkout is a secure Stripe hosted payments page, which is easily customizable with a few lines of code, and it's designed to be a drop-in replacement for payments, and it's desktop and mobile optimized and helps to drive conversion. We serve assets for checkout using CDNs as well. So why do we need multiple CDNs? So we thought about this a lot when we started the project, and in this section I'll go over some of the process we used while deciding why we need multiple CDNs. The primary reasoning was around resiliency. Outside, we try to avoid single points of failures for our infrastructure and having a redundant CDN provider will help us a lot for serving production traffic during any of the CDN provider outages.

(03:14):
This also helps us to shield our origin servers from extra load, when one of the providers have an outage. Another consideration we had was around latency. Having multiple CDNs will provide us a way to do smart routing at the edge. You can think of it as an example where if our users let us know that they're having some issue for loading assets in one part of the world, we have systems to detect that and do automatic failover between different CDNs based on their availability or if they're having any outages. Cost is another reason. Oftentimes, having multiple CDN providers will help you negotiate a better cost structure for your use cases. This can be done by contacting the vendor team, and explaining your use case and also solve the bandwidth requirements you have.

(04:06):
Simplicity was another thing we consider a lot at Stripe. We have a slightly complicated deploy pipeline for assets. This was because we wanted to ensure we have written and auditing servers, and it's low latent for our end users. But having multiple CDNs will help us simplify that pipeline and make it easy to reason about because we can rely on a redundant CDN in this case to serve assets in terms of failure, and not worry about having multiple origin servers. So once we understood the need for having multiple CDNs, the next thing we thought a lot about: how or what should we think about when we consider a new provider? We'll go over them here.

(04:46):
The first and the primary one was performance. We wanted to ensure we do not degrade performance for end users with a new CDN setup. It should be as transparent as possible for them, when we have multiple CDNs versus a single CDN. One way we assured this was we identified high traffic paths for our static assets among different products and we ran benchmarking tools on the new provider. This way, we ensured that whichever provider we choose will not degrade existing performance for our clients.

(05:17):
Cost is also another factor to consider. It's always good to have a back of the napkin calculation on how much the new provider would cost. This will help you to ensure that you do not go over an existing budget for your CDN cost. This can be done by looking at publicly available cost structure, or talking to the sales team to understand how much would it cost for specific use cases and requirements.

(05:44):
Another consideration to think about in the initial stages is feature parity. For example, we had slightly complicated VCL logic with Fastly where we are using different functionalities like modifying requests and response headers, sending custom content security policies for different products. Also, enabled way to do your redirection and custom certificates, so it's always good to have a new provider which provides these functionalities out of the box, so that it will allow us to build custom solutions locking into a specific provider. So we looked a lot on their documentation to see if we can directly use their solutions. We also looked at the setup for the new CDN. For example, we looked at ways if they have good API documentation for us to easily integrate with them or good Terraform support for example.

(06:38):
Another consideration was points of presence. Stripe currently operates in 34 countries, and we wanted to ensure the new provider is also available in different regions around the world. This will help to reduce the latency for our end users by enabling them to connect to the closest spot which is close to them. Change propagation is pretty critical for us, primarily because we have multiple product teams which make changes to CDNs quite a lot, and we wanted to ensure we have a way for these changes to be quickly propagated to the edge of any CDNs. So you can think of this as like a fast iteration loop for the downstream users and having a way to quickly propagate it to all the edge servers was critical.

(07:23):
Support was another aspect which oftentimes we overlook when we start new projects, but when you're working on a new system like a completely new vendor, you oftentimes trip a lot. So it's good to have a good support site where the things are clearly documented, if not, they have a good support model. Like we have good support with Fastly, we wanted to ensure we can get the same with a new vendor we chose.

(07:47):
Contract negotiation is another piece where oftentimes it takes some time to complete the contract negotiation process. So having the thought in the back of your mind when starting the project will help you get a good estimation for the project itself.

(08:05):
Once we understood the need and some of the considerations we are looking for in a new provider, we looked into how we can implement it for our downstream users. The first thing we thought about was how can we make the experience seamless for the downstream users? In this case, the product teams like dashboard checkout or JS. Currently, they were used to managing a single CDN provider where we had a way for them to define the conflicts and functionalities they want. But it's hard to translate this to a new provider which might have very different way of mentioning this or operating this. Towards this, we built a common framework where we expose a single interface for our downstream users. Using which they can define functions like custom certificates, custom headers for their use cases, and custom redirection logic. And then we built tooling around that where we automatically transpire this to individual city and specific configuration. That way, the product teams do not have to worry about one versus n CDNs and they can worry about their functionalities and leave the rest to us to propagate it to the CDNs. Another tooling we thought a lot about while implementing is an easy way to allow deployments to more than one CDN. It's good to have a documented and well-tested way to deploy your changes to multiple CDNs. So towards this we built a client, and in this example, lets say we have two CDNs, A and B, both are serving 50-50% of the production traffic in an active active manner. So now if you want to deploy to the CDN, the first thing we do is we use the CLA to ensure a deployed command, and then we change the weights of the routing, which is going to individual CDNs. In this case we moved the production traffic completely away from B, and away from B to A, and now we issue the commands for deploy on B.

(10:12):
Once a deploy is done, we run validations on top of it. You can think of validations as like hitting custom end points for assets, verifying the response headers, and also some CSP rules. And once the validation is successful, then we do the same thing for A now, but in reverse. We slowly can read the traffic back to CDN B. That way we can ensure if the changes are mismatched or wrong, we can automatically roll back. And once we move the traffic to B, we do the same process. We deploy, we run validations and we verified. And at the end of this we return success to the client, and before exiting it will set the routing back to 50-50%.

(10:55):
So, the reason we maintain active traffic among CDNs is to ensure both of them, or all of them, have updated conflicts and there are no stale conflicts hanging around. Now I will hand it over to my teammate Shulin to talk more about some of the tooling we built around incident responses, observability, and takeaways.

(11:14):
Oh, thank you.

(11:20):
So, as my coworker Suhas was talking about how we built some common tooling to manage two CDNs in terms of making changes to the CDNs, as well as deploying changes to the CDNs, it turns out we also had to extend that tooling to how we respond to incidents. I know this has been mentioned a lot, but purging is very easy in Fastly. So, previously when we would have an incident where an incorrect asset was being deployed to the origin server, and it gets propagated to the CDNs, we will roll back that change onto the origin and we would have to purge the cache onto our single CDN, which was really easy to do. We actually ended up using the console a lot just because it was so fail-safe. Yeah, but that was, you can imagine that's quite difficult to do when you have a incident on two CDNs logging to both menus. It's kind of nerve wracking. So we made sure that when we built the common tooling, we would have a command to invalidate both caches at once.

(12:32):
And then another scenario, this one is one of the primary reasons we built the second CDN, is when a provider had an outage. When a CDN provider had an outage here, we wanted to make sure our tooling could route traffic, not just for our deployments, but also for incident response. So, we use DNS for routing traffic to our two CDNs. And this is really helpful to us because we can even use things like geo DNS, or anycast, to reroute traffic to specific regions that are experiencing issues.

(13:14):
And I'll talk a bit about how we actually detect those incidents in a bit, but first I want to talk about observability. Observability was really important to us when we were looking into this project because we need to know when things are going wrong, but also we need to know when things are going right. We set up a second CDN and all of these configurations were named differently and both CDN, so we wanted to make sure that the end result was very similar. So some of the things that we wanted to collect in terms of metrics was are the cache hit/miss very similar? Because that might tell us that we've configured the TTLs correctly or not, or if the amount of bandwidth we were serving at 50-50 was similar, and that might tell us if we've set up compression correctly.

(14:06):
And then of course metrics help us determine if we're seeing any problems or any incidents, and how we detected if a CDN provider had issues was we set up this health check endpoint on both CDN providers, and we set up network monitoring agents to constantly ping their local POPs to check if they were reachable. And from there, we could notify the on-call engineer if there was a problem with the CDN and that would lead to them running the command to reroute traffic to the CDN that was healthy.

(14:51):
And we also wanted to make sure we had the ability to observe when we caused issues with our own deployments. This is a graph from when we did a deployment, and we had an elevated number of cache misses and also an elevated number of elevated latency, which is kind of expected because when you do a deployment you introduce a new asset, and you haven't warmed the cache, you're going to have more cache misses and you can see in this case it decreased so it wasn't an incident, but in other cases where maybe we'll see a increase in four hundreds or five hundreds that might be an incident that we would have to deal with.

(15:39):
So that's why metrics are really important for us to collect. And we spend a lot of effort collecting them, because it turns out when you have two CDN providers, they're not going to give you the same APIs and they're also not going to give you the same metrics. For example, metric names are really different. One provider gave us metrics like the cache hit/miss ratio as a account of requests that were considered hits versus misses, and another provider actually gave that those numbers to us in that percentage. And then as another example, like latency was expressed in seconds or milliseconds, depending on the provider. So, thankfully it was really easy to just integrate those into our dashboards and kind of manipulate the formats of those metrics to make them look very similar.

(16:34):
But we had to spend a lot of work getting the logs into our logging platform. Fastly provides a stream that you can input into your logging provider. The other provider we used, we had to do a lot more setup. And also similar to metrics, we saw that the metrics and the logs, they are expressed differently. As I learned during this conference, Fastly's logs are highly configurable, but maybe the second providers is not. And also the formats of the logs are expressed differently. For example, I put the user agent here in one provider, we got strings that were URL encoded, and another provider didn't give us those string URL encoded strings. So we had to kind of consolidate those two because things like user agent, it's really important for us when we're trying to debug issues.

(17:40):
So aside from all of this data and observability that we can gather from our CDN providers, we also wanted to make sure we collected information about how the actual customer was experiencing our product. So we set up some synthetic traffic to hit the POPs that are local to regions that our customers are using. And this was really useful to us because, not only could we see a breakdown of latency, we could use these monitors to also see what type of route a packet took to actually get to the POP, or how DNS was resolved. So just more information to help us debug issues with specific regions.

(18:28):
But of course this cannot, this doesn't actually represent what real users are seeing because the network conditions in a datacenter is not the same as what you and I get at home. So we plan on amending this data with real user metrics. We actually do this for our API, and we will soon do this for our CDN assets. And what we'll use is a resource timing API, which is provided in most browsers. And from that you can get a breakdown of latency between not just total overall latency, but you can see the breakdown of the DNS lookup time, TCP connection time, and the request and response time. And I think that will help us get a better idea of how we're serving our users with our CDN.

(19:22):
And some of the takeaways from doing this project with Suhas is that some expertise is needed to set up multiple CDNs. You really have to understand the offerings of both your CDN providers to be able to tell which ones actually mean the same thing. And because we work on that team, we have an on-call rotation, our entire team has to understand how to operate in the CDN, especially in times of emergencies. So, the custom tooling that we built was really helpful for that. And there were a few steps that we couldn't automate with APIs. So we actually wrote runbooks, which were quite helpful. And this was a lot of work to do just to set up a second CDN, but in the end, for improved resiliency against single provider outages, we definitely think it was worth it. Thank you.