Cats, memes, and everything in between: how Giphy delivers billions of GIFS at scale | Altitude NYC 2019

Join Giphy CTO, Anthony Johnson as he explains how Fastly enables Giphy to deliver billions of GIFs to millions of users — every single day.

(00:05):

Good afternoon. Ooh, that's loud. Well, hopefully, you guys can all hear me. So one of the things, I mean I've been watching the presentations the whole day and it's a lot of code. This being Giphy, there's not going to be any code here, but a lot of animations. And if there's hard questions, I'm actually going to be calling on that laugh that I've heard just now from Anthony, another Anthony. So I'm the CTO of Giphy. We were asked if we wanted to present here and we were actually very excited to. And there's a couple of reasons why we were excited to present is for us, Fastly is key to how we operate day to day. And so a lot of stuff today is about features. It's about new ways of doing stuff and so I have no intention of repeating information about a specific feature. What I do want to cover is a little bit about how we think of it, how we use it, and how it's part of our strategy and design, and architecture.

(01:18):

So first of all, let's talk about scale. We serve 10 billion GIFs a day. Those 10 billion GIFs are going through Fastly. There are 700 million active users a day and that's all through Fastly again, and roughly speaking that 16 million hours of GIFs being watched every day, which, and that's a conservative number, again, makes you question a lot about what people are doing in the world. So what are we looking for? Actually, I'm going to skip this one. Here. So this is where people use it. It's in Facebook, it's in Twitter, it's in iMessage. Who he has used it in Slack? Great. So that's 98% of people. You've also probably used it in every single other communication mechanism that you have, whether you knew it was us or not, I don't know, but everyone knows Slack.

(02:20):

So about three years ago, we found ourselves having grown massively. Year over year, we saw traffic more than double and we up until that point had a very traditional viewpoint of CDNs, which was saying, "Listen, we have content, we want to cache it. We have an API, we want to cache it. And we have media and we want to cache it." And that was pretty well the end of the story. Then we thought saying, "Okay, well, we actually need to understand what's happening at the edge. So let's start bringing in some logs." And three years ago, three and a half years ago, we started running into issues with a lot of the people we were working with at the time. Where there was a failure in a POP. It was very hard for us to work with our partners to understand what the issue was and when we needed to get logs and we wanted real-time information about what was happening, nothing was really designed around that.

(03:24):

So we spent a lot of our time saying, "Hey, that, you know, hourly S3 transfer didn't happen." And we quickly realized that we needed a better relationship with a CDN. So we started looking around and we started talking to Fastly. And you know, from our standpoint as a company to grow we need to provide the best experience to our end users and we need to provide a really good experience to our partners. So we started from there and then kind of worked backward. So we realized, "Okay, there's a couple of core things we need to do. When there is a slowdown within when there's an issue If we spend our time talking to four different IT groups and networking groups to work out where the issue is and everyone's blaming someone else, then we're never going to resolve the issue and our partners are not going to get the quality that they need."

(04:24):

So when we talked to Fastly, we said, "Listen, this is core to us." And through the relationship, they've been able to help us with debugging issues that clients are seeing, and that was actually a core deciding factor for that. Performance and uptime, traditional ones. We did a comparison, we did a bake-off. We found the performance was everything we wanted. Again, worldwide coverage. Then the part that started coming up more and more was DDoS. We started seeing issues there. We needed someone who could provide a little bit more insight into that. But really for us, it's — Giphy is a small company. We serve a lot of GIFs, but we serve a lot of GIFs with very few engineers, relatively speaking. So what we needed was something that, by default, worked well, that was extensible and that we could grow our usage with a partner. Done. That's the sales pitch done.

(05:37):

Now, there's a couple of things we decided when we were looking at this. One of the things is we have, well, hundreds of millions of GIFs and we need to be able to manage those millions of GIFs efficiently. That means we need to be able to remove them, refresh them, change them, replace them, do all that and we have to have certain guarantees around that. We can't say, "Hey, we cleared a cache. Maybe sometime tomorrow it'll be gone." In general, our partners that don't find that acceptable so we needed something that we could do faster.

(06:13):

So for us, the surrogate key piece was actually absolutely core and there's a couple of examples here of how we do these URLs. We needed a way to be able to clear queries, we need to be able to clear individual GIFs. There's a bunch of different tools that we need. Sorry, there's a bunch of different ways of looking at the things that are being cached that we need to be able to use to clear that cache meaningfully. And obviously, the most important part was we were kind of interested by this whole VCL thing. We had never done anything at the edge and we saw that as an opportunity for quite a few areas of interest that we needed to expand into. So I'm going to talk about two main pieces. Our world, the Giphy world consists of our API, the media we're serving, and our owned and operated properties like our websites and mobile apps.

(07:12):

Now, today I'm actually just going to focus a little bit on media and mostly on the API. Now, all these diagrams are pretty well simplified but I want to walk a little bit through some of the complexity that we see in our operations. So with media, as I said, there were three things that really mattered. The features that mattered to us were the ability to purge keys instantaneously, whether that's for a whole bunch of GIFs or one GIF. The surrogate key pairing, being able to look at a bunch of GIFs from a search term for example, and clear all of them. And the ability to have a POP, which basically we have, again, hundreds of millions of GIFs. We need to cache those as aggressively as possible to get the content quickly around the world.

(08:11):

And this diagram, I'm not going to go into a bunch of detail, but there's a couple of key points that I would love to have a laser pointer to show you that I don't have. So at the bottom left you'll see that there is an ad load event. So every time a media gets loaded, there's an ID that identifies that GIF, it sends it back. And when we're doing ads, it allows us to reconcile that information with our ad servers. The other part is all our logs, kind of going back to the previous presentation, are streamed real time. So we actually get those real time through a log analytics server. We get them through a real-time server that is then used for business intelligence and we also get them in batches every five minutes, which allows us to then build a whole bunch of downstream ETLs, which allow us to understand how users have really are using our system.

(09:13):

But really the thing I want to talk about today was API. So I mentioned analytics — we serve 10 billion GIFs a day and that's great and we're very happy about that. But four years ago we didn't really exactly know where those GIFs were going. Now, four years later we have a much, much better understanding. Every time a GIF goes out, we can track it back to roughly what the request was. We understand that whole life cycle. And that's incredibly important for us because we need to be able to serve ads. We need to be able to understand our user base. We need to be able to understand our partners and their usage, and actually, it's core to search quality every time. If you pick one GIF over another GIF, we need to know that because that is the most important signal that we can possibly have to make search better.

(10:14):

Retry logic. So we always want to have the best user experience possible. Now, we all know not everything always works perfectly. So there's always two options. We can be very formal about it and say, "All right, that's a 500 or whatever the issue is and we're going to throw a 500 and hope that the thousand companies that have implemented a REST API all spend a great deal of time reading Fielding's document so that they know exactly how to behave in every single one of the 55 different rest codes." Or we can just say, "Sorry, no GIFs on that one." And it turns out, that implementation is a lot easier for our partners and breaks far fewer people's applications by doing that and helping them. Now, we can switch that off. We can get smarter about that. We can be more aggressive about sharing information, but ultimately it gives our users, our partners a better experience as developers.

(11:16):

Stale content cache. Our viewpoint on content is if I search for something and no one else has searched for it for the last seven minutes or whatever the timeline is, the reality is that something that's maybe a minute out of date is probably perfectly adequate. So we have a bunch of logic around this to give users the best experience. If I'm on a mobile phone, the last thing we need to be doing is adding latency to that very long request and sending a whole bunch of media over. So we optimize for the user experience. Error handling, I already kind of mentioned that and then DDoS and bot protection — as we scaled, we saw more and more DDoS attacks and more recently we've been seeing bot attacks. So we now work with Perimeter X, which tied in very nicely into the Fastly platform and has protected us against the daily attacks we see.

(12:12):

So this is the ultra-simplified API layout. A request comes in, it hits the Fastly CDN, it hits our API, and in most cases, assuming it's search, it goes to a search, pulls some metadata. But the reason I wanted to show this is because sometimes people upload new content, sometimes we delete content. It doesn't really matter. The point is content is constantly changing. So what we've done is we centralized all that logic across 50 different microservices. Every time content changes, we call our internal cache service, cache control, which then goes out to Fastly, purges it based on surrogate keys or individual keys and makes sure that the thing that we're presenting to the world is always the best version, the cleanest version, and the most up-to-date version.

(13:10):

There were a couple of principles we wanted to operate on. So, again, I mentioned the REST stuff. Ultimately, we want the API to be simple and simple doesn't always mean following formal practices. It means dealing with real engineers, real users, people who make mistakes and providing them with a great experience. And the second is always providing the end user with the best possible experience. In our case, the best sticker or the best GIF. So that's why we put, again, we put together retry logic. We make sure that if there are errors in our systems we try again. If there's timeout, if there's network issues, let's make sure that the end user experiences those as little as possible. Stale content cache. Again, same thing and the error handling I mentioned.

(14:10):

So the tracking analytics, again, is very cool. So we've done a couple of things. We have the real-time — it's being piped through Syslog. Syslog is then ingested, processed and put into a Kinesis stream, and we have a whole bunch of downstream systems from that. Second, is S3 Bucket. I think at one point four years ago, we didn't have access to anything remotely like Syslog. So there was no real time at all. And second is we actually had to provide an FTP or SFTP server to get logs. And we saw most of these systems fail at least twice a week. So having S3 Buckets, having Syslog is something that has fundamentally changed how we deal with the data and what we're able to build. And both of these combined let us do things like understanding ad performance. They let us understand how the search is doing and they let us understand how the content is being used, and how the content is being used by different partners.

(15:21):

All right. Monitoring. So we have three main systems. The first is Scalyr. And the way we use this is when we see an arrow, we need to understand what's happening in real time. So we go to Scalyr, we have the live Syslogs, they're being sliced and diced and we're able to go through it. We're able to click on it, we're able to see which partner it is. We're able to see if it's systematic or if it's different aspects, If it's different countries, it doesn't really matter. So we slice and dice through that and that helps us understand usually what the real-time type of issues we see are. And that's become a pretty core first step in debugging issues that we see. Second is DataDog. So DataDog provides a complimentary set of tools. It provides alerting. It basically allows us to have all that data come in and alert on it, whether it's a partner that is suddenly seeing an increase in volume or a sudden spike in total traffic, it doesn't really matter. We can put that in and we can do the slicing and then have that go into a Slack channel to inform us when we see anomalies.

(16:45):

And that's where the time series analysis and comes in. There's a lot of seasonality. There are far more people looking for GIFs at 1:00 PM Eastern time than there are at 1:00 AM. Now, every country has its own thing but the point is there's a seasonality component to that and we have to work out how to handle that. DataDog provides that ability. Then obviously the Fastly dashboard. So typically, this is where we go to understand major, major systemic changes. Whether it's a cache hit rate change or a real-time performance. These are the tools that we probably look at at least once a day, whether things are operating well or if there's issues. We will use a combination of these three to really dig down, identify what's going on. And actually, the one feature I will mention is the network debug tool, which is the web page and we've sent it to virtually every partner that we have to work out why they're having issues and it's fantastic.

(17:54):

So these are the things that we're talking about. So Scalyr, again, we use it for debugging, real-time slicing of data and kind of a control setter for performance. We can slice and dice the data. DataDog, and I had to pull some fake graphs for all this because I'm afraid I can't show the actual traffic numbers. But so we integrated the main infrastructure monitoring and multi-stage network data, which is to say ELBs combined with the Fastly traffic, and getting an understanding of how these things all tie together and then Fastly.

(18:39):

So I want to talk about the analytics pieces. There's two main ones. And analytics, in this case, I don't mean performance monitoring of production. I mean how are we getting that data? How are we using that data and what are the systems we've put in place to be able to take in the10 plus billion log entries that we get a day and make that into something meaningful? And I won't go into the data engineering aspect of that, but I think it's important for us to talk about how we've leveraged the Fastly integration to make that real.

(19:23):

So the Fastly2Kinesis is the name of the service. You can see it at the bottom right of the diagram. What this is is it takes in the real-time data, pre-processes it a little bit, puts it into a Kinesis stream that is then used by several downstream services. Although this one is a particular example which is MOAT, which has an ad tracking system from Oracle that we are tied into. So this means that every time a GIF that's an ad is used, it goes back to a third-party tracking system and is verified. This is something that, again, we would not have been able to do a few years ago and it is core to Giphy's ad tech platform.

(20:18):

This is the analytics workflow. I wanted to just cover this a little bit because part of the intelligence that we have to do is we have a whole bunch of discreet and completely unrelated calls. So a client's application calls us, they say, "Hey, I'm looking for cats." We send a bunch of links back and then the client application calls and downloads all of those GIFs. So we have to be able to find a way of merging these two pieces of data into something unified and meaningful. This is the workflow that happens and this is all done through a combination of our actual API code and some VCL plugging in real-time stats into that. And then downstream, these IDs are merged back together and we're able to sessionize that data and create a true history of the usage by users.

(21:18):

So everyone here is listening to the latest and greatest products. One of the comments in Slack a couple of days ago was there were several people in Slack, very, very excited about containers at the edge. I wasn't able to get a screen capture but there was a 10-minute discussion in Slack about this particular event specifically on that topic. So I have a feeling that is definitely coming down the road. For us, strategically, the ability to have code that is more traditional part of our API or pushing our API out into the container is something we're looking into. Being able to unit test and provide a traditional CI/CD system for some of the business logic that is getting more and more complex over time and moving that into the containers is something we will be doing in 2020.

(22:15):

And implementation. Our code base is written in several different languages. It would be nice to be able to leverage some of those codebases at the edge. So there's virtually no question that we will continue down that path. And then the other one actually is we are now in five different locations worldwide. And one of the things that's happened as we've moved into an ad tech space and started having more brand clients is we need to have a unified way of thinking about internal tools, external access, and how do we do that with multiple data centers and do we have to have IT running around? Can we have VPNs? And as we were thinking about this, we run into the ability to start doing authentication at the edge. So that is one of the strategies we're thinking about and trying to understand — can we have a unified way of looking at all of our access points across internal and external applications by using Fastly?