Rack and Roll: How Fastly’s network grows with purpose | Altitude NYC 2019

Go behind the scenes with Fastly's Senior Director of Hardware Systems Engineering, Davin Camara, and Supply Chain Manager, Kat Diamantine to learn about how our new “Rack and Roll” POP delivery program lets us strategically choose future POP locations and build, test, and ship our hardware.

Kat Diamantine (00:04):

Hi. So, thank you everyone for taking the time today. We're going to be very different. You've been hearing about serverless, you've been hearing about all this. We're going to talk about a server. We're going to talk about some physical hardware and we're going to go over kind of how Fastly deploys, but real quick just introductions here real quick. So my name is Davin Camara. I'm Senior Director, Hardware Systems Engineering. I lead an amazing team of hardware engineers, software engineers, and sourcing and supply chain experts at sourcing and building Fastly's hardware platforms throughout the world.

Kat Diamantine (00:34):

And I'm Kat Diamantine, Supply Chain Manager at Fastly. I lead our Global Supply Chain Operations, managing all of our vendor engagements, and planning and executing our quarterly and annual budgets.

Kat Diamantine (00:47):

Right. We can skip that slide, too, because we already just did that so fast. Fastly's network growth over the past couple of years has been absolutely amazing. You're seeing this map on the screen right now. As of September 30th, every one of these dots represents at least one POP and, a lot of times, many physical POPs throughout these countries. And now each one of these POPs on here is easy. It's a dot on the map. But each and every one of these is a physical facility that we're in where there's physical servers and hardware and network equipment and cabling and different customs, different requirements, different facility requirements. We're doing different things. Each and every one of these sites is a little bit different. So as we build Fastly's infrastructure, we have to understand all those requirements and build into those.

Kat Diamantine (01:33):

Within the web-scale world, deploying into large data centers is something that's been done for a long time. You go into a web-scale data center, deploy 200 to 300 racks, kind of roll them in over a period of time. We're trying to apply those models across the globe in a ton of smaller physical facilities. So what we're trying to do and what we've done over the past about a year now, is taking those learnings, taking those lessons that the industry has found and morphing them and applying them to Fastly's model and to Fastly's network infrastructure.

Kat Diamantine (02:03):

So first we're going to talk about foundations and we're in not building houses here or anything like that, but hear me out on it a little bit. So foundation in any building is the most critical first step to building that. It takes a lot of design, takes a lot of planning. You need to know what's built above it, you need to have a good idea. It's a lot of details that go into it. The foundation, unlike a door, a window, a wall, even within a facility is hard to change. You mess it up, ripping it all out again is kind of a hard challenge. Redoing it or resizing is kind of a challenge too, but it's the critical layer for everything that's sitting above that foundation and supports everything.

Kat Diamantine (02:44):

So you build the foundation, right? Building will sit there and stand strong. It'll be able to support everything you're trying to do and more. It'll stand the test of time. It'll sit there and you won't have many failures. You won't have many struggles with it. So if you miss plan it, you kind of get a couple problems here and you might have some problems.

Kat Diamantine (03:04):

I mean, best-case scenario, you're having to replace your foundation earlier. You're having to sit there and tweak and tune the foundation. Worst case scenario, your foundation's failed and you have to fully replace it and rip it up again. So Fastly's foundation is our global POP footprint. All the servers, the hardware, the memory, the SSDs that's deployed. So, that's our critical foundation. Today we're going to talk about what's in now and what we do to put that all together.

Kat Diamantine (03:33):

So let me paint you a picture of Fastly's past before we dive into our present and our future. Prior to Rack and Roll, supply chain would go out and we would place large quarterly orders with all of our hardware suppliers to take advantage of volume discounts. They would then ship all of that hardware to a third party warehouse where it would be staged for future use. Once we were ready to deploy a site, we would send a BOM, or a bill of materials, to the warehouse.

Kat Diamantine (04:01):

They would then go out onto the warehouse floor. They would tear down all of those pallets and pick and pack specific to the builds need. So there's a lot of manual labor that goes into this process, a lot of auditing, inventory management, and it was clear to us that with how quickly Fastly wanted to grow, this model just wasn't scalable. And most importantly, there was no way for us to validate the hardware sitting in the warehouse would work once it was received on site.

Kat Diamantine (04:31):

So, the photos represented on the screen are of this old model of deployment. So, it might look neat and organized, but when you realize that the standard Fastly build has over 900 pieces in it, ranging from servers to networking switches to transceivers and cables, that whole workflow that I just described takes on an entirely new life. And for a build to be successful, all of those components packed in separate boxes and pallets need to make their way to the site. They need to be installed correctly and work the first time. Fastly's data center infrastructure team would then fly around the globe. They would get to the data center, collect the pallets. They would unbox all the hardware, de-trash and then go about inventorying to make sure that what was received on site is representative of the bomb for that build.

Kat Diamantine (05:27):

They would then begin racking and stacking and cabling all the hardware. And they were the first ones to validate if all the hardware was working. So for those of you out in the crowd that have worked with server hardware before, you know that DOAs happen. And by DOAs I mean dead on arrivals. So thinking back to Fastly's network, we are deploying to the edge of the internet. So, when we have a build underway in Sao Paulo, Brazil, or Johannesburg, South Africa or Mumbai, India, and whoops, we find out we have a DOA, getting replacement POPs into those countries and regions becomes a much longer and more arduous process.

Kat Diamantine (06:10):

So when that happens, our deadlines end up getting pushed. So in the old model, we have multiple shipments, all of which are critical to launching a site. So if even one box or pallet gets damaged during transit, we're again at serious risk of missing our deadline. We're also managing different freight forwarders and IOR agents. And a lot of the times our freight forwarder will outsource a portion of the shipment to a third party agent who's more familiar with the region or country that we're trying to clear assets into.

Kat Diamantine (06:47):

Now, normally that process works beautifully. But when misses happen, when we have a delay due to bad weather or damage occurs because our freight is making its way through various airports, this all leads us to having delays. But through it all, we've created a work of art and been able to grow an incredible global network. But now the question that we had to pose ourselves was how do we scale this and should we scale in this way?

Kat Diamantine (07:27):

So for those of you who attended Altitude last year, this slide might look familiar, and to quote Tom Daly, our SVP of Infrastructure, working harder is not a strategy. So how do we improve the way we deploy and go faster while building a stronger foundation while also reducing the amount of back and forth handoffs between teams and guaranteeing that hardware is tested before it arrives to the site?

Kat Diamantine (07:55):

Before I answer this, let's first take a quick peek into Fastly's hardware design lifecycle. Fastly's hardware design lifecycle is comprised of four buckets: Design for Function, Design for Deployment, Design for Cost and Design for Supply. So looking at the first one, Design for Function, we have to focus on what our internal engineering teams need to continue to develop Fastly's products and features.

Kat Diamantine (08:24):

So with this in mind, we go out and we have conversations with our hardware suppliers and we learn about their product roadmaps, but we're also educating them on what Fastly needs to see from them in the future. So, we have built incredibly dense boxes. One of them is on stage. I won't go into the specifics because Davin is actually going to talk about that later in the presentation. But we've been able to create an incredibly small but powerful global footprint that we can then deploy to the edge of the internet.

Kat Diamantine (08:59):

Design for Deployment. Here's where Rack and Roll comes in. Rack and Roll is key to pushing Fastly into its next chapter. Again, working smarter, not harder. Rack and Roll has optimized our deployment model for increased efficiency across the wider Fastly engineering organization. Next, we look at Design for Cost. One of my favorites. So, Fastly evaluates hardware on a 12- to 18-month cadence, but during that time we're constantly evaluating new POPs and components. So if we see a shift in market trends or a tightening of supply on a certain SKU, we can easily pivot to ensure that our deployment timelines aren't negatively affected.

Kat Diamantine (09:45):

We also have direct relationships with all of our hardware suppliers, so not just our server integrators and our networking supplier. We have direct pricing agreements down to the smallest component on our BOM. So that includes CPUs, transceivers, and SSDs, which really give us a lot of flexibility around the cost of our architecture. And finally, Design for Supply. Fastly looks at hardware design to also take advantage of commodity parts.

Kat Diamantine (10:20):

Now when I say commodity parts, I'm talking about SKUs that our OEMs or ODMs are buying in large quantities. There's sometimes a catalog SKU. Now there are a lot of fancy widgets and parts out on the market that do offer really favorable features. But we have to balance the cons of going with those custom SKUs instead of a commodity part. Two of the main ones are supply scarcity and then cost because we can't leverage our OEM and ODMs bulk buying power to our advantage.

Kat Diamantine (10:59):

So out of Fastly's hardware design lifecycle came Rack and Roll. So what Rack and Roll is, is our solution for building at the edge within the future. So Rack and Roll is four fully built, fully integrated racks that are shipped as a unit. So these have all of our networking, all of our servers, all of our cabling power, everything packed up in a shippable rack that can be shipped throughout the globe, all over the world. So this is Fastly's building block for how we build our POPs in a repeatable, flat fashion — least time humanly possible out the side.

Kat Diamantine (11:32):

So each one of our racks is 16 servers in general. And again, everything is all included within this rack. So everything about this rack is designed for shippability, serviceability, and deployment throughout the world. Everything from the rack and the weights it can support to the shipping pallet and the shock that it can take. These pallets go on planes, trains, trucks, anywhere in the world to be able to get them to our data center.

Kat Diamantine (11:59):

But within those servers, you know, right now it'd be a time where I'd be able to roll up four racks up onto the stage and build a POP. And honestly, in about the time of this talk, we can cable up a POP and probably apply power to it. But I don't think I can pull that off of here. So instead we have Fastly's Generation 6 Cache that we'd like to go over real quick. So I'd make a joke earlier that this would be like a normal hardware conference and I'll lift this up and hold this here. The server weighs a ton cause there's a ton of stuff in it. So if anyone wants to go and look in it afterward or talk about it, we're happy to walk through everything within this, but Fastly's Generation 6 Cache is dual Intel 6140 CPUs over 768 gigabits of RAM — about as much as we can cram into the box that's cost-effective. Twelve 2T SSDs — that's just within this model. We have other models that are twelve 8T SSDs too, depending on where you are in the world. And then 100 Gigabits a second out the back via four 25 Gigs. So this box went through, as Kat showed earlier, Fastly's full design lifecycle to make sure it's the most effective box that Fastly can use to deliver your content and your services on our platform, and can meet the needs while also being cost-effective.

Kat Diamantine (13:08):

So now let's take you through Rack and Roll's first deployment. So because we kind of like a little bit of a challenge and timing worked out horrendously, Fastly's first Rack and Roll deployments were actually in the United Kingdom and in Dublin. So we deployed two sites to London and one site to Dublin. So these sites were to answer a need for an immediate capacity build within those regions to grow these regions. So like most hardware programs, Rack and Roll kind of went through a pathway. So from development to deployment, testing, shipping, and then finally launch. So we'll walk you through the steps here and show you just how Rack and Roll was deployed to these countries.

Kat Diamantine (13:49):

So during development, we're working with our OEM and ODM partners. So this is the point in time where everything that we're doing is on paper. Everything is easy to adapt, everything is easy to explore. So we're sitting there and trying numerous different combinations of hardware, different ideas, different concepts to see what will work, what won't work. This is usually about 12 months out from when we're looking to actually ship our first hardware platform.

Kat Diamantine (14:14):

Revisions are easy at this standpoint. So you have a lot of daydreaming, you have a lot of, "Hey, what if we did this?" in our conversation. You have a lot of different engineering teams really excited about putting some expensive widget they usually wind up not putting into the box into the platform so we can have those conversations. Remember what Kat said earlier, Fastly's POPs — over 900 items. That's not even completely including all the components in each one of the servers. Again, 12 SSDs, tons of RAM, and every other component within these servers. So there's a lot of different combinations of hardware that we're looking at.

Kat Diamantine (14:46):

On top of that, we try to dual vendor everything that we do. So on top of that, you have multiple vendors that we're looking at to see how they work together, how those components work together, and how they're sourced and supplied. So Fastly is passionate about selecting specific components. For anyone that's built a server, before you go to Dell.com and Dell is one of Fastly's providers of hardware. Go to Dell.com, you should be able to choose some dropdowns, pick some SSDs, pick some memory. Hey, I have a server. So most of those SSDs or most of those options you have are what's called blended options. Great thing within the market, there are multiple vendors that produce a very similar part. They are blended together under one brand, so Dell brands them off so that way you can consume them and use them.

Kat Diamantine (15:29):

It helps supply chain, helps you from a resiliency standpoint. Problem with us is we're so passionate and so detail-oriented about how our SSDs perform, how our memory performs in this platform, we don't accept those. Every component is tested by Fastly before it's accepted into our platform. It's certified and tested by us and it's also certified and tested by our vendors.

Kat Diamantine (15:47):

So we have a lot of iterations we go through within these boxes of getting parts that we like, that our vendors may not have on their roadmaps into their roadmaps, testing and qualifying them within their platforms itself. So you're also looking at airflow, cooling, fan curves to make sure that this server can function within our POPs for the lifespan that it's running.

Kat Diamantine (16:09):

So we're now to the build phase. One of the main benefits of working with integrators is they have years and years of experience building and testing these racks. So they're going to be using best practices that sometimes haven't occurred to us as we were scoping out the project. So if you hadn't gleaned from photos earlier in the presentation, Fastly is fastidious about the smallest of details when it comes to our hardware and our build. So during the scoping process, we pulled together hundreds and hundreds of pages of build and design documents.

Kat Diamantine (16:47):

But no matter how much detail you put into a document, it's not always going to translate correctly into the real world. So at this point, the hardware systems team and other members from infrastructure conducted a first article inspection to audit the integration of our racks. While we were on this trip, we were able to meet face to face with the integration team and the logistics company that would be shipping the racks, as well as make some final tweaks and edits to the build before they went into their final testing stage, which includes a 24-hour burn-in.

Kat Diamantine (17:21):

So with all this work that we've put in so far, it'd be really ironic if we shipped something that just didn't work. So you have multiple layers of test and validation that goes into every bit of hardware that ships out of Fastly. So two layers, one at our vendors facilities before the racks ever packed up and shipped out of the facility. And then the second layer is at our POPs, both from our vendor standpoint, so they check their work first, and then we check our work and check their work. So trust but verify is a model that we follow, because in the end of the day it's you, our customers, that are going to be using this environment. We're passionate about making sure that the environment is correct, it's right, and honestly we don't have to service it again for a little while. These sites are throughout the world and scattered everywhere

Kat Diamantine (18:00):

Having to re-roll a person or work through Smart Hands — if anyone's ever worked with Smart Hands at a data center before — it can be extremely frustrating at times. We want to avoid that as much as possible. So we have a shot to get this right. We want to make it, make sure that we get it right. So, and this Kat said before, dead on arrivals, DOAs happen. Failures happen within hardware. We can engineer the living daylights out of this box and brace things to make sure it ships right. But at the end of the day, failure rates happen. We have to accept that they happen and find them early and find them before we put production traffic on it. Most failures within the server platform in the first 24 months can happen with stress testing in the first 12 hours of operation. So you just need to tease them out of the system for lack of a better way of saying it.

Kat Diamantine (18:42):

So internally to Fastly, our test and validation system, was actually under development right now and has gone through its first proof of concept deployments with our first Rack and Roll sites, is what's Buildbot. So Buildbot is our tool and our infrastructure to be able to build our racks in the future. So this is a platform that'll automate everything from receiving that rack identification of hardware, stress testing a hardware all the way up through OS application and handing it off into our larger application stack. So this is something that Fastly is developing in order to speed up our build timelines and make it more repeatable and make them more consistent across the globe.

Kat Diamantine (19:24):

So the racks have finished their final phase of validation. They're then palletized and equipped with shock pallets and tilt sensors. And these are going to give us a greater level of confidence that, once the racks arrive to their final destination, we're going to have a better understanding of the type of treatment they received along their journey. So Rack and Roll shipping methods are broken out based on region. They will either go via air or by dedicated truck. So you're probably all looking at the photo on the right side of the screens and thinking why dedicated trucks?

Kat Diamantine (19:59):

There's so much empty space that seems kind of inefficient, but hear me out. The reason we've decided to go with the dedicated truck solution is that we don't want the racks making multiple stops along their journey before reaching its final destination. That's again just going to open up a lot of touchpoints for the racks to get jostled and damaged. And also, it makes tracking down where damage occurs a lot harder.

Kat Diamantine (20:27):

So we have a hardware, it's sitting outside our data center site. What do we do next? Truck's there. We just roll it in. Not quite as easy as that, but just about. So when the racks hit the site, every one of these racks are unbolted off the pallets and rolled down this ramp. Now it's about a 2,500-pound rack. Rolling it down a ramp sounds really, really easy 'til you let go of it the first time. Then it becomes this moving lob of metal that's going to find whoever's standing in front of it. So the engineering that goes into those ramps, and the processes and procedures and SOPs that go into making sure we get the stuff off the pallet safely. It's only that high. Trust me, it's a challenge.

Kat Diamantine (21:06):

So and on top of it, every data set is different. Every structure of a data center is different. Some have elevators we have to be challenged with. Some have raised floors that we have to be challenged with. First time you roll a 2,500-pound rack onto an elevator, I don't care what the nameplate of that elevator says you're going to question it and you're going to watch that thing settle when you push that rack on top of it. It's an experience the first time. Don't be behind the rack, be in front of the rack.

Kat Diamantine (21:30):

But anyways with that, a lot of validation goes in. A lot of data collection goes in before we ever visit these sites with this hardware, there's a whole survey process. There's teams that go onto these sites to make sure that we know how to deliver it to the sites and we know that the requirements happen. We love programs like OCP's Ready Data Center Certification because it can kind of normalize out data centers and understand that program isn't completely across the industry yet. So we do a lot of our own leaning off a lot of those requirements. So with this, this is where our white glove delivery teams from our vendors do an amazing job. That truck team that picked up that rack initially are the ones that are delivering that and bringing it into our data center.

Kat Diamantine (22:12):

So with that, the racks are rolled in, secured down to the data center floor if it's needed for the area, bolted together and then all of our inter-rack cabling that has already been pre-cabled at the sites before. So these four racks all interconnect, all labeled ready to go. We interconnect these sites, apply power, launch the site. On top of that, all of the external connectivity is cabled up to these switches too. So, it's actually a funny story. The first time we designed this rack within our vendor, they looked at the switches and they're like, "Guys, this is a 64 to 100 Gig switch. You are like using a third of the ports. What's going on? That's a waste of money. That's a waste of time." 'Til they saw us deploy them out to the site and realize just how much external connectivity we bring into the sites.

Kat Diamantine (22:54):

All the yellow connectivity that you see on the screen there, is all single-mode connectivity out to our providers out to the IXs and transit partners for these racks. And then that's it. All the cabling has been done for these racks already. All the cables have been validated, labeled, tied back, PDUs are in there and gorgeous and are done just right. And on top of it, it's exactly the same way from site to site. It's repeatable. It's constant. When we call a tech, when we call someone to troubleshoot a site, we know just how it's set up, just how it's laced, just how it's run.

Kat Diamantine (23:30):

So at this point, Fastly's data center infrastructure team will come on-site and they'll be the ones that will finalize cabling, transit connection, and triple-check that all hardware is working and reporting correctly. Then Buildbot, that Davin just mentioned, is run to build, test, and validate all of the systems within the rack. At this point, the site is ready for handoff to our edge cloud operations team. Prior to these notices going out, ECO takes the site to its final launch phase beginning with bootstrapping and provisioning. So there are a lot of transit connections that come into the top of the rack switches and ECO is instrumental in turning those circuits up. Once all of the software, features, and applications are installed, ECO then brings the site to production readiness. They modify traffic routing in the region to bring traffic to the new site. And at this point, the notifications go out that announce our new POP.

Kat Diamantine (24:30):

Rack and Roll has drastically reduced the amount of time, money, and effort spent prior to go live. It's not only reduced supply chain risk, it's increased hardware quality and allowed for cleaner transitions during handoff. And there you have it, a new Fastly POP. This graph is taken from our recent Dublin site and it shows how traffic currently being served by POPs in neighboring regions are shifted to the new site, bringing our amazing customers content even closer to the end users. And that is Rack and Roll.