Customer case study 15:24

Traveling light: how a simpler architecture — and the promise of edge serverless — put us in a position to win

Presented by
speaker avatar

David Annez

Head of Engineering, loveholidays


When David Annez began as Head of Engineering at Loveholidays, the company was only using Fastly for image optimization. Over his first year, David led a full migration onto the platform, simply by moving the next right thing to the edge, again and again. In this talk, David shares actionable insights on how this performance-led engineering culture can enable businesses to drive down costs, simplify infrastructure, and increase conversion. He’ll also speak to the next phase of his team’s journey: realizing the true power of edge serverless, including a new experimentation platform running in Rust on Compute@Edge.

Personalized experiences on the fly

Hear why travel and hospitality companies around the world trust Fastly to deliver and secure their online experiences.

Video Transcript

David Annez (00:08):

Hey there, I'm David Annez, head of engineering at Loveholidays. And I'm here to talk to you about building for the future and focusing really on performance and driving the right things for the customer.

David Annez (00:20):

And just to give you a bit of context around who we are. We're probably the biggest travel company you've never heard of. We are currently, last year we were the fastest growing company in the UK. And we offer holidays across a set of countries, the UK, Ireland, the US and the Nordics. Both package holidays and hotel only holidays. We flew a million passengers last year.

David Annez (00:44):

To give you a bit of an overview on stats and where we're at from our website perspective, we get about 4,000 requests a second on the site. Millions of users a month, so about five to 10. An interesting stat for us was the P95 origin time first byte.

David Annez (00:58):

There's only 500 milliseconds. I want to talk a little bit about our growth, and when you grow really fast as a company over the last eight years, you tend to cut corners. You tend to focus on the now and not the future. So when you do that, infrastructure tends to take a bit of a hit and then you tend to end up in potentially not the best place you want to be in.

David Annez (01:20):

So when I joined, we started looking at how do we build for the future and how do we build the right infrastructure. And we doubled down on performance to actually start driving a really, really great impact. The goals that we set ourselves were, and especially during these times, right. We, I think for travel, the travel industry suffered quite a bit right now. We wanted to save thousands on infrastructure and make sure that we weren't overspending where we didn't need to.

David Annez (01:48):

And we wanted to focus on conversion rates. So with traffic down and with customer booking is down, we wanted to drive a better conversion rate and ensure we were getting more bookings. And then finally travel is going to come back. We want to prepare for that. And we want to ensure that we're enabling stability and scalability in the platform for the future. So setting some context to begin with, we went from what you can see currently on the screen around a very complex infrastructure where in fact, we weren't even using Fastly as a content delivery network, we were using it as a really dumb caching mechanism.

David Annez (02:22):

And there was a lot of indirection with this. There was a lot of complexity and a lot of instability, primarily due to I was using some very legacy systems, that a lot of the engineers didn't understand. So there wasn't a lot of safety and changing things and actually trying to improve these pieces, because no one really knew how they worked. And we moved all of that. And we shifted all of our core logic over to Fastly. I started using Fastly as our content delivery network and this made everything a lot easier and much, much better to manage. And engineer's actually felt a lot more comfortable with the current situation.

David Annez (02:59):

And you can see the difference that we had just from our First Contentful Paint and the performance that we had. The impact was massive, right? I think we had about 180% improvement on our First Contentful Paint when we switched over to Fastly. And we switched over all of that complexity onto the edge. But it's not just about performance, it's important that we talk about specifically what we've achieved in the business and actually focusing on those business metrics.

David Annez (03:28):

So we drove up conversion rate by 11% and we saved over two terabytes a month on bandwidth. And with this shift in complexity and ensuring that Fastly was delivering most of our content, we had about 30% less outages. So constantly looking at that view from the business side. So I want to start with that statement of, by focusing on performance, you cater for scalability, stability, and cost reduction.

David Annez (03:57):

I think we built this performance culture in the business and performance has that knock-on effect across the board were you start thinking about, well how do you make your applications faster. Well if they're faster then they probably handle more connections. And additionally, if you start thinking about performance, you start wanting to reduce the size of JavaScript bundles, which saves you money from an infrastructure perspective.

David Annez (04:19):

So it's all connected, but most importantly, performance is directly correlated with conversion rate. And we did this analysis right at the start, when we were starting to look into this. We wanted to make sure we were coming armed with the right information, where we can see the time to interactive is directly correlated with performance where the better the times interactive, the better the conversion rate. But before we specifically talk about some of the things that we did and implemented at Loveholidays to drive that performance, we need to ensure that we're set up to measure all of these things.

David Annez (04:55):

I think it's really important to try and gather as much information and as much data as possible upfront, and prepare yourself to actually start looking at that over time and seeing those step changes. And here we can see that massive step change in our real user monitoring, which we capture using the Chrome Web Vitals and the difference that we'd made on performance here.

David Annez (05:16):

And I think to just talk a little bit about logging and actually capturing that information, there's a very good presentation from Herman back in 2015, around measuring CDN performance and why you're probably doing it wrong. And one of the things that he says there around real user monitoring is that data infrastructure is not trivial to implement.

David Annez (05:40):

And I think that actually in 2020 data infrastructure is trivial. And with Fastly being able to stream logs into systems like big query, you can start getting that live analysis on logs. And additionally, start looking at the long-term analysis and understanding, I think the most important part as well, to talk about it here is the cost. So this is accessible to most of us. To give you a bit of context. We spend about $50 a month on streaming 750 gigabytes of data, which included all of our information, right.

David Annez (06:14):

So if you wanted to cut that down and save even more money, you could. But at least we have that information so that we could start actually learning from it and then start applying the next bits of the performance improvements onto the site. So now that we can measure these things, we want to start testing and learning. We want to start actually testing the performance improvements that we're making and see what the impact is on the business. And I think this is where your paid search team is your best friend. I think that we couldn't A/B test appropriately our landing pages, because they were cached and our A/B testing platform didn't fit into Fastly at that point. So we wanted to find ways of validating the value to the business, that actually focusing purely on performance is going to drive the right numbers.

David Annez (07:05):

And here you can see the massive step change between our test and our control, which was purely our performance tests around our landing pages and trying to make sure that we were continuously testing and learning and we can apply this. And I think this is a really great way for businesses to test and learn without having to spend a lot of money on maybe an A/B testing platform or trying to test SEO pages, which could potentially have detrimental effects.

David Annez (07:34):

So I think that this was a great start for us to then start moving forward and focusing on improving the performance. So now that we've tested and learned this, we wanted to start driving even better performance and starting to think about, well, how do we reduce our costs. How do we start driving down the bandwidth usage of our pages. And I want to quickly talk about time to first bite versus client-side JavaScript.

David Annez (08:03):

And I think that it doesn't, performance is everything, but if you forget that you're dropping three megabytes of JavaScript on your client and you have 25 marketing tags as well, your time to first byte just doesn't really matter in the grand scheme of things, because you're still going to be delivering a poor performance to the customer. And this is something that we'd been suffering. So we were thinking, well, how do we start improving that experience and how do we start delivering optimal experiences on each device. So here we start thinking about actually delivering optimal experiences per device. And this has been spoken about over the last few years, and there's amazing systems like that can do this. But we wanted to go a bit more granular. And in fact, we wanted to start integrating that into our cached pages.

David Annez (08:56):

So what did we do. Well, we took this concept of being able to identify modern legacy browsers using browser list. And then we started actually applying that rejects on the VCL and starting caching different variations. And actually with this, we received terabytes a month on bandwidth and we delivered a significantly more performing experience for modern browsers because we didn't need to polyfill. We didn't need to add anything into those modern browsers. And then I think I see new here, but I'm not really sure it's new anymore.

David Annez (09:28):

We really should be looking at adopting better ways of compressing things and brotli is the now, right. I think I haven't seen enough companies adopt brotli from their origins to start compressing their responses. And I don't think that it's something that we should be ignoring because actually it's widely supported. It is fast and just as fast as [inaudible 00:09:53] and the main thing around it is actually probably CPU usage.

David Annez (09:59):

But when we implemented this, we didn't see any additional costs primarily because our [inaudible 00:10:06] could cope with that change in CPU. So there was a huge benefit of this, right? We 20% smaller responses across the board from origin. And if then you can add Fastly brotli support on top in their beta, then you start really actually driving even better performance to the client. We've improved the performance. We've driven down our bandwidth usage.

David Annez (10:31):

And now we want to start thinking about caching, everything. And this quote from Hooman really, really struck a note with me last year, altitude. Talking about how TTLs generally reflect how nervous you are about serving stale content. And I think that was very true of the systems that we were running at that point were we felt like a lot of things were dynamic and never cacheable.

David Annez (10:57):

And I think a lot of us in this room feel the same, right. Like we probably have a lot of that. And I'm not really sure all of us actually apply, right, caching mechanisms all the time. So we started looking at our infrastructure and we started saying, well, actually, can we cache any of this. Can we take this and actually probably apply some rules and only filter some query strings and patch them on Fastly.

David Annez (11:21):

And we started realizing that a lot of it was cacheable. Maybe not for a long time, or maybe we could approach it, but at least we could actually start applying that performance benefit and actually utilizing the Fastly edge platform even more so. So when we started doing this, that we actually saw this massive step change in our origin times of first bite. And this is our anomaly detection system currently running were I got an alert and I was worried that we broken something.

David Annez (11:52):

But in fact, what it was alerting to us was that we went from 80 to 100 milliseconds times of first bite to 10 to 20 milliseconds times of first bite, with the changes in the caching that we did. So big big step change. And this wasn't just a performance, but we were reducing huge amount of load on our solar index and actually ensuring that we were saving costs a couple of levels down in our ingresses. So we've spoken a little bit about measuring. We spoken about how we can, how from that measurement we start testing and learning. And then the things that we can start applying that can improve the performance from brotli to browser device detection. And I think that now we've started really thinking about, well, what can we do next.

David Annez (12:36):

What's the future? What does the future hold for us and with Fastly? And this is where we started to think about experimentation and starting to think about what. We want to actually start AB testing properly on the edge. We couldn't do so before, because our landing pages were too slow. We couldn't do so because we actually use our own AB testing platform. We wanted to make sure that we were still using that because we have some complex configuration there. And we wanted to really emphasize the performance piece here, right. We wanted to not damage that time to first bite, that we've optimized drastically on Fastly. And so this is where Compute@Edge came into play. And we actually implemented our AB testing platform in last, in a couple of days. And then we service change to our original Fastly service.

David Annez (13:29):

And now it is now actually operating in production now. We create assignment based on your user ID. And that assignment then becomes a cache key for all of our pages. You can see here, the sheer performance that you were getting out of this system, right. Instead of writing our own system, that we would then restart in Fastly, we started to utilize computer age to drive this. And there's no difference whatsoever between aligning these running experiments and aligning the page that isn't.

David Annez (14:00):

This is really exciting. And this is only the beginning. This is doing some very basic AB testing, but then we start thinking, well, actually, where do we take this to next? I think I'm just really excited about what is next. I think my most exciting part is going to be implementing multi-arm bandit testing. We're eagerly awaiting for computer edge to release the ability to have data structures there. Where then we can shift all of our experiment config into computer edge. But then we can start changing that conflict depending on how we're converting.

David Annez (14:33):

And I think that's going to be really, really cool when we can start changing our traffic allocations and running multi-armed bandit tests at scale on the edge, and constantly improving that conversion rate for our customers. We spoken a lot about performance and some of the things that maybe you can start looking at in your company to apply. And we've gone from the measurement to the application, to the future. And I'm just really excited to see what more we can do with computer edge, to continuously focus on performance and drive the best conversion rate and lower our costs and ensure that we are actually able to scale for the future, especially when travel comes back.