In this exclusive article, Fastly Principal Developer Advocate Andrew Betts — formerly of the Financial Times — walks you through how to dramatically reduce bandwidth consumption for a site hosting versioned downloadable assets.
“We now have an origin that’s properly tuned for dealing with Fastly.
Fastly helped us reduce our origin costs by 80% over the past year.” Chris Boylan, Director of Engineering, Wenner Media
Andrew Betts
Andrew is a Web Developer and Principal Developer Advocate for Fastly, where he works with developers across the world to help make the web faster, more secure, more reliable, and easier to work with. He founded a web consultancy which was ultimately acquired by the Financial Times, led the team that created the FT’s pioneering HTML5 web app, and founded the FT’s Labs division. He is also an elected member of the W3C Technical Architecture Group, a committee of nine people who guide the development of the World Wide Web.
Requesting the difference between two previously cached files, using just a CDN configuration and a serverless cloud compute function, is a great example of exploiting edge and serverless compute services to make your website more efficient and performant, and lower your bandwidth costs.
In this article, we’ll present a solution which, for a site hosting versioned downloadable assets such as software, documents, and saved games, can reduce bandwidth consumption dramatically.
Traditionally, CDNs are only useful for caching assets closer to your users. But today, modern CDNs like Fastly’s can be used to perform many activities you may have previously implemented in your own infrastructure. Some of these are products you can sign up for as add-ons to your CDN service, while some you can build yourself directly at the edge by deploying your own configuration code directly to edge points of presence (POPs). With Fastly, for example, you can update your config via web interface or API to all our global edge locations in under five seconds
In the spirit of using the edge more intelligently, I was recently downloading packages using the npm package manager,1 and realized that although I often have a previous version of a package already installed, npm has to download the entire tarball for the new version if installing an update to a module. This seems very inefficient.
Take my open source service, Polyfill.io.2 It is published as an npm module, the latest version of which is 11MB gzipped, and 99MB uncompressed. Using bsdiff, 3 we can produce a patch to summarize the changes from the penultimate version to the latest:
$ bsdiff polyfill_io-3.16.0.tar polyfill_io-3.17.0.tar polyfill_io-3.16.0...3.17.0.patch
$ ls -lah
total 424
drwxr-xr-x 5 me staff 170B 18 Apr 15:55 .
drwxr-xr-x 14 me staff 476B 18 Apr 16:32 ..
-rw-r--r-- 1 me staff 209K 18 Apr 17:27 polyfill_io-3.16.0...3.17.0.patch
-rw-r--r-- 1 me staff 99M 18 Apr 15:54 polyfill_io-3.16.0.tar
-rw-r--r-- 1 me staff 97M 18 Apr 15:53 polyfill_io-3.17.0.tar
So if the client already has 3.16.0, getting to 3.17.0 could be done with a download of only 209KB, a mere 1.8% of the full 11MB (gzipped from 99MB) that you’d otherwise need for the full tarball. However, module hosting services like npm typically store their modules on a static hosting environment like Amazon S3 or Google Cloud Storage, so there is limited or no ability to add this kind of dynamic content generation feature, and pre-generating a diff between every pair of versions of every module seems unlikely to be a good use of compute or storage resources.
A CDN that allows origin services to be selected based on characteristics of the request could be used to route “diff” requests to a patch-generating service. With Fastly’s CDN, we can do this in VCL (Varnish Configuration Language, which we make accessible to customers). First, define a special backend:
backend be_diff_service {
.dynamic = true;
.port = “443”;
.host = “<<CLOUD-FUNCTIONS-HOSTNAME>>”;
.ssl_sni_hostname = “<<CLOUD-FUNCTIONS-HOSTNAME>>”;
.ssl_cert_hostname = “<<CLOUD-FUNCTIONS-HOSTNAME>>”;
.ssl = true;
.probe = {
.timeout = 10s;
.interval = 10s;
.request = “GET /healthcheck HTTP/1.1” “Host: <<CLOUD-FUNCTIONS-HOSTNAME>>” “Connection: close”
“User-Agent: Fastly healthcheck”;
}
}
Now, we can decide on a special syntax to use for patch requests, and make a small addition to vcl_recv
that detects this syntax and routes the request to the special backend:
sub vcl_recv {
....
declare local var.diffUrlPrefix STRING;
declare local var.diffUrlSuffix STRING;
if (req.url ~ “^(/.*\/\-\/.*)\-(\d+\.\d+\.\d+)...(\d+\.\d+\.\d+)(\.tgz)\.patch”) {
set var.diffUrlPrefix = if (req.http.Fastly-SSL, “https://”, “http://”) req.http.Host
“.global.prod.fastly.net” re.group.1 “-”;
set var.diffUrlSuffix = re.group.4;
set req.backend = be_diff_service;
set req.http.Host = “<<CLOUD-FUNCTIONS-HOSTNAME>>”;
set req.http.Backend-Name = “diff”;
set req.url = “/compareURLs?from=” var.diffUrlPrefix re.group.2 var.diffUrlSuffix “&to=”
var.diffUrlPrefix re.group.3 var.diffUrlSuffix;
}
....
}
npm’s downloads use URLs such as /module-name/-/module-name-1.2.3.tgz
, so I’d like to also support /module-name/-/module-name-1.2.3...1.2.4.tgz.patch
as a diff request. The regular expression in the VCL above captures the requests that fall into this category, and then:
Host
header so we are sending the origin’s domain in the request to the serviceFor more information on getting started with running your own VCL on the Fastly edge cloud platform, see our introductory guide to VCL.4
This is all very well, but the CDN cache nodes cannot generate diffs by themselves. This is a great use case for serverless compute services, such as AWS Lambda or Google Cloud Functions. We’ll use a Google Cloud Function to handle this.
If you want to use GCF and don’t have it set up already, Google have an excellent quick start guide5 that will get you up and running. The source of the cloud function that we need looks like this:
const url = require(‘url’);
const zlib = require(‘zlib’);
const fetch = require(‘node-fetch’);
const bsdiff = require(‘node-bsdiff’).diff;
exports.compareURLs = function compareURLs (req, res) {
Promise.resolve()
.then(() => {
return Promise.all([‘from’, ‘to’].map(param => {
return fetch(req.query[param])
.then(resp => {
const name = url.parse(req.query[param]).pathname.replace(/^.*\/([^\/]+)\/?$/, ‘$1’);
const isCompressed = Boolean(resp.headers.get(‘Content-Encoding’) === ‘gzip’ || name.match(/\.(tgz|gz|gzip)$/));
const respStream = isCompressed ? resp.body.pipe(zlib.createGunzip()) : resp.body;
const bufs = [];
respStream.on(‘data’, data => bufs.push(data));
return new Promise(resolve => {
respStream.on(‘finish’, () => {
resolve(Buffer.concat(bufs));
});
});
})
;
}))
})
// Create patch and serve it
.then(([from, to]) => {
const patch = bsdiff(from, to);
res.status(200);
res.send(patch);
})
;
};
I’m using two public npm modules, node-fetch6 which implements the now-standard WHATWG Fetch API in NodeJS (which at time of writing is not natively supported by Node), and node-bsdif, 7 which performs the amazing binary diff algorithm8 invented by Colin Percival.9
This code includes no error handling or validation, and we can also improve the patch response by adding appropriate Cache-Control
information (the patch can be cached for as long as the least-cacheable of the two files being compared), and also by passing through any surrogate-key10 headers present on the input files. I’ve uploaded a more comprehensive solution to GitHub11 with comments, so feel free to make use of that.
To test the new endpoint, I invented differentnpm.com
: a fictitious new domain name for the npm registry for which I could create a Fastly service, and I set it up with the real npm registry as its origin server. A request to download the full tarball of lodash 4.17.4, one of the most popular modules on npm, shows that the new service behaves like the npm registry:
$ curl “http://differentnpm.com.global.prod.fastly.net/lodash/-/lodash-4.17.4.tgz” -vs
1>/dev/null
< HTTP/1.1 200 OK
< Cache-Control: max-age=21600
< Content-Type: application/octet-stream
< Content-Length: 310669
< X-Served-By: cache-sjc3143-SJC, cache-sjc3628-SJC
< X-Cache: HIT, HIT
This request is routed to npm’s real registry, and results in a 310KB file (see the Content-Length header), and as we’d expect, is a cache HIT because this is a popular file so it’s likely to be available at the local CDN cache node.
However, this new registry also transparently supports the new diff URLs:
$ curl “http://differentnpm.com.global.prod.fastly.net/lodash/-/lodash-4.17.3...4.17.4.tgz.patch” -vs 1>/dev/null
< HTTP/1.1 200 OK
< Cache-Control: max-age=21600
< content-type: application/octet-stream
< Content-Length: 1207
< Connection: keep-alive
< X-Served-By: cache-sjc3132-SJC
< X-Cache: HIT
Here the request for the difference between lodash 4.17.3 and 4.17.4 is a patch of only 1,207 bytes, just 0.3% of the original size.
Bsdiff ships with a companion bspatch tool, which can take the old file and the patch, and produce the new one:
$ ls -la
-rw-r--r-- 1 me staff 2254848 18 Apr 16:30 lodash-4.17.3.tar
-rw-r--r-- 1 me staff 1207 19 Apr 17:35 lodash-4.17.3...4.17.4.tgz.patch
$ bsdiff lodash-4.17.3.tar lodash-4.17.4.tar lodash-4.17.3...4.17.4.tgz.patch
$ tar tf lodash-4.17.4.tar
package/package.json
package/README.md
package/LICENSE
package/_baseToString.js
To work out how useful this kind of thing could be, I made a list of npm’s most depended-upon modules,12 and for each one, gathered the following data:
One thing we can’t know from public data is how often a user has a prior version of a file in cache locally. Let’s look at the impact if that number were 5%, 15%, and 50%:
Patch size | Monthly data savings, GB, by cache ratio: | |||||||
---|---|---|---|---|---|---|---|---|
Module | Downloads (1000s) | Size (bytes) | Monthly transfer (GB) | Abs (b) | Rel (%) | 5% | 15% | 50% |
lodash | 42,866 | 310,669 | 12,403 | 1,207 | 0.39% | 618 | 1,853 | 6,177 |
request | 24,756 | 56,636 | 1,306 | 3,248 | 5.73% | 62 | 185 | 615 |
async | 43,923 | 97,968 | 4,008 | 23,083 | 23.56% | 153 | 459 | 1,532 |
express | 11,577 | 52,372 | 565 | 602 | 1.15% | 28 | 84 | 279 |
chalk | 21,045 | 5,236 | 103 | 1,027 | 19.61% | 4 | 12 | 41 |
bluebird | 14,327 | 135,089 | 1,803 | 2,669 | 1.98% | 88 | 265 | 883 |
underscore | 12,229 | 34,172 | 389 | 6,879 | 20.13% | 16 | 47 | 155 |
commander | 26,118 | 13,425 | 327 | 1,309 | 9.75% | 15 | 44 | 147 |
debug | 45,226 | 16,144 | 680 | 588 | 3.64% | 33 | 98 | 328 |
moment | 9,219 | 497,477 | 4,271 | 891 | 0.18% | 213 | 640 | 2,132 |
Total (top 10 modules) | 251,286 | 25,853 | 1,229 | 3,687 | 12,290 | |||
Relative saving | 4.75% | 14.26% | 47.54% |
Diff sizes obviously vary, and the most popular npm modules also tend to be quite small, but if 50% of npm’s module requests could be diffs, then this data suggests that they would eliminate almost that same percentage of their bandwidth.
Package managers are not the only type of business that could benefit from this. Android uses binary diffs to update apps from the Google Play store, and any scenario where you need to send a user an update to something they already have, diffs can make your bandwidth use dramatically more efficient.
Whether or not your business can make use of this kind of solution, there are many ways you can get more value from your CDN. Whatever your scaling challenge, a globally distributed edge compute and caching network is often a key part of the solution, so choose one that gives you as much control and flexibility as possible.
1 https://www.npmjs.com/
2 https://polyfill.io/
3 http://www.daemonology.net/bsdiff/
4 https://docs.fastly.com/guides/vcl/guide-to-vcl/
5 https://cloud.google.com/functions/docs/quickstart
6 https://www.npmjs.com/package/node-fetch
7 https://www.npmjs.com/package/node-bsdiff
8 http://www.daemonology.net/bsdiff/
9 https://twitter.com/cperciva
10 https://docs.fastly.com/guides/purging/getting-started-with-surrogate-keys
11 https://github.com/fastly/diff-service
12 https://www.npmjs.com/browse/depended