Testing HTTP freshness in CDNs
CDNs all use HTTP caching to optimize performance, but sometimes different CDNs do it in slightly different ways and that can make things more complicated for our customers. This blog post makes a case for CDN interoperability and introduces a common test suite to help identify differences between CDNs to start paving the way.
## Why Test Freshness?
Per [RFC7234](https://httpwg.org/specs/rfc7234.html#expiration.model) “Freshness” is one of the core concepts of HTTP caching; fresh responses can be used without contacting the origin server, improving efficiency. Effectively, it’s a contract between the publisher and all downstream HTTP caches that amounts to “You can serve my content for this long before checking back with me.”
Over the years, CDNs and reverse proxies have evolved to ignore some of HTTP’s freshness controls (for example, `Cache-Control` in requests), and have added a few new ones, like `Surrogate-Control`. Doing so is often sensible in the moment; CDNs and reverse proxies have a special relationship with the origin server, and they take advantage of that to optimize their services (adding “Varnish”, in the words of Poul-Henning Kamp).
Unfortunately, anecdotal evidence suggests that each CDN is doing so in different, sometimes incompatible ways, and that’s not great for customers. If CDNs and reverse proxies break the rules of HTTP caching in an unanticipated way, it can bring unpleasant surprises.
For example, imagine you use a third party server framework that depends on `Cache-Control: private` working to assure that personalized content isn’t cached. If you interpose a CDN that doesn’t honor it, the results could be disastrous; content intended for one user could end up with another, because the framework assumed (rightly) that HTTP caching would work correctly.
At the same time, people are using HTTP in much more imaginative and fine-grained ways than they were even five years ago, especially for HTTP-based APIs. As a result, they expect CDNs and reverse proxies to behave in predictable, specification-conformant ways, so that they can reliably use all of the protocols’ relevant features.
Ideally, there would be a __common profile for how CDNs handle the low-level details of HTTP caching__, with detailed documentation about any deviation from “normal” behaviors. This would reduce the friction -- and risk -- of switching between CDNs, and allow third-party software to target one thing, rather than try to integrate with the growing number of CDNs out there.
However, defining such a profile is tricky. There are valid reasons for the varying individual behaviors of CDNs today -- even if they’re not compatible with the specification or other CDNs. And, all CDNs (including Fastly) don’t want to change how they behave for existing customers in surprising ways.
What we can do, though, is gather data to help shape an idea of how a CDN (or reverse proxy) should behave, and identify any problem areas. In other words, we need __a test suite for CDNs and reverse proxies__ that they and their customers can use to help understand how they behave. In time, that might help us drive a discussion about where it makes sense to converge our behavior.
A while back, I wrote a [set of tests](https://github.com/web-platform-tests/wpt/tree/master/fetch/http-cache) for HTTP caching in Web browsers that became part of [Web Platform Tests](https://web-platform-tests.org/) (the W3C’s common test suite for browsers). It turns out that they serve as a good basis for a [public CDN test suite](https://github.com/mnot/cdn-tests) too.
Right now, the tests look for strict, vanilla HTTP conformance, as per [the specification](https://httpwg.org/specs/rfc7234.html). As discussed above, CDNs diverge from the specification for sometimes good reasons, and there’s not yet any common agreement about what the “correct” behavior is, so they should not be treated as a conformance test -- i.e., there is no prize for passing them all.
However, they are a great tool for figuring out how a CDN behaves, as well as getting insight into how it treats your content, and how you can tailor your content to improve performance.
## Fastly’s Results
So how does Fastly do today? By the numbers, right now it’s 69 pass, 36 fail. That’s actually pretty good, as we’ll see when we walk through the details below and see the reasons for the divergence from strict HTTP conformance.
Why isn’t it perfect? Like many CDNs, Fastly has a “special” relationship with the origin server; we want to encourage caching wherever possible, and we often have external information (e.g., in VCL) that overrides caching metadata. HTTP caching was designed before CDNs were thought of, and some things that make sense for a generic forward proxy cache just don’t make sense for a CDN.
Also, being derived from Varnish, we have the baggage of history; in some cases, changing how we behave is going to hurt existing customers because they’ve come to depend on the quirks of Varnish and/or our service.
The big difference with Fastly is our not-so-secret weapon, [VCL](https://docs.fastly.com/vcl/). __Using VCL, you can make Fastly pass nearly all of them__ (103 out of 105) if you need to. If you want more HTTP-conformant behavior, all you need to do is add the snippets linked below to your account and activate; five seconds later, it’ll be live.
Doing that may or may not be appropriate for your site; read on to find out why.
## Explicit Freshness: Expires and Cache-Control: max-age
Most people are familiar with [`Expires`](https://httpwg.org/specs/rfc7234.html#header.expires) and [`Cache-Control: max-age`](https://httpwg.org/specs/rfc7234.html#cache-response-directive); they’re the primary way you declare a freshness lifetime in HTTP responses. __Fastly passes almost all of the tests__ regarding Expires, response `Cache-Control: max-age` (and '`s-maxage`), their relative priority, [age calculation](https://httpwg.org/specs/rfc7234.html#age.calculations), and so forth; we will only ignore these headers if you set `Surrogate-Control` (see below) or explicitly override them in VCL.
The only exception is when a response uses only Expires to set a freshness lifetime, and there’s an upstream cache -- for example, when you have a reverse proxy between Fastly and your origin -- with significant clock synchronization issues. See "Date, Age and Expires” below for details of the problem here.
## Heuristic Freshness
If the response doesn’t have explicit freshness assigned to it, HTTP caches are allowed to calculate their own [heuristic freshness lifetime](https://httpwg.org/specs/rfc7234.html#heuristic.freshness) in certain situations. This is because lots of content doesn’t explicitly set freshness. However, to keep things safe heuristic freshness is only allowed on certain status codes, and only when there isn’t __any__ explicit information.
Fastly allows control over the heuristic we use with the [ttl parameter](https://docs.fastly.com/guides/performance-tuning/controlling-caching#how-long-fastly-caches-content) in VCL and our control panel. And, the tests show that __we correctly avoid using heuristic freshness with all response status codes that don’t allow it.__
However, like many browsers and other intermediary caches, Fastly is very conservative about heuristic freshness; the tests show __we don’t use it on all status codes that HTTP allows__, including `204 (No Content)`, `405 (Method Not Allowed)`, `414 (URI Too Long)`, `501 (Not Implemented)`, and any response with [`Cache-Control: public`](https://httpwg.org/specs/rfc7234.html#cache-response-directive.public) on it. Instead, these responses won’t be stored in our cache if there isn’t explicit freshness information.
If you want to cache these responses as well, use this VCL in `vcl_fetch`:
```
# This sets a heuristic TTL of 1 hour if Last-Modified is present.
if (
( ! beresp.http.Cache-Control:max-age
&& ! beresp.http.Cache-Control:s-maxage
&& ! beresp.http.Expires
&& time.is_after(now, std.time(beresp.http.Last-Modified, now))
) && (
http_status_matches(beresp.status, "200,203,204,404,405,410,414,501")
|| beresp.http.Cache-Control:public
)
)
{
set beresp.ttl = 3600s;
set beresp.cacheable = true;
}
```
## Private, No-Cache, and No-Store
Other response `Cache-Control` directives can be used to restrict caching in various ways.
The [`private`](https://httpwg.org/specs/rfc7234.html#cache-response-directive.private) directive disallows caching in “shared” caches like Fastly, and the __tests show we correctly avoid caching responses with `Cache-Control: private`__ (as [documented](https://docs.fastly.com/guides/tutorials/cache-control-tutorial.html#do-not-cache)).
The [`no-cache`](https://httpwg.org/specs/rfc7234.html#cache-response-directive.no-cache) and [`no-store`](https://httpwg.org/specs/rfc7234.html#cache-response-directive.no-store) response directives are often confused. No-cache allows caches to store something, but it can’t be reused without checking with the origin server first (e.g., with an `If-None-Match` validation). `No-store` outright disallows caching; effectively, it means the response is required to bypass the caching system.
Because of its Varnish roots, __[Fastly ignores both 'no-cache' and 'no-store' by default](https://docs.fastly.com/guides/tutorials/cache-control-tutorial.html#do-not-cache)__, both storing and serving the response without checking with the origin server, and thus __we fail the corresponding tests__.
However, you can support them in `vcl_fetch`:
if (beresp.http.Cache-Control:no-store) { return (pass); }
if (beresp.http.Cache-Control:no-cache) {
set beresp.ttl = -1s;
set beresp.grace = 0s;
return (deliver);
}
__Warning:__ The `no-cache` support here is correct according to the spec, but be aware that it causes significant inefficiency, because it effectively serializes these requests. Be cautious when deploying in production; if you don’t need the exact semantics of `no-cache`, it’s better to use `no-store`.
## Surrogate-Control
Tucked away in the fairly obscure [Edge Architecture Specification](https://www.w3.org/TR/edge-arch/) is the Surrogate-Control header, which gives a way for content providers to explicitly convey freshness information to their CDN, overriding any explicit freshness controls in the response.
Surrogate-Control has been adopted by a number of CDNs over the years, including Fastly (see [our docs](https://docs.fastly.com/guides/performance-tuning/controlling-caching.html)). However, the brevity of the specification (as editor, my fault; sorry) and the header’s similarity to Cache-Control has led to a lot of different interpretations of its functionality, syntax and semantics.
So, the tests aim to establish a reasonable interoperable base for `Surrogate-Control`, starting with the `max-age` and `no-store` directives. They expect a cache to honor `Surrogate-Control` before `Expires` or `Cache-Control`, and to still account for things like the `Age` header, so that it isn’t over-cached.
Here, the tests tell us that __Fastly handles `Surrogate-Control: max-age` properly, but doesn’t honor `Surrogate-Control: no-store` yet__.
Similar to Cache-Control: no-store above, this can be added in `vcl_fetch`:
if (beresp.http.Surrogate-Control:no-store) { return (pass); }
## Cache-Control in Requests
Like many CDNs, Fastly ignores [Cache-Control directives in requests](https://httpwg.org/specs/rfc7234.html#cache-request-directive) and so __we fail almost all of the request `Cache-Control` tests__.
CDNs do this for a good reason: allowing clients more control over the cache means that you (the website operator) have less control. If you’re relying on a CDN for performance and availability, this can be an attack vector, and so most customers disable request `Cache-Control` by default.
However, if your site needs to honor request `Cache-Control` (for example, you’re serving an HTTP API and the clients are authenticated), it’s easy enough to do so with VCL:
In `vcl_recv`:
# Cache-Control is removed from requests when we try to cache
# so we make a copy of the fields we need forwarded
if (req.http.Cache-Control) {
set req.http.Forward-Cache-Control = req.http.Cache-Control;
}
# handle no-store
if (req.http.Cache-Control:no-store) {
# For the edge node
set req.http.Forward-Cache-Control:no-store = "1";
return(pass);
}
if (req.http.Forward-Cache-Control:no-store) {
# For the shield node
unset req.http.Forward-Cache-Control;
return(pass);
}
if (req.http.Forward-Cache-Control:max-stale) {
declare local var.maxstale RTIME;
set var.maxstale = std.atoi(req.http.Forward-Cache-Control:max-stale);
set req.max_stale_while_revalidate = var.maxstale;
} else {
set req.max_stale_while_revalidate = 0s;
}
In `vcl_hit`:
declare local var.ttl INTEGER;
if (req.http.Forward-Cache-Control:no-cache) {
set obj.ttl = 0s;
restart;
}
if (req.http.Forward-Cache-Control:max-age) {
declare local var.maxage INTEGER;
set var.maxage = std.atoi(req.http.Forward-Cache-Control:max-age);
set var.ttl = var.maxage;
set var.ttl -= obj.ttl;
if (var.ttl < 1) {
return (pass);
}
}
if (req.http.Forward-Cache-Control:min-fresh) {
declare local var.minfresh INTEGER;
set var.minfresh = std.atoi(req.http.Forward-Cache-Control:min-fresh);
set var.ttl = obj.ttl;
set var.ttl -= var.minfresh;
if (var.ttl < 1) {
return (pass);
}
}
In `vcl_miss`:
if (req.http.Forward-Cache-Control:only-if-cached) {
error 504 "Gateway Error";
}
if (req.http.Forward-Cache-Control) {
set bereq.http.Cache-Control = req.http.Forward-Cache-Control;
}
unset bereq.http.Forward-Cache-Control;
And in `vcl_fetch`:
# Save responses for request Cache-Control: max-stale
set beresp.stale_while_revalidate = 1h;
Note that `beresp.stale_while_revalidate` is set in `vcl_fetch` to enable request `Cache-Control: max-stale`. This acts as a maximum for stale-while-revalidate as well as max-fresh handling, so adjust its value accordingly. If you want to use stale-while-revalidate, make sure you change `req.max_stale_while_revalidate` to the value you desire in `vcl_recv`.
## Status Codes and Caching
In HTTP, just about any status code -- even an unknown one -- can be cached, as long as the response has explicit freshness information (or `Cache-Control: public`). Like many HTTP caches, Fastly is more conservative, only caching those status codes it knows about, but this can be changed in (you guessed it) `vcl_fetch`:
```
if (
beresp.http.Cache-Control:max-age
|| beresp.http.Cache-Control:s-maxage
|| beresp.http.Expires )
{
set beresp.cacheable = true;
}
```
[`500 (Internal Server Error)`](https://httpwg.org/specs/rfc7231.html#status.500) and [`503 (Service Unavailable)`](https://httpwg.org/specs/rfc7231.html#status.503) deserves special mention here. Because they indicate a server is having trouble, Fastly will, by default, retry the request on the origin, possibly for a long time. If you want to write them through to the client, do this in `vcl_fetch`:
```
if (beresp.status == 500 || beresp.status == 503) {
set req.http.Fastly-Cachetype = "ERROR";
return (deliver);
}
```
## Date, Age and Expires
While Fastly handles responses whose freshness is based upon `Cache-Control` correctly, we found a few issues with responses whose freshness is based upon just `Expires`.
The `Age` response header helps a cache account for the time that a response spends in upstream caches. It’s used for calculating the freshness lifetime of the response, even when based upon `Expires`.
HTTP’s algorithm for calculating Expires freshness is effectively:
freshness_lifetime = Date - Expires - Age
Since both `Date` and `Expires` come from the origin server, it doesn’t matter whether the cache’s clock is well-synchronized to the origin server’s. However, Fastly doesn’t do this; instead, we compare the `Expires` header to our local clock and decide how much longer the response is fresh for, ignoring `Age`.
Based upon [code comments](https://github.com/varnishcache/varnish-cache/blob/2.1/bin/varnishd/rfc2616.c#L46), it looks like we inherited that approach from Varnish. It works great when the origin server’s clock is well-synchronised with ours, but if they’re significantly out of sync, it can cause the response to be cached longer or shorter than intended.
This is a bug that the tests have helped us to identify. We’re currently assessing how to fix it without affecting existing customers’ VCL. In the meantime, if you use a cache between your origin and Fastly and use `Expires`, make sure that your origin’s clock is well-synchronised using [NTP](http://www.ntp.org/).
These headers are also important to get right for downstream caches -- for example, those in browsers. The tests show that Fastly sends the `Age` header to its clients. However, when freshness is based upon `Expires`, it does so based upon how long the response has been stored in Fastly’s cache only; if it was cached upstream, this isn’t accounted for in the `Age` emitted by us, so a cache downstream from Fastly could store the response for too long.
This is a another bug that the test suite helped us find, and it only affects responses that use just `Expires` for freshness. We’re currently exploring how to fix this while making sure that existing customers aren’t impacted. In the meantime, if you have a cache between Fastly and your origin (like Varnish or another reverse proxy) and you want to assure that the freshness lifetime of an object using 'Expires' for freshness in it, Fastly and downstream caches doesn’t exceed the specified amount, add this in `vcl_fetch`:
if (beresp.http.Age) {
set beresp.http.Before-Fastly-Age = beresp.http.Age;
}
… and in `vcl_deliver`:
if (resp.http.Before-Fastly-Age) {
declare local var.age INTEGER;
set var.age = std.atoi(resp.http.Age);
set var.age += std.atoi(resp.http.Before-Fastly-Age);
set resp.http.Age = var.age;
unset resp.http.Before-Fastly-Age;
}
Similarly, the [`Date` header](https://httpwg.org/specs/rfc7231.html#header.date) is used to assure that the `Expires` header is honored correctly. Fastly updates `Date` on every response to the current time, and this has a potentially bad interaction with `Age`, because downstream HTTP caches also use the algorithm for `Expires` freshness explained above.
However, since we update `Date` and also send `Age`, this means that downstream caches (like those in browsers) will consider the response fresh for a shorter amount of time than `Expires` allows it to.
Again, this is a bug that the test suite helped us identify, and it just affects responses that use only `Expires` for freshness. We’re evaluating how to fix it without impact on current customers, but if this is important for you that responses using just `Expires` for freshness see their full lifetime in downstream caches, you can address it in `vcl_fetch`:
if (beresp.http.Date) {
set beresp.http.Before-Fastly-Date = beresp.http.Date;
}
… and in `vcl_deliver`:
if (resp.http.Before-Fastly-Date) {
set resp.http.Date = resp.http.Before-Fastly-Date;
unset resp.http.Before-Fastly-Date;
}
## What’s Next?
In the long run, I’d love nothing more than to see all CDNs and reverse proxies passing all of these tests (and many more: freshness is just the start). To get there, it’s going to require a number of the tests to change, some adjustments by CDNs and reverse proxies, a lot of discussion, and I suspect a fair amount of time.
That’s OK. Having open tests helps guide not only a larger discussion, but also affects implementation decisions, and gives customers greater transparency into how we handle HTTP in the meantime.
Improving CDN interoperability is in everyone’s interests. No CDN (that I know of) differentiates their products based upon how they handle the `Cache-Control` header, but those differences can can impair customers, both in functionality and efficiency. The more consistently we handle the low-level details like this, the more our customers and third-party tools and frameworks can consider CDNs as a solid platform to be built upon, rather than something to be configured as a one-off.
If you want to help, please use the test suite, file issues (for bugs or new tests), and make pull requests. If you’re a CDN customer, tell your CDN provider that interoperability is important to you. If you’re a CDN, I’d love to start talking about which tests should pass and what else needs to change to best. As explained above, there are good reasons for not following the HTTP specifications in some cases, but arbitrary differences between CDNs’ protocol handling don’t help anyone.
*Thanks to Rogier Mulhuijzen and Andrew Betts for their help with VCL.*