Testing HTTP freshness in CDNs

Senior Principal Engineer, Fastly

September 05, 2018

CDNs all use HTTP caching to optimize performance, but sometimes different CDNs do it in slightly different ways and that can make things more complicated for our customers. This blog post makes a case for CDN interoperability and introduces a common test suite to help identify differences between CDNs to start paving the way.

Why Test Freshness?

Per RFC7234 “Freshness” is one of the core concepts of HTTP caching; fresh responses can be used without contacting the origin server, improving efficiency. Effectively, it’s a contract between the publisher and all downstream HTTP caches that amounts to “You can serve my content for this long before checking back with me.”

Over the years, CDNs and reverse proxies have evolved to ignore some of HTTP’s freshness controls (for example, Cache-Control in requests), and have added a few new ones, like Surrogate-Control. Doing so is often sensible in the moment; CDNs and reverse proxies have a special relationship with the origin server, and they take advantage of that to optimize their services (adding “Varnish”, in the words of Poul-Henning Kamp).

Unfortunately, anecdotal evidence suggests that each CDN is doing so in different, sometimes incompatible ways, and that’s not great for customers. If CDNs and reverse proxies break the rules of HTTP caching in an unanticipated way, it can bring unpleasant surprises.

For example, imagine you use a third party server framework that depends on Cache-Control: private working to assure that personalized content isn’t cached. If you interpose a CDN that doesn’t honor it, the results could be disastrous; content intended for one user could end up with another, because the framework assumed (rightly) that HTTP caching would work correctly.

At the same time, people are using HTTP in much more imaginative and fine-grained ways than they were even five years ago, especially for HTTP-based APIs. As a result, they expect CDNs and reverse proxies to behave in predictable, specification-conformant ways, so that they can reliably use all of the protocols’ relevant features.

Ideally, there would be a common profile for how CDNs handle the low-level details of HTTP caching, with detailed documentation about any deviation from “normal” behaviors. This would reduce the friction -- and risk -- of switching between CDNs, and allow third-party software to target one thing, rather than try to integrate with the growing number of CDNs out there.

However, defining such a profile is tricky. There are valid reasons for the varying individual behaviors of CDNs today -- even if they’re not compatible with the specification or other CDNs. And, all CDNs (including Fastly) don’t want to change how they behave for existing customers in surprising ways.

What we can do, though, is gather data to help shape an idea of how a CDN (or reverse proxy) should behave, and identify any problem areas. In other words, we need a test suite for CDNs and reverse proxies that they and their customers can use to help understand how they behave. In time, that might help us drive a discussion about where it makes sense to converge our behavior.

A while back, I wrote a set of tests for HTTP caching in Web browsers that became part of Web Platform Tests (the W3C’s common test suite for browsers). It turns out that they serve as a good basis for a public CDN test suite too.

Right now, the tests look for strict, vanilla HTTP conformance, as per the specification. As discussed above, CDNs diverge from the specification for sometimes good reasons, and there’s not yet any common agreement about what the “correct” behavior is, so they should not be treated as a conformance test -- i.e., there is no prize for passing them all.

However, they are a great tool for figuring out how a CDN behaves, as well as getting insight into how it treats your content, and how you can tailor your content to improve performance.

Fastly’s Results

So how does Fastly do today? By the numbers, right now it’s 69 pass, 36 fail. That’s actually pretty good, as we’ll see when we walk through the details below and see the reasons for the divergence from strict HTTP conformance.

Why isn’t it perfect? Like many CDNs, Fastly has a “special” relationship with the origin server; we want to encourage caching wherever possible, and we often have external information (e.g., in VCL) that overrides caching metadata. HTTP caching was designed before CDNs were thought of, and some things that make sense for a generic forward proxy cache just don’t make sense for a CDN.

Also, being derived from Varnish, we have the baggage of history; in some cases, changing how we behave is going to hurt existing customers because they’ve come to depend on the quirks of Varnish and/or our service.

The big difference with Fastly is our not-so-secret weapon, VCL. Using VCL, you can make Fastly pass nearly all of them (103 out of 105) if you need to. If you want more HTTP-conformant behavior, all you need to do is add the snippets linked below to your account and activate; five seconds later, it’ll be live.

Doing that may or may not be appropriate for your site; read on to find out why.

Explicit Freshness: Expires and Cache-Control: max-age

Most people are familiar with Expires and Cache-Control: max-age; they’re the primary way you declare a freshness lifetime in HTTP responses. Fastly passes almost all of the tests regarding Expires, response Cache-Control: max-age (and 's-maxage), their relative priority, age calculation, and so forth; we will only ignore these headers if you set Surrogate-Control (see below) or explicitly override them in VCL.

The only exception is when a response uses only Expires to set a freshness lifetime, and there’s an upstream cache -- for example, when you have a reverse proxy between Fastly and your origin -- with significant clock synchronization issues. See "Date, Age and Expires” below for details of the problem here.

Heuristic Freshness

If the response doesn’t have explicit freshness assigned to it, HTTP caches are allowed to calculate their own heuristic freshness lifetime in certain situations. This is because lots of content doesn’t explicitly set freshness. However, to keep things safe heuristic freshness is only allowed on certain status codes, and only when there isn’t any explicit information.

Fastly allows control over the heuristic we use with the ttl parameter in VCL and our control panel. And, the tests show that we correctly avoid using heuristic freshness with all response status codes that don’t allow it.

However, like many browsers and other intermediary caches, Fastly is very conservative about heuristic freshness; the tests show we don’t use it on all status codes that HTTP allows, including 204 (No Content), 405 (Method Not Allowed), 414 (URI Too Long), 501 (Not Implemented), and any response with Cache-Control: public on it. Instead, these responses won’t be stored in our cache if there isn’t explicit freshness information.

If you want to cache these responses as well, use this VCL in vcl_fetch:

# This sets a heuristic TTL of 1 hour if Last-Modified is present.
if (
    (    ! beresp.http.Cache-Control:max-age
      && ! beresp.http.Cache-Control:s-maxage
      && ! beresp.http.Expires
      && time.is_after(now, std.time(beresp.http.Last-Modified, now))
    ) && (
      http_status_matches(beresp.status, "200,203,204,404,405,410,414,501")
      || beresp.http.Cache-Control:public
      )
    ) 
    { 
      set beresp.ttl = 3600s;
      set beresp.cacheable = true;
    }

Private, No-Cache, and No-Store

Other response Cache-Control directives can be used to restrict caching in various ways.

The private directive disallows caching in “shared” caches like Fastly, and the tests show we correctly avoid caching responses with Cache-Control: private (as documented).

The no-cache and no-store response directives are often confused. No-cache allows caches to store something, but it can’t be reused without checking with the origin server first (e.g., with an If-None-Match validation). No-store outright disallows caching; effectively, it means the response is required to bypass the caching system.

Because of its Varnish roots, Fastly ignores both 'no-cache' and 'no-store' by default, both storing and serving the response without checking with the origin server, and thus we fail the corresponding tests.

However, you can support them in vcl_fetch:

if (beresp.http.Cache-Control:no-store) { return (pass); }
if (beresp.http.Cache-Control:no-cache) { 
  set beresp.ttl = -1s;
  set beresp.grace = 0s;
  return (deliver);
}

Warning: The no-cache support here is correct according to the spec, but be aware that it causes significant inefficiency, because it effectively serializes these requests. Be cautious when deploying in production; if you don’t need the exact semantics of no-cache, it’s better to use no-store.

Surrogate-Control

Tucked away in the fairly obscure Edge Architecture Specification is the Surrogate-Control header, which gives a way for content providers to explicitly convey freshness information to their CDN, overriding any explicit freshness controls in the response.

Surrogate-Control has been adopted by a number of CDNs over the years, including Fastly (see our docs). However, the brevity of the specification (as editor, my fault; sorry) and the header’s similarity to Cache-Control has led to a lot of different interpretations of its functionality, syntax and semantics.

So, the tests aim to establish a reasonable interoperable base for Surrogate-Control, starting with the max-age and no-store directives. They expect a cache to honor Surrogate-Control before Expires or Cache-Control, and to still account for things like the Age header, so that it isn’t over-cached.

Here, the tests tell us that Fastly handles Surrogate-Control: max-age properly, but doesn’t honor Surrogate-Control: no-store yet.

Similar to Cache-Control: no-store above, this can be added in vcl_fetch:

if (beresp.http.Surrogate-Control:no-store) { return (pass); }

Cache-Control in Requests

Like many CDNs, Fastly ignores Cache-Control directives in requests and so we fail almost all of the request Cache-Control tests.

CDNs do this for a good reason: allowing clients more control over the cache means that you (the website operator) have less control. If you’re relying on a CDN for performance and availability, this can be an attack vector, and so most customers disable request Cache-Control by default.

However, if your site needs to honor request Cache-Control (for example, you’re serving an HTTP API and the clients are authenticated), it’s easy enough to do so with VCL:

In vcl_recv:


# Cache-Control is removed from requests when we try to cache
# so we make a copy of the fields we need forwarded
if (req.http.Cache-Control) {
  set req.http.Forward-Cache-Control = req.http.Cache-Control;
}

# handle no-store
if (req.http.Cache-Control:no-store) {
  # For the edge node
  set req.http.Forward-Cache-Control:no-store = "1";
  return(pass);
}
if (req.http.Forward-Cache-Control:no-store) {
  # For the shield node
  unset req.http.Forward-Cache-Control;
  return(pass);
}

if (req.http.Forward-Cache-Control:max-stale) {
  declare local var.maxstale RTIME;
  set var.maxstale = std.atoi(req.http.Forward-Cache-Control:max-stale);
  set req.max_stale_while_revalidate = var.maxstale;
} else {
  set req.max_stale_while_revalidate = 0s;
}

In vcl_hit:


    declare local var.ttl INTEGER;

    if (req.http.Forward-Cache-Control:no-cache) {
      set obj.ttl = 0s;
      restart;
    }

    if (req.http.Forward-Cache-Control:max-age) {
      declare local var.maxage INTEGER;
      set var.maxage = std.atoi(req.http.Forward-Cache-Control:max-age);
      set var.ttl = var.maxage;
      set var.ttl -= obj.ttl;
      if (var.ttl < 1) {
return (pass);
      }
    }

    if (req.http.Forward-Cache-Control:min-fresh) {
      declare local var.minfresh INTEGER;
      set var.minfresh = std.atoi(req.http.Forward-Cache-Control:min-fresh);
      set var.ttl = obj.ttl;
      set var.ttl -= var.minfresh;
      if (var.ttl < 1) {
return (pass);
      }
    }

In vcl_miss:

    if (req.http.Forward-Cache-Control:only-if-cached) {
      error 504 "Gateway Error";
    }

    if (req.http.Forward-Cache-Control) {
      set bereq.http.Cache-Control = req.http.Forward-Cache-Control;
    }

    unset bereq.http.Forward-Cache-Control;

And in vcl_fetch:

    # Save responses for request Cache-Control: max-stale
    set beresp.stale_while_revalidate = 1h;

Note that beresp.stale_while_revalidate is set in vcl_fetch to enable request Cache-Control: max-stale. This acts as a maximum for stale-while-revalidate as well as max-fresh handling, so adjust its value accordingly. If you want to use stale-while-revalidate, make sure you change req.max_stale_while_revalidate to the value you desire in vcl_recv.

Status Codes and Caching

In HTTP, just about any status code -- even an unknown one -- can be cached, as long as the response has explicit freshness information (or Cache-Control: public). Like many HTTP caches, Fastly is more conservative, only caching those status codes it knows about, but this can be changed in (you guessed it) vcl_fetch:

if (
         beresp.http.Cache-Control:max-age
      || beresp.http.Cache-Control:s-maxage
      || beresp.http.Expires )
    { 
      set beresp.cacheable = true;
    }

500 (Internal Server Error) and 503 (Service Unavailable) deserves special mention here. Because they indicate a server is having trouble, Fastly will, by default, retry the request on the origin, possibly for a long time. If you want to write them through to the client, do this in vcl_fetch:

if (beresp.status == 500 || beresp.status == 503) {
    set req.http.Fastly-Cachetype = "ERROR";
    return (deliver);
  }

Date, Age and Expires

While Fastly handles responses whose freshness is based upon Cache-Control correctly, we found a few issues with responses whose freshness is based upon just Expires.

The Age response header helps a cache account for the time that a response spends in upstream caches. It’s used for calculating the freshness lifetime of the response, even when based upon Expires.

HTTP’s algorithm for calculating Expires freshness is effectively:

freshness_lifetime = Date - Expires - Age

Since both Date and Expires come from the origin server, it doesn’t matter whether the cache’s clock is well-synchronized to the origin server’s. However, Fastly doesn’t do this; instead, we compare the Expires header to our local clock and decide how much longer the response is fresh for, ignoring Age.

Based upon code comments, it looks like we inherited that approach from Varnish. It works great when the origin server’s clock is well-synchronised with ours, but if they’re significantly out of sync, it can cause the response to be cached longer or shorter than intended.

This is a bug that the tests have helped us to identify. We’re currently assessing how to fix it without affecting existing customers’ VCL. In the meantime, if you use a cache between your origin and Fastly and use Expires, make sure that your origin’s clock is well-synchronised using NTP.

These headers are also important to get right for downstream caches -- for example, those in browsers. The tests show that Fastly sends the Age header to its clients. However, when freshness is based upon Expires, it does so based upon how long the response has been stored in Fastly’s cache only; if it was cached upstream, this isn’t accounted for in the Age emitted by us, so a cache downstream from Fastly could store the response for too long.

This is a another bug that the test suite helped us find, and it only affects responses that use just Expires for freshness. We’re currently exploring how to fix this while making sure that existing customers aren’t impacted. In the meantime, if you have a cache between Fastly and your origin (like Varnish or another reverse proxy) and you want to assure that the freshness lifetime of an object using 'Expires' for freshness in it, Fastly and downstream caches doesn’t exceed the specified amount, add this in vcl_fetch:

    if (beresp.http.Age) {
      set beresp.http.Before-Fastly-Age = beresp.http.Age;
    }

… and in vcl_deliver:

    if (resp.http.Before-Fastly-Age) {
      declare local var.age INTEGER;
      set var.age = std.atoi(resp.http.Age);
      set var.age += std.atoi(resp.http.Before-Fastly-Age);
      set resp.http.Age = var.age;
      unset resp.http.Before-Fastly-Age;
    }

Similarly, the Date header is used to assure that the Expires header is honored correctly. Fastly updates Date on every response to the current time, and this has a potentially bad interaction with Age, because downstream HTTP caches also use the algorithm for Expires freshness explained above.

However, since we update Date and also send Age, this means that downstream caches (like those in browsers) will consider the response fresh for a shorter amount of time than Expires allows it to.

Again, this is a bug that the test suite helped us identify, and it just affects responses that use only Expires for freshness. We’re evaluating how to fix it without impact on current customers, but if this is important for you that responses using just Expires for freshness see their full lifetime in downstream caches, you can address it in vcl_fetch:

    if (beresp.http.Date) {
      set beresp.http.Before-Fastly-Date = beresp.http.Date;
    }

… and in vcl_deliver:

    if (resp.http.Before-Fastly-Date) {
      set resp.http.Date = resp.http.Before-Fastly-Date;
      unset resp.http.Before-Fastly-Date;
    }

What’s Next?

In the long run, I’d love nothing more than to see all CDNs and reverse proxies passing all of these tests (and many more: freshness is just the start). To get there, it’s going to require a number of the tests to change, some adjustments by CDNs and reverse proxies, a lot of discussion, and I suspect a fair amount of time.

That’s OK. Having open tests helps guide not only a larger discussion, but also affects implementation decisions, and gives customers greater transparency into how we handle HTTP in the meantime.

Improving CDN interoperability is in everyone’s interests. No CDN (that I know of) differentiates their products based upon how they handle the Cache-Control header, but those differences can can impair customers, both in functionality and efficiency. The more consistently we handle the low-level details like this, the more our customers and third-party tools and frameworks can consider CDNs as a solid platform to be built upon, rather than something to be configured as a one-off.

If you want to help, please use the test suite, file issues (for bugs or new tests), and make pull requests. If you’re a CDN customer, tell your CDN provider that interoperability is important to you. If you’re a CDN, I’d love to start talking about which tests should pass and what else needs to change to best. As explained above, there are good reasons for not following the HTTP specifications in some cases, but arbitrary differences between CDNs’ protocol handling don’t help anyone.

Thanks to Rogier Mulhuijzen and Andrew Betts for their help with VCL.