Cache-Control in the wild
RFC7234 defines the syntax and semantics of Cache-Control. Since I’m one of the editors currently revising the core HTTP specifications, I wanted to know how web content creators actually use Cache-Control in the wild: Do people get the syntax right? Do they use the cache directives in a clear way? I also wanted to see how common use of Cache-Control intersected with how browser, proxy, and CDN caches actually handle it, as shown by the HTTP cache test suite I’ve been working on.
To get a better idea, I wrote some scripts to process the crawl dumps of the HTTP Archive. They save the results of loading the homepages of the top 5 million or so websites in a browser, resulting in an archive of several hundred million HTTP requests. This is done without logging into or otherwise interacting with the site, so there are bound to be differences across the web as a whole. That said, the archive can give us some interesting insights, and lots of web performance researchers use it as a way to answer questions about trends on the web.
There’s a lot of detail below, but at a high level, we’ll see that the vast majority of sites are using Cache-Control correctly and in a way that will be interpreted as intended by caches. However, it also becomes clear that many developers don’t have confidence in their understanding of (or cache’s implementation of) Cache-Control, so they overuse it — to the point of sending everything that looks like it might help.
How much data are we talking about? In the May 2020 crawl, there were 428,599,822 requests that the scripts could extract headers for. Of those, 317,894,908 (74.1%) contained Cache-Control header fields, and 317,010,929 of those successfully parsed. In total, there were Cache-Control headers from 4,616,541 distinct origin servers.
The numbers above contain our first insight: 74% of responses contained explicit Cache-Control directives rather than leaving the decision of whether to cache up to the recipient, as HTTP allows with heuristic freshness. This is a promising start; heuristic freshness can cause things to be over- or under-cached and often surprises people. What else can we learn?
How are standard cache directives used?
Those Cache-Control headers contained a total of 634,923,558 cache directives. Of those, 628,437,364 (98.9%) were well-recognised, standards-defined response directives. This is how widely they were used:
The dominance of max-age is good to see, but not surprising. The older Expires header requires assigning a specific date, rather than a lifetime, so max-age fits the common case better; it’s not usually useful to specify a fixed time for expiry. Additionally, it’s easy to get the date formatting wrong in Expires, whereas max-age is a simple number.
The popularity of public isn’t surprising either, but it does illustrate how prevalent misconceptions about this directive are. 97.3% of responses containing it have max-age or s-maxage too, making it redundant (unless HTTP authentication is in use, or in cases where the status code doesn’t allow heuristic freshness). As a result, almost all instances of Cache-Control: public are just wasting response bytes; caches don’t need it to store them.
Cache-Control: no-cache is pretty straightforward; it’s a prohibition against using the response without checking with a validating request to the origin server. However, 73% of responses that contain no-cache don’t have either ETag or Last-Modified present, which means they could have just used no-store. Not a big deal, since caches will likely handle it the same way, but this does show that there’s underlying confusion between the two directives.
On the other hand, so much use of must-revalidate is a bit surprising. This directive requires a response to be updated from the origin once it becomes stale; if the origin server is down, it can’t be used (no-cache is the same as max-age=0 with must-revalidate). While there are certainly cases where this is important (e.g., HTML that has transactional or time-specific data in it), in the majority of cases it’s not necessary; certainly not on 16% of responses.
This becomes clearer when you break down must-revalidate use by response content type. 7% of image/jpeg responses contain the must-revalidate directive; 8% of image/png do. It’s hard to believe that even that many images require such strict freshness. In contrast, 49% of text/html responses carry must-revalidate; while more believable because of the nature of HTML, that’s still quite high.
Furthermore, almost 80% of responses with must-revalidate also included no-cache or no-store, which override it. I suspect this is because a lot of folks aren’t sure what different directives do, so they “throw the kitchen sink” at caches.
In some ways, the converse of must-revalidate is stale-while-revalidate. This directive allows caches to reuse stale responses while they’re refreshed in the background. Doing so effectively hides latency from clients and can particularly help perceived performance when a lot of requests are being made for a single page load.
Another thing to note is that because Cache-Control can carry multiple directives, it’s possible for them to conflict; for example, you can say both “cache this” and “don’t cache this” at the same time. Luckily, these situations are comparatively rare; 0.263% of responses contained a positive max-age and either no-cache or no-store; 0.331% contained public and either no-cache or no-store. 0.241% contained must-revalidate and either stale-while-revalidate or stale-if-error.
When caches encounter conflicts like this, they generally follow the more conservative directive and avoid reusing the response. Fastly is an exception here. Early on, we found that our customers sent Cache-Control: no-cache and no-store because they couldn’t control downstream caches, but our extremely fast purge capability gave them much finer-grained control over what we cache, so we ignored them. This was useful for getting the most value out of Fastly for our early customers, but over time we’ve become convinced that aligning with other cache’s behavior (as well as the standard) is more important and are looking at ways to smoothly transition to that.
How about non-standard directives?
Not all directives are standardised, of course. Over the years, a number of HTTP cache directives have been proposed and even implemented. Often these are experiments to see whether a new feature will work.
HTTP allows this by setting guidelines for cache extensions. Importantly, a new directive can’t require implementations to know about it; they need to be able to “fail safe” and do the right thing if the message is handled by a cache that doesn’t understand it.
For example, in the 2000s Microsoft Internet Explorer introduced the pre-check and post-check directives, intended to allow content to be refreshed asynchronously (much like the stale-while-revalidate directive).
However, IE hasn’t supported them for years. Eric Lawrence (generally the go-to person for IE networking) even wrote a blog entry imploring people not to use them to prevent caching; they waste bytes, and in some circumstances can even cause bugs with very old versions of IE.
Unfortunately, pre-check and post-check have become part of web developer lore; there are a number of online guides that insist that they need to be sent if you want to avoid caching in all cases. As Eric points out, this isn’t true, but we still see a lot of them: 6,089,255 times (0.959% of all directives).
3,067,142 pre-check (0.965% of CC headers seen on 585223 / 12.6% of origins)
3,022,113 post-check (0.951% of CC headers seen on 583259 / 12.6% of origins)
From the numbers above, we can infer that origins commonly send them together when they send them at all; the popular “pre-check=0, post-check=0” pattern that Eric cautions against.
To remove all doubt: you can omit these directives, they’re just wasting bytes. They never did what people think they do and if your site actually needs to support extremely old versions of IE, you may be causing bugs.
A number of other extension directives were present in the total; 139,460 (0.022% of all directives). Digging into why people sent these was a lot of fun.
A lot of them appeared to be either misunderstandings of how caching works, or local extensions (e.g., for a reverse proxy that’s been modified to understand them). For example: store (21,979), cache (17,904), no-check (2,951), cached (2,497), expire (616), max-cache (549), min-age (288), and off (216). There wasn’t any discernible pattern to the owners of the sites or the software they were using, so without too much digging it’s likely it was the former.
Besides that group, the next most common undefined directive was Cache-Control: false, seen 68,862 times (in 0.022% of Cache-Control headers) from only 73 origins. Almost all of them appear to be driven from one content producer, who hasn’t yet responded to a query.
proxy-public shows up 10,211 times, from 71 (0.002%) origins. This appears to be one of the extensions defined in Microsoft’s Windows Media HTTP Streaming Protocol (and the only one seen in any number). It appears to modify how Cache-Control: private works in some Microsoft caches, and seems to be very popular with a group of comic book-related sites that appear to be run by the same party.
A pair of directives, browser-ttl (2,024) and sw-max-age (2,022), show up from a few sites associated with Baqend. Read their blog entry to learn more about why they’re doing this.
It was really cool to see a group of media websites in Norway using Cache-Control: channel-maxage (798 on 91 origins) and group (309 times on two origins). These directives are part of a protocol extension for cache invalidation that I proposed over a decade ago; it’s good to see that it’s still used in some places!
cache_static_250 (487), cache_static_2.91 (391) and cache_static_148 (348) show up on a number of sites run by the same company in Vietnam. I’d love to know more about what they’re doing here, but haven’t had luck getting in touch yet.
A number of sites in Brazil used max-age86400, apparently omitting the =. Many of them have since removed that, which may mean that an administrator realised the problem and corrected it since the May run.
Finally, Cache-Control: imagine shows up on 408 responses from nine sites in Romania, all apparently sites coded by one IT shop. I reached out to ask why they were using this seemingly whimsical directive, but haven’t heard back yet.
How often do request directives end up in responses?
People occasionally get confused about when a directive is intended to be used in requests or responses; 97,171 times (or 0.015% of all directives) in the May crawl.
95,135 max-stale (0.030% of CC headers seen on 3879 / 0.084% of origins)
1,320 only-if-cached (0.000% of CC headers seen on 54 / 0.001% of origins)
716 min-fresh (0.000% of CC headers seen on 46 / 0.001% of origins)
Considering the size of the dataset, this isn’t too bad; caches will ignore these directives and fall back to heuristic freshness if there aren’t any explicit directives like max-age, so it isn’t harmful.
That said, having the same header whose interpretation depends on which direction it was sent in doesn’t help. If we were designing Cache-Control from scratch today, we’d probably use a different name for the field in requests to help avoid this kind of confusion.
How often do cache directives get misspelled?
Since most Cache-Control directives are typed by hand into places like Apache .htaccess files, there’s always the potential for them to be misspelled. But how often?
Using a well-known pattern matching algorithm, unrecognised directives were compared to the defined ones (including the non-standard and request directives), with the most similar one above a minimum level winning. This identified 160,308 directives — 0.025% of all those seen.
The most common misspellings were of max-age, seen 57,281 times in total. Of its different misspellings, “maxage” was by far the most common (51,519 times). Other frequently misspelled directives include s-maxage (54,586), immutable (14,306), and must-revalidate (12,893).
Many misspellings involve hyphens; either omitting them (e.g., “maxage” as above) or turning them into underscores (“max_age”). Some seem to be the result of faulty memories (“max-page”; 210 times), innately bad spelling skills (“inmutable”; 221 times; “no-cashe”, 62 times), or just bad typing (“no-tore”, 30 times).
What does this tell us? The number of misspellings is surprisingly small given how many directives were seen in total. I don’t think implementations should try to support these misspellings; that’s likely to make interoperability worse, since it will be difficult to get all caches to support the same set of corrections, and there are always more misspellings to support.
However, when defining new directives, we should keep the potential for misspelling in mind; staying with short, common words, making sure that they’re not similar to other existing directives, and avoiding hyphens wherever possible, using them in a consistent manner when we do.
How often do sites send bad values for max-age?
Setting a max-age (or its sibling s-maxage) only does good if the cache can understand it, so I was keen to see how often a few common errors were encountered.
First, while in theory max-age can be infinitely large, many implementations store it as an unsigned 32-bit integer, and numbers larger than that can cause an overflow. HTTP cautions against this, but 111,100 max-age and s-maxage values were greater than 2**32-1 (over 68 years!) — 0.042% of those directives. The cache tests show the risk here; a number of implementations have problems with such large numbers.
Decimal values (like “3.5” or “60.0”) were also occasionally seen; 5,098 times (0.002% of max-age and s-maxage values). Some implementations don’t cache values with decimal values, and those that do may not behave in the same way.
Slightly more common were negative numbers, seen 29,093 times (0.011%). Most implementations will disregard these responses; Fastly is an exception (yes, we’ve logged a bug).
What’s up with no-transform?
Cache-Control: no-transform appeared over 10 million times, on 3.45% of responses. Hailing from a time when most web traffic was unencrypted, no-transform politely asks proxies not to modify the response. Why is it showing up now, considering that 91% of no-transform directives seen were on HTTPS sites?
I didn’t pay too much attention to this (relatively) low number until I came across the even less-used no-siteapp, which appeared 528 times on 78 origins — including support.apple.com.
The no-siteapp directive appears to be defined by Baidu a while back as a way to turn off transcoding of responses when they present search results (in a way similar to Google AMP). It’s easy to guess why they didn’t use no-transform; it’s already present on a bunch of sites, so using a different directive with the same meaning gives them a “fresh start.”
This reveals an interesting trend in HTTP — intermediaries are not just proxies and CDNs any more, they’re third party endpoints like search engines, and they’re meddling with content.
Cache-Control appears to be well utilised on the web, but web content creators and administrators could improve their use of it by understanding what the directives actually mean and by paring down overly verbose headers to what will get the job done.
In particular, there’s no need to send redundant directives because the tests show caches will honour the relevant directives correctly. Most of the time, that means sending either max-age or no-store, and considering adding immutable and stale-while-revalidate to boost perceived performance.
It also helps to double-check your site with tools like REDbot to make sure that you’re using the directives correctly and spelling them correctly.