Surrogate Keys: Part 2
One of my favorite computer science jokes is that there are only two truly difficult problems in computer science: naming things, cache invalidation, and off-by-one errors. At Fastly, we deal with these sorts of difficult problems every day. In our previous post on surrogate keys, we discussed how this feature can help you solve the problem of invalidating dynamic content. In this post, we’ll take a look under the hood at how we efficiently represented this system.
To solve this problem, we need to characterize relationships between keys and cached objects. Surrogate keys allow for a many-to-many relationship: an individual response may associate multiple keys with the object, and the same key may be supplied by many different responses. Consider these two request / response pairs:
GET /blog/ HTTP/1.1 Host: www.example.com HTTP/1.1 200 OK Content-Type: text/html Content-Length: 1234 Surrogate-Key: mainpage template-a GET /blog/article/fastly-rocks HTTP/1.1 Host: www.example.com HTTP/1.1 200 OK Content-Type: text/html Content-Length: 2345 Surrogate-Key: template-a article-fastly-rocks
In this case, we have two objects (/blog/ and /blog/article/fastly-rocks) with three keys (mainpage, template-a, and article-fastly-rocks).
Varnish and Objects
Varnish is the caching reverse proxy we use at Fastly, providing the core of our services. It serves as an intermediary between the client requesting an HTTP asset and the server containing that asset. When Varnish completes a request on behalf of a client and the response allows for it, Varnish stores the requested asset in memory or on disk (depending on a number of factors). This cached copy of the asset is called an “object” in Varnish.
Objects in Varnish are stored with an identifying hash based on distinctive pieces of the request (the path, method, and some headers, for instance). This hash must be unique between objects, and so a cryptographic hash function(SHA256) is used. If we used this object identifier, we could write an external utility to pair keys to hashes and send purges over a socket. Our prototype of the surrogate keys system operated in this fashion; unfortunately, there are some undesirable behaviors associated with this approach.
Namely, when objects expire, we need a reliable way to remove the object hash from all associated keys. If this doesn’t happen, we are at least wasting memory and disk space. Depending on the conditions of the removal failure, we might also serve content that was deemed stale. We could use a memory-only store, but if this crashes out of band with respect to Varnish we lose all associated keys, and stale (or invalid) content can be served to the user.
Additionally, objects in Varnish can represent multiple versions of the same resource. If an origin returns a response with a Vary: Accept header, multiple Content-Types may be associated with a single object. This behavior is common from (as an example) REST API endpoints providing support for multiple clients, where individual clients may prefer differing serialized data formats. When one serialization format changes without respect to other formats, one might prefer to leave the other representations cached.
Unless we dive a level deeper, we can’t solve these problems.
Surrogate Keys in Varnish
To address these issues, we embedded support for surrogate keys directly into Varnish. Implementing support directly in Varnish helps us solve all of the problems we might run into with a separate process, but it requires us to solve some complex problems.
Many-to-many relationships are notoriously confusing due to the amount of indirection involved. When a key is deleted, we need to remove the object to key relationships (as well as the key to object relationships) so that we don’t leak memory. When an object expires naturally (either due to an Expires header or a VCL-supplied TTL for the object triggers), we must remove that object from its associated keys.
Three main goals exist for us in terms of key storage:
Storage must be reasonably compact in memory.
Insertions must scale.
Removals must scale.
The obvious choice for string storage is a hash table. Although it isn’t particularly compact, it provides near constant time insertion and removal in the average case. Unfortunately, the edge cases of hash table performance are significant. Hash tables are comprised of pre-allocated buckets of fixed size, and due to speed requirements of typical non-cryptographic hash functions, collisions are inevitable. There are a number of different ways of handling collisions:
Chaining: In this case, collisions are resolved by using a second data structure (such as a linked list) to compare elements when a collision is found. This method “degrades gracefully,” but requires a good idea of the bucket size to do so.
Open addressing: Entries are all stored within the hash buckets, and when a collision is found, some probing algorithm is used to find the next free bucket. When free slots run low, the buckets are resized.
Cuckoo hashing: Multiple hash functions are used to insert into different places in the bucket. If all hash functions collide, the bucket is resized.
Several other methods also exist. Fundamentally, the problem with these methods is that they require some level of runtime tuning. When the buckets require resizing, every element stored in the bucket must be re-hashed to find its new place. With millions of keys, this resize operation becomes prohibitively expensive. The time required to complete this operation could cause us to pause for a significant period of time — significant enough to cause us to appear to “miss” purge requests.
Finally, hash tables aren’t compact: they require preallocation of memory and while cuckoo hashing might give us better general performance than (for example) linear probing on a busy / full table, we basically have to double the size of the table when we run out of space.
We needed something different. Enter the crit-bit tree.
The crit-bit tree is a condensed radix trie, generally yielding better performance due to the reduced number of pointers involved in maintaining nodes of the tree (and they’re significantly more compact than trie implementations requiring more pointers). Due to the nature of radix tries, the crit-bit tree stores condensed prefixes of strings for its internal nodes. This allows us to significantly reduce the amount of memory used in the case of many common prefixes while not paying a huge overhead when few shared prefixes exist.
In addition to these fine memory properties, the crit-bit tree has O(k) performance for additions and removals, where k represents the longest string in the data set. At first glance, this doesn’t seem like a win over balanced binary search trees (having O(log n) performance), but consider that every node traversal in a binary search tree requires an O(k) comparison in the worst case when searching for strings.
Because each key can represent multiple objects, the terminal nodes of this tree must be containers. There are a number of data structures that fit the bill; we chose a red-black tree. Red-black trees have storage properties proportional to the number of elements in the tree and have logarithmic worst-case time for additions and removals.
The red-black tree is balanced with respect to pointers to the Varnish objcore. An objcore contains the data associated with a particular object; objects may have multiple associated objcores if (for instance) the Vary header comes into play.
With a red-black tree, insertions and removals are O(log n). Insertions are practically only slightly slower than they would be for a linked list, where your insert is always O(1). The access frequency between insertion and removal tends to favor removal. Even with tens or hundreds of millions of objects, the height of the red black tree will never be more than 30 levels (⌈log2(10^9 – 1)⌉), which is a fair hit to take on insert to favor logarithmic time removal.
Finally, we need to make the reverse relationship: an object may be associated with many keys. Each objcore stores a list of pointers to the per-key red-black tree containing it. A list is sufficient, because object expiration requires removal from all associated keys: every node has to be traversed anyway. Additionally, objects are bounded to the number of associated keys due to the fact that the Surrogate-Key header has a fixed length limitation of 1024 bytes (including spaces and hex-encoded values).
Designing and implementing scalable solutions for complex problems always feels like a huge win. At Fastly, we engineers are constantly faced with such tasks. The Internet is not a solved problem, and we strive to make it faster and better.
Thanks for reading! Keep an eye out for another post soon on how our surrogate key purges are spread globally through our distributed purging infrastructure.