State at the edge

Principal Engineer

July 13, 2020

In the von Neumann model of computer architectures, a computer is defined as the combination of a logic unit — commonly, a CPU — with some form of memory, taking input and providing output.

You can view the Fastly edge cloud platform as a von Neumann computer, in a way: the input and output are HTTP requests and responses; the CPU is your service configuration expressed in VCL, our current configuration language; and the memory is a combination of the web content that we cache for you, read-only data provided via Edge Dictionaries, and the metadata associated with each request.

With the introduction of Compute, Fastly now provides a richer model for the CPU. WebAssembly, powered and secured by the Lucet compiler and runtime, unlocks essentially arbitrary code execution within each request lifecycle. If we could offer a similar upgrade for the memory — or state — component, our customers would be able to treat Fastly as a fully programmable edge computer and build entirely new classes of things that weren’t possible before.

What would that richer model for state look like? More specifically, how would we go about accepting — and persisting — reads and writes of arbitrary state within each request lifecycle? We’ll explore that and more in this blog post.

Latency is (still) the mind killer

Fastly isn’t just a generic service provider. We’re in business specifically to make your applications performant, and in this game, latency is (still) the mind killer. We’re able to offer WebAssembly as a “CPU upgrade” in our computer only because of Lucet’s unparalleled cold start latency of 35.4 microseconds. Any “memory upgrade” would need to be approximately as fast, too.

When we cache your web content, we’re able to offer sub-millisecond 99th percentile time-to-first-byte* thanks to a great deal of sophisticated engineering, but also because that form of caching is in many ways easy. Because you own your origin, and because it’s authoritative for your content, Fastly always knows exactly where to go for a given object and how to re-fetch it, if necessary. And because Fastly never modifies that content once it’s cached — because it’s immutable once it’s in our systems — we don’t have to worry about conflicts or coördination. We take advantage of those facts, at many layers in our infrastructure, to keep our systems reliable and fast.

But if we introduce writable state at the edge, there’s no longer an obvious, authoritative origin for each object, which means we have a new problem of addressability. And since requests can read and write potentially the same object from any point in our global network — because the state is now mutable — we have a new problem of state conflicts. Both of these problems are solvable, but solving them within our sub-millisecond latency budget presents a number of novel engineering challenges.

Our enemy is the speed of light

In the fight against latency, caching — or, equivalently, locality — is our chief weapon. At least until quantum entanglement becomes commercially viable, we have to keep all of the pieces of our von Neumann computer physically close to each other to keep them fast: the incoming request, any computation we apply to it, any cached content we need to access, and the outgoing response. Writable state, too! This means that each POP acts as a totally discrete computer, in a way, serving requests and responses and caching your content independent of the others.

Handily, this locality requirement can actually help solve our addressability problem. If we keep copies of the writable state in each of our POPs, in the same way we keep copies of your web content, we can operate on that state locally, and stay within our latency budget. But that solution exacerbates the other problem of state conflicts.

Imagine that we received a request in Tokyo that caused some key C to receive value 13, while, in the same instant, we received a request in New York that caused the same key C to receive value 56. What’s the truth? Typical conflict-resolution strategies, like serializable transactions, would require at least one round trip between Tokyo and New York, which would destroy our latency budget.

The annoying but inevitable conclusion, dictated by the speed of light in our physical universe, is that there are actually multiple parallel universes of truth, at least for a short period of time. Like Schrödinger’s cat, C is both 13 and 56, depending on who you ask, until we’re eventually able to collapse the quantum state and choose a single value.

Fortunately, so-called eventually consistent data systems — examples include Cassandra, DynamoDB, and MongoDB, with the right settings — are plentiful, reasonably well-understood, and can be deployed to help solve this problem. But they’re really only half of the solution: they can provide the protocol for consistency, but not necessarily the semantics of it. That is, how do we pick a value for C given it’s currently both 13 and 56? Do we split the difference and say 34.5? Do we add them together and say it’s 69? Do we flip a coin and just pick one?

Conflict-free data types

A plain number — 1, 2, 42 — is difficult or impossible to make consistent in this way, because consistency isn’t well defined. If the number represents hits on a website, we might want to resolve conflicts by adding them together; if it represents a score of some kind, we might want to average them. There’s no way of knowing, without additional information.

But it turns out that there are data types that have well-defined semantics for consistency already baked in. CRDTs, or conflict-free replicated data types, are a relatively novel discovery in computer science that, among other things, give deterministic results when “merged” together. If we can express our state at the edge in terms of CRDTs, we get everything we want. We get local read and write transactions, so we can stay within our latency budget. We allow for state conflicts, and the unavoidable parallel universes of truth. And we can reliably merge those parallel universes together, in order to get deterministic eventual consistency on a global scale.

CRDTs are especially interesting for what would otherwise be very tricky distributed systems, because of their inherent monotonicity. That’s an expensive word for an easy-to-understand concept: CRDTs can be merged in any order (also known as associativity and commutativity), and the same merge performed repeatedly (also known as idempotence), and the result will reliably converge to the same, correct, stable outcome. It’s like the difference between adding a bunch of numbers to each other over and over again, versus unioning a bunch of sets of numbers to each other over and over again: the numbers will grow into infinity, but the sets will stabilize.

Consequently, a system built on CRDTs can safely do away with a lot of the coördination and error handling work that makes typical distributed systems so difficult. In effect, as long as our upgraded, CRDT-based von Neumann machines can broadcast their local state on a regular basis, the system will naturally converge to a stable and correct global state without special effort. There’s no need to figure out a correct order of events, or synchronize updates between peers, or manage sophisticated undo or retry strategies, or aim for exactly-once delivery of messages, or anything like this. A smarter building block makes for a simpler system.

No free lunch

I’ve painted a pretty rosy picture of CRDTs, but reality is a bit more complicated. It’s not always easy to find CRDT equivalents of classic data structures like simple registers (variables), maps, lists, or sets. The CRDTs that do exist often come with annoying caveats, like registers that can only be set once, or sets that you can only insert into and never delete from. As you might expect, it can be awkward to build useful applications with such limited abstractions.

Getting more usable APIs from CRDTs is possible, but often means combining multiple low-level CRDTs together, which in turn frequently means that CRDTs require a lot more memory than their classical counterparts. That memory use also tends to grow in proportion to the number of nodes in the system, and in proportion to the amount of “history” that the data type needs to maintain. The result is that a single byte of usable state might cost tens, hundreds, or even thousands of bytes of physical memory. Optimizing for space is probably the single biggest challenge in providing production-capable CRDT systems, and there’s active research in the area, but it’s far from a solved problem.

Finally, while it’s true that CRDTs solve a lot of the difficult problems of distributed systems, they do so by essentially duplicating effort: rather than coördinating a single update from one POP to another, a CRDT state system might just re-broadcast the same data multiple times, which carries costs of bandwidth and CPU. Again, there’s a world of possible optimizations, but it’s also an area of active research.

The road ahead

Although there are challenges to a system of this design, we believe that they’re fundamentally solvable, and that CRDTs represent the most promising direction for writable state at the edge. To that end, Fastly is actively prototyping and deploying a CRDT-based edge state system, which we’ll continue to talk more about in the future.

We’re taking an integrated approach, leveraging Fastly’s unique system and network architecture to help solve, or work around, some of the challenges that CRDTs present. And we’re taking a conservative approach in terms of risk, rolling out specific, tightly-scoped products first, and slowly expanding to more general products once we have sufficient operational experience. In fact, it’s likely you’ll be using a CRDT-backed Fastly product without even knowing about it — and well before you hear about edge state in the context of Compute.

The goal is a theoretically sound, operationally simple, reliable, and — above all else — fast system for writable state at the edge. We’ll keep you up to date on our progress.

For more on this subject, and a deeper dive into the prototype edge state system, see my below talk at QCon London on Infinite Parallel Universes: State at the Edge, or read this interview on the same topic.

*as of Dec. 31, 2019