Don’t DIY a secure platform: lessons learned from side-channel attacks
Cloud infrastructures have given us platforms on which to build applications, in turn giving developers components up and down the stack with which they can build applications. With this liberation, there's a pull and push between how much of our applications we build from scratch using these platforms and how much of it we subscribe to as a service. Securing the full integration of the platform is an important issue to consider when figuring out where to draw the build-versus-subscribe line during design.
There’s a well-known security cliche, “don’t roll your own crypto.” That remains solid advice, but “don’t roll your own platform” should be its follow up. The act of building the platform brings in so many possible combinations that the surface areas between them practically beg for exploitation by clever attackers — and attackers are very clever. Customizing a platform isn’t a “set it and forget it” job, it’s one that’s constantly evolving — and a platform like Fastly’s, that pre-integrates computing services with caching and networking services, is well suited for fending off these attacks as they evolve.
In this post though, let’s explore what can go wrong. The largest category of these difficult-to-anticipate security design weaknesses come from side-channel attacks. These are vulnerabilities that occur when an algorithm is internally secure but interacts with its environment in a surprising way that allows attackers to infer the secret parts of an algorithm.
I thought it would be fun and educational — in equal parts — to take a brief tour of some of the more foundational and out there side channel-related exploits that have afflicted the security conscious over the years. I’ll keep the tour on a high level to get the basic ideas across, but the details are linked for those who enjoy a full seminar.
Let’s start the tour with a throwback classic to (at least) the 1980s. While Van Eck Phreaking is mostly of historical interest now, it serves as a great example of the side channel genre. The key insight in this attack was that a computer monitor produces EMF in predictable ways as a side effect of drawing an image on the monitor. Most people would recognize it as some static on the radio, but a determined attacker can translate that static back into a decent reproduction of the screen itself. Even with a non-networked computer in a room without windows, the side effects of a computer monitor sending out a light beam formed an exploitable side channel. Reflecting on the absurdity of that can really give you an appreciation for the complexity of the problem.
These side effects are maddeningly difficult to predict and, as time goes by, only get more clever. One of my recent favorites, admittedly less about computer security and more about physical security, was just published this summer. Researchers were able to use audio recordings of a key being inserted into a lock and then 3-D print the matching key from the recording! The sound of the key became the side-channel. Good thing we don’t all have microphones in our pockets or on our kitchen counters to hear them.
Time is of the essence
Clocks make great side channels too, and it’s not hard to see why they can accidentally reveal information unanticipated by the security algorithm. The raccoon attack is a recent entry into this category of attack which has many precedents.
This attack exploits a flaw in a pretty deep part of TLS 1.2 and earlier where the specification for a common variant requires leading zeros from a secret value be discarded before performing a hash in the same way you would consider discarding the leading zeros of a number like 00314159 before displaying it on a screen. The timing attack happens because computing the hash is easier (and thus quicker) when performed on 314159 than it is on 00314159, and therefore, the attacker can estimate a leading zero or two and thus learn something about the secret input just by watching the clock and measuring the speed of TLS.
For the last few years, caches of all sorts have been the hot source of timing analysis side-channel flaws. Caches intentionally store the result of past computations in order to speed up future ones. Unfortunately, this can effectively leak history in between entities that aren’t supposed to know each other’s history.
A web browser cache is a great example to get started here. One of the core responsibilities of a browser is to keep different web sites separated — your browsing history of site A is none of the business of site B. However, site B may be able to determine whether or not site A is in your cache by loading a site A subresource itself and observing how long it takes to load. Very fast loads imply it was in your cache and you’ve been there before — cache timing becomes a side channel. As a result of this, more and more types of caches are being double keyed by “first party” (i.e. the website that triggers the load), which increases privacy but reduces the effectiveness of the cache.
This notion of cache entries leaving fingerprints is sort of endemic across layers for the very reason caches are useful implementation devices — they provide transparent performance boosts. It’s that very transparency that leaks information. It’s not just client-side applications such as browsers that have this problem, multi-tenant architectures on the server side have it as well. Pythia exploits the transparent caching of page tables in the RDMA hardware used for efficient storage access in datacenter and clustered computing. The usual playbook is used for this attack — two services that are supposed to be isolated from each other can probe each other’s history by carefully timing the responses to their own requests to infer whether or not some information was cached by the actions of the other party.
Turtles all the way down
As mentioned above, many of these problems are rooted in the abstractions necessary to bring the technologies into existence at all. So, as a systems designer, it can feel comforting to reach the bottom of the stack with the real hardware. CPUs, chips, RAM — good old Von Neumann architectures you can sink your teeth into! But of course, it’s just an illusion — those may be virtual machines, they may have sub-processors of their own, they may have changeable microcode, or they may even contain significant abstractions just to break their own complexity into more manageable parts. It’s turtles all the way down. So the integration problems continue. Let’s turn our focus to what seem to be fundamental pieces, the CPU, RAM, and network.
At this point, you may be able to see the infamous trainwreck of Spectre-branded transient execution attacks coming down the tracks at our story. Many parts of Spectre are familiar, as it’s based on footprints left in cache and timing analysis. Spectre takes it up a notch by linking these things to the internal behavior of the CPU. Flaws in the core logic of the processor are the gifts that keep on giving to attackers. The cache gets mutated in a Spectre-class attack when performing failed speculative execution. Practically, that means the CPU was, quite literally, calculating what would have happened had something other than what really happened actually happened just in case it happened.
It’s Schrodinger's Cat-level confusing, but the issue is that this CPU-level conjecture leaves behind evidence in the processor’s own cache that the wrong code can get its hands on through timing attacks similar to other cache attacks described above. This isn’t a single attack but a whole class of problems that continues to evolve and right now basically requires a thoughtful holistic in-depth strategy to mitigate.
While most side channels are about attackers learning data they aren’t supposed to be able to read, some related approaches allow them to write it too. Rowhammer is a classic example of this. In Rowhammer, an attacker intentionally exercises one piece of RAM so vigorously that the electrical impulses impact neighboring RAM without any consultation from the operating system or application about whether or not that’s a secure thing to do. This is often thought of as a fault attack where access to the neighboring RAM acts as the side channel-like conduit for exploitation.
I want to touch on one more attack because it is so fiendishly clever in identifying an unusual side channel to use. Reflecting on this reminds me to be forever open minded about what I think I know. This attack doesn’t rely on something as prosaic as a shared cache between the attacker and victim but instead looks at the shared network path they might both use through the victim’s home wireless router.
TCP is generally considered insecure against on-path attackers but robust to off-path attacks. So if the attacker is not able to see the traffic between the client and server (i.e. it is off the transport path), it is not supposed to be able to manipulate the connection. The exploitable secret that is not available off path is the 48 bits of client port number combined with the TCP sequence number. If the attacker obtains this information, it can easily inject any data it likes or even terminate an off-path connection at will.
The key to this attack is that the victim reacts with different sized responses to the attacker-controlled probes of the secret based on whether the attacker was correct, way off, or sort of close. Normally, that wouldn’t matter because the off-path attacker cannot see the different responses — but if the attacker creates a separate connection with the victim in parallel with the attacks, it can observe the impact of the hidden messages on something they share — the WiFi router. Bigger messages on the off-path channel slow down the messages on the on-path channel in an observable way, and this provides enough information to infer which category the guess fell into. This lets the attacker play the game of “hot or cold” to bisect the secret space quickly rather than having to explore the whole space looking for an exact match. So smart!
These are all good examples of what may not necessarily be obvious when you set out with a DIY project. Every one of these examples — and this is just a taste of the possibilities — is exceedingly clever, and you have to respect them. I’d much rather you worry about the creative parts of application building and leverage an integrated edge cloud platform that’s intended, scaled, and maintained to support that style of application without requiring your constant energy.
Protecting applications from these flaws and other vulnerabilities like them is more a process than a discrete event: Carefully consider the full context of the application, document the exposures, implement mitigations where you can, define ABIs that consider time and space as well as data definitions, and review and revise constantly. “Ship it and forget it” will lead to vulnerabilities over time. Even vigilant patching, as important as it is, can overlook the environmental combinations at the core of the exploits and developers of individual components have only partial visibility into the problem. Customizing a platform is a full-time job.