Addressing the challenges of TLS, revocation, and OCSP
An important aspect of secure systems design is planning for the full lifecycle of the system, including any foreseeable disaster scenarios. This is particularly true when dealing with cryptography. Rotation, expiration, and revocation of secrets are all important concerns that require careful and difficult up-front design. Transport Layer Security (TLS), the protocol underlying secure web traffic (HTTPS), is one of the cryptographic systems with the largest deployment and day-to-day use, and serves as a good case study for all of the proceeding concerns. In this post, I’ll discuss how revocation is addressed in TLS, and how it relates to both performance and security.
TLS relies on a binding of a public key to an identity (the domain of the website being accessed), delivered in the form of an X509 Certificate. The veracity of this binding is based on an endorsement (cryptographic signature) from a trusted third party — a certificate authority (CA).
Under normal circumstances, this binding is considered valid for a fixed period, defined by a start date/time and an end date/time. Outside of normal circumstances, there are a number of cases where this binding might no longer be valid, for example if the subject of the binding loses their private key, or if it is stolen by an adversary. Ideally the certificate binding could be explicitly broken in these cases, rendering it untrusted by clients — a process referred to as revocation.
CRLs and OCSP: two approaches to revocation
There are two revocation technologies in the TLS space: Certificate Revocation Lists (CRL) and Online Certificate Status Protocol (OCSP). With CRLs, the CA provides a signed list of certificate serial numbers that should be considered revoked and invalid. Clients periodically fetch the CRL from the CA and compare the serial number of observed certificates against the list of revoked serials.
OCSP differs in that, as the name suggests, it is an online process that requires that the browser requests information about the validity of the certificate at the time it is presented by a server. An OCSP query is made to the CA’s OCSP endpoint, which returns a signed statement on the validity of the certificate. To prevent forgery, the CA signs the returned data for both CRL and OCSP.
With CRLs, the size of the response grows with the number of revocations as more serial numbers are added. This incurs a high bandwidth cost to learn the revocation status of certificates the client likely has not and may never observe in the wild. The signal-to-noise ratio of CRL lists and the bandwidth required to fetch them has led to several browser vendors no longer retrieving or honoring CRLs (including both Firefox and Chrome), leaving OCSP the only viable general purpose revocation technology in place today.
A challenge for performance and security
Unfortunately, OCSP leaves much to be desired both in terms of security and performance. OCSP requires an extra round-trip connection from the client to the CA’s OCSP endpoint before the TLS session can proceed to delivering content from the server to the client. Observers have placed the median time to connect to an OCSP server at ~300 milliseconds, and a mean time of up to 1 second. This makes OCSP a performance bottleneck, potentially delaying the entire TLS connection with an undue wait.
On the security side, since OCSP requires the CA respond to the request for certificate status in real time, a decision must be made on how to handle failure cases in the context of the TLS session.
If the CA’s OCSP endpoint is not reachable, a client must decide whether to treat the lack of a response as a failure (often called “hard fail”) or whether to ignore the lack of response and continue the TLS transaction anyway (referred to as a “soft fail”). The frequency with which an OCSP endpoint is unreachable due to network congestion or routing difficulties has led to a climate in which the majority of clients favor a soft fail approach. Consider captive portal systems frequently seen on public WiFi at hotels and coffee shops — these portal systems frequently redirect initial traffic to a TLS protected sign-on webpage. This leads to a chicken-and-egg problem in which the certificate for the captive portal must be verified with OCSP before the sign-on page can be presented, but the OCSP request will not be allowed past the captive portal until the user has signed in. Treating OCSP as a hard failure would render all of these portal protected networks unusable.
Another factor influencing clients to favor soft failure is the potential for outages of HTTPS websites due to distributed denial-of-service (DDoS) attacks on OCSP endpoints. If OCSP were treated as a hard failure, then targeting OCSP endpoints with DDoS attacks could render thousands of independently hosted HTTPS services inaccessible due to certificates that could not be validated as unrevoked.
Lastly, the OCSP protocol itself quickly falls apart in the presence of an active man-in-the-middle (MITM) attack in which the adversary is able to modify the OCSP response without breaking the CA signature to tell browsers to “Try again later.”
Overall it seems as though revocation is a grim matter, so what’s there to be done? On the performance side, you can address the delay introducing OCSP round-trip request lag by relying on a newer extension to the TLS ecosystem known as OCSP stapling. In this case, the server the client is communicating with includes in its handshake not only the certificate chain binding the server’s identity to a public key, but also the CA signed OCSP status response for the website’s certificate. This provides the client with the same information they would fetch with the extra round-trip connection but automatically includes it as part of the existing TLS handshake, saving the client the delay of a separate connection.
In the short term, things are still rather dreary on the security side for two primary reasons. First, there is presently no standardized mechanism by which a website can tell clients that it will always staple an OCSP response to the TLS handshake to prove certificate validity. This means clients are left treating a server reply without an OCSP staple as a soft failure — there is no mechanism to reliably determine if a staple was expected.
Second, the existing OCSP stapling mechanism only allows one OCSP status to be included, meaning that for websites serving both a leaf certificate and one or more intermediate certificates, it is not possible to ascertain the validity of the whole chain from the included OCSP staple. Both of these issues are being addressed in draft standards — the first by creating a mechanism similar to HSTS that will allow certificates to specify that they must be accompanied by an OCSP staple. This creates a way for websites to express their desire to be treated in a more strict manner in regards to OCSP staples and revocation status.
Similarly, the single staple restriction is being addressed by a modification to the OCSP stapling standard to allow multiple statuses to be stapled into the handshake at once. Some browsers have created independent revocation mechanisms through delivering CRL-like information through the browsers regular software update mechanism, for example Chrome’s CRLSet project.
Looking longer term, research is also underway to attempt to build a certificate transparency-inspired project for the purposes of revocation, an effort being referred to as “revocation transparency.”
Fastly cares deeply about performance and has been an early adopter of OCSP stapling. All of our TLS properties use OCSP stapling today, avoiding unnecessary round-trip connections. As standards for OCSP “must staple” and multiple certificate status request stapling evolves, we will work to quickly adopt them in order to address security issues related to revocation.