Unicode in VCL

Principal Software Engineer, Edge Delivery

November 07, 2018

There's more to life than just the Latin alphabet. Because we’re a global platform with humans using all kinds of writing systems, recently we added the ability to write synthetic responses — e.g. a web page with an error message — in UTF-8 in Fastly VCL.

if (beresp.status == 503) {
  error 720;
}

if (obj.status == 720) {
  set obj.status = 503;
  synthetic "メンテナンス中です";
  return (deliver);
}

In this post, I'll share some of the behind-the-scenes work to show how we did that.

Strings in VCL

There are several ways to write string literals in Fastly VCL, offering different features:

Double-quoted strings with percent escapes: "...%xx..."
Long strings: {"..."}
"Heredoc" style long strings: {xyz"..."xyz}
LF is a convenience for a single newline character

UTF-8 encodes a single Unicode code point as sequences of multiple bytes; previously, these bytes had to be given as individually escaped hex values:

synthetic "%f0%9f%90%8b%f0%9f%8c%8a%f0%9f%8c%8a"; # Three code points! "🐋🌊🌊"

This was quite cumbersome because each Unicode code point had to be converted to its corresponding UTF-8 byte sequence, and it's especially difficult to see where one UTF-8 sequence ends and the next code point's sequence begins.

First, we extended the hex escaping for double-quoted strings to provide encoding for Unicode code points:

%uXXXX (exactly four hex digits)
%u{...} (one to six hex digits, but not to exceed U+10FFFF)

Here, the UTF-8 encoding is still present — it's done for you when the string is read and tokenized by the VCL compiler:

synthetic "%u{1F40B}%u{1F30A}%u{1F30A}"; # equivalent to the manual UTF-8 encoding above

That is a bit better, because you can see where one code point ends and the next begins. But it's still not particularly convenient to write, especially for human languages rather than emoji. That's the next step.

VCL lexical tokens

When VCL is uploaded to Fastly, it's compiled before being given to Varnish to run.
The VCL compiler goes through several steps internally, the first being to cut the VCL text up into lexical tokens. In the example above, synthetic "xyz"; is cut up into three tokens: the keyword synthetic, the string "xyz" , and the semicolon ;. It's in this phase that whitespace is skipped.

Until this point, all our string tokens have been ASCII only; the string "%u{1F40B}" itself only contains ASCII characters. To allow UTF-8 here, we needed to change the encoding for the input to each token to UTF-8. We could have done that just for string tokens specifically, but just to make sure we got it right, we did it for every token type.

In doing so, we found a few strange things which we cleared up:

Comments permitted arbitrary bytes. This is just asking for trouble,
so we changed those to UTF-8 only, along with the rest of VCL.
Floating point numbers were actually treated as three tokens,
123, ., 456 — which could be separated by whitespace, and even comments!
We made those a single token instead.
We documented the VCL types
at docs.fastly.com/vcl/types

So all VCL source is UTF-8 now, and synthetic responses are just one part where that shows. And finally, you can write:

synthetic "🐋🌊🌊";

Or perhaps something more useful:

synthetic "メンテナンス中";

Strings in VCL

VCL lexical tokens

Ready to get started?