OpenTelemetry Part 2: Using OpenTelemetry in VCL

We're starting to get excited about OpenTelemetry, and want you to be able to observe your Fastly services just like you do with apps running in your core cloud provider — and see the stories of your end user's journeys mapped end to end. VCL services can emit OpenTelemetry data, and be part of that story.

Fastly customers often have complex architectures of many interconnected systems, in which Fastly is one amongst many PaaS, SaaS, serverless, frontend and native app components. Too often, observability tools, like logging, shine a light on only one part of the system at a time, making it hard to diagnose issues where a transaction has traversed multiple systems (which is, basically, every transaction!).

You should be able to visualize your whole architecture as one seamless machine and figure out what's going on in each part of it, in the context of the overall system and the way transactions flow from one end to the other. This has always been a major problem, and OpenTelemetry can help. We wrote more about why OpenTelemetry is exciting in part one of this blog series, now I'm going to look at how you can emit OpenTelemetry data from Fastly services that run VCL.

Inside the OpenTelemetry protocol

Integrating OpenTelemetry into VCL is a bit of a different challenge compared to using it in a fully featured general purpose language like JavaScript or Go. VCL (as run by Fastly within our core Delivery product) doesn't support dependencies, has a very limited standard library, and cannot make arbitrary HTTP requests.

It is possible to use OpenTelemetry libraries in our Compute platform, and we'll cover that in part 3!

However, our real time logging platform is ideal for sending serialized data to an external processor, and works just fine with the OpenTelemetry Collector. So while we can't run any of OpenTelemetry's language-specific SDKs or libraries, we can piece together the relevant data into a string in the OpenTelemetry protocol format, and emit it using real time logging.

This isn't as scary as it sounds, and it's actually a nice way to demystify OpenTelemetry. Taking a look at the protocol specification, we can see the definitions of a trace span and a log event, and then construct these in VCL.

I'm assuming for the purposes of this blog that you have a Fastly service with custom VCL and you've started out with our recommended VCL boilerplate, but you can also achieve this with VCL snippets if that's your thing. Start with some utility subroutines to generate OTel-friendly timestamps and identifiers (place these at the top of your custom VCL file or in an "init" VCL snippet):

sub time_now_ns STRING {
  declare local var.time_now INTEGER;
  set var.time_now = std.atoi(time.start.usec);
  set var.time_now += std.atoi(time.elapsed.usec);
  set var.time_now *= 1000;
  return var.time_now;
}
sub random_8bit_identifier STRING {
  declare local var.id STRING;
  set var.id = randomstr(16, "0123456789abcdef");
  return var.id;
}
sub random_16bit_identifier STRING {
  declare local var.id STRING;
  set var.id = randomstr(32, "0123456789abcdef");
  return var.id;
}

OpenTelemetry's timing unit is nanoseconds, whereas the highest resolution available in VCL is microseconds (μsec), but we can create a nanosecond timestamp by multiplying the microsecond time by 1000.

Next, all OpenTelemetry objects share a common definition of a "resource" (see spec), so we can make a subroutine to output that (VCL doesn't have a way to serialize JSON, so roll up your sleeves for a bit of string concatenation!):

sub otel_resource STRING {
  declare local var.str STRING;
  set var.str = {"{ "attributes": [ "}
    {"{ "key": "service.name",           "value": { "stringValue": "Fastly www"    } }, "}
    {"{ "key": "telemetry.sdk.language", "value": { "stringValue": "vcl"           } }, "}
    {"{ "key": "telemetry.sdk.name",     "value": { "stringValue": "opentelemetry" } }, "}
    {"{ "key": "telemetry.sdk.version",  "value": { "stringValue": "1.0.1"         } }, "}
    {"{ "key": "host.name",              "value": { "stringValue": ""} server.identity {"" } }"}
  {"], "droppedAttributesCount": 0 }"};
  return var.str;
}

Thanks to OpenTelemetry's semantic conventions, we can also define a reusable bunch of properties that will be comparable with other systems that output OTel data and suitable for splicing in anywhere that OpenTelemetry wants a KeyValueList:

sub otel_attributes_general STRING {
  declare local var.data STRING;
  set var.data = ""
    {"{ "key": "http.method",    "value": { "stringValue": ""} req.method {"" } },"}
    {"{ "key": "http.target",    "value": { "stringValue": ""} req.url {"" } },"}
    {"{ "key": "http.host",      "value": { "stringValue": ""} req.http.host {"" } },"}
    {"{ "key": "http.protocol",  "value": { "stringValue": ""} req.protocol {"" } },"}
    {"{ "key": "http.client_ip", "value": { "stringValue": ""} client.ip {"" } },"}

    {"{ "key": "fastly.restarts",     "value": { "stringValue": ""} req.restarts {"" } },"}
    {"{ "key": "fastly.visits_this_service", "value": { "stringValue": ""} fastly.ff.visits_this_service {"" } },"}
    {"{ "key": "fastly.server_role",  "value": { "stringValue": ""} req.http.x-trace-server-role {"" } },"}
    {"{ "key": "fastly.server_ip",    "value": { "stringValue": ""} server.ip {"" } },"}
    {"{ "key": "fastly.server_id",    "value": { "stringValue": ""} server.identity {"" } },"}
    {"{ "key": "fastly.server_role",  "value": { "stringValue": ""} req.http.x-trace-server-role {"" } },"}
    {"{ "key": "fastly.vcl_version",  "value": { "stringValue": ""} req.vcl.version {"" } },"}
    {"{ "key": "fastly.pop",          "value": { "stringValue": ""} server.datacenter {"" } },"}
    {"{ "key": "fastly.workspace.overflowed",  "value": { "stringValue": ""} workspace.overflowed {"" } },"}
    {"{ "key": "fastly.workspace.bytes_total", "value": { "stringValue": ""} workspace.bytes_total {"" } },"}
    {"{ "key": "fastly.workspace.bytes_free",  "value": { "stringValue": ""} workspace.bytes_free {"" } },"}
  ;
  return var.data;
}

To make this as useful as possible it's a good idea to limit the variables used in this subroutine to only those that are available in all VCL subroutines. The VCL compiler will prevent you calling a custom subroutine from a place in the workflow where a variable may not be defined. For example, I haven’t used any variables in the bereq, beresp or obj namespaces, which are not available in all parts of the VCL workflow.

This can now be put together to record spans in VCL. If you wanted to record a span for the whole of a request's lifetime within a Fastly POP, you could start with this at the start of vcl_recv:

if (req.restarts == 0) {
  set req.http.x-trace-vcl-span-id = random_8bit_identifier();
  if (req.http.traceparent ~ "^\d+-(\w+)-(\w+)-\d+$") {
    set req.http.x-trace-id = re.group.1;
    set req.http.x-trace-parent-span-id = re.group.2;
  } else {
    set req.http.x-trace-id = random_16bit_identifier();
  }
  set req.http.x-trace-server-role = if (fastly.ff.visits_this_service == 0, "edge", "shield");
}

And then this at the end of vcl_log:

declare local var.otel_resource STRING;
declare local var.otel_attribs STRING;
declare local var.time_start_ns STRING;
declare local var.time_now_ns STRING;
set var.time_start_ns = time.start.usec "000";
set var.time_now_ns = time_now_ns();
set var.otel_resource = otel_resource();
set var.otel_attribs = otel_attributes_general();
log "syslog " req.service_id " otel_collector_http :: "
  {"{ "resourceSpans": [ { "}
    {""resource": "} var.otel_resource {", "}
    {""instrumentationLibrarySpans": [ { "spans": [ { "}
      {""traceId": ""} req.http.x-trace-id {"", "}
      {""spanId": ""} req.http.x-trace-vcl-span-id {"", "}
      if(req.http.x-trace-parent-span-id,
        {""parentSpanId": ""} req.http.x-trace-parent-span-id {"", "},
      "")
      {""name": "Fastly request processing", "}
      {""kind": 1, "}
      {""startTimeUnixNano": "} var.time_start_ns {", "}
      {""endTimeUnixNano": "} var.time_now_ns {", "}
      {""attributes": [ "}
        var.otel_attribs
        {"{ "key": "http.user_agent", "value": { "stringValue": ""} req.http.User-Agent {"" } }, "}
        {"{ "key": "http.status_code", "value": { "stringValue": ""} resp.status {"" } }"}
      {"], "}
      {""status": { "code":"} if (fastly_info.state ~ "ERROR", "2", "0") {" }, "}
      {""links": [], "}
      {""droppedLinksCount": 0"}
    {"} ] } ]"}
  {"} ] }"}
;

Here we're combining the output of otel_attributes_general() with some additional attributes that are only available in vcl_log or relevant to this particular span. The output of otel_resource() is also concatenated in as the value of a resource key.

An OpenTelemetry span is actually not terribly complicated - ultimately it can be represented as a lump of JSON, and the spec does a good job of describing the format quite precisely. The lack of JSON serialization functions in VCL makes the code somewhat ugly, but I enjoyed getting a better understanding of how OpenTelemetry actually works.

Shipping the data

Another difference between VCL and Compute services is that VCL cannot make arbitrary backend requests - so we need another means of extracting the OTel data.

Our real time logging platform is designed for collection, batching and dispatch of extremely high volumes of log messages in a highly efficient way, so it'd be good to be able to take advantage of this. We don't (yet) offer OTel's preferred protobuf/gRPC as a log endpoint type, but we do support arbitrary HTTP POST endpoints and OpenTelemetry Collector supports OTLP over HTTP POST as an ingestion mechanism. However, there are still a few problems/limitations :

To prevent people using Fastly as a DDoS platform, we require that hosts you add as HTTP log endpoints "opt in" by answering a simple challenge sent to /.well-known/fastly/logging/challenge.
Our logging system will batch multiple log events into a single, newline delimited HTTP request body. The OpenTelemetry Collector is not able to unbatch these and expects only one event per request.
For things that happen at a single moment in time, OpenTelemetry supports a message format for logs. At time of writing, this is still experimental, and no OpenTelemetry tooling is able to attach logs to spans, even if span and trace IDs are included in the log events. However, the trace spec includes the concept of span events, which look a lot like logs and are supported by many tools like Honeycomb, Jaeger and Zipkin.

I decided to solve all of these issues at the same time with a proxy in front of the OpenTelemetry collector, whose job is to unbatch multi-line events, convert logs into span events, and respond to the Fastly opt-in challenge. I built an example app that does this and threw it up on Glitch. This is also a nice way to validate the data coming from Fastly without involving the OTel collector.

Trace context propagation

The magic that makes it possible to see spans that happen in backend systems as children of the Fastly spans within the same trace is thanks to trace context propagation. Specifically, the W3C Trace Context spec, which describes the traceparent header. In the VCL above, we already adopt a trace ID and parent span ID if they are included in the request received by Fastly, and Fastly needs to also forward the ID of the trace and the current span when making requests to backends, so that the backend can emit a child span that references the Fastly span as a parent.

This can be done in VCL by creating another custom subroutine and calling it from vcl_miss and vcl_pass:

sub telem_start_backend_fetch {
  set bereq.http.traceparent = "00-" req.http.x-trace-id + "-" + req.http.x-trace-fetch-span-id "-01";
    
  # Avoid leaking internal headers to backends
  unset bereq.http.x-trace-id;
  unset bereq.http.x-trace-parent-span-id;
  unset bereq.http.x-trace-server-role;

  # Leapfrog cloud service infra that creates 'ghost spans'
  set bereq.http.x-traceparent = bereq.http.traceparent; 
}

The first line of this subroutine adds the traceparent header to requests leaving Fastly. We can also use this opportunity to remove the telemetry headers that are being used internally to save state in VCL, and avoid forwarding those to origin.

Finally, I found another gotcha with OpenTelemetry. Several PaaS providers I tried (e.g. Google's App Engine) actually mutate the traceparent header, presumably because their own tracing system is recording a span. The problem is, if your collector doesn't get access to that span's data, your backend spans will be detached from your edge spans (they will at least belong to the same trace, but nesting will be broken). Hopefully cloud providers will fix this, but in the meantime, I found a workaround was to send an x-traceparent header that was a copy of the traceparent header.

As long as every independent component of your system is able to read and respect incoming traceparent headers, and add one to every outgoing HTTP request, OpenTelemetry can create a holistic story charting a single transaction as it moves through the whole system. Fastly is one part of your system, and we know we're not the only part, so we want to be as inspectable and debuggable as possible. If you try instrumenting a Fastly VCL service using OpenTelemetry let us know how you get on!

All four parts of our OpenTelemetry blog series are now published:

OpenTelemetry Part 2: Using OpenTelemetry in VCL

Inside the OpenTelemetry protocol

Shipping the data

Trace context propagation

Want to continue the conversation?

Ready to get started?