User error logs collected | Fastly

VP of Technology, Fastly

June 22, 2020

Industry insights Edge network Observability

There's almost no such thing as too much visibility when it comes to how users interact with web sites and applications. To that end, a holistic view that tells us about these experiences, their performance characteristics, and failure/error conditions will never stop being useful. So any mechanisms and metrics that serve this cause are extremely valuable.

At Fastly, operating a large, globally distributed network of servers gives us good insight into how web clients interact with our network from our servers’ point of view. But since we want a full, end-to-end view of what’s going on, we’re also big fans of mechanisms that give us a picture of what’s happening from the perspective of the client. This is why browser technologies such as Navigation Timing and Resource Timing (which we leverage for Fastly Insights) are valuable — they provide us metrics for understanding the performance characteristics of the client’s experience. And we’ve blogged about the relatively new Server Timing spec that lets us collect server metrics alongside browser metrics at the client.

In this post, we’ll dig into Network Error Logging, another spec that helps with visibility, and talk about what it does, how it works, and how you can use Fastly to collect the data it surfaces.

What is network error logging?

As the name implies, Network Error Logging (or NEL, as the cool kids call it) is a proposed W3C standard for capturing and collecting network error/failure events at the browser. While Navigation TIming, Resource Timing, and to some extent Server Timing are meant to provide visibility into client performance, NEL aims to provide insight into availability.

In other words, NEL’s intent is to uncover failures and provide a reporting mechanism for situations where a client fails to access a web resource altogether. A server never knows if a client couldn’t reach it, so server-side metrics can only go so far when it comes to unearthing some problems. NEL helps bridge that gap and get us much closer to a full, end-to-end picture.

How NEL works

In short, NEL detects network errors and failures at the browser — during the DNS, connection (TCP and TLS), or application (HTTP) phase of a request — and reports them to one or more reporting endpoints. It works in conjunction with the Reporting API, another W3C spec that defines a generic reporting mechanism for a browser, meant to enable any feature that needs reports delivered somewhere. NEL is one of the consumers of the Reporting API, but there can be others (e.g. Content Security Policy also uses this API). It’s NEL’s job to define error logging policies for origins and create reports for various error conditions. It’s then the job of the Reporting API to deliver those reports to one or more endpoints.

The mechanism is enabled through two response headers:

The somewhat oddly named and slightly yelly NEL response header defines the NEL policy for the origin that sends the header.
The Report-to header defines a group of one or more endpoints for report delivery. The NEL header will then point to an endpoint group to send reports to by the group name specified here.

The NEL header is a JSON object and, in its simplest form, looks like this:

NEL: {"report_to": "network-errors", "max_age": 2592000}

The report_to member is required and specifies where the reports go (in this example, “network-errors” is the name of an endpoint group defined in the Report-to header, which we’ll see in a minute). The max_age member is also required and defines, in seconds, how long this policy is valid in the browser. A max_age of 0 removes the NEL policy from the client (and is the only time you can omit the report_to element).

In addition to report_to and max_age, there are a few other optional members that a NEL header may include:

The include_subdomains member is a boolean which indicates whether this policy applies to this origin and all its subdomains. It’s false by default.
The failure_fraction member defines a sampling rate for failure reporting. It’s a good way to manage report volume if you don’t want every single error recorded. It has a value between 0.0 and 1.0, inclusive. It defaults to 1.0.
Similarly, NEL allows reporting of successful transactions too. The success_fraction member defines the sampling rate for reporting of successes. It can also be between 0.0 and 1.0, but defaults to 0.0. Having both success and failure conditions recorded is a good way to gauge failure percentage.

So, for example, let’s say responses from https://example.com have a NEL header that looks like this:

NEL: {"report_to": "network-errors", "max_age": 2592000, "include_subdomains": true, "success_fraction": 0.01, "failure_fraction": 0.05}

This would register a NEL policy in the browser that’s valid for 30 days, will collect and report errors for example.com and its subdomains, and report 1% of the successes and 5% of the failures. The reports will be sent to whatever the Report-to header defines as the “network-errors” endpoint group.

The Report-to header is also a JSON object and defines a group of one or more endpoints for reporting. It’s a group because failover and load balancing functionality is built into the Reporting API. In its simplest form, it looks like this:

Report-to: {"group": "network-errors", "max_age": 2592000, "endpoints": [{"url": "https://nel.example.com/report"}]}

The group member specifies a name for the group of one or more endpoints being defined. It’s actually optional and defaults to “default” but it’s probably good practice to use a meaningful name. Note that this group name is what the NEL header points to. Like in the NEL header, max_age is required and defines the lifetime, in seconds, of this group, and a max_age of 0 removes this group from the client. There’s also an optional include_subdomains boolean member which means exactly the same thing as it does in the NEL header. If you want subdomain errors reported, this needs to be included in both headers.

The endpoints member is an array of JSON objects that define each endpoint within the group. Each endpoint specified in this array can have up to three members:

The endpoints[].url member is required and defines the URL where reports should be sent.
The endpoints[].priority member is an optional non-negative integer and defines failover behavior. It defaults to 1. When attempting report delivery, the lowest priority endpoint in the group is tried first and if delivery fails, the report will be sent to the next highest priority endpoint.
The endpoints[].weight member is also an optional non-negative integer and defines load balancing behavior. It also defaults to 1. Endpoints in a group (assuming they have the same priority) are load balanced using their weights as ratios.

So, let’s consider this response header from https://example.com:

Report-to: {"group":"network-errors","max_age":2592000,"include_subdomains":true,"endpoints":[{"url":"https://nel1.example.com/report","priority":1,"weight":1},{"url":"https://nel2.example.com/report","priority":1,"weight":3},{"url":"https://nel3.example.com/report","priority":2}]}

This header defines a group named “network-errors” that’s valid for 30 days and will include reports for subdomains of example.com. Reports will be load balanced to https://nel1.example.com/report and https://nel2.example.com/report with 25% going to the former and 75% to the latter. If one of those endpoints fails, the other will get 100%. If they both fail, all reports will go to https://nel3.example.com/report.

Hopefully this makes it clearer how the NEL and Report-to header work together to collect and report error conditions: the NEL header defines the policy and the Report-to header defines where the reports are sent.

Each individual report is a JSON object. Delivery is done by sending an array of one or more reports as the body of an HTTP POST request to a reporting endpoint. One important aspect of reporting is that they’re not necessarily sent in real time. The client may queue them up and send them in batches. Let’s look at a sample report to dig into this a little bit more:

[
  {
    "age": 666,
    "body": {
      "elapsed_time": 37,
      "method": "GET",
      "phase": "connection",
      "protocol": "http/1.1",
      "referrer": "https://www.example.com/",
      "sampling_fraction": 1,
      "server_ip": "1.2.3.4",
      "status_code": 0,
      "type": "tcp.reset"
    },
    "type": "network-error",
    "url": "https://www.example.com/image.png",
    "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"
  }
]

The specs get into all these fields. But I want to focus on a couple of them.

As we discussed before, NEL reports DNS, connection, or application errors. This is indicated by the phase in the report body. In this example, this is a connection error. NEL defines a number of error types. In this case, the error type is tcp.reset, which means the TCP connection was reset while trying to fetch https://www.example.com/image.png.

Since reports may be queued up and sent in batches, age, elapsed_time, and a little math can help calculate the timing of errors. The age of the report is the amount of time between when the error occurred and when the report was sent. In our example, the report was sent more than 11 minutes after it happened (this can go much higher!). If we had “age”: 0 instead, it would’ve meant that the report was sent immediately after the error detection. The elapsed_time tells us how long it took between the start of the request and when the browser decided this was an error (or when it gave up). In this case, it took 37 seconds between when the client made the request to https://www.example.com/image.png and when it saw the TCP connection reset. The reports deliberately don’t include a timestamp to account for clock skew and the fact that not all clients’ clocks will be correct. But, if you consider these two numbers and log a timestamp at the reporting endpoint for when reports arrive, you can get a pretty good feel for when this error actually occurred at the client.

Leveraging Fastly to deploy NEL

We’ve covered the basics of NEL and how it can help get some visibility into client failures and error conditions. Now, let’s spice things up and see how we can put these mechanisms into work using Fastly’s edge cloud platform.

Adding response headers to establish a NEL policy and reporting endpoints is pretty simple to do with VCL:

But that’s not all! Much like beacon termination, we can build an origin-less reporting endpoint for collecting and logging NEL reports with Fastly. It’s slightly more than two lines of code, but it’s also pretty straightforward.

First, since NEL reports are being sent from the browser as POST bodies, the client will make a CORS preflight request to check whether it can send POST requests to the endpoint. So, the first thing we have to do is make sure the endpoint can handle these preflights properly. This means having the correct Access-Control-* headers on responses:

Here, in vcl_recv we terminate any request to /report and generate a synthetic 204 (No Content) in vcl_error with the appropriate CORS response headers. This will handle the preflight OPTIONS request and the reports themselves since we can respond to the report delivery POST requests with a 204 too.

All that’s left is logging the report to a logging endpoint when it’s delivered via a POST body. We can use the req.postbody VCL variable for the contents of the report. But, since we have control over what we log in VCL, we can also log additional information to help us with analysis later. We already talked about how the receive timestamp can be helpful, but there are a number of other VCL variables that could also prove themselves useful. Here’s an example where we log the NEL report, the timestamp when it arrives, and some geo-ip information about the client in vcl_log to a GCS bucket:

In this case, “reports_bucket” is the name of a GCS logging endpoint configured in the Fastly service. We’ve built each log line to be a single JSON object. That means each file in GCS will be in newline-delimited JSON format, which is helpful when exporting for analysis later (note that we chose GCS as an example; you can configure to send the reports to any logging endpoint you choose).

It would be a lot simpler to log directly to, for example, Google’s BigQuery, so we could instantly analyze the data as soon as it’s logged. What makes that challenging here is the body of the reports because, since they can be sent in batches, a single request can carry multiple reports. Since VCL can’t process the JSON to extract each report, if we sent the entire report to BigQuery, a single table cell may contain multiple reports and that’ll make queries pretty difficult. So, instead, we can send the entire body to GCS and then use Cloud Functions to process each file and extract the individual reports before inserting them into BigQuery, one row per report.

We could do away with these last two steps if we were able to read and process the request bodies. Well, it just so happens that this is one of the many doors that Compute will open for us. I’ll leave that as a teaser here for now and in a future blog post, we’ll talk about how we can use Compute to make logging JSON bodies much simpler and more streamlined.

Lessons learned from deploying NEL

Adoption of NEL doesn’t seem to be that prevalent yet. Browser support isn’t widespread, and digging into HTTP Archive data (which, as of recently, had records of requests from over 5.4 million unique hostnames on the internet), only about 1.7% of hosts include NEL headers on their responses. But, the promise of NEL is grand and, being a grand fan of grand things, we’ve been experimenting with it as part of Fastly Insights. We’ve learned some practical things I want to share:

Report delivery does not show up as a request in Chrome’s devtools. So if you want to actually see the requests being sent from the client (or troubleshoot when they’re not), you’ll have to get creative. Might I suggest Wireshark, the greatest troubleshooting tool in the history of earth?
Chrome will, however, show you both registered NEL policies and reporting endpoints through chrome://net-export. The process is a bit more involved than the old chrome://net-internals interface that had everything in one place, but once you go through it, you should see all NEL-related things.
If you end up with multiple policies that will apply to a single origin, the more specific one wins. So, if you have one policy for https://example.com with include_subdomains, and another policy for https://www.example.com, when there’s an error for a resource on https://www.example.com, the browser will use the policy for https://www.example.com. It may be good practice to keep the NEL headers and reporting endpoints consistent through your app to avoid surprises. But, if you want flexibility across subdomains, there’s a way to get it.
The reporting endpoints themselves can fall under a NEL policy. So, NEL can let you know when error delivery has problems.
Because of security constraints, NEL has a special way of handling errors for origins whose IP address changes. This may happen a lot if you use multiple IP addresses for an origin (with something like round robin DNS). It also has a chance of happening when using large content delivery networks who use dynamic routing to get users to an edge location. For situations like this, the error reports lose a bit of granularity. I won’t go into too much detail here, but if you’re considering digging into NEL and this sounds relevant, this section of the spec is worth a read.
The specs for both NEL and the Reporting API are evolving. The most recent editor’s draft for NEL has changed a bit since the last official published version and, if adopted, will add the ability to also collect request and response headers as part of a NEL report. The most recent editor’s draft of the Reporting API also looks different than the last published version. The Report-to header has been renamed and redefined as a structured header, which is not a bad idea. There are probably other changes too. So, if all of this has been interesting and you’re still awake and reading, you may want to keep an eye on the specs as they evolve.
Be careful if you’re trying to report errors to the same place that you’re monitoring for errors. This is where the Reporting API’s failover mechanism can come in handy.

Conceptually, I find myself a big fan of NEL’s potential. It has the means to provide great visibility into application and network health from a client’s perspective. I’m also a huge fan of building applications using Fastly’s edge capabilities, and an origin-less reporting endpoint is a perfect example of that. If you also end up finding these mechanisms useful and experimenting with them, let us know. And if you feel like sharing your findings and lessons in blog posts and presentations too, that would be just grand.