Health checks

Backends attached to Fastly services have a health status of either healthy or sick. We determine this status by regularly sending a predefined HTTP request to the backend and checking that we get back the expected response. This regular polling of the backend is a health check.

Creating health checks

You can create a health check using fastly healthcheck create in the CLI, the health check API endpoint, the web interface, or in VCL as part of a backend { ... } declaration.

Health checks can be configured to poll at an interval of your choice determined by the check_interval property in the API or the interval property in the VCL backend { ... } declaration. The sensitivity of the health check is determined by other properties (which all apply both in the API and in VCL):

  • window: The number of health check results to keep track of
  • threshold: The number of health checks that must pass (within the window) for a backend to be considered healthy
  • initial: The number of successful health check results to pre-populate into the window at startup

Controlling initial state

Backends that do not have a health check are considered healthy at all times, including immediately upon initialization.

Backends that do have a health check are considered sick upon service initialization. They are marked healthy once enough successful health checks have been completed to reach the configured threshold. To allow a health-checked backend to instead be immediately considered healthy upon initialization, set the value of initial to be the same as the value of threshold. This means the threshold will be met immediately, even before any health check requests are sent.

IMPORTANT: VCL services are initialized when the first request to the service from an end user is received. As a result, if initial is less than threshold, the first request made in each site following the first deployment of the service will likely result in an HTTP 503 (service unavailable) response. Subsequent requests will continue to encounter unhealthy backends until enough successful health checks have been performed.

Any redeployment of an existing service that changes any property of a backend will reset the health status of that backend (and therefore if initial is less than threshold that backend will briefly become sick). Redeployments of services that don't involve any changes to a backend or its associated health check will not affect the health status of the backend.

Understanding health check traffic volume

The number of health check requests that are received by your backend server is likely to be much higher than your setting of interval/check_interval may suggest. This is due to a number of effects:

  1. Each Fastly site handles health checks independently, but shares the results of the health check within the site in a process called health check amortization. After the service is activated, the number of health check instances for each backend definition will gradually approach the number of operational Fastly sites that are handling traffic for your service.
  2. If the backend's hostname resolves to multiple IP addresses, a separate health check will be sent to each one.
  3. If you create the same backend on multiple Fastly services and give each of them a health check, then by default they will run independently, even if the health check request is identical.

HINT: A realistic "worst case" scenario based on the above details might be one where you have 50 Fastly services that all use the same backend and that backend's DNS lookup returns 5 A records. In this situation, configuring the backend with a check_interval of 1000ms (1 per second) would actually result in:

[1 check/sec] x [150 sites] x [50 services] x [5 IPs] = 37,500 requests/sec

It's also possible for Fastly sites to briefly have amortized health checks disabled, for example during the initial deployment of a new POP. In such situations health check rates may increase temporarily.

To reduce health check traffic, first consider applying the same share_key to backends that are identical across multiple Fastly services, which will enable them to share the same health check. By default, share_key is set to your service ID, but if you have many similar Fastly services (e.g., staging and other non-live environments), then it's a good idea to set it to your customer ID instead, so that the share_key is the same across all services you own. In the above example, this would reduce the volume from 37,500 rps to 750 rps.

To further reduce health check traffic, consider increasing the interval/check_interval or reducing the number of IP addresses returned from a DNS query of the backend's hostname.

Using health checks for load balancing and failover

Fastly will not send HTTP requests to backends that are sick.

For the simplest possible service configuration (a single backend, and no content in cache), the effect of the backend being sick is that all end-user requests will elicit a Fastly-generated 503 Service unavailable response. This may still be better than not having a health check at all, because a backend server that is failing might output unpredictable content.

The following example demonstrates this by assigning an always-sick backend as the backend for all requests. Press ▶ RUN to see the 503 response:

Health checks have more impact and provide a more seamless user experience when applied to services with multiple backends because it's then possible to intelligently select a healthy backend in preference to a sick one.

For more information see load balancing, failover, and the directors reference docs.

Health-checking mechanics

Provided that amortized health checks are enabled at both the service and site level (which they are in almost all cases), each registered backend will be health-checked by one designated Fastly cache server per site, which will then distribute the results to the rest of the site. This server is also responsible for performing DNS lookups on the backend hostname, and distributing the resulting host addresses.

Illustration of the health check mechanism

Partial health

If a backend has multiple IP addresses and some, but not all, are sick, then the backend will be considered healthy. Fastly will continue to route traffic to it using the healthy IPs and will consider it to be healthy as part of assessing the health of any director that the backend is a member of.

By default, we will register up to 16 IPs for each backend. If more than 16 addresses are returned from a DNS query for a backend hostname, we will use only 16 of them. In some circumstances this limit can be increased; if you need more, contact support. Keep in mind that it may be better to split up large IP pools into multiple backends so that Fastly can assign different health statuses to each of them.

When forwarding live traffic to a healthy backend that has more than one healthy IP, Fastly cache servers will select a random healthy address.

DNS caching

Fastly honors the DNS TTL of backend hostnames. However, since renewing DNS results is only performed when needed for health-checking purposes, the time between DNS queries also depends on the frequency of the health check (determined by the interval or check_interval parameter).

Backend requests resulting from live end-user traffic to your Fastly service do not trigger DNS lookups and will always used cached results. As a result, we may use stale DNS data for short periods depending on the frequency of the health check attached to a backend.

If a DNS lookup triggered by a health check fails (i.e., the response is not one of "NOERROR", "NODATA", or "NXDOMAIN"), we will continue to use stale DNS data for both the health check and the forwarding of backend traffic for a short period. If it continues to fail after this period, we will clear the stale IPs, mark the backend as sick, and continue to attempt to obtain fresh DNS results at the check interval frequency.