Disponible uniquement en anglais

Cette page n'est actuellement disponible qu'en anglais. Nous nous excusons pour la gêne occasionnée, merci de revenir sur cette page ultérieurement.

Protecting Against Scrapers with Fastly Bot Management

Brooks Cunningham

Senior Security Strategist, Fastly

TL;DR? Here are some ways to protect your cached content from these threats using Fastly Bot Management:

  • Using Dynamic challenges to challenge clients who may be using a web driver

  • Blocking scrapers by user-agent that should not access your site using a Request Rule

  • Using the VERIFIED-BOT signal to block bots that are impersonating well-known and verifiable bots

By combining these techniques, you can create a robust defense against unauthorized scraping and protect your valuable content.

Caching significantly boosts website performance, but it also makes your content more accessible, increasing the risk of theft by scrapers. This can severely impact your revenue and SEO. Unauthorized scrapers can pilfer product descriptions, pricing, articles, and other valuable data, republishing it elsewhere and potentially:

  • Undercutting your prices: Competitors can scrape your pricing information and offer lower prices, stealing your customers.

  • Duplicating your content: Scrapers can copy your unique product descriptions and articles, harming your SEO rankings and brand authority.

  • Creating counterfeit products: Detailed product information scraped from your site can be used to create counterfeit products, damaging your brand reputation.

Bot Management and Cached Content

As of this blog's writing, the out-of-the-box implementation of Bot Management is designed to protect content after cache in the VCL lifecycle for miss or pass. This was done originally to focus on inspecting requests that go to the origin to protect against traditional OWASP top 10 attacks. The Fastly Platform provides a tremendous amount of customization. With a small amount of VCL, you may inspect not only requests that would go to the origin, but also requests that result in a cache hit.

Fastly VCL integration to Protect Cached Content

To protect your cached content, you need to implement some VCL to send requests to Bot Management when there is a cache hit. Here are some of the key points for the VCL that are mentioned further below.

  • Bot Management Inspection on Cache Hits: The primary goal is to inspect cached content with Fastly Bot Management. This is achieved by using a noop backend. When a request hits the cache, instead of immediately serving the content, the VCL redirects the request to the noop backend, which allows Bot Management to inspect the request without needing to send the request outside of the Fastly PoP.

  • Static File Exclusion: To optimize performance, static files (JS, CSS, fonts, images) are excluded from Bot Management inspection when they are a cache hit. This reduces latency for these commonly cached assets. This may be customized depending on the type of cached content you want to protect. 

  • Handling Bot Management Actions: The VCL checks for specific tags (BLOCKED or CHALLENGED) in the response headers from Bot Management. If these tags are present, then the corresponding Bot Management action (block or challenge) is applied. If no action is required by Bot Management, the request is restarted to serve the content directly from the cache.

Below is the init snippet that you would apply to your VCL service.

# vcl_init

# noop backend is used so that the NGWAF may quickly inspect requests that are cache HIT.
backend F_noop_origin {
  .between_bytes_timeout = 10s;
  .connect_timeout = 1s;
  .first_byte_timeout = 1s;
  .host = "127.0.0.1";
  .max_connections = 200;
  .port = "443";
  .ssl = true;
  .max_tls_version = "1.3";
  .min_tls_version = "1.3";
  .ssl_cert_hostname = "127.0.0.1";
  .ssl_check_cert = always;
  .ssl_sni_hostname = "127.0.0.1";
}

# force cluster for all requests and on restarts. https://www.fastly.com/documentation/guides/vcl/clustering/#enabling-and-disabling-clustering
sub vcl_recv {
  set req.http.Fastly-Force-Shield = "1";
  # enable ngwaf logging headers
  if (req.restarts == 0 && fastly.ff.visits_this_service == 0) {
    set req.http.X-Sigsci-Response-Headers = "true";
  }
}

# On cache hit, send the request to NGWAF
sub vcl_hit {
  if (req.restarts < 1
    && !req.http.X-SigSci-No-Inspection) {
    # Exclude static files from cache HIT NGWAF inspection
    if (!(req.url.ext ~ "(?i)^(js|css|tff|woff|ico|png|jpg|jpeg)$")) {
      set req.http.X-SigSci-Cached-Inspect = "HIT";
      return(pass);
    }  
  }
}

# When there is a cache HIT, set the noop backend origin.
sub vcl_pass {
  if (req.http.X-SigSci-Cached-Inspect == "HIT") {
    set req.backend = F_noop_origin;
  }
}

# If BLOCKED or CHALLENGED is present, then return that response to the client
# If there is no action, then restart and serve content from cache
sub vcl_fetch {
  if (req.http.X-SigSci-Cached-Inspect == "HIT"
  && req.restarts < 1
  && !(beresp.http.X-SigSci-Tags ~ "(BLOCKED|CHALLENGED)")) {
      set req.http.x-restart-reason = "ngwaf-action=none";
      restart;
  }
}

Using Fastly Bot Management to Protect Cached Content

Protecting cached content is just as important as securing origin-bound traffic, especially when it comes to defending against content scraping, impersonator bots, and automated threats. With Fastly Bot Management (and a small VCL update), you can inspect cache hits, apply dynamic challenges, and enforce verified bot validation - without sacrificing performance. By proactively securing your cached assets, you not only safeguard your revenue and SEO rankings but also maintain control over your brand’s most valuable digital content. 

Ready for better protection? Get in touch with our team to learn how to implement these techniques and start securing your cached content!