Taming third parties with a single-origin website

Principal Developer Advocate, Fastly

May 11, 2022

Almost all webpages today load resources from origins other than the one the page came from. These third-party scripts slow down your site, make it harder to write a strict Content-Security-Policy, and hand full access to your site to the third party. Using Compute@Edge and edge-based proxying, there might be a better way.

Our Developer Hub is a great example of a statically generated website that serves most of its pages and resources from a cloud storage bucket (Google Cloud Storage in our case). But like many other websites, those pages also pull in resources from other domains. For example, we use:

Google Analytics for measuring traffic
Sentry for reporting and aggregating JavaScript errors
FormKeep for collecting user sentiment and feedback
Swiftype to search the website

These are pretty popular vendors — you might be using them on your own website. According to a 2020 study by Ghostery, a browser privacy assistant, the average news and media website has more than 10 third-party scripts for tracking alone.

Privacy, security, regulatory and performance problems

Undeniably, there are privacy problems with third-party scripts, especially those dedicated to behavioral tracking; plugins, like Ghostery, are a great way for end users to protect their privacy. In fact, these protections are slowly being built more and more into browsers.

Governments too are taking a stricter line. A junior court in Germany recently fined a website operator for using Google Fonts, on the basis that doing so shared the end user's IP address with Google.

Still, although the website owner is the one that chooses to use these third party services, they don't really have much — if any — control over what the third party does or what data is collected. In fact, in some cases engineering teams may not know what is being loaded on the site at all if tools like Google Tag Manager are being used to delegate control over third-party scripts to other teams within the organization.

Perhaps if as developers, we had more direct control over the behavior of third-party scripts, we could better protect the interests of our end users, while still getting the benefit of whatever service the third party offers.

It's also not just about privacy. Throw in a few trackers, analytics, fonts, and so on, and suddenly your users are fetching things from 20, 30, 50, or even more domains. Just to render one webpage.

In practical terms, this means you can't write an effective Content-Security-Policy, browsers have to make multiple separate TCP connections to different servers (and therefore may not be able to prioritize efficiently), and your site's availability depends on the availability of all the third parties. What if your font provider goes down or is blocked by some country and your website renders as a blank page?

Proxying to the rescue

If you serve your website through Fastly, you already have an edge-deployed reverse proxy with best practice security and the latest protocol support, capable of presenting a single domain to the world and yet routing requests to multiple different backend servers. Many of our customers use this feature to create a microservices at the edge architecture.

The same principle can be used to proxy many third-party scripts. Let's examine how this can work:

The third party (e.g. www.google-analytics.com) is added to your Fastly service as a new backend.
The <script> tag in the HTML is modified to load from a local path, e.g. /services/analytics.
Requests to that path are transformed into the correct backend path by Fastly and routed to that backend.
The library code served by the third party is fetched into Fastly and transformed as needed, for example, to find and replace the third party's data collector URL with your proxy endpoint (this can be somewhat risky, but we'll discuss this later).
Subsequent requests made by the third-party script are sent to your Fastly service, inspected, and filtered as needed, then forwarded to the third party.

With this pattern in place, we can achieve the following benefits:

Strict Content-Security-Policy
Maximum effectiveness of HTTP prioritization
Protection from third-party outages
Control of data sharing with the third party
Circumvention of client-side blocking/filtering plugins

That last one is… controversial. But I'm going to assume that if you're willing to go to the trouble of proxying third parties, you care about minimizing their impact on your end users. Now let's look at how this can be implemented for some of the third parties we use on the Developer Hub.

The Developer Hub is a GatsbyJS application fronted by a Compute@Edge service written in JavaScript. Learn more about how we migrated it to Compute@Edge in our earlier blog post.

HTTP APIs (FormKeep and Swiftype)

Starting simple: some third parties don't actually have scripts at all, but are just API endpoints that we query from our own client-side script. For example, FormKeep receives data from our feedback form in an HTTP POST, and Swiftype returns results for search queries. Moving these into the primary domain is straightforward.

Start by modifying your Compute@Edge program to recognize a specific path and direct requests on that path to a new backend (we'll call it "formkeep" here):

// src/index.js (Fastly Compute@Edge app)

const req = event.request
const reqUrl = new URL(req.url)
const reqPath = reqUrl.pathname

let backendName;

if (reqPath === "/api/internal/feedback") {
    backendName = "formkeep";
    reqUrl.pathname = "/f/xxxxxxxxxxxx"
} else {
    backendName = "gcs";
}

let beReq = new Request(reqUrl, req);

let beResp = await fetch(beReq, { backend: backendName });

return beResp;

Then, modify the behavior of your frontend application or HTML page to send the API request to the new path:

// feedback.html (client side HTML page)

async function handleFormSubmit(evt) {
  const data = new FormData(evt.target)
  buttonEl.current.disabled = true
  await fetch("/api/internal/feedback", {
    method: "post",
    body: data,
    headers: { accept: "application/json" },
  })
  setIsSubmitted(true)
}

Finally, add the backend, either in the web interface or by using the Fastly CLI, and deploy the updated app.

fastly backend create --name=formkeep --host=formkeep.com --version=active --autoclone
fastly compute publish

The --version=active and --autoclone flags will cause the currently active version of the service to be cloned and the new backend will be added to the clone, but it won't be activated. The compute publish command will upload your updated code to the draft service version and then activate it.

This kind of third-party integration is so easy to wire up to Fastly that there's really no reason not to do it.

Configurable clients (Sentry)

Some third-party services offer a JavaScript client that needs to run on the browser, like the error aggregation service Sentry. If you're lucky, the provider will allow the hostname and path of the requests made by their client to be configurable.

Sentry is a good example of one that does, using their tunnel option. This can be configured wherever you place your Sentry configuration. For the Developer Hub, we use the Sentry plugin for Gatsby, and the config goes in the plugins array in our gatsby-config.js:

// gatsby-config.js

{
  resolve: "@sentry/gatsby",
  options: {
    dsn: "https://#######@###.ingest.sentry.io/######",
    tunnel: "/api/internal/errors",
    sampleRate: 0.7,
    tracesSampleRate: 0.7,
    release: process.env.COMMIT_SHA,
  }
}

If you are using Sentry outside of an application framework like Gatsby, you'd most likely put the tunnel option wherever you call Sentry.init.

Now modify your Compute@Edge app to add the new path, and remap to the path Sentry expects:

// src/index.js (Fastly Compute@Edge app)

if (reqPath === "/api/internal/feedback") {
    backendName = "formkeep";
    reqUrl.pathname = "/f/xxxxxxxxxxx"
} else if (reqPath === "/api/internal/errors") {
    backendName = "sentry";
    reqUrl.pathname = "/api/" + SENTRY_PROJECT_ID + "/envelope/"
} else {
    backendName = "gcs";
}

As before, you need to add the new backend, matching the name you used in the code, and then deploy a new version of your program:

fastly backend create --name=sentry --host=oXXXXXXXX.ingest.sentry.io --version=active --autoclone
fastly compute publish

Another advantage of Sentry's Gatsby plugin is that it bundles the Sentry client code into our site bundle so we don't have to worry about the request that loads the library itself, only the request that dispatches data to Sentry's collectors.

Dynamically rewriting clients (Google Analytics)

Other scripts require a bit more assistance. Google Analytics (GA) hard-codes the destination URL into their tracking script, and the Gatsby plugin for GA loads the library directly from Google. In these cases you could self-host a modified version of the client library, but then you wouldn't get updates that the provider makes to their client code.

Instead, we can use a streaming transform in Compute@Edge to modify these URLs on the fly.

This same technique can be used to deal with Google Fonts since the CSS returned from Google loads the actual font files and those also need to be routed through the primary domain. Fastly customer Houzz is using this solution to create a privacy preserving method for loading fonts from Google.

First, add a function to do a simple find-and-replace on a stream:

// src/index.js (Fastly Compute@Edge app)

const streamReplace = (inputStream, targetStr, replacementStr) => {
  let buffer = ""
  const decoder = new TextDecoder()
  const encoder = new TextEncoder()
  const inputReader = inputStream.getReader()
  const outputStream = new ReadableStream({
    start() {
      buffer = ""
    },
    pull(controller) {
      return inputReader.read().then(({ value: chunk, done: readerDone }) => {
        buffer += decoder.decode(chunk)

        if (buffer.length > targetStr.length) {
          buffer = buffer.replaceAll(targetStr, replacementStr)
          controller.enqueue(encoder.encode(buffer.slice(0, buffer.length - targetStr.length)))
          buffer = buffer.slice(0 - targetStr.length)
        }

        // Flush the queue, and close the stream if we're done
        if (readerDone) {
          controller.enqueue(encoder.encode(buffer))
          controller.close()
        } else {
          controller.enqueue(encoder.encode(""))
        }
      })
    },
  })
  return outputStream
}

Just after the fetch to the backend, beResp.body will be a readable stream. Using the stream replace function, we can replace the GA domain with our own:

let beResp = await fetch(beReq, { backend: backendName });

const respContentType = beResp.headers.get("content-type") || ""
if (respContentType.startsWith("text/")) {
  const newRespStream = streamReplace(
    beResp.body,
    "www.google-analytics.com",
    "developer.fastly.com/api/internal/analytics"
  )
  beResp = new Response(newRespStream, { headers: beResp.headers })
}

return beResp;

The Gatsby plugin for GA hard codes the <script> tag in every page, so it's necessary to apply this to all text responses because every HTML page contains both the hostnames hard coded into the analytics.js library itself and the markup to load the library. The gatsby-plugin-google-gtag looks to be an alternative which does allow the <script> tag to be rewritten to a local path, but for the sake of this post I thought it was worth covering an ultimate fallback solution that should work for almost anything.

Be aware that rewriting third party code like this is inherently risky. Some third party libraries may fetch from multiple hostnames. They might change the hostname they are fetching from. And some might even attempt to obfuscate the construction of URLs to avoid exactly this type of rewriting! We've had good experience doing this with Google Analytics and Google Fonts.

If you only need to apply the transform to one URL, you could change the if statement to check the reqPath variable we defined earlier.

Now add the new path to the routing code:

if (reqPath === "/api/internal/feedback") {
    backendName = "formkeep";
    reqUrl.pathname = "/f/xxxxxxxxxxx"
} else if (reqPath === "/api/internal/errors") {
    backendName = "sentry";
    reqUrl.pathname = "/api/" + SENTRY_PROJECT_ID + "/envelope/"
} else if (reqPath.startsWith("/api/internal/analytics")) {
    backendName = "ga";
    reqUrl.pathname = reqPath.replace("/api/internal/analytics/", "/")
} else {
    backendName = "gcs";
}

And of course, we also need to add the Google Analytics backend, upload the new code, and activate the new version of the service:

fastly backend create --name=ga --host=www.google-analytics.com --version=active --autoclone
fastly compute publish

Removing cookies from requests

By directing all requests to your domain, you're already exerting a lot more control over the behavior of third parties. For example, the third party will no longer see the IP address of your end users, instead all the requests you send to them will come from Fastly.

You can also proactively strip unnecessary data from the request. Probably the most important things to consider are the X-Forwarded-For, Fastly-Client-IP and Cookie headers, which will otherwise leak personal data to the third party and nullify all the privacy benefits of proxying. This is very easy to remove just before you send the request to the backend:

beReq.headers.delete("cookie");
beReq.headers.delete("x-forwarded-for");
beReq.headers.delete("fastly-client-ip");

There are lots of other things you can do too — even filtering the body content of the request or copying a sample of it to a log endpoint for inspection.

Conclusion

Consolidating all your site's resources and requests onto a single domain comes with some significant advantages and helps to keep a lid on unintentional performance and privacy regressions. Compute@Edge, and edge computing in general, promises to make this stuff easier and easier over time, but already we're able to take advantage of some powerful ways to shape the way our sites load.