The Truth About Blocking AI, And How Publishers Can Still Win

VP Strategic Initiatives, Fastly

August 19, 2025

Cloudflare made waves last month by announcing a block on AI crawlers. But here’s the thing: they left out two major crawlers: Google and Apple, which means their high-profile “solution” can’t actually stop some of the largest drivers of AI scraping today without hurting SEO in the process.

That’s not a knock on Cloudflare. It’s a reality check for the whole industry. Because when it comes to controlling AI crawlers, the truth is more complicated, and a lot more interesting, than a simple allow-or-deny toggle.

Fastly has been working for much of this year with various groups on similar initiatives. Rather than implement our own standard or require everyone to use our payment gateway, we are, instead, working on two different but related approaches.

The first is working with companies that provide similar solutions, particular those who we share customers with. We announced our TollBit partnership in early July, which, like the CloudFlare solution, gives publishers the option to charge bots instead of just banning them. We are also talking to other vendors who our customers have introduced us to. Our aim is to meet our customers where they already are and provide customers looking for solutions with choice, much like the way our Real Time Logging feature integrates with over 30 logging providers.

Meanwhile, we are actively working with open standards groups such as IAB (Interactive Advertising Bureau), RSL (Really Simple Licensing), and the big dogs of the internet standards world - the W3C and the IETF. We have a lot of experience working in standards, and so, as well as choice, we're also trying to represent the needs of all our customers who may not have the resources or the expertise to take part in these forums.

This work is driven by what we see in the wild: data from Fastly’s Q2 2025 Threat Insights Report shows that AI crawlers disproportionately target high-authority domains like news sites, open datasets, government pages, educational resources, and technical docs. And 95% of that crawler traffic comes from just three players: Meta (52%), Google (23%), and OpenAI (20%).

Edge-level control is becoming the last line of defence for publishers who care about how their content is used.

Here’s what publishers need to know:

Robots.txt can’t stop AI scraping, unless the bot chooses to comply
Blocking Google’s AI completely means blocking Google’s search bot; there’s no clean separation today
Most other AI bots can be detected and filtered at the edge
Fastly’s programmable edge gives real-time control over bot traffic. No black boxes, no false promises

Robots.txt: A Suggestion, Not a Shield

At the heart of the AI scraping debate is a decades-old protocol: robots.txt. It’s meant to tell bots where they can and can’t go. And for many well-behaved bots, like Googlebot or Applebot, it still works, kind of.

See, Google’s web crawler doesn’t just power search anymore; it also feeds large language models via something called Google-Extended, a directive you can disallow in your robots.txt to opt out of AI training. But here’s the catch: Google-Extended isn’t a bot. It’s a flag. One that Googlebot is free to ignore if it wants to.

The same goes for Apple-Extended and other similar directives. They only signal intent, and compliance is entirely voluntary.

So, while your robots.txt file might look like this:

_{User-agent: Googlebot}

_{Allow: /}

_{User-agent: Google-Extended}

_{Disallow: /}

…you’re still depending on Google’s goodwill to respect the second line. And if you block Googlebot entirely? Say goodbye to your search traffic.

That’s the real bind. If you want to protect your content from AI use, but not kill your SEO, there’s no clean separation between search crawling and AI scraping, at least not when it comes to Google and Apple. It’s all merged on purpose.

To build trust and avoid being blocked, AI bot operators need to take transparency and control seriously. That starts with publishing IP ranges or supporting verifiable methods like reverse DNS lookups, something OpenAI already does, making it easier for developers to identify and filter their crawlers.

Bots should also respect robots.txt and any emerging web-standard opt-outs. These signals may not be legally binding, but they’re foundational to maintaining goodwill with publishers.

Why Blocking AI Is Harder Than It Looks

Some AI companies do a good job of separating their bots. DuckDuckGo, for instance, uses DuckDuckBot for search and DuckAssistBot for AI. These are easier to spot and block.

But the big players like Google run AI scrapers from the same IP ranges and infrastructure as search bots. Anthropic (Claude) doesn’t publish IPs at all, making it nearly impossible to verify traffic claiming to be Claude. And some companies skip crawling entirely and buy access to crawled data from third parties, some of which don’t identify themselves at all.

Bots rotate IPs, spoof user agents, and in some cases fly completely under the radar. Whether that’s due to technical limitations or design choices is unclear, but the effect is the same: you can’t block what you can’t see.

And if you’re relying on robots.txt or headers like X-Robots-Tag, you’re playing defence on the honour system.

Open source developers are already seeing the cost of that breakdown. Projects like Diaspora, Fedora, KDE, and GNOME have been overwhelmed by AI bots ignoring robots.txt, spoofing user-agents, and rotating IPs to evade detection.

History Repeating

There are parallels back to the search engine wars of the early 2000s. Publishers and Search Engines existed in an uneasy symbiosis—publishers needed the search engines to drive traffic, search engines needed the publishers to provide them with content to index.

But Search Engines want to keep users on their own pages so they can show them more ads, so they do things like show inline results. Publishers obviously want those same visitors on their sites directly so THEY can show ads.

So there is a nervous tension between the two parties: sites do things to attract search engines, such as performing SEO and signing up for programs like sitemap.xml and AMP. But they also live in fear of getting deranked by the ALMIGHTY ALGORITHM. People's careers lived and died because of The Google Dance—the periodic reweighting of their PageRank index.

And we're now seeing this play out again. AI Crawlers need the content to build their models and also to power their RAG queries and the publishers want more traffic (or money) to be sent their way.

So What Can Be Done?

This is where edge control comes in.

Unlike robots.txt, which sits politely at the application layer, edge tools like Fastly’s Next-Gen WAF (NGWAF) operate at the network layer. They inspect traffic in real-time, looking at user-agent strings, known IP ranges, request patterns, and behavior, then take action instantly.

As mentioned we're working with a number of open standards such as RSL, IAB and the IETF and talking to providers such as Supertab, Scalepost, and Skyfire. We're also exploring the Web Bot Auth W3C proposal submitted by Cloudflare, and have partnered with Tollbit, a paywall system designed specifically for bots. The TollBit integration means you don’t have to just block AI bots, you can charge them. In late July we joined the IAB meeting about this topic alongside representatives from Google, Meta, Cloudflare, Dot Dash Meredith, a variety of publishers and many many others, making the case for transparency, accountability, and actual enforcement.

Forcing payment is one solution. And if blocking or even offensive bot tactics, such as randomly generated content mazes, proof of work or gibberish generators, are the stick, then there also exists a possibility of a carrot. Publishers provide high quality APIs to the crawlers—easy to access and cross reference, semantically marked up— in exchange for better bot behaviour, licensing fees, or attribution?

This is an arms race playing out at multiple levels: not just site providers vs crawlers, but also between governments. The current US administration has publicly stated that, for now, it doesn’t expect crawlers to pay for content. But Europe may take a different approach, possibly introducing regulation requiring compensation.

Will that doom Europe’s AI startups and content platforms, or push more publishers to host their content in Europe, relying on CDNs to manage latency? We will have to wait and see.

At Fastly, we're looking to not only provide our customers with choice so that we can fit in with what THEY need but also to encourage a robust and open ecosystem, not centralize everything through us.

Unlike static text files and meta tags, edge-level defences can actually enforce decisions, not just suggest them. Are they perfect? No. Even Fastly’s system can’t distinguish between AI bots that ignore robots.txt and those that follow it. That’s still an open problem. But in terms of actionable, programmable control? The edge gives publishers their best shot.

And unlike competitors who quietly dodge the Google problem, Fastly is upfront about the limitations. There’s no silver bullet for Google AI, but we can give you the tools to decide what you want to do.

Download the full Q2 2025 Fastly Threat Insights Report to explore how AI bot traffic is evolving across industries, and what site owners can do about it.