Back to blog

Follow and Subscribe

Wikipedia Tells AI Companies to "Stop Scraping"

Natalie Griffeth

Senior Content Marketing Manager

The pushback on bots has started

On Monday, Wikipedia - the internet’s trusty ‘crowd-sourced’ encyclopedia - issued a statement urging the major AI scrapers to use their paid API, rather than snatching their information for ‘free’. Wikipedia, which depends on donations and community funding, is seeing AI companies bypassing the very model that makes their platform possible. They are begging AI companies to instead adhere to the use of their paid solution.

In their statement, Wikipedia stated that they “are calling on AI developers and other content reusers who access our content to use it responsibly and sustain Wikipedia. They can accomplish this through two straightforward actions: attribution and financial support”. The ask is simple: Proper attributions and “[proper] access [of] Wikipedia’s content through the Wikimedia Enterprise platform”. Wikipedia says that “this paid-for opt-in product allows companies to use Wikipedia content at scale and sustainably without severely taxing Wikipedia’s servers, while also enabling them to support our nonprofit mission”.

Publishers are feeling the scrape

So why the ask? Because publishers are feeling the effects of unmitigated scraping. 

In October, Wikipedia released a stat of an 8% decline YoY in web traffic due to AI. Marshall Miller, Senior Director of Product at the Wikipedia Foundation, said that they are working to determine human traffic versus bot traffic. Further concerns are emerging about the lack of community edits to Wikipedia pages - the very foundation of how Wikipedia retains its rich content. When consumers use AI overviews instead of clicking through to Wikipedia itself, this results in fewer visitors, yes, but also fewer editors. This stands to leave Wikipedia as a dead space.

Wikipedia’s concern is two-fold: First, scraping of their content means fewer visitors and less reliance on their platform, resulting in declining revenue and traffic. Second, they're concerned about the enormous strain scraping traffic is placing on their servers. That’s why Wikipedia is urging users (namely AI companies) to use their paid search option, which allows users to search their content at scale, but without overloading servers or essentially ‘stealing’ this valuable information for ‘free’.  

The themes of this particular story are right in line with what we’re seeing from our security research team. Our Q2 Threat Insights Report and our upcoming Q3 report highlight similar findings. With bots comprising a large proportion of overall website traffic, the result is infrastructure strain, ‘stolen’ content, and risk of malicious intent going unchecked. Our Q2 report highlighted how “AI bots can place significant strain on unprotected web infrastructure, with peak traffic reaching up to 39,000 requests per minute”.

The pushback against AI is underway

Miller, in his blog post, stated that their adoption of better bot management solutions, to “reclassify [their] traffic”, yielded the finding that “much of their unusually high traffic… was coming from bots that were built to evade detection”.

This points to a trend we are tracking across our own data - organizations are wising up to the fact that even ‘wanted bots’ or those seemingly without malicious intent, can still place undue strain on their infrastructure and ‘steal’ their valuable IP. 

In fact, our upcoming Q3 Threat Insights Report found that organizations are now increasingly blocking ‘wanted’ bots, or those that are thought to be non-malicious. This indicates to us that tolerance for AI scraping, even for legitimate purposes, is starting to be met with scrutiny. Orgs are no longer openly allowing scraping to consume their data without compensation. 

While Wikipedia is under increased pressure due to its non-profit status, the same issue persists for publishers worldwide.

Given the impact on revenue and infrastructure costs, we anticipate others joining Wikipedia and continuing to crack down on AI scraping.

A Bot Management strategy is no longer a nice-to-have

Bot management solutions are no longer optional - they should be a mandatory component of any AppSec program. Capabilities like our offering in partnership with Tollbit allow organizations to charge bots instead of just banning them altogether - this is exactly the approach Wikipedia is taking.

Net net? Orgs are starting to crack down - It’s not profitable or sustainable to simply allow free use of your content, and it is becoming increasingly important to have a bot strategy in place. 

Organizations should remember that Robots.txt files are not a shield - they are simply a suggestion.