THE SYSTEMS LIBRARIAN
The Battle Against the Bots
by Marshall Breeding
The latest headache for content providers comes from massive swarms of bots that are hungry to extract and consume all content on a site regardless of its operational impact. |
Content providers on the web face incredible technical challenges brought on by armies of automated bots, especially those associated with gathering data for AI products and projects. Bots have been around since the beginning of the web, serving as search engine crawlers as well as scanning for security vulnerabilities by individuals and organizations that are seeking opportunities for ransomware attacks or other malicious endeavors. For many content providers, the overhead incurred by bot activity has intensified dramatically in recent months, imposing unbearable strain on resources to the point that performance for actual users has become constrained. Library Technology Guides—although a small content provider in a limited niche—illustrates the traffic patterns related to search engines, malware scanning, and AI training. Managing this site gives me direct experience on the impact of the recent wave of bot swarms and a perspective on the strategies involved in mounting an effective defense. Other providers of discovery services and content services have described similar experiences.
Search Engine Crawlers: A Symbiotic Relationship
Bots and web crawlers are not new. All websites and content repositories expect to see a significant portion of requests made through automated agents. Search engines use automated crawlers to scour the web for content for their indexes. Google, Bing, Baidu, and other organizations periodically scan the web for new and updated content.
Content providers depend on this activity to drive users to their sites. For those that depend on advertising revenue, placement in the top results has important financial implications. An entire industry of SEO advises content providers on how to configure their sites to receive the best treatment by Google, which is the dominant player in search and advertising.
A symbiotic relationship exists between content providers and search engines. Websites provide guidance for search engine crawlers through conventions such as the Robots Exclusion Standard, which is implemented through the sitemap.txt file provided to give a comprehensive list of all of the unique resources available. Search engines will also follow links that lead to the content in different ways, such as through search terms or facets. The latter is less efficient and may lead to a much higher level of overhead. The sitemap.txt file can also state the specific crawlers allowed or disallowed and can list any specific pages or directories not to be accessed. All of the major search engines adhere to the instructions given in a site’s robots.txt.
I developed an internal monitoring tool to show website activity in real time, separating any identifiable bot activity from requests made by real users. This distinction can be made by examining the HTTP User-Agent header presented within each page request. The User-Agent text identifies the type of device requesting a page for each web browser, crawler, or automated bot. When everyone plays by the rules, it’s easy to identify the source of each request.
Before the advent of the new AI-related bots, activity related to search engine crawlers on Library Technology Guides ranged between 40% and 50% of overall page requests. This level of overhead is easily absorbed by the web service without any performance degradation.
Malware Bots
Another category of bots has malicious intent. Such attack attempts are widespread and persistent. This activity includes probes or other attack vectors that aim to exploit any technical vulnerabilities that might enable an external individual or organization to gain unauthorized control of the site or to gain access to its content. They systematically generate requests that exploit any known vulnerability in the modules and components within the website’s infrastructure.
The most frequent attacks I see are attempts to exploit vulnerabilities specific to WordPress, even though it is not part of my environment. Other common ploys include attempts to launch SQL injection attacks. These attacks are dangerous and require very careful coding of web applications to ensure all parameters accepted in a request are comprehensively tested against any possible malicious content. In addition to the obvious security implications, these malware attacks can also induce performance issues. Any given attack event can generate massive numbers of requests within a short period.
These attacks come indiscriminately. Defense strategies include closing all security vulnerabilities and blocking the IP addresses associated with a current or previous attack in the internal firewall for the server and at the organization’s network perimeter. The requests associated with these attack requests may include clues in the HTTP User-Agent so that they can be systematically blocked by the web server.
AI Bot Swarms
The latest headache for content providers comes from massive swarms of bots that are hungry to extract and consume all content on a site regardless of its operational impact. The large language models (LLMs) behind generative AI services require massive amounts of training data. The organizations that construct these models often do not respect the protocols and practices established for search engine crawlers. They follow a brute force approach that presses the limits—or goes beyond—the resources of the site available.
The AI bots may or may not make use of the sitemap.txt file to extract content from the site, and they often do not respect exclusion directives. They tend to pursue a very aggressive strategy of following links, iterating through extensive combinations of search and facet parameters. Each of these searches may consume considerable resources, and these requests are blasted to the server at merciless speed. Another characteristic of the AI bots involves not providing identifying text in the HTTP User-Agent. Rather, they often use User-Agents that are identical to web browsers, making them especially difficult to detect and control.
Not all AI harvesting bots are unscrupulous. The bots from OpenAI that provide information for ChatGPT, for example, follow the Robots Exclusion Protocol and provide identifying text in their HTTP User-Agent string. Most malicious attack incidents come through a single IP address and can be quickly and effectively disarmed through a firewall. These AI swarms are often highly distributed, with any given attack coming through hundreds, if not thousands, of individual IP addresses. This distributed and coordinated tactic makes it much more difficult to identify and contain an episode.
To put the intensity of these attacks into perspective, the internal monitor for Library Technology Guides typically shows a baseline activity of 300–600 requests per minute. During one of these AI bot episodes, the page request rate can rise to more than 10,000–20,000 for that same interval. The server can usually sustain that load, but it slows performance for normal users.
On a single day, a massive attack of 841,611 page requests came through a single IP address. This IP address belonged to paperspace.com, now part of DigitalOcean, which describes itself as a cloud platform for building and scaling AI applications. This event clearly falls within the style of aggressive website content harvesting.
Strategies to mitigate the impact of these requests are complex and can be expensive. Given the limitations of using basic firewall techniques to block these attacks, many content providers have begun using specialized content delivery networks to protect their sites. Cloudflare is a standalone service that can be implemented as a web application firewall that intercepts all requests for a server and can be programmed to filter out those associated with unwanted AI bots or malicious agents. There’s a version of Cloudflare with basic features, but versions with more advanced capabilities involve monthly fees.
The aggressive activity of the AI bots not only impairs the services offered by content providers, but it also disrupts long-standing patterns of use and access. The overhead of search engine crawlers is part of a symbiotic relationship in which content providers and search engines both see benefits. Content providers gain more use as individuals click on links provided by search engines. The aggressive behavior of the AI bots is parasitic. The content extracted by these bots helps train LLMs but usually does not direct users back to the original content sources. The technical overhead can be massive with no direct benefit. In this model, content providers have little control over how their materials are used and disseminated.
Many of the large publishers and content providers have entered into licensing agreements with the major AI companies seeking a mutually beneficial arrangement. But the AI bots that accumulate content through brute force harvesting represent a major disruption in the content ecosystem on the internet. The advent of search engines led to a set of practices and protocols that achieved |