NIXsolutions: SourceHut Blocks AI Bots Over Excessive Traffic

Open source Git hosting platform SourceHut has reported a slowdown in its services caused by web crawlers deployed by AI companies. This issue is becoming more common, with owners of other resources voicing similar complaints.

To counteract the unwanted traffic, SourceHut deployed Nepenthes, a tool designed to protect against rogue web crawlers that collect data for AI model training. As part of its mitigation strategy, SourceHut’s administration took the step of blocking entire address ranges belonging to several cloud providers, including Google Cloud and Microsoft Azure. This action was due to the excessive volume of bot traffic originating from these networks. Administrators of legitimate services using these infrastructures were encouraged to contact SourceHut individually to request an exemption. Notably, back in 2022, SourceHut also faced service issues from an influx of requests by the Google Go Module Mirror service.

NIXsolutions

Widespread AI Bot Abuse Across Platforms

Efforts to control AI crawler behavior have been made in recent years. In 2023, OpenAI announced that its bots would respect robots.txt directives, which outline how web crawlers should interact with websites. Other AI developers have issued similar promises. However, reports of misuse persist. For instance, last summer, the iFixit website experienced an invasion of requests from Anthropic’s Claudebot. Later, in December, hosting provider Vercel reported that AI crawlers were heavily present in its infrastructure: OpenAI’s GPTbot generated 569 million requests, and Anthropic Claude generated 370 million. Combined, these accounted for nearly 20% of the 4.5 billion Googlebot requests used for indexing content on Google.

Diaspora developer Dennis Schubert also raised concerns, revealing that over 70% of his server traffic over a 60-day period came from AI bots. After his post gained attention, bot activity briefly decreased. However, soon after, malicious actors launched a wave of fake requests, mimicking OpenAI GPTbot. Unlike the legitimate OpenAI bot, which operates from Microsoft Azure, these requests originated from AWS and even American ISPs.

Growing Need for Clear Bot Identification

Adding to the complexity, some bots serve multiple functions, notes NIXsolutions. For example, Meta’s AI bot and AppleBot focus solely on AI training, while GoogleBot supports both AI and search indexing. To reduce confusion, Google introduced a distinct Google-Extended value in 2023 specifically for AI training-related activities.

SourceHut’s experience reflects a broader trend of platforms taking stronger measures to protect their services. We’ll keep you updated as more platforms adapt or introduce protections against AI bot overreach.