posts / Science

AI Crawlers Devouring My Server: A Complete Defense Guide from robots.txt to Data Poisoning

phoue

9 min read --

For the past 30 years, the internet as we know it hasn’t just been supported by massive undersea fiber optic cables.

Behind the scenes, a vast social capital of ’trust’ has been at play.

There existed a long-standing implicit agreement between website publishers and search engines.

“Take my content and make it known to the world. In return, send traffic back to me.”

The sole protocol orchestrating this reciprocal relationship was a simple text file called robots.txt.

Legal enforcement? None whatsoever.

It was akin to a constitution safeguarding web peace, based on the assumption that members of the digital ecosystem would adhere to it – a ‘gentlemen’s agreement’.

However, frankly speaking, that era is over.

The explosive growth of Generative AI, beginning in 2023, has turned this old peace treaty into a piece of scrap paper.

For AI companies, web data is no longer a ‘map’ to guide users.

It has become ‘fuel’ to be burned to enhance the intelligence of Large Language Models (LLMs).

We are now witnessing the ‘Tragedy of the Digital Commons’ unfold in real-time, and on a truly destructive scale.

Tragedy of the Digital Commons
Tragedy of the Digital Commons

This article will delve into how AI crawlers are suffocating open-source communities and propose practical solutions for infrastructure survival, ranging from the first line of defense with robots.txt to active defense scenarios using Anubis and Cloudflare.

1. Infrastructure’s Scream: Externalization of Costs and Destruction of the Commons

The indiscriminate data collection race by AI companies might seem like technological innovation on the surface.

However, viewed dispassionately from an economic perspective, it’s a classic case of ‘Externalization of Costs’.

External Cost
External Cost

It’s a perverse structure where AI companies monopolize the immense profits from model training, while non-profits and open-source projects bear the entire burden of infrastructure costs for data processing and transmission.

Let’s examine the severity of the problem through real-world damage cases.

1.1 The SourceHut Incident: Thieves Stealing Computation

The experience of Drew DeVault, founder of the open-source software development platform SourceHut, is shocking. It’s a pivotal example demonstrating that AI crawlers have evolved beyond mere ‘data collectors’ into ‘computing resource parasites’.

Attacks Targeting CPU, Not I/O

Typically, general search bots like Googlebot scrape static pages (HTML). For a server, this primarily involves reading files, imposing a relatively low load (I/O Bound).

However, the AI crawlers that overwhelmed SourceHut maliciously targeted high-cost endpoints with extreme precision.

Attack Type Details Nature of Technical Load
Git Blame Attack Repeatedly calling high-cost endpoints like git blame pages, not just source code files Requires re-tracing the entire commit history to track modifications for each line of a file.
Git Log Attack Indiscriminate crawling of change log pages Extreme CPU resource consumption (CPU Bound) during the process of sorting and filtering vast amounts of log data.
Conclusion Compute Theft Not just bandwidth occupation, but the unauthorized appropriation of server computing power for AI data refinement.

Stealth Indistinguishable from Botnets

What’s even more troublesome is the origin of the attacks.

They arrived not from single data center IPs, but through tens of thousands of residential IPs.

This means either a botnet composed of hacked home PCs or routers was used, or the identity was laundered through a paid residential proxy service.

This incapacitates traditional IP blocking methods.

1.2 Read the Docs and a 73TB Bill: Disaster Born of Incompetence

The case of ‘Read the Docs’, a technical documentation hosting service, starkly illustrates how AI companies’ negligence leads to financial damage.

  • Visualization of Costs: In May 2024, a single bot scraped 10TB of data daily, accumulating a staggering 73TB in a month. The resulting excess bandwidth charges alone were approximately $5,000 (around 7 million KRW).
  • Technical Flaw: What’s more infuriating is that the bot completely ignored HTTP’s default caching mechanisms.

\[ Analysis of AI Bot's Technical Flaws \]

Check Item Normal Bot (Googlebot, etc.) Problematic AI Bot Result
If-None-Match Checks Etag and requests only changed files Not Used Downloads even previously downloaded files anew each time
If-Modified-Since Requests only changes since the last modification date Not Used Downloads files from 3 years ago daily as if new
QA Performed? Thorough logic verification before deployment Not Performed Deployed without basic quality control

1.3 Geopolitical Blocking: Why Brazil Disappeared (Geo-blocking)

Fedora, a Linux distribution project, faced near infrastructure paralysis due to scraping attacks originating from Brazilian IP ranges.

Investigations revealed that insecure IoT devices in Brazil were infected with a botnet called ‘Aisuru’, being exploited as a zombie army for AI data collection.

Aisuru botnet
Aisuru botnet

Ultimately, the Fedora team took the extreme measure of ‘blocking all traffic from Brazil (Geo-blocking)’.

Geo-blocking
Geo-blocking

This is a reality straight out of a dystopian movie, where a technical problem in cyberspace leads to the closure of physical borders.

2. Practical Defense Strategy I: Optimizing robots.txt and Overcoming Limitations

Many webmasters still attempt to resolve everything with a single User-agent: *.

However, in the current era of AI crawlers, this is akin to opening the front door wide and saying, “Welcome, thief.”

We must modernize and redesign the most basic yet essential first line of defense, robots.txt.

2.1 Key Block List for AI Defense in robots.txt

AI bots often operate under distinct User-Agent names. The table below lists the primary AI bots that must be blocked.

User-Agent Name Owning Company / Service Reason for Blocking & Characteristics
GPTBot OpenAI (ChatGPT) Most common bot for LLM training data collection.
ChatGPT-User OpenAI Bot activated when using ChatGPT’s ‘Browsing’ feature.
CCBot Common Crawl Large-scale crawler used as a foundational dataset for most LLMs.
anthropic-ai Anthropic (Claude) Data collection for Claude model training.
Google-Extended Google (Gemini, etc.) Used to maintain search visibility while selectively blocking AI training.
Bytespider ByteDance (TikTok/Doubao) Extremely aggressive. One of the main culprits of server load.
FacebookBot Meta (Llama, etc.) For Meta’s AI model training, including the Llama series.
Amazonbot Amazon For Alexa and Bedrock model training.

Based on the analysis above, here is an optimized configuration ready for immediate application to your server. Please copy and use it.

Plaintext

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Amazonbot
Disallow: /

# [중요] 모든 봇에 대해 고비용 연산 페이지 원천 차단
User-agent: *
Disallow: /git/blame/       # 리소스 소모가 큰 경로 (CPU 부하 방지)
Disallow: /search/          # 내부 검색 결과 페이지
Disallow: /admin/           # 관리자 페이지
Crawl-delay: 10             # (참고) 일부 '착한' 봇만 준수하지만 설정 권장

2.3 Critical Limitations of robots.txt

While the above configuration is essential, it is unfortunately not foolproof.

  1. Voluntary Compliance: robots.txt is a ‘request’, not a law. Malicious AI startups or scrapers built by hackers will ignore it completely.
  2. User-Agent Spoofing: Malicious bots can disguise themselves as regular Chrome browsers, using names like Mozilla/5.0 (Windows NT 10.0...), to gain entry.

Ultimately, robots.txt is merely a ’nameplate’; to actually block intruders, a ’lock (technical defense)’ is necessary.

3. Practical Defense Strategy II: Technical Defense Implementation Scenarios

Beyond simple blocking, we’ve outlined specific scenarios to protect system resources and neutralize bots, categorized by scale.

3.1 Comparison of Technical Defense Solutions by Scenario

Category Scenario A: Small-to-Medium / Open Source Scenario B: Enterprise / Media
Recommended Solution Anubis (Proof-of-Work Middleware) Cloudflare WAF & AI Labyrinth
Primary Target Personal blogs, open-source communities News outlets, large e-commerce sites, platform companies
Core Technology Proof-of-Work (PoW) Honey Pot & Data Poisoning
How it Works Presents cryptographic puzzles (finding SHA-256 hashes) to connecting clients, requiring them to solve it for access. Detects bots and redirects them to a ’labyrinth’ containing fake data, trapping them in an infinite loop.
Defense Effect Forcibly consumes bot CPU resources, reducing profitability and discouraging them. Corrupts training data, inducing hallucinations in AI models, providing a powerful deterrent.
Cost Inexpensive (Open-source utilization possible) Incurs costs (Requires enterprise plan)

Scenario A Details: Anubis’s Difficulty Adjustment Strategy

Anubis adjusts the puzzle’s difficulty based on traffic conditions.

  • Normal Mode: The hash must start with 0 (0.5 seconds for humans, starts becoming burdensome for bots).
  • Defense Mode: During traffic surges, the hash must start with 0. This exponentially increases the computational load for bots, making sustained attacks impossible. (Crucially, major search engine IPs like Google must be whitelisted).

Scenario B Details: Cloudflare’s AI Labyrinth and Data Poisoning

Using Cloudflare Workers, unauthorized bots are redirected to AI-generated fake pages instead of actual content.

  • Invisible Links: Links visible only to bots are embedded, causing them to endlessly loop within the fake environment.
  • Data Poisoning: Subtle misinformation is mixed with factual data. AI models trained on this degraded data experience performance issues, making it a terrifying countermeasure for AI companies.

4. Redefining Protocols: The Advent of the Permissioned Web

After this intense technological battle, the fundamental structure of the internet is shifting.

It’s a massive transition from an ‘open web accessible to anyone’ to a ‘contractual web accessible only to verified identities’.

4.1 “Forget IPs”: Cryptographic Identity Verification (Web Bot Auth)

Cloudflare recently declared, “The era of IP-based blocking is over,” proposing the ‘Web Bot Auth’ standard.

Web Bot Auth
Web Bot Auth

  • How it Works: Bot developers sign ([[INLINE_CODE_10]]) HTTP request headers with their private key to prove their identity. Websites verify this using the public key.
  • Significance: IP spoofing is no longer effective. It’s now possible to mathematically verify “Is this truly OpenAI’s bot?” This signals the internet’s shift from ‘anonymous trust’ to ‘verified trust (Zero Trust)’.

4.2 Pay-Per-Crawl: Monetizing Data
Pay-Per-Crawl
Pay-Per-Crawl

The ‘Pay-Per-Crawl’ model launched by Cloudflare in July 2025 has opened the era where major media outlets like TIME and Fortune officially charge AI companies for their data.

Ultimately, the internet is splitting into two.

A ‘premium web’ accessible only to paying AIs, and a ‘junk web’ tangled with bots and fake data.

This is a bitter, yet unavoidable reality.

5. Conclusion: It’s Time to Raise Your Digital Walls

The cases of SourceHut and Read the Docs have clearly demonstrated that relying on a paper shield like robots.txt is no longer sufficient to protect infrastructure.

The insatiable data crawling by AI companies will not stop.

Defense is no longer an option; it’s a matter of survival.

Website operators, I urge you to take the following actions immediately:

  1. Immediate Check: Open your robots.txt and add the explicit AI bot blocking rules mentioned above.
  2. Log Analysis: Analyze your traffic logs to check for abnormal user agents and access to high-cost endpoints ([[INLINE_CODE_13]] etc.).
  3. Implement Defense: Seriously consider adopting a technical defense solution suitable for your scale (Anubis or Cloudflare WAF).

We are witnessing the re-architecture of the internet.

IP addresses have lost their trustworthiness, and data has become a paid asset.

In this turbulent wave of change, only those who build robust technological defenses will be able to preserve their digital territories.

#AI crawler blocking#robots.txt configuration#web scraping defense strategy#Cloudflare WAF setup#Anubis middleware#open source infrastructure protection#data poisoning attack#GPTBot blocking#traffic cost reduction#server load management

Recommended for You

40% of Data Center Power Isn't Used for Computation — Where Does That Money Go?

40% of Data Center Power Isn't Used for Computation — Where Does That Money Go?

18 min read
The Thermodynamics of Intelligence: Power Bottlenecks and Global Energy Wars Sparked by AI (Survival Strategies for the US, China, and South Korea)

The Thermodynamics of Intelligence: Power Bottlenecks and Global Energy Wars Sparked by AI (Survival Strategies for the US, China, and South Korea)

10 min read
2025 Data Catastrophe: Is Your Privacy Still Intact? (A Digital Social Contract for Survival)

2025 Data Catastrophe: Is Your Privacy Still Intact? (A Digital Social Contract for Survival)

10 min read

Advertisement

Comments