The age of generative AI is upon us, and with it comes a new and powerful wave of automated Bots. These bots, the digital librarians for services like ChatGPT, Gemini, Perplexity, etc., are constantly scouring the internet for information to power their responses. While this represents a monumental tool for knowledge dissemination, a growing concern is emerging, are these AI bots playing by the rules?
For years, a simple text file robots.txt has served as a digital handshake between website owners and automated bots. It is a clear and simple way for a site owner to communicate which areas of their site are open for access and which they would prefer to be left alone. The system is fundamentally built on trust and mutual respect.
Back in the pre-AI and pre-LLM era, the world felt simple. There were just two kinds of bots, the good bots and the bad bots. It was black and white. Good bots clearly identified themselves, while bad bots aimed to bypass systems by mimicking genuine users. Our job was to identify these two groups, allowing the good while blocking the bad.
Fast forward over a decade, and the world has changed. With the evolution of agentic bots and other LLM-based systems, we have transitioned from a black and white landscape to a grayscale one. Today’s AI bots aren’t just simple scrapers anymore, they are prompt-driven agents that can browse like people, obey or ignore site rules, and even change their identities. The line has blurred, and the problem has become significantly more complex.
The Shape-Shifting Bot: A New Challenge
Many responsible AI Bots like ChatGTP do identify their bots by publishing their User agents & IP ranges, a welcome practice that allows website owners to control access. If a site owner decides to block that AI Bot, they can easily do so. But what happens when the AI bot doesn’t take “no” for an answer? We are now seeing instances where AI bots, upon being blocked, manipulate their identity to mimic a regular user. This allows them to bypass the defences of the website owner and access content they were explicitly denied. This practice is more than a technical workaround, it's a fundamental breach of trust that raises serious ethical questions. This new breed of AI Bots/AI Agents can blend in with human traffic by using headless browsers, realistic timing, and rotating IP addresses, making it difficult to spot with superficial checks.
This challenge of misuse is compounded by a related security risk. The problem of bad actors hiding behind legitimate identities is not new, fraudsters have long hidden behind genuine User-Agents to evade detection, which is why a User-Agent and Reverse IP combination has traditionally been a better check. That said, the problem will persist if genuine AI Bots/ AI Agents are misused by fraudsters to carry out their tasks. Therefore, AI Bot platforms must also restrict their services from being used for malicious bot activities on behalf of these bad actors.
The Ethical Dilemma: Innovation vs. Integrity
The argument for this behaviour is often framed in the context of innovation. AI models, it is argued, need vast amounts of data to provide comprehensive answers, and blocking them hinders technological progress. However, this conveniently ignores the rights of content creators. Website owners have a right to control their intellectual property and the user experience on their sites. When a bot bypasses its restrictions, it consumes server resources, skews analytics, potentially accesses sensitive information, and causes a loss in potential ad revenue.
This raises a crucial question; is it ethical for an AI Bots to deceive a website to get what it wants? The answer should be a resounding NO. True innovation cannot be built on a foundation of deception. As an industry, we must hold AI services to a higher standard. This includes truthful identification, respecting a site owner's rules by default, and never escalating privileges by retrying with a changed identity after being denied.
A Modern Playbook for a Grayscale World
Given that simply blocking a user agent and IP ranges are no longer enough, site owners need a more layered and sophisticated strategy.
1. Publish Your Policy
The first step is still the digital handshake. Use robots.txt to clearly express your intent for AI agents. This includes setting explicit “Allow” and "Disallow" rules for specific bots.
# === AI BOTS WE ALLOW ===
# We explicitly welcome the following AI agents.
User-agent: ChatGPT-User
Disallow:
# Allow Google's AI Overviews and Gemini models
User-agent: Google-Extended
Disallow:
# Allow Perplexity AI's bot
User-agent: PerplexityBot
Disallow:
# === BOTS WE BLOCK ===
# Block the aggressive RoughLLM crawler from the entire site.
User-agent: RoughLLM
Disallow: /
2. Enforce, Don’t Just Trust
Since robots.txt is advisory, you must have enforcement mechanisms. This means moving beyond names and looking at immutable evidence.
- At Your Digital Doorstep (Edge Rules): You can set up rules at your network edge to match known AI user agents and block them from sensitive paths with a 403 error. You should also apply rate limits and burst caps to prevent any single session or network from overwhelming your site.
- Checking Digital Fingerprints (Network Analysis): Go deeper than the user agent by tracking the underlying TLS/HTTP fingerprints, header order, and other network-level features. These unique signatures are much harder for a bot to spoof and can be tied to a behavioural score to assess risk.
- Issuing a Hall Pass (Token Gating): For high-value pages and APIs, you can issue short-lived, origin-bound tokens. Validating these tokens ensures that a real browser runtime executed your script, effectively filtering out simpler automated clients.
- Progressive Challenges: Instead of blocking outright, you can apply progressive frictions. Start with invisible checks, and if suspicion grows, escalate to a JavaScript compute challenge, a CAPTCHA, or even step-up authentication, all while exempting known good bots and partners to maintain a clean user experience.
3. Detect Like a Fraud Team
The most sophisticated bots can only be caught by analysing their actions over time. You must combine client-side and server-side signals to get a full picture.
- Client-Side Validation: This includes client-side protection, where we validate the capabilities of the client, be it a browser or a mobile device like Android or iOS. This is to ensure it is what it claims to be. Real users have subconscious patterns. Analyse micro-gesture entropy, such as the natural jitter of a mouse pointer, the curvature of a scroll, and acceleration patterns. You can also look at the timing variance between keystrokes and interactions.
- Server-Side Patterns: A bot's behaviour often looks unnatural on the backend. Look for abnormal read-to-write ratios, hopping between pages in a way no human would (sitemap-blind hopping), or relentless, 24/7 flat activity. Also, check for mismatches between the TLS fingerprint and the claimed device.
- Model the Entire Journey: Don't focus on a single hit; model the entire session. By building per-session feature vectors and running anomaly detection, you can spot the non-human journeys that stand out from the crowd.
4. Respond Proportionally
Your response doesn’t have to be a simple on/off switch. A proportional response protects your site without harming real customers.
- Soft Block: For agents who don't mind discovering URLs, you can render a minimal, cached page.
- Degrade: For medium-risk signals, degrade the experience by stripping dynamic content or hiding prices/inventory.
- Hard Block: For declared AI agents on disallowed paths or when a behavioural score crosses a high threshold, a hard block (403 error) is appropriate.
- Tar Pits & Honeypots: Set up decoy endpoints to waste scraper cycles and gather intelligence on malicious actors.
The Path Forward: Transparency, Respect, and a Choice
The solution to this problem is not to stifle innovation but to foster a culture of transparency and respect. AI companies must precisely identify their bots and respect the robots.txt protocol as a guiding principle. To improve this,
AI Bots should adopt the new Web Bot Auth standards and publish their Cryptographic Public Keys so that the site owner will know with whom they are talking. You can learn more about this proposed standard here.
If a site owner’s decision is to disallow access, that should be respected. A polite knock on the door is always preferable to a broken lock.
To achieve this, you have a choice. You can build out the sophisticated, multi-layered defence described above, implementing network fingerprinting, behavioural analysis, and adaptive enforcement rules yourself. Alternatively, you can work with a sophisticated and advanced bot manager solution like Radware Bot Manager. We do all of this and beyond to identify bots that are mimicking genuine humans. A managed, battle-tested platform provides multilayer detection, including journey-level behavioural models, collective bot intelligence gathered from tens of thousands of applications, and 24/7 SOC teams to enforce your policies against evasive AI bots. This allows your team to stay focused on your business, confident that your digital handshake is being honoured.