For most of the last decade, the bot conversation at any site owner's table was about two things: good bots and bad bots. Search engine crawlers sat on one side, scrapers and credential stuffers sat on the other, and the job was to tell them apart. Things were a little more nuanced than that, of course, but the mental model worked.
The generative AI wave broke that model. In our previous post, The Polite Knock or the Broken Lock? Navigating the Grayscale World of AI Bots, we talked about how AI bots are no longer simple scrapers and how identity itself has become negotiable. This post takes that conversation one step further. Because within what most people still lump together as "AI traffic," there are actually two very different classes of automation hitting your site, and they need to be treated differently.
Those two classes are AI crawlers and AI agents. They share a label in casual conversation, but they have different purposes, different traffic shapes, different levels of self-identification, and therefore different defence strategies. Treating them as one bucket is the most common mistake we see site owners make today.
AI Crawlers vs. AI Agents: The Real Distinction
An AI crawler is, at heart, a centralized collector. Its job is broad discovery, indexing, and ingestion of web content for training, search, or retrieval. It typically runs from vendor-controlled infrastructure, it tends to be documented, and in most cases, it actually tells you who it is.
All the major AI vendors now publish, to varying degrees, the identities of their crawlers and the infrastructure those crawlers run from. A few examples of where to find this:
- OpenAI separates its bots by role: GPTBot for training-related crawling, OAI-SearchBot for surfacing sites inside ChatGPT search results, and ChatGPT-User for user-directed fetches. Each one is documented with its own user agent string, robots.txt implications, and a published IP address list. See OpenAI's bot documentation.
- Anthropic does something similar with ClaudeBot, Claude-User, and Claude-SearchBot, each tied to a distinct product role, with robots.txt behaviour documented. See Anthropic's crawler documentation.
- Google, the elder statesman of this space, has been publishing crawler catalogs for years and now explicitly separates common crawlers, user-triggered fetchers, and user-triggered agents. Verification is supported through reverse-plus-forward DNS and published JSON IP ranges. See Google's crawler verification guide.
An AI agent is a different animal. It is not trying to harvest the web, it is trying to accomplish a task. It clicks, renders JavaScript, fills forms, holds cookies, moves through multi-step workflows, and generally looks much more like a browsing user than a catalog-building bot. OpenAI's ChatGPT agent, which can navigate websites, use a visual browser, and take actions on behalf of the user, is the current reference example.
Here is the critical part. AI agents come in two flavours today, and only one of them is easy to identify.
The first is the declared agent, where the vendor has invested in a cryptographic identity. ChatGPT agent falls here, signing every outbound HTTP request. The second, and much larger, class is agent-like browser automation. These are browser-based or browser-emulating systems executing goal-driven workflows. They might be built on headless browsers, they might be orchestrated through automation frameworks, and they may not present any trustworthy vendor-verifiable identity at all. Operationally, they behave like agents. Defensively, they can look a lot like users.
One common myth worth correcting, while we are here. AI crawlers do not always come from a small fixed set of IPs, and AI agents do not always come from an end user's laptop. The better way to think about it is that crawlers are typically launched from centralized vendor infrastructure with some form of published identity, and agents are typically user-initiated but may execute from vendor-managed infrastructure, remote browsers, or automation environments rather than the user's own device. The allowlisting model used by ChatGPT agent, which relies on signed requests and public-key discovery, strongly implies controlled agent infrastructure, not random traffic from random laptops.
Why the Same Defence Doesn't Work for Both
Once you accept that these are two different traffic classes, the reason a one-size-fits-all approach fails becomes obvious.
Crawlers are relatively easy to classify. They are broad, they are documented, their vendors want you to be able to identify them (at least the compliant ones), and their infrastructure is reasonably stable. The layered model we discussed in the previous post, robots.txt plus user agent matching plus source verification, works well here.
Agents break every one of those assumptions. Their activity is narrow and task-driven, not broad. Their traffic can look almost identical to a human session. They may not self-identify cleanly. And even when they do identify themselves, a user agent string alone is meaningless, because anyone can send User-Agent: ChatGPT. If your defence pipeline treats declared crawlers, signed agents, and unsigned agent-like automation identically, you are either going to over-block real users or under-block evasive automation, and usually both at the same time.
So the job splits into two problems: identifying crawlers and identifying agents. They deserve different playbooks.
Identifying AI Crawlers
For declared crawlers, the model has matured and is fairly well understood at this point.
Start with policy. Use robots.txt to express intent. For compliant, declared crawlers it is genuinely useful. OpenAI documents robots.txt handling for its crawler families, and Anthropic states that its bots honor robots.txt directives and even respect crawl-delay. It is important to frame robots.txt honestly though. It is a policy declaration, not a security boundary. It works when the bot chooses to comply.
Next, match the user agent against documented crawler families. But never stop there. User agents are trivially spoofable.
The third and most important step is source verification. Google's long-standing guidance, which is still the gold standard, is reverse DNS followed by forward DNS back to the originating IP, or matching the source IP against Google's published crawler IP ranges. OpenAI publishes IP address resources for the same purpose. For declared crawlers, the rule is: do not trust a user agent without also verifying that the network identity matches what the vendor has published.
Finally, maintain this intelligence. Vendor infrastructure evolves, IP lists change, and new crawler families show up.
Identifying AI Agents
This is where it gets harder, and where most defence stacks fall behind.
For declared agents, cryptographic request verification is the strongest identity layer available today. This is often described under the umbrella of Web Bot Auth-style mechanisms, built on the HTTP Message Signatures standard. The idea is simple in principle. The agent signs every outbound HTTP request with a private key. The site verifies the signature using the vendor's published public key, discovered through a well-known URL. If the signature verifies, the identity is real. If it doesn't, it isn't.
OpenAI's ChatGPT agent is a concrete implementation of this approach. A signed request carries a small set of signature-related headers that the receiving edge verifies together in real time. A simplified illustration looks something like this:
GET /checkout/review HTTP/1.1
Host: example.com
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36
Signature-Agent: "https://chatgpt.com"
Signature-Input: sig1=("@method" "@authority" "@path" "signature-agent");
created=1737046800;
keyid="openai-agent-key-2025-01";
alg="ed25519"
Signature: sig1=:MEUCIQD...base64-signature-bytes...:
Notice the User-Agent here. It looks like a completely ordinary Chrome browser, because in practice, that is exactly what most agent-like traffic looks like at the user agent layer. This is one of the key operational differences between crawlers and agents. A crawler like GPTBot is happy to tell you it is GPTBot. An agent, declared or otherwise, is usually driving a real browser engine and therefore carries a real browser user agent. The identity, if it exists at all, lives in the signature headers, not in User-Agent.
The receiving edge pulls the public key from the vendor's well-known endpoint, reconstructs the signature base from the listed fields, and verifies the signature. No verification, no trust. This is a meaningfully stronger model than user agent checking, because cryptography doesn't care what the User-Agent string says.
Worth saying plainly: Web Bot Auth is an emerging approach, not a universal standard yet. Parts of it still live in active IETF drafts. But the direction of travel is clear, and declared agents from serious vendors are moving this way.
For unsigned or loosely declared agent-like automation, which is the majority of real-world agent traffic today, cryptography is not available. Here, the defence moves into behavioural and runtime territory. Client-side validation of the execution environment, JavaScript runtime fingerprints, TLS and HTTP-layer signatures, header ordering, micro-gesture entropy, session journey modelling, anomaly detection on a per-session feature vector. None of these individually is proof, but together they form a probabilistic identity that is hard to fake consistently across a full workflow. This is essentially the layered behavioural detection philosophy we have written about before, applied to agent-shaped traffic.
The cleanest way to summarize the two tracks: crawlers reveal themselves through declared identity and infrastructure. Agents reveal themselves through cryptographic proof when available, and through workflow behaviour when it is not.
Identification Is Only Half the Problem
Here is a question that does not get asked often enough. Once you have correctly identified a request as coming from an AI agent, what then? Do you let it through?
Identification and trust are not the same thing. An agent that you have cryptographically verified as a real ChatGPT agent can be doing something perfectly legitimate on your site, checking out a product on behalf of a user, booking a flight, comparing specifications, filling a form the user explicitly asked it to fill. Or it can be doing something you would never want any automation to do, scraping your entire catalog, probing your checkout flow for price intelligence, carding, inventory hoarding, or reconnaissance for an attack.
The identity does not tell you which of those two things is happening. The objective does.
So the next logical step, after crawler vs. agent classification, is agent intent classification. What is this agent actually trying to accomplish on my site? Is it a good actor doing a user-delegated task, a neutral actor doing general browsing or research, or a malicious actor using an agent framework as a scraping or fraud tool? The answer to that question should drive the response, not the identity alone. A verified agent with a malicious objective should not get a pass just because its signature checks out. An unsigned agent with a clearly benign objective probably should not be hard-blocked just because it cannot prove who it is.
This is a meaty topic on its own, and it deserves its own post. We will dig into it in the coming months, looking at how objective and intent can be inferred from workflow shape, request patterns, and session-level behaviour, and how that signal can be combined with the identification layers we have been discussing here. Consider this a preview.
What This Means for Site Owners
The practical takeaway is that the defence posture has to evolve. The tooling and processes most sites built for the last generation of bots, and even for declared AI crawlers, were not designed for this new traffic class. Site owners now need to invest real time in two things: first, identifying what is actually hitting their application, at the level of crawler vs. signed agent vs. agent-like automation vs. human, and second, managing each class with a response that fits it.
That means a roadmap item to support Web Bot Auth verification at the edge. It means keeping crawler IP and DNS intelligence current. It means behavioural modelling that is tuned for workflow-driven automation, not just high-volume scraping. And it means treating declared agents as something closer to trusted partners when their signatures verify, while treating unsigned agent-like traffic with appropriate scepticism.
You can build all of this yourself. It is a non-trivial engineering investment, but it is doable. The alternative, and the path most site owners end up choosing once they actually scope the work, is to work with a dedicated bot management platform. Radware Bot Manager already does this across all four traffic classes, combining declared-crawler verification, signed-request validation, network and TLS fingerprinting, and journey-level behavioural models fed by collective intelligence from tens of thousands of applications. Most of the work becomes a few configurations on your side, and the detection and enforcement layers are already up and running.
AI crawlers and AI agents are going to keep evolving, and the line between agents and real users is going to keep blurring. The site owners who fare best in the next two years will be the ones who stop treating "AI traffic" as a single bucket, start defending it as the two distinct classes it actually is, and then go one step further, asking not just who is visiting, but what are they trying to do.