← Back to Blog
Developer Experience14 minFebruary 27, 2026

Agent-aware logging: how to distinguish AI agent traffic from bot traffic in your analytics

A
Anon Team

Your analytics dashboard says you got 50,000 sessions last month. What it doesn't tell you is that 25,000 of those were machines — and it's treating a GPTBot training crawler, a price-scraping competitor, and an AI agent trying to buy your product on behalf of a customer as the same thing.

According to Imperva's 2025 Bad Bot Report, automated traffic surpassed human activity for the first time in a decade, accounting for 51% of all web traffic. But the industry's response — lump everything non-human into a "bot" bucket and either block it or ignore it — is increasingly dangerous. Because not all automated traffic is equal.

Some of those bots are stealing your content. Others are your next customers.

The difference between blocking AI agents and welcoming them could be worth hundreds of thousands in revenue. McKinsey projects that 20–50% of traditional search traffic is at risk of displacement by AI-powered search, representing $750 billion in consumer spend by 2028. HUMAN Security has documented a four-digit percentage increase in agentic commerce traffic over the past year alone.

If you can't distinguish between a malicious scraper and an AI agent trying to complete a purchase, you're flying blind. This post shows you how to build a logging and classification pipeline that actually tells them apart.

The three types of non-human traffic

Before writing any code, you need a mental model. Non-human traffic falls into three categories with fundamentally different behaviors and business value:

1. Crawlers: indexing the web

Crawlers systematically traverse websites to build indexes. They're the most "honest" category — most self-identify via user-agent strings and respect robots.txt.

Examples: Googlebot, GPTBot, ClaudeBot, PerplexityBot, CCBot, Bytespider

Behavioral signature:

  • Breadth-first navigation across many pages
  • Minimal interaction (no clicks, no form fills)
  • Consistent request intervals (often with Crawl-delay respect)
  • Self-identifying user-agent strings
  • Request patterns follow sitemap/link structure

Business value: Medium. Training crawlers fuel the AI models that will surface your product. Search crawlers drive organic traffic. Blocking them has a real cost.

2. Scrapers: extracting specific data

Scrapers target specific data — prices, product listings, contact info, content — and extract it at scale. Some are legitimate (price comparison services), most are not.

Behavioral signature:

  • Narrow, targeted page access (product pages, pricing pages)
  • High request velocity on specific URL patterns
  • Often rotate IPs and spoof user-agents
  • Ignore robots.txt
  • No session continuity — each request is independent

Business value: Usually negative. Competitors scraping your pricing, content thieves copying your docs.

3. AI agents: acting on behalf of users

This is the new category, and the one most analytics stacks can't see. AI agents — like OpenAI's Operator (now ChatGPT agent), Anthropic's Claude computer use, or custom agents built with Browser Use, Playwright, and Puppeteer — browse your site the way a human would, but with commercial intent.

Behavioral signature:

  • Goal-directed navigation (search → product → checkout)
  • Form interactions (signups, purchases, API key generation)
  • Session continuity with state management
  • Often use full browser environments (Playwright/Puppeteer)
  • May or may not self-identify
  • Mouse movements with programmatic precision (ChatGPT's agent moves in 0.25px increments, per HUMAN Security's research)

Business value: High. These are purchase-intent visitors that happen to be machines. Blocking them blocks revenue.

Why standard analytics fail

Google Analytics, Mixpanel, Amplitude — none of them were built for this taxonomy. They typically offer a single binary: "Exclude all hits from known bots and spiders." Check the box, and crawlers, scrapers, and purchase-intent agents all disappear from your data.

The problems compound:

  1. Inflated metrics: AI agents that complete signup flows inflate conversion rates when included, but excluding all bots deflates legitimate agent conversions.
  2. No segmentation: You can't A/B test your agent experience if you can't identify agent traffic.
  3. Invisible revenue: If an AI agent purchases your product, was that a "human" conversion or a "bot" conversion? Most analytics can't tell you.
  4. Security blind spots: Lumping agents with scrapers means you either over-block (losing agent customers) or under-block (allowing scrapers through).

You need a purpose-built classification layer that sits between your web server and your analytics.

Building the classification pipeline

Here's a four-layer detection architecture, ordered from cheapest (computationally) to most expensive:

Request → [Layer 1: User-Agent] → [Layer 2: Network Origin] → [Layer 3: Behavior] → [Layer 4: Composite Score] → Analytics

Layer 1: User-agent classification

The cheapest signal. Many crawlers self-identify, and you should take advantage of it. Here's a comprehensive classifier:

// agent-classifier.ts — User-agent classification layer

interface TrafficClassification {
  category: 'crawler' | 'scraper' | 'agent' | 'human' | 'unknown';
  subcategory: string;
  confidence: number; // 0-1
  operator: string | null;
  signals: string[];
}

// Known AI crawlers that self-identify
const KNOWN_CRAWLERS: Record<string, { operator: string; purpose: string }> = {
  'GPTBot':           { operator: 'OpenAI', purpose: 'training' },
  'ChatGPT-User':     { operator: 'OpenAI', purpose: 'user-initiated' },
  'OAI-SearchBot':    { operator: 'OpenAI', purpose: 'search-index' },
  'ClaudeBot':        { operator: 'Anthropic', purpose: 'training' },
  'Claude-User':      { operator: 'Anthropic', purpose: 'user-initiated' },
  'Claude-SearchBot': { operator: 'Anthropic', purpose: 'search-index' },
  'anthropic-ai':     { operator: 'Anthropic', purpose: 'research' },
  'Google-Extended':  { operator: 'Google', purpose: 'ai-training' },
  'Googlebot':        { operator: 'Google', purpose: 'search-index' },
  'PerplexityBot':    { operator: 'Perplexity', purpose: 'search-index' },
  'Perplexity-User':  { operator: 'Perplexity', purpose: 'user-initiated' },
  'CCBot':            { operator: 'Common Crawl', purpose: 'training' },
  'Bytespider':       { operator: 'ByteDance', purpose: 'training' },
  'Applebot-Extended':{ operator: 'Apple', purpose: 'ai-training' },
  'Amazonbot':        { operator: 'Amazon', purpose: 'ai-training' },
  'Meta-ExternalAgent':{ operator: 'Meta', purpose: 'ai-training' },
  'Grok-bot':         { operator: 'xAI', purpose: 'training' },
  'cohere-ai':        { operator: 'Cohere', purpose: 'training' },
  'AI2Bot':           { operator: 'Allen Institute', purpose: 'research' },
  'Diffbot':          { operator: 'Diffbot', purpose: 'knowledge-graph' },
};

// Automation framework signatures (used by agents)
const AUTOMATION_SIGNATURES = [
  'HeadlessChrome',
  'Playwright',
  'Puppeteer',
  'Selenium',
  'PhantomJS',
  'webdriver',
];

// Known AI agent platforms
const AGENT_SIGNATURES = [
  'Genspark',
  'BrowserUse',
  'Operator',
  'AgentQL',
  'MultiOn',
];

export function classifyUserAgent(ua: string): Partial<TrafficClassification> {
  // Check known crawlers first
  for (const [signature, info] of Object.entries(KNOWN_CRAWLERS)) {
    if (ua.includes(signature)) {
      return {
        category: 'crawler',
        subcategory: info.purpose,
        operator: info.operator,
        confidence: 0.9, // High but not 1.0 — UA can be spoofed
        signals: [`ua-match:${signature}`],
      };
    }
  }

  // Check known agent platforms
  for (const sig of AGENT_SIGNATURES) {
    if (ua.includes(sig)) {
      return {
        category: 'agent',
        subcategory: 'identified-platform',
        operator: sig,
        confidence: 0.85,
        signals: [`agent-platform:${sig}`],
      };
    }
  }

  // Check automation frameworks
  for (const sig of AUTOMATION_SIGNATURES) {
    if (ua.includes(sig)) {
      return {
        category: 'agent', // Could be scraper too — needs more signals
        subcategory: 'automation-framework',
        operator: null,
        confidence: 0.5, // Low — could be agent, scraper, or test suite
        signals: [`automation:${sig}`],
      };
    }
  }

  return {
    category: 'unknown',
    subcategory: 'no-ua-match',
    confidence: 0.0,
    signals: [],
  };
}

Limitation: This only catches traffic that self-identifies. HUMAN Security's research found that every major AI agent they tested uses Playwright, Puppeteer, or Selenium under the hood — but many deliberately mask their user-agent strings. You need more layers.

Layer 2: Network origin analysis

Where the request comes from tells you a lot. Human traffic comes from ISPs. Bots come from cloud providers.

// network-classifier.ts — IP/ASN-based classification

import { lookup } from 'maxmind'; // MaxMind GeoIP2 database

// Known cloud provider ASN ranges
const CLOUD_ASNS: Record<number, string> = {
  16509: 'AWS',
  14618: 'AWS',
  15169: 'Google Cloud',
  396982: 'Google Cloud',
  8075: 'Microsoft Azure',
  13335: 'Cloudflare',
  14061: 'DigitalOcean',
  63949: 'Linode/Akamai',
  20473: 'Vultr',
  24940: 'Hetzner',
  16276: 'OVH',
};

// Known AI company IP ranges (maintain this list)
const AI_COMPANY_RANGES: Record<string, string[]> = {
  'OpenAI': [
    '20.15.240.0/20',    // Azure-hosted
    '52.230.0.0/16',     // Azure eastus
  ],
  'Anthropic': [
    '35.196.0.0/16',     // GCP us-east
  ],
  'Perplexity': [
    '52.8.0.0/16',       // AWS us-west
  ],
};

// Known VPN/proxy ASNs (to avoid false positives)
const VPN_ASNS = new Set([
  9009,   // M247 (NordVPN, Surfshark)
  212238, // Datacamp (proxies)
  60068,  // CDN77 / Datacamp
]);

export interface NetworkSignal {
  isCloud: boolean;
  cloudProvider: string | null;
  isKnownAI: boolean;
  aiCompany: string | null;
  isVPN: boolean;
  asn: number | null;
  confidence: number;
  signals: string[];
}

export function classifyNetworkOrigin(ip: string): NetworkSignal {
  const geo = lookup(ip);
  const asn = geo?.autonomousSystemNumber ?? null;
  const signals: string[] = [];

  let isCloud = false;
  let cloudProvider: string | null = null;
  let isKnownAI = false;
  let aiCompany: string | null = null;
  let isVPN = false;

  // Check ASN against cloud providers
  if (asn && CLOUD_ASNS[asn]) {
    isCloud = true;
    cloudProvider = CLOUD_ASNS[asn];
    signals.push(`cloud-asn:${cloudProvider}`);
  }

  // Check against known AI company ranges
  for (const [company, ranges] of Object.entries(AI_COMPANY_RANGES)) {
    if (ranges.some(range => ipInRange(ip, range))) {
      isKnownAI = true;
      aiCompany = company;
      signals.push(`ai-company:${company}`);
    }
  }

  // Check VPN ASNs
  if (asn && VPN_ASNS.has(asn)) {
    isVPN = true;
    signals.push('vpn-asn');
  }

  const confidence = isKnownAI ? 0.8 : isCloud ? 0.6 : isVPN ? 0.2 : 0.1;

  return { isCloud, cloudProvider, isKnownAI, aiCompany, isVPN, asn, confidence, signals };
}

The key insight from Snowplow's research: network origin is the "sweet spot" for reliability vs. complexity. It's harder to spoof than a user-agent string but simpler to implement than behavioral analysis. The caveat is VPN users — a human browsing through NordVPN looks identical to a bot on the same cloud provider.

Layer 3: Behavioral analysis

This is the hardest layer for bots to fake, and the most computationally expensive to implement. HUMAN Security's research found that even sophisticated AI agents exhibit telltale behavioral patterns:

// behavior-classifier.ts — Session-level behavioral signals

interface SessionBehavior {
  // Timing signals
  avgTimeBetweenRequests: number;  // milliseconds
  requestTimingVariance: number;   // coefficient of variation
  sessionDuration: number;         // seconds

  // Navigation signals
  pagesVisited: number;
  uniquePaths: number;
  hasSearchInteraction: boolean;
  hasFormSubmission: boolean;
  hasCheckoutAttempt: boolean;
  navigationDepth: number;         // max click depth from entry

  // Interaction signals (if JS tracking available)
  hasMouseMovement: boolean;
  mouseMovementEntropy: number;    // 0 = robotic, 1 = human
  hasScrollEvents: boolean;
  scrollPatternVariance: number;
  hasKeyboardEvents: boolean;
  typingCadenceVariance: number;   // 0 = robotic, 1 = human

  // Technical signals
  executesJavaScript: boolean;
  cookiesEnabled: boolean;
  hasWebGLFingerprint: boolean;
  screenResolution: string;
}

interface BehaviorClassification {
  likelyCrawler: number;   // 0-1 probability
  likelyScraper: number;
  likelyAgent: number;
  likelyHuman: number;
  signals: string[];
}

export function classifyBehavior(session: SessionBehavior): BehaviorClassification {
  const signals: string[] = [];
  let crawlerScore = 0;
  let scraperScore = 0;
  let agentScore = 0;
  let humanScore = 0;

  // --- Timing analysis ---

  // Crawlers have consistent, slow pacing
  if (session.requestTimingVariance < 0.1 && session.avgTimeBetweenRequests > 2000) {
    crawlerScore += 0.3;
    signals.push('consistent-slow-pacing');
  }

  // Scrapers are fast and consistent
  if (session.requestTimingVariance < 0.05 && session.avgTimeBetweenRequests < 500) {
    scraperScore += 0.4;
    signals.push('rapid-consistent-requests');
  }

  // Agents show moderate speed with some variance
  if (session.requestTimingVariance > 0.05 && session.requestTimingVariance < 0.3
      && session.avgTimeBetweenRequests > 500 && session.avgTimeBetweenRequests < 5000) {
    agentScore += 0.2;
    signals.push('moderate-paced-slight-variance');
  }

  // Humans are irregular
  if (session.requestTimingVariance > 0.4) {
    humanScore += 0.3;
    signals.push('irregular-timing');
  }

  // --- Navigation analysis ---

  // Crawlers visit many diverse pages
  if (session.pagesVisited > 20 && session.uniquePaths / session.pagesVisited > 0.8) {
    crawlerScore += 0.3;
    signals.push('broad-crawl-pattern');
  }

  // Scrapers hit the same pattern repeatedly
  if (session.pagesVisited > 10 && session.uniquePaths / session.pagesVisited < 0.3) {
    scraperScore += 0.3;
    signals.push('repetitive-path-pattern');
  }

  // Agents show goal-directed navigation
  if (session.hasFormSubmission || session.hasCheckoutAttempt) {
    agentScore += 0.4;
    signals.push('goal-directed-interaction');
  }
  if (session.hasSearchInteraction && session.navigationDepth > 2) {
    agentScore += 0.2;
    signals.push('search-then-navigate');
  }

  // --- Mouse/interaction analysis ---

  if (session.hasMouseMovement) {
    if (session.mouseMovementEntropy < 0.2) {
      // HUMAN Security found ChatGPT's agent moves in 0.25px increments
      // Robotic mouse movement is a strong agent signal
      agentScore += 0.3;
      signals.push('robotic-mouse-movement');
    } else if (session.mouseMovementEntropy > 0.6) {
      humanScore += 0.3;
      signals.push('organic-mouse-movement');
    }
  } else if (session.executesJavaScript) {
    // JS runs but no mouse? Likely headless with no simulated input
    scraperScore += 0.2;
    signals.push('js-no-mouse');
  }

  // --- Normalize scores ---
  const total = crawlerScore + scraperScore + agentScore + humanScore || 1;

  return {
    likelyCrawler: crawlerScore / total,
    likelyScraper: scraperScore / total,
    likelyAgent: agentScore / total,
    likelyHuman: humanScore / total,
    signals,
  };
}

Layer 4: Composite scoring

Now combine all three layers into a single classification with a confidence score:

Get Started

Ready to make your product agent-accessible?

Add a few lines of code and let AI agents discover, request access, and get real credentials — with human oversight built in.

Get started with Anon →
// composite-scorer.ts — Final classification

import { classifyUserAgent } from './agent-classifier';
import { classifyNetworkOrigin } from './network-classifier';
import { classifyBehavior } from './behavior-classifier';

interface CompositeClassification {
  category: 'crawler' | 'scraper' | 'agent' | 'human';
  confidence: number;
  operator: string | null;
  intent: 'training' | 'search' | 'commerce' | 'extraction' | 'browsing' | 'unknown';
  signals: string[];
  rawScores: {
    ua: Partial<TrafficClassification>;
    network: NetworkSignal;
    behavior: BehaviorClassification;
  };
}

export function classifyTraffic(
  userAgent: string,
  ip: string,
  session: SessionBehavior
): CompositeClassification {
  const ua = classifyUserAgent(userAgent);
  const network = classifyNetworkOrigin(ip);
  const behavior = classifyBehavior(session);

  const allSignals = [
    ...(ua.signals ?? []),
    ...network.signals,
    ...behavior.signals,
  ];

  // If UA identifies as known crawler with high confidence, trust it
  if (ua.category === 'crawler' && (ua.confidence ?? 0) > 0.8) {
    return {
      category: 'crawler',
      confidence: 0.95,
      operator: ua.operator ?? null,
      intent: inferCrawlerIntent(ua.subcategory ?? ''),
      signals: allSignals,
      rawScores: { ua, network, behavior },
    };
  }

  // Weighted composite score
  const weights = { ua: 0.25, network: 0.25, behavior: 0.50 };

  const scores = {
    crawler:
      (behavior.likelyCrawler * weights.behavior) +
      (ua.category === 'crawler' ? (ua.confidence ?? 0) * weights.ua : 0) +
      (network.isKnownAI ? 0.7 * weights.network : network.isCloud ? 0.4 * weights.network : 0),
    scraper:
      (behavior.likelyScraper * weights.behavior) +
      (network.isCloud && !network.isKnownAI ? 0.5 * weights.network : 0),
    agent:
      (behavior.likelyAgent * weights.behavior) +
      (ua.category === 'agent' ? (ua.confidence ?? 0) * weights.ua : 0) +
      (network.isCloud ? 0.3 * weights.network : 0),
    human:
      (behavior.likelyHuman * weights.behavior) +
      (!network.isCloud && !network.isVPN ? 0.8 * weights.network : 0),
  };

  // Pick the winner
  const entries = Object.entries(scores) as [string, number][];
  entries.sort((a, b) => b[1] - a[1]);
  const [category, confidence] = entries[0];

  return {
    category: category as CompositeClassification['category'],
    confidence: Math.min(confidence, 1.0),
    operator: ua.operator ?? network.aiCompany ?? null,
    intent: inferIntent(category, behavior),
    signals: allSignals,
    rawScores: { ua, network, behavior },
  };
}

function inferCrawlerIntent(subcategory: string): CompositeClassification['intent'] {
  if (subcategory === 'training') return 'training';
  if (subcategory === 'search-index') return 'search';
  if (subcategory === 'user-initiated') return 'commerce';
  return 'unknown';
}

function inferIntent(
  category: string,
  behavior: BehaviorClassification
): CompositeClassification['intent'] {
  if (category === 'scraper') return 'extraction';
  if (category === 'human') return 'browsing';
  if (category === 'agent') return 'commerce';
  return 'unknown';
}

Notice the weighting: behavior gets 50% of the composite score, with UA and network splitting the rest equally. This follows Snowplow's research showing behavioral signals are "the most reliable detection method" because "efficiency and stealth are at odds" for AI agents — the more they try to mimic humans, the more computational overhead they incur.

Integrating with your analytics stack

Once you have a classifier, you need to get the data somewhere useful. Here are three integration patterns:

Pattern 1: Server log enrichment (Nginx)

Add a custom log format that captures the classification:

# /etc/nginx/conf.d/agent-logging.conf

# Map user-agent to basic traffic type
map $http_user_agent $traffic_type {
    default                      "unknown";
    ~*GPTBot                     "crawler:openai";
    ~*ChatGPT-User               "crawler:openai:user";
    ~*OAI-SearchBot              "crawler:openai:search";
    ~*ClaudeBot                  "crawler:anthropic";
    ~*Claude-User                "crawler:anthropic:user";
    ~*Claude-SearchBot           "crawler:anthropic:search";
    ~*anthropic-ai               "crawler:anthropic";
    ~*Google-Extended             "crawler:google";
    ~*PerplexityBot              "crawler:perplexity";
    ~*Perplexity-User            "crawler:perplexity:user";
    ~*CCBot                      "crawler:commoncrawl";
    ~*Bytespider                 "crawler:bytedance";
    ~*Applebot-Extended          "crawler:apple";
    ~*Meta-ExternalAgent         "crawler:meta";
    ~*Genspark                   "agent:genspark";
    ~*HeadlessChrome             "agent:headless";
    ~*Playwright                 "agent:playwright";
    ~*Puppeteer                  "agent:puppeteer";
    ~*(Selenium|webdriver)       "agent:selenium";
}

log_format agent_aware
    '$remote_addr - $remote_user [$time_local] '
    '"$request" $status $body_bytes_sent '
    '"$http_referer" "$http_user_agent" '
    'traffic_type=$traffic_type '
    'request_time=$request_time';

access_log /var/log/nginx/agent-access.log agent_aware;

Then parse it downstream with your log pipeline (Fluentd, Vector, Logstash):

# Quick analysis: AI traffic breakdown by type
awk -F'traffic_type=' '{print $2}' /var/log/nginx/agent-access.log \
  | awk '{print $1}' \
  | sort | uniq -c | sort -rn | head -20

Pattern 2: Express middleware with analytics dispatch

For application-level tracking with richer signals:

// middleware/agent-analytics.ts
import { Request, Response, NextFunction } from 'express';
import { classifyUserAgent } from '../lib/agent-classifier';
import { classifyNetworkOrigin } from '../lib/network-classifier';

// In-memory session tracker (use Redis in production)
const sessions = new Map<string, {
  requests: { path: string; timestamp: number }[];
  classification: ReturnType<typeof classifyUserAgent>;
  network: ReturnType<typeof classifyNetworkOrigin>;
}>();

export function agentAnalytics(req: Request, res: Response, next: NextFunction) {
  const ua = req.headers['user-agent'] ?? '';
  const ip = req.ip ?? req.socket.remoteAddress ?? '';
  const sessionId = req.cookies?.session_id ?? `${ip}:${ua.slice(0, 50)}`;

  // Classify on first request in session
  if (!sessions.has(sessionId)) {
    sessions.set(sessionId, {
      requests: [],
      classification: classifyUserAgent(ua),
      network: classifyNetworkOrigin(ip),
    });
  }

  const session = sessions.get(sessionId)!;
  session.requests.push({ path: req.path, timestamp: Date.now() });

  // Attach classification to request for downstream use
  (req as any).trafficClassification = {
    ...session.classification,
    network: session.network,
    requestCount: session.requests.length,
  };

  // Set response header for transparency
  const category = session.classification.category ?? 'unknown';
  res.setHeader('X-Traffic-Type', category);

  // Emit to your analytics backend
  emitAnalyticsEvent({
    event: 'page_view',
    sessionId,
    path: req.path,
    trafficType: category,
    operator: session.classification.operator,
    isCloud: session.network.isCloud,
    timestamp: new Date().toISOString(),
  });

  next();
}

function emitAnalyticsEvent(event: Record<string, any>) {
  // Send to your analytics pipeline (Segment, Snowplow, custom)
  // Example: Segment-compatible
  if (process.env.SEGMENT_WRITE_KEY) {
    fetch('https://api.segment.io/v1/track', {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'Authorization': `Basic ${btoa(process.env.SEGMENT_WRITE_KEY + ':')}`,
      },
      body: JSON.stringify({
        event: event.event,
        properties: event,
        timestamp: event.timestamp,
      }),
    }).catch(() => {}); // Fire and forget
  }
}

Pattern 3: Structured logging for data warehouses

If you're running a modern data stack (BigQuery, Snowflake, ClickHouse), emit structured events that your warehouse can query directly:

// Structured event schema for your data warehouse
interface AgentTrafficEvent {
  // Standard fields
  event_id: string;
  timestamp: string;           // ISO 8601
  session_id: string;
  page_path: string;
  http_method: string;
  status_code: number;
  response_time_ms: number;

  // Classification fields
  traffic_category: 'crawler' | 'scraper' | 'agent' | 'human' | 'unknown';
  traffic_subcategory: string;
  traffic_confidence: number;  // 0.0 - 1.0
  traffic_operator: string | null;
  traffic_intent: string;

  // Signal fields
  user_agent: string;
  ip_is_cloud: boolean;
  ip_cloud_provider: string | null;
  ip_is_known_ai: boolean;
  ip_ai_company: string | null;
  detection_signals: string[]; // Array of signal codes

  // Session aggregate fields (updated per request)
  session_request_count: number;
  session_unique_paths: number;
  session_has_form_submit: boolean;
  session_duration_seconds: number;
}

With this schema, you can run queries like:

-- What percentage of your traffic is AI agents vs crawlers vs humans?
SELECT
  traffic_category,
  COUNT(*) as requests,
  COUNT(DISTINCT session_id) as sessions,
  ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER(), 1) as pct
FROM agent_traffic_events
WHERE timestamp >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY traffic_category
ORDER BY requests DESC;

-- Which AI operators are generating the most commerce-intent traffic?
SELECT
  traffic_operator,
  COUNT(DISTINCT session_id) as sessions,
  SUM(CASE WHEN session_has_form_submit THEN 1 ELSE 0 END) as form_submissions,
  AVG(session_duration_seconds) as avg_session_duration
FROM agent_traffic_events
WHERE traffic_category = 'agent'
  AND traffic_intent = 'commerce'
  AND timestamp >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY traffic_operator
ORDER BY sessions DESC;

-- Find scraper patterns: high-frequency, narrow-path sessions
SELECT
  session_id,
  user_agent,
  COUNT(*) as requests,
  COUNT(DISTINCT page_path) as unique_pages,
  MIN(timestamp) as first_seen,
  MAX(timestamp) as last_seen
FROM agent_traffic_events
WHERE traffic_category = 'scraper'
  AND timestamp >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY session_id, user_agent
HAVING COUNT(*) > 50
ORDER BY requests DESC
LIMIT 20;

Building a real-time dashboard

Once you have the data flowing, build a dashboard that answers the questions that matter:

Panel 1: Traffic composition over time

  • Stacked area chart: % human vs crawler vs agent vs scraper by day
  • Alert when agent traffic spikes (new AI product launched?) or scraper traffic surges (competitive intelligence?)

Panel 2: Agent operator breakdown

  • Pie chart of agent traffic by operator (OpenAI, Anthropic, etc.)
  • Track which AI companies are sending you the most goal-directed traffic

Panel 3: Agent conversion funnel

  • Funnel: Agent visit → Product page → Signup → API key generation
  • Compare agent conversion rate to human conversion rate
  • This is the single most valuable metric for agent-readiness

Panel 4: Scraper threat feed

  • Table of high-velocity sessions from cloud IPs with low path diversity
  • Auto-flag for security review

Adaptive response: what to do with the classification

Detection is useless without action. Here's a decision matrix based on your classification:

Traffic Type Intent Response
Crawler (training) Training AI models Respect robots.txt. Serve content. Consider llms.txt for optimized responses.
Crawler (user-initiated) Real-time user query Serve content fast. These drive citations and referrals.
Crawler (search) Search indexing Treat like Googlebot. This is your AI-era SEO.
Agent (commerce) Purchase/signup Red carpet treatment. Fast responses, no CAPTCHAs, clear error messages. Consider an /agent API endpoint.
Agent (unknown) Unclear Monitor. Apply standard rate limits. Don't block.
Scraper Data extraction Rate limit aggressively. Consider serving honeypot data.
Human Browsing Normal experience.

The revenue-critical insight: agents with commerce intent should get better treatment than humans, not worse. They're faster, more decisive, and more likely to convert — if you don't block them first.

// Example: adaptive rate limiting based on classification
import rateLimit from 'express-rate-limit';

export const adaptiveRateLimit = (req: Request, res: Response, next: NextFunction) => {
  const classification = (req as any).trafficClassification;

  const limits: Record<string, { windowMs: number; max: number }> = {
    human:   { windowMs: 60_000, max: 100 },
    crawler: { windowMs: 60_000, max: 30 },
    agent:   { windowMs: 60_000, max: 200 },  // Agents get MORE headroom
    scraper: { windowMs: 60_000, max: 10 },   // Scrapers get squeezed
    unknown: { windowMs: 60_000, max: 50 },
  };

  const category = classification?.category ?? 'unknown';
  const limit = limits[category] ?? limits.unknown;

  return rateLimit({
    windowMs: limit.windowMs,
    max: limit.max,
    keyGenerator: (req) => req.ip ?? 'unknown',
    handler: (req, res) => {
      res.status(429).json({
        error: 'rate_limit_exceeded',
        retryAfter: Math.ceil(limit.windowMs / 1000),
        // Help agents self-correct
        hint: category === 'agent'
          ? 'Consider using our API at /api/v1 for higher limits'
          : undefined,
      });
    },
  })(req, res, next);
};

The emerging standard: authenticated agent identity

User-agent sniffing and behavioral heuristics work today, but the industry is moving toward authenticated identity for AI agents. Two standards are worth watching:

RFC 9421 HTTP Message Signatures allow agents to cryptographically prove their identity. Cloudflare has already implemented a web bot authentication standard using this protocol. Instead of guessing whether a request from "GPTBot" is actually GPTBot, the server can verify a cryptographic signature.

The Agent Protocol (emerging from the MCP ecosystem) proposes a standard handshake where agents declare their identity, capabilities, and intent before beginning a session. Think OAuth for AI agents.

Until these standards see wide adoption, the layered classification approach in this post is your best bet. But design your system with a pluggable identity layer — when authenticated agent identity arrives, you'll want to slot it in as a high-confidence signal that short-circuits the heuristic pipeline.

Start measuring what matters

The companies that will win the agent economy aren't the ones blocking all non-human traffic. They're the ones that can tell the difference between a scraper stealing their pricing data and an AI agent trying to buy their product.

Your action items:

  1. Add Layer 1 today. The Nginx map directive takes 10 minutes. You'll immediately see how much AI traffic you're getting.
  2. Add Layer 2 this week. MaxMind's free GeoLite2 ASN database lets you tag cloud-origin requests.
  3. Build Layer 3 this month. Session behavioral analysis requires more instrumentation, but it's the most valuable signal.
  4. Create a dedicated dashboard. You can't optimize what you can't see.

The AI agent market is projected to grow from $7.9 billion in 2025 to $52.2 billion by 2030 — a 46% CAGR. That's a lot of purchase-intent traffic heading your way. The question is whether your analytics stack will see it coming.


Want to know how agent-ready your product is? Run the free AgentGate Benchmark to see how your site scores across authentication, documentation, bot detection, and 4 other categories. Check the Leaderboard to see how you compare against 800+ other SaaS companies.

Free Tool

How agent-ready is your website?

Run a free scan to see how AI agents experience your signup flow, robots.txt, API docs, and LLM visibility.

Run a free scan →