Best Practices13 min readFebruary 27, 2026

Rate limiting for agents vs humans: why your 429s are killing conversion

Anon Team

The 429 problem

A human user hits your API maybe 10-50 times during a session. They click a button, wait for a response, think about what to do next, click another button. The gaps between requests are measured in seconds.

An AI agent hits your API 500 times in the first minute. It reads your entire API reference, enumerates available resources, runs a discovery scan, then starts making the calls it actually needs. The gaps between requests are measured in milliseconds.

Your rate limiter — designed for humans — sees 500 requests in 60 seconds and does exactly what it was built to do: return 429 Too Many Requests. The agent backs off, retries, gets another 429, backs off again. Within 3 minutes, a legitimate agent that was trying to integrate with your product has given up and moved to a competitor with more permissive limits.

You never see this in your analytics because the agent doesn't file a support ticket. It just leaves.

According to Cloudera's 2025 enterprise survey, 96% of IT leaders plan to expand their use of AI agents in the next 12 months. If your rate limiting is calibrated for human traffic, you're about to lose a lot of potential integrations.

How agent traffic differs from human traffic

Before redesigning rate limits, you need to understand what you're designing for. Agent traffic has five characteristics that break traditional rate limiting assumptions:

1. Burst-then-settle pattern

Human traffic is relatively steady — a few requests per second, sustained over minutes. Agent traffic is bimodal: an intense initial burst (discovery, authentication, schema fetching) followed by a lower, steady-state pattern (actual API usage).

Human traffic pattern:
  ▂▃▃▂▃▃▂▃▂▃▃▂▃▃▂▃▂▃▃▂  (~3-5 req/sec, steady)

Agent traffic pattern:
  █████████▃▂▂▁▁▂▁▂▁▂▁▁  (50 req/sec burst → 2 req/sec steady)

A fixed-window rate limit of 100 requests/minute handles the human fine (180 total) but kills the agent in the first 12 seconds of its burst phase (it blows through 100 requests) — even though the agent's total requests over 10 minutes might be lower than the human's.

2. Parallel request patterns

Humans are serial: click, wait, click, wait. Agents are parallel. A well-built agent pipeline will fire 10-20 concurrent requests to fetch related resources, process them simultaneously, and then issue the next batch.

This means per-second rate limits hit agents harder than per-minute limits. An agent sending 20 parallel requests in 100ms looks like a DDoS attack to a per-second limiter, but it's well within a 1000 req/minute budget.

3. Retry amplification

This is the most insidious pattern. When an agent gets a 429:

It retries (often immediately — not all agents implement backoff)
The retry also gets 429'd (it's still in the rate limit window)
Each retry counts against the limit, making the window last longer
Multiple agents hitting the limit simultaneously create a retry storm

A single 429 can cascade into dozens of additional requests — the rate limiter makes the problem it was designed to prevent.

4. Discovery-heavy first sessions

The first time an agent interacts with your API, it needs to discover what's available. This means fetching:

API schema or OpenAPI spec
Authentication endpoints
Available resources and their relationships
Pagination metadata for collection endpoints
Rate limit policies (if documented)

This initial discovery phase can generate 50-200 requests before the agent makes a single "real" API call. If your rate limit is 100 req/minute globally, the agent never gets past discovery.

5. Multi-agent synchronization

When a popular AI framework releases a new feature or a triggering event occurs (market data change, scheduled task), hundreds of agents may hit your API simultaneously. Unlike human traffic spikes that ramp up gradually, agent traffic spikes are instantaneous — every agent acts on the same signal at the same time.

What your rate limit headers should look like

Before changing your limits, make sure agents can read your limits. The IETF draft for RateLimit header fields defines a standard that agents can parse programmatically.

The minimum viable rate limit response

Every API response should include these headers:

HTTP/1.1 200 OK
Content-Type: application/json
RateLimit-Limit: 1000
RateLimit-Remaining: 847
RateLimit-Reset: 1709078400
RateLimit-Policy: 1000;w=3600, 50;w=1

Let's break these down:

Header	Meaning	Why agents need it
`RateLimit-Limit`	Max requests in the current window	Agent knows its budget
`RateLimit-Remaining`	Requests left in this window	Agent can pace itself
`RateLimit-Reset`	Unix timestamp when the window resets	Agent knows when to resume
`RateLimit-Policy`	Limit structure (1000/hour, 50/second)	Agent can plan request scheduling

The 429 response

When an agent does hit the limit, give it everything it needs to recover:

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30
RateLimit-Limit: 1000
RateLimit-Remaining: 0
RateLimit-Reset: 1709078400
RateLimit-Policy: 1000;w=3600, 50;w=1

{
  "error": {
    "type": "rate_limit_exceeded",
    "message": "Rate limit exceeded. 1000 requests per hour allowed.",
    "retry_after": 30,
    "limit": 1000,
    "window": "1h",
    "reset_at": "2026-02-27T12:00:00Z",
    "docs_url": "https://docs.yourapp.com/rate-limits"
  }
}

The Retry-After header is critical. Without it, agents guess — and they usually guess wrong (either too aggressive or too conservative). GitHub's API is the gold standard here:

x-ratelimit-limit: 5000
x-ratelimit-remaining: 4999
x-ratelimit-reset: 1709078400
x-ratelimit-resource: core
x-ratelimit-used: 1

GitHub returns rate limit headers on every response, not just 429s. The x-ratelimit-resource field is particularly useful — it tells the agent which rate limit bucket the request counted against, so it can manage different quotas independently.

Three rate limiting architectures, compared

1. Fixed window (what most APIs use today)

Window: 1 minute
Limit: 100 requests
Counter resets at: start of each minute

[  0:00 ─────────────── 1:00 ][ 1:00 ─────────────── 2:00 ]
  Human:  ▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃    ▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃  ✅ (60/min)
  Agent:  █████████░░░░░░░░    ██████░░░░░░░░░░░  ❌ (100 in 20s)

Problem for agents: An agent hitting 100 requests in the first 20 seconds gets blocked for 40 seconds, even though its total load over the full minute would've been under 100. Worse: the boundary problem. If an agent sends 90 requests in the last 10 seconds of one window and 90 in the first 10 seconds of the next, it effectively sends 180 requests in 20 seconds — double the intended limit — because counters reset at the boundary.

Implementation (Redis):

async function fixedWindowCheck(clientId, limit, windowSec) {
  const window = Math.floor(Date.now() / 1000 / windowSec);
  const key = `ratelimit:${clientId}:${window}`;

  const current = await redis.incr(key);
  if (current === 1) {
    await redis.expire(key, windowSec);
  }

  return {
    allowed: current <= limit,
    remaining: Math.max(0, limit - current),
    resetAt: (window + 1) * windowSec,
  };
}

Verdict: Simple to implement. Bad for agents. Use only for the most basic APIs where agent traffic is minimal.

2. Token bucket (better for burst tolerance)

Bucket capacity: 100 tokens
Refill rate: 10 tokens/second
Each request costs 1 token

[  0:00 ─────────────────────────── 10:00 ]
  Agent burst: ██████████ (100 tokens spent in 2s)
  Bucket refills: ░░░░░░░░░░░░░░░░░░ (10/sec)
  Agent resumes: ▃▃▃▃▃▃▃▃▃▃▃▃▃▃ (steady 8/sec)  ✅

The token bucket allows bursts up to the bucket capacity, then throttles to the refill rate. This matches the agent's burst-then-settle pattern perfectly.

Implementation (Redis + Lua for atomicity):

-- Token bucket rate limiter (Lua script for Redis)
local key = KEYS[1]
local capacity = tonumber(ARGV[1])     -- 100
local refill_rate = tonumber(ARGV[2])  -- 10 tokens/sec
local now = tonumber(ARGV[3])          -- current timestamp (ms)
local requested = tonumber(ARGV[4])    -- tokens needed (usually 1)

-- Get current bucket state
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now

-- Calculate refilled tokens since last request
local elapsed = (now - last_refill) / 1000  -- convert ms to seconds
local refilled = math.floor(elapsed * refill_rate)
tokens = math.min(capacity, tokens + refilled)

-- Check if request is allowed
local allowed = tokens >= requested
if allowed then
  tokens = tokens - requested
end

-- Update bucket state
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 1)

return {allowed and 1 or 0, tokens, math.ceil((requested - tokens) / refill_rate * 1000)}

Node.js wrapper:

const tokenBucketScript = fs.readFileSync("./token-bucket.lua", "utf8");

async function tokenBucketCheck(clientId, capacity, refillRate) {
  const [allowed, remaining, retryAfterMs] = await redis.eval(
    tokenBucketScript,
    1, // number of keys
    `ratelimit:${clientId}`, // KEYS[1]
    capacity, // ARGV[1]
    refillRate, // ARGV[2]
    Date.now(), // ARGV[3]
    1, // ARGV[4] - tokens per request
  );

  return {
    allowed: allowed === 1,
    remaining,
    retryAfterMs: allowed ? 0 : retryAfterMs,
  };
}

Verdict: Great for agent traffic. Tolerates the initial burst, then enforces a sustainable rate. The capacity parameter directly controls how much burst you'll accept.

3. Sliding window with agent-aware tiers (recommended)

The ideal approach combines a sliding window counter (no boundary problems) with per-client tier configuration that recognizes agents as a distinct traffic class:

const TIERS = {
  // Human-oriented: browser sessions, interactive use
  free: {
    requestsPerMinute: 60,
    requestsPerHour: 1000,
    burstCapacity: 20, // max concurrent
    retryAfterSeconds: 60,
  },
  // Standard agent tier: registered API clients
  agent_standard: {
    requestsPerMinute: 300,
    requestsPerHour: 10000,
    burstCapacity: 100, // agents burst
    retryAfterSeconds: 10,
  },
  // Premium agent tier: paid integrations
  agent_premium: {
    requestsPerMinute: 1000,
    requestsPerHour: 50000,
    burstCapacity: 500,
    retryAfterSeconds: 5,
  },
};

Sliding window implementation:

async function slidingWindowCheck(clientId, tier) {
  const config = TIERS[tier];
  const now = Date.now();
  const windowMs = 60_000; // 1 minute
  const windowStart = now - windowMs;

  const key = `ratelimit:sliding:${clientId}`;

  // Atomic operation: remove old entries, add new, count
  const pipeline = redis.pipeline();
  pipeline.zremrangebyscore(key, 0, windowStart); // Remove expired
  pipeline.zadd(key, now, `${now}:${Math.random()}`); // Add current
  pipeline.zcard(key); // Count in window
  pipeline.expire(key, 120); // TTL safety

  const results = await pipeline.exec();
  const requestCount = results[2][1];

  const allowed = requestCount <= config.requestsPerMinute;

  if (!allowed) {
    // Remove the request we just added
    await redis.zremrangebyscore(key, now, now);
  }

  return {
    allowed,
    limit: config.requestsPerMinute,
    remaining: Math.max(0, config.requestsPerMinute - requestCount),
    resetAt: Math.ceil((windowStart + windowMs) / 1000),
    retryAfter: allowed ? 0 : config.retryAfterSeconds,
    tier,
  };
}

Verdict: Most accurate rate limiting. Per-agent tiers let you offer higher limits to registered agent clients without changing anything for human users. The sliding window eliminates boundary exploits.

Get Started

Ready to make your product agent-accessible?

Add a few lines of code and let AI agents discover, request access, and get real credentials — with human oversight built in.

Get started with Anon →

Designing agent-aware rate limit tiers

The key insight: agents and humans should not share the same rate limit pool. Here's a practical tier design:

Tier 1: Anonymous / unidentified traffic

# No API key, no auth — could be anyone
requests_per_minute: 30
requests_per_hour: 500
burst: 10
applies_to: requests without API key or token

This is your DDoS protection tier. Low limits, aggressive throttling. Agents that haven't authenticated yet land here during their discovery phase.

Tier 2: Authenticated human users

# Logged-in users making API calls from your dashboard/app
requests_per_minute: 100
requests_per_hour: 3000
burst: 30
applies_to: requests with session cookie or user access token

Standard human limits. Most SaaS APIs already have something like this.

Tier 3: Registered agent clients

# OAuth client credentials, API keys marked as agent/service
requests_per_minute: 500
requests_per_hour: 20000
burst: 200
initial_burst_bonus: 500 # Extra burst for first 5 minutes
applies_to: requests with client_credentials token or agent-flagged API key

This is the critical tier. Registered agent clients get 5x the human limit, plus an initial burst bonus for the discovery phase. The initial_burst_bonus gives agents 500 extra requests in their first 5 minutes — enough to fetch your OpenAPI spec, enumerate resources, and start working.

Tier 4: Premium / enterprise agent clients

requests_per_minute: 2000
requests_per_hour: 100000
burst: 1000
dedicated_pool: true # Doesn't share capacity with other tiers
applies_to: enterprise contracts, high-volume integrations

Enterprise agents get dedicated capacity that doesn't compete with other traffic. This is table stakes for any SaaS company selling to enterprises that use AI agents for automation.

Implementation: detecting agent traffic

How do you know which requests come from agents? Several signals:

function classifyClient(req) {
  // 1. OAuth client credentials = definitely an agent
  if (req.auth?.grantType === "client_credentials") {
    return req.auth.tier || "agent_standard";
  }

  // 2. API key with agent flag
  if (req.apiKey?.type === "agent" || req.apiKey?.type === "service") {
    return "agent_standard";
  }

  // 3. User-Agent heuristics (fallback)
  const ua = req.headers["user-agent"] || "";
  const agentPatterns = [
    /^python-requests/i,
    /^axios/i,
    /^node-fetch/i,
    /^Go-http-client/i,
    /langchain/i,
    /openai-agent/i,
    /anthropic-sdk/i,
    /^curl/i,
  ];

  if (agentPatterns.some((p) => p.test(ua))) {
    return "agent_standard"; // Treat SDK traffic as agent
  }

  // 4. Behavioral signals
  if (req.session?.requestsInLastMinute > 50) {
    return "agent_standard"; // Upgraded mid-session based on behavior
  }

  return "free"; // Default to human tier
}

This isn't about blocking agents — it's about serving them better. Agents classified into the agent_standard tier get higher limits than the default free tier.

The retry-after contract

The Retry-After header is a contract between your API and the agent. When you return it, you're saying: "If you wait this long, I guarantee your next request will succeed."

Most APIs break this contract. They return Retry-After: 60 but the rate limit resets in 45 seconds, or worse — the agent retries after 60 seconds and gets another 429 because the window calculation doesn't align.

Implementing honest Retry-After

function buildRateLimitResponse(req, res, rateLimitResult) {
  const { limit, remaining, resetAt, retryAfter, tier } = rateLimitResult;

  // Always include rate limit headers — even on 200 responses
  res.set({
    "RateLimit-Limit": limit,
    "RateLimit-Remaining": remaining,
    "RateLimit-Reset": resetAt,
    "RateLimit-Policy": `${limit};w=60`,
    "X-RateLimit-Tier": tier,
  });

  if (!rateLimitResult.allowed) {
    // Calculate EXACT seconds until the agent can retry
    const exactRetryAfter = Math.max(
      1,
      resetAt - Math.floor(Date.now() / 1000),
    );

    res.set("Retry-After", exactRetryAfter);

    return res.status(429).json({
      error: {
        type: "rate_limit_exceeded",
        message:
          `Rate limit exceeded for tier "${tier}". ` +
          `${limit} requests per minute allowed.`,
        retry_after: exactRetryAfter,
        limit,
        remaining: 0,
        reset_at: new Date(resetAt * 1000).toISOString(),
        tier,
        upgrade_url:
          tier === "free"
            ? "https://yourapp.com/pricing#agent-tier"
            : undefined,
      },
    });
  }
}

Note the upgrade_url in the error response. When a free-tier client gets rate limited, the response tells the agent (or the developer building the agent) exactly where to go to get higher limits. This turns a 429 from a dead end into a conversion opportunity.

Adaptive rate limiting: letting the system breathe

Static rate limits — even well-designed ones — can't handle all scenarios. What happens when your API is running hot and even the allowed traffic is causing latency? Or when it's 3 AM and your servers are idle — why not let agents burst higher?

Adaptive rate limiting adjusts limits based on real-time system health:

class AdaptiveRateLimiter {
  constructor(baseConfig) {
    this.baseConfig = baseConfig;
    this.healthMultiplier = 1.0;

    // Monitor system health every 10 seconds
    setInterval(() => this.updateHealth(), 10_000);
  }

  async updateHealth() {
    const metrics = await this.getSystemMetrics();

    // Scale limits based on system load
    if (metrics.p99Latency > 2000 || metrics.errorRate > 0.05) {
      // System stressed: tighten limits
      this.healthMultiplier = 0.5;
    } else if (metrics.p99Latency > 1000 || metrics.errorRate > 0.02) {
      // System warm: slight reduction
      this.healthMultiplier = 0.75;
    } else if (metrics.cpuUtilization < 0.3) {
      // System idle: allow more traffic
      this.healthMultiplier = 1.5;
    } else {
      // Normal operation
      this.healthMultiplier = 1.0;
    }
  }

  getEffectiveLimit(tier) {
    const base = this.baseConfig[tier].requestsPerMinute;
    return Math.floor(base * this.healthMultiplier);
  }

  async check(clientId, tier) {
    const effectiveLimit = this.getEffectiveLimit(tier);
    return slidingWindowCheck(clientId, {
      ...this.baseConfig[tier],
      requestsPerMinute: effectiveLimit,
    });
  }
}

The healthMultiplier scales all rate limits based on system conditions:

System stressed (high latency or errors): Cut limits by 50% to protect the service
System warm: Reduce by 25% as a precaution
System idle: Boost limits by 50% — let agents use available capacity
Normal: Apply base limits

This means an agent hitting your API at 3 AM might get 750 req/minute instead of 500, while the same agent during a traffic spike might get 250. Both are fair — and both are better than a static limit that's either too low during quiet times or too high during load.

What this looks like in practice

Here's how the complete middleware fits together:

const rateLimiter = new AdaptiveRateLimiter(TIERS);

app.use(async (req, res, next) => {
  const clientId = req.auth?.clientId || req.ip;
  const tier = classifyClient(req);

  const result = await rateLimiter.check(clientId, tier);

  // Always set headers (even on success)
  buildRateLimitHeaders(res, result);

  if (!result.allowed) {
    return buildRateLimitResponse(req, res, result);
  }

  next();
});

And the agent-side handling:

import httpx
import asyncio
from typing import Optional

class AgentAPIClient:
    def __init__(self, base_url: str, api_key: str):
        self.base_url = base_url
        self.client = httpx.AsyncClient(
            headers={"Authorization": f"Bearer {api_key}"}
        )
        self.rate_limit_remaining: Optional[int] = None
        self.rate_limit_reset: Optional[float] = None

    async def request(self, method: str, path: str, **kwargs):
        # Pre-check: if we know we're out of budget, wait
        if self.rate_limit_remaining == 0 and self.rate_limit_reset:
            wait = self.rate_limit_reset - asyncio.get_event_loop().time()
            if wait > 0:
                await asyncio.sleep(wait)

        for attempt in range(5):
            response = await self.client.request(
                method, f"{self.base_url}{path}", **kwargs
            )

            # Update rate limit state from headers
            self.rate_limit_remaining = int(
                response.headers.get("RateLimit-Remaining", 999)
            )
            reset = response.headers.get("RateLimit-Reset")
            if reset:
                self.rate_limit_reset = float(reset)

            if response.status_code != 429:
                return response

            # 429: use Retry-After if available, else exponential backoff
            retry_after = response.headers.get("Retry-After")
            if retry_after:
                await asyncio.sleep(float(retry_after))
            else:
                await asyncio.sleep(2 ** attempt)  # 1, 2, 4, 8, 16s

        raise Exception(f"Rate limited after 5 retries on {path}")

The agent reads RateLimit-Remaining on every response and proactively waits when it knows it's about to hit the limit. When it does get a 429, it respects the Retry-After header exactly. No guessing, no retry storms.

The conversion math

Let's put real numbers to this. Assume:

100 AI agent integrations attempt your API per month
30% are blocked by rate limits during the discovery phase (common with default configs)
Each successful agent integration generates $200/month in API usage
Average agent lifetime: 8 months

With default rate limits (calibrated for humans):

70 agents succeed → $200 × 70 × 8 = $112,000 lifetime revenue

With agent-aware rate limits:

95 agents succeed → $200 × 95 × 8 = $152,000 lifetime revenue

The difference: $40,000 in recovered revenue from changing a configuration file. No new features. No new code. Just acknowledging that your fastest-growing user segment needs different limits than a human clicking around a dashboard.

The takeaway

Rate limiting exists to protect your infrastructure. But protection that blocks legitimate traffic isn't protection — it's a conversion killer.

The fix is straightforward:

Add rate limit headers to every response (not just 429s)
Use token bucket or sliding window instead of fixed window
Create separate agent tiers with higher burst capacity and limits
Return honest Retry-After values so agents can schedule retries precisely
Consider adaptive limits that scale with system health
Include an upgrade_url in 429 responses to convert rate-limited agents into paying customers

Your rate limiter is the first thing every AI agent interacts with on your platform. Make it an onramp, not a wall.

Want to see how your API's rate limiting affects your agent-readiness score? Run your domain through the AgentGate benchmark.