Inference Abuse & Resource Exhaustion

LLM DoSModel AbuseInference Cost Attack

Inference Abuse & Resource Exhaustion at a glance

What it is: Attacks that exploit LLM inference costs through excessive or malicious requests, causing financial damage, service degradation, or denial of service.
Why it happens: Inference abuse occurs when LLM endpoints lack proper authentication, rate limits, token or cost controls, caching, and usage monitoring, allowing excessive or unauthorized model use and resource exploitation.
How to fix: Apply rate limits and token caps, monitor per-user inference costs, and use caching to minimize redundant LLM requests and control resource usage.

Overview

Large Language Model inference is computationally expensive, with costs typically charged per token processed. Attackers can exploit this by generating excessive requests, crafting inputs that maximize token generation, or triggering expensive model operations repeatedly. This can cause significant financial damage and degrade service availability.

Common abuse patterns include sending high volumes of requests to exhaust API quotas or budgets, crafting prompts that trigger maximum-length outputs, exploiting loops or repetitive generation, targeting expensive operations (image generation, code execution), and abusing free tiers or trial accounts at scale.

sequenceDiagram participant Attacker participant App participant LLM as LLM API participant Billing loop Automated requests Attacker->>App: Craft prompt to maximize output tokens App->>LLM: Process (no limits) LLM->>LLM: Generate maximum length response LLM-->>App: 4000 tokens response LLM->>Billing: Charge for 4000 tokens end Billing->>App: Invoice: $10,000 (unexpected cost) Note over App: Missing: Rate limiting<br/>Missing: Token limits<br/>Missing: Cost monitoring
A potential flow for a Inference Abuse & Resource Exhaustion exploit

Where it occurs

Inference abuse occurs in LLM deployments lacking rate limits, authentication, usage monitoring, or cost controls, allowing excessive or unauthorized model access and resource consumption.

Impact

Inference abuse leads to unexpected financial costs (potentially thousands of dollars), service degradation or denial for legitimate users, API quota exhaustion, resource starvation affecting other applications, and bankruptcy of free tier offerings. Automated attacks can scale the damage rapidly.

Prevention

Prevent this vulnerability by enforcing authentication, rate and token limits, cost monitoring with automatic shutoffs, caching, circuit breakers, abuse detection, tiered usage controls, and continuous logging to block misuse and cost-based attacks.

Examples

Switch tabs to view language/framework variants.

LLM API without rate limiting enables DoS and cost exhaustion

Public-facing LLM endpoint lacks rate limiting, allowing unlimited requests that exhaust API credits.

Vulnerable
Python • FastAPI + OpenAI — Bad
# Vulnerable: No rate limiting or cost controls
from fastapi import FastAPI
import openai

app = FastAPI()

@app.post('/api/generate')
async def generate_content(prompt: str):
    # No rate limiting!
    response = openai.ChatCompletion.create(
        model='gpt-4',  # Expensive model
        messages=[{'role': 'user', 'content': prompt}],
        max_tokens=4000  # No limit on token usage
    )
    
    return {'result': response.choices[0].message.content}
  • Line 9:
  • Line 12:

No rate limiting allows unlimited expensive LLM API calls.

Secure
Python • FastAPI + OpenAI — Good
# Secure: Rate limiting and cost controls
from fastapi import FastAPI, Request
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
import openai

limiter = Limiter(key_func=get_remote_address)
app = FastAPI()
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

@app.post('/api/generate')
@limiter.limit('10/hour')  # Rate limit
async def generate_content(request: Request, prompt: str):
    # Validate input length
    if len(prompt) > 1000:
        return {'error': 'Prompt too long'}, 400
    
    response = openai.ChatCompletion.create(
        model='gpt-3.5-turbo',  # Use cheaper model
        messages=[{'role': 'user', 'content': prompt}],
        max_tokens=500,  # Limit tokens
        user=str(request.client.host)  # Track per-user usage
    )
    
    return {'result': response.choices[0].message.content}
  • Line 13:
  • Line 21:

Rate limiting, input validation, and token limits prevent abuse.

Engineer Checklist

  • Implement rate limiting per user, IP, and API key

  • Set maximum input and output token limits

  • Monitor real-time inference costs per user/tenant

  • Implement cost budgets with automatic shutoffs

  • Cache responses for common/repeated queries

  • Require authentication for all LLM endpoints

  • Use stricter limits for free tier users

  • Detect and block automated abuse patterns

  • Set up alerts for cost threshold breaches

  • Log all requests with token consumption metrics

End-to-End Example

A public-facing LLM chatbot has no rate limiting, allowing attackers to send thousands of requests and exhaust the API budget.

Vulnerable
PYTHON
# Vulnerable: No limits or monitoring
@app.route('/chat', methods=['POST'])
def chat():
    user_input = request.json['message']
    
    # No rate limiting, no token limits
    response = openai.ChatCompletion.create(
        model='gpt-4',
        messages=[{'role': 'user', 'content': user_input}]
        # No max_tokens limit!
    )
    
    return jsonify({'response': response.choices[0].message.content})
Secure
PYTHON
# Secure: Rate limiting and token limits
from flask_limiter import Limiter
from flask_limiter.util import get_remote_address
import redis

limiter = Limiter(
    app,
    key_func=get_remote_address,
    storage_uri="redis://localhost:6379"
)

# Track costs per user
cost_tracker = redis.Redis()

@app.route('/chat', methods=['POST'])
@limiter.limit("10 per minute")  # Rate limit
def chat():
    user_id = get_authenticated_user()
    user_input = request.json['message']
    
    # Check user's cost budget
    user_cost = float(cost_tracker.get(f"cost:{user_id}") or 0)
    if user_cost > MAX_DAILY_COST:
        return jsonify({'error': 'Daily budget exceeded'}), 429
    
    # Limit input tokens
    if len(user_input) > 2000:
        return jsonify({'error': 'Input too long'}), 400
    
    # Check cache first
    cache_key = f"chat:{hash(user_input)}"
    cached = cost_tracker.get(cache_key)
    if cached:
        return jsonify({'response': cached.decode()})
    
    # Call LLM with limits
    response = openai.ChatCompletion.create(
        model='gpt-3.5-turbo',  # More cost-effective
        messages=[{'role': 'user', 'content': user_input}],
        max_tokens=1000,  # Limit output
        temperature=0.7
    )
    
    result = response.choices[0].message.content
    
    # Track cost
    tokens_used = response.usage.total_tokens
    cost = calculate_cost(tokens_used)
    cost_tracker.incrbyfloat(f"cost:{user_id}", cost)
    
    # Cache response
    cost_tracker.setex(cache_key, 3600, result)
    
    return jsonify({'response': result})

Discovery

Test if LLM API endpoints lack rate limiting or resource controls, allowing abuse through excessive inference requests.

  1. 1. Test for rate limiting

    llm_api

    Action

    Send multiple rapid requests to check for rate limits

    Request

    POST https://api.example.com/llm/generate
    Headers:
    Authorization: Bearer free-tier-token
    Body:
    {
      "prompt": "Hello",
      "note": "Sent 100 requests in 10 seconds"
    }

    Response

    Status: 200
    Body:
    {
      "message": "All 100 requests succeeded with no rate limit errors",
      "cost_estimate": "~$50 in API costs (100 requests × $0.50 each)"
    }

    Artifacts

    no_rate_limiting resource_abuse_possible cost_inflation
  2. 2. Test for token/length limits

    llm_api

    Action

    Submit extremely long prompts to test resource limits

    Request

    POST https://api.example.com/llm/generate
    Body:
    {
      "prompt": "Repeat this 10000 times: [long text repeated 10k times - ~50,000 tokens]",
      "max_tokens": 4096
    }

    Response

    Status: 200
    Body:
    {
      "message": "Request processed successfully",
      "tokens_used": 54096,
      "cost": "$27.05",
      "processing_time": "45 seconds"
    }

    Artifacts

    no_token_limits expensive_request dos_vector
  3. 3. Test concurrent request limits

    llm_api

    Action

    Launch 50 concurrent long-running inference requests

    Request

    POST https://api.example.com/llm/generate
    Body:
    {
      "prompt": "Write a detailed 10,000 word essay about...",
      "note": "50 concurrent requests"
    }

    Response

    Status: 200
    Body:
    {
      "message": "All 50 concurrent requests accepted and processing",
      "note": "API becomes unresponsive for other users. Request queue: 148 pending"
    }

    Artifacts

    no_concurrency_limits service_degradation dos_confirmed

Exploit steps

Attacker floods LLM API with expensive inference requests to drain credits, cause denial of service, or inflate victim's API bills.

  1. 1. Drain API credits with bulk requests

    Credit exhaustion attack

    llm_api

    Action

    Automated script sends thousands of expensive inference requests

    Request

    POST https://api.example.com/llm/generate
    Body:
    {
      "prompt": "Generate 50 detailed product descriptions...",
      "note": "Script sends 5,000 requests over 2 hours"
    }

    Response

    Status: 200
    Body:
    {
      "message": "Attack result: $12,500 in API costs consumed. Victim's monthly budget ($15,000) depleted in 2 hours. Legitimate users cannot access service.",
      "requests_completed": 5000,
      "total_tokens": 25000000
    }

    Artifacts

    budget_exhaustion service_denial financial_impact
  2. 2. Cause service degradation through resource exhaustion

    DoS via concurrent long requests

    llm_api

    Action

    Flood API with maximum-length inference requests

    Request

    POST https://api.example.com/llm/generate
    Body:
    {
      "prompt": "[4096 token prompt requesting 4096 token response]",
      "max_tokens": 4096,
      "note": "200 concurrent requests"
    }

    Response

    Status: 200
    Body:
    {
      "message": "All GPU workers saturated. Response times for legitimate users increased from 2s to 180s. Service effectively unavailable.",
      "queue_depth": 847,
      "estimated_wait": "25 minutes"
    }

    Artifacts

    service_dos resource_exhaustion legitimate_user_impact
  3. 3. Inflate competitor's cloud bills

    Cost inflation attack

    llm_api

    Action

    Abuse free trial or stolen API key to rack up charges

    Request

    POST https://api.example.com/llm/generate
    Body:
    {
      "prompt": "Very long prompt...",
      "note": "Using stolen API key, attacker runs 24/7 inference jobs for image generation, embeddings, and text"
    }

    Response

    Status: 200
    Body:
    {
      "message": "48 hours of abuse: OpenAI API bill increased from $2,000/month baseline to $45,000 unexpected charges. Company forced to disable API access.",
      "total_cost": "$43,000",
      "requests": 215000
    }

    Artifacts

    financial_attack api_bill_inflation business_disruption

Specific Impact

Financial damage from excessive LLM API costs, service degradation or denial for legitimate users, and potential bankruptcy of free tier services.

Fix

Implement rate limiting to prevent request floods. Set maximum token limits for inputs and outputs. Monitor and track costs per user with budget limits. Use caching to avoid redundant inference. Require authentication to attribute usage.

Detect This Vulnerability in Your Code

Sourcery automatically identifies inference abuse & resource exhaustion vulnerabilities and many other security issues in your codebase.

Scan Your Code for Free