Inference Abuse & Resource Exhaustion

LLM DoSModel AbuseInference Cost Attack

Inference Abuse & Resource Exhaustion at a glance

What it is: Attacks that exploit LLM inference costs through excessive or malicious requests, causing financial damage, service degradation, or denial of service.

Why it happens: Inference abuse occurs when LLM endpoints lack proper authentication, rate limits, token or cost controls, caching, and usage monitoring, allowing excessive or unauthorized model use and resource exploitation.

How to fix: Apply rate limits and token caps, monitor per-user inference costs, and use caching to minimize redundant LLM requests and control resource usage.

Overview

Large Language Model inference is computationally expensive, with costs typically charged per token processed. Attackers can exploit this by generating excessive requests, crafting inputs that maximize token generation, or triggering expensive model operations repeatedly. This can cause significant financial damage and degrade service availability.

Common abuse patterns include sending high volumes of requests to exhaust API quotas or budgets, crafting prompts that trigger maximum-length outputs, exploiting loops or repetitive generation, targeting expensive operations (image generation, code execution), and abusing free tiers or trial accounts at scale.

sequenceDiagram participant Attacker participant App participant LLM as LLM API participant Billing loop Automated requests Attacker->>App: Craft prompt to maximize output tokens App->>LLM: Process (no limits) LLM->>LLM: Generate maximum length response LLM-->>App: 4000 tokens response LLM->>Billing: Charge for 4000 tokens end Billing->>App: Invoice: $10,000 (unexpected cost) Note over App: Missing: Rate limiting<br/>Missing: Token limits<br/>Missing: Cost monitoring

A potential flow for a Inference Abuse & Resource Exhaustion exploit

Where it occurs

Inference abuse occurs in LLM deployments lacking rate limits, authentication, usage monitoring, or cost controls, allowing excessive or unauthorized model access and resource consumption.

Impact

Inference abuse leads to unexpected financial costs (potentially thousands of dollars), service degradation or denial for legitimate users, API quota exhaustion, resource starvation affecting other applications, and bankruptcy of free tier offerings. Automated attacks can scale the damage rapidly.

Prevention

Prevent this vulnerability by enforcing authentication, rate and token limits, cost monitoring with automatic shutoffs, caching, circuit breakers, abuse detection, tiered usage controls, and continuous logging to block misuse and cost-based attacks.

Examples

Switch tabs to view language/framework variants.

LLM API without rate limiting enables cost exhaustion

Missing rate limits allow unlimited expensive LLM calls.

Vulnerable

JavaScript • Express + OpenAI — Bad

const express = require('express');
const { OpenAI } = require('openai');
const app = express();
const openai = new OpenAI();

app.post('/generate', async (req, res) => {
    // BUG: No rate limiting, token limits, or cost monitoring
    const response = await openai.chat.completions.create({
        model: 'gpt-4',
        messages: [{ role: 'user', content: req.body.prompt }]
        // No max_tokens limit!
    });
    res.json({ result: response.choices[0].message.content });
});

Line 8: No limits or cost tracking

Without limits, attackers can exhaust API budgets through automated abuse.

Secure

JavaScript • Express + OpenAI — Good

const express = require('express');
const { OpenAI } = require('openai');
const rateLimit = require('express-rate-limit');
const redis = require('redis');

const app = express();
const openai = new OpenAI();
const costTracker = redis.createClient();

const limiter = rateLimit({
    windowMs: 60 * 1000,
    max: 10
});

app.post('/generate', limiter, async (req, res) => {
    // Check user cost budget
    const userCost = await costTracker.get(`cost:${req.user.id}`) || 0;
    if (userCost > 10) {
        return res.status(429).json({ error: 'Daily budget exceeded' });
    }
    
    // Limit input and output tokens
    const prompt = req.body.prompt.slice(0, 500);
    
    const response = await openai.chat.completions.create({
        model: 'gpt-3.5-turbo',
        messages: [{ role: 'user', content: prompt }],
        max_tokens: 500
    });
    
    // Track cost
    const cost = calculateCost(response.usage);
    await costTracker.incrByFloat(`cost:${req.user.id}`, cost);
    
    res.json({ result: response.choices[0].message.content });
});

Line 10: Rate limiting
Line 17: Cost budget check
Line 28: Token limits

Implement rate limiting, token limits, and cost budgets with automatic shutoffs.

LLM API without rate limiting enables DoS and cost exhaustion

Public-facing LLM endpoint lacks rate limiting, allowing unlimited requests that exhaust API credits.

Vulnerable

Python • FastAPI + OpenAI — Bad

# Vulnerable: No rate limiting or cost controls
from fastapi import FastAPI
import openai

app = FastAPI()

@app.post('/api/generate')
async def generate_content(prompt: str):
    # No rate limiting!
    response = openai.ChatCompletion.create(
        model='gpt-4',  # Expensive model
        messages=[{'role': 'user', 'content': prompt}],
        max_tokens=4000  # No limit on token usage
    )
    
    return {'result': response.choices[0].message.content}

Line 9:
Line 12:

No rate limiting allows unlimited expensive LLM API calls.

Secure

Python • FastAPI + OpenAI — Good

# Secure: Rate limiting and cost controls
from fastapi import FastAPI, Request
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
import openai

limiter = Limiter(key_func=get_remote_address)
app = FastAPI()
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

@app.post('/api/generate')
@limiter.limit('10/hour')  # Rate limit
async def generate_content(request: Request, prompt: str):
    # Validate input length
    if len(prompt) > 1000:
        return {'error': 'Prompt too long'}, 400
    
    response = openai.ChatCompletion.create(
        model='gpt-3.5-turbo',  # Use cheaper model
        messages=[{'role': 'user', 'content': prompt}],
        max_tokens=500,  # Limit tokens
        user=str(request.client.host)  # Track per-user usage
    )
    
    return {'result': response.choices[0].message.content}

Line 13:
Line 21:

Rate limiting, input validation, and token limits prevent abuse.

Engineer Checklist

Implement rate limiting per user, IP, and API key
Set maximum input and output token limits
Monitor real-time inference costs per user/tenant
Implement cost budgets with automatic shutoffs
Cache responses for common/repeated queries
Require authentication for all LLM endpoints
Use stricter limits for free tier users
Detect and block automated abuse patterns
Set up alerts for cost threshold breaches
Log all requests with token consumption metrics

End-to-End Example

A public-facing LLM chatbot has no rate limiting, allowing attackers to send thousands of requests and exhaust the API budget.

Vulnerable

PYTHON

# Vulnerable: No limits or monitoring
@app.route('/chat', methods=['POST'])
def chat():
    user_input = request.json['message']
    
    # No rate limiting, no token limits
    response = openai.ChatCompletion.create(
        model='gpt-4',
        messages=[{'role': 'user', 'content': user_input}]
        # No max_tokens limit!
    )
    
    return jsonify({'response': response.choices[0].message.content})

Secure

PYTHON

# Secure: Rate limiting and token limits
from flask_limiter import Limiter
from flask_limiter.util import get_remote_address
import redis

limiter = Limiter(
    app,
    key_func=get_remote_address,
    storage_uri="redis://localhost:6379"
)

# Track costs per user
cost_tracker = redis.Redis()

@app.route('/chat', methods=['POST'])
@limiter.limit("10 per minute")  # Rate limit
def chat():
    user_id = get_authenticated_user()
    user_input = request.json['message']
    
    # Check user's cost budget
    user_cost = float(cost_tracker.get(f"cost:{user_id}") or 0)
    if user_cost > MAX_DAILY_COST:
        return jsonify({'error': 'Daily budget exceeded'}), 429
    
    # Limit input tokens
    if len(user_input) > 2000:
        return jsonify({'error': 'Input too long'}), 400
    
    # Check cache first
    cache_key = f"chat:{hash(user_input)}"
    cached = cost_tracker.get(cache_key)
    if cached:
        return jsonify({'response': cached.decode()})
    
    # Call LLM with limits
    response = openai.ChatCompletion.create(
        model='gpt-3.5-turbo',  # More cost-effective
        messages=[{'role': 'user', 'content': user_input}],
        max_tokens=1000,  # Limit output
        temperature=0.7
    )
    
    result = response.choices[0].message.content
    
    # Track cost
    tokens_used = response.usage.total_tokens
    cost = calculate_cost(tokens_used)
    cost_tracker.incrbyfloat(f"cost:{user_id}", cost)
    
    # Cache response
    cost_tracker.setex(cache_key, 3600, result)
    
    return jsonify({'response': result})

Discovery

Test if LLM API endpoints lack rate limiting or resource controls, allowing abuse through excessive inference requests.

1. Test for rate limiting
llm_api
Action

Send multiple rapid requests to check for rate limits
Request
POST https://api.example.com/llm/generate

Headers:

Authorization: Bearer free-tier-token

Body:

{ "prompt": "Hello", "note": "Sent 100 requests in 10 seconds" }
Response
Status: 200

Body:

{ "message": "All 100 requests succeeded with no rate limit errors", "cost_estimate": "~$50 in API costs (100 requests × $0.50 each)" }
Artifacts

no_rate_limiting resource_abuse_possible cost_inflation

2. Test for token/length limits

llm_api

Action

Submit extremely long prompts to test resource limits

Request

POST https://api.example.com/llm/generate

Body:

{
  "prompt": "Repeat this 10000 times: [long text repeated 10k times - ~50,000 tokens]",
  "max_tokens": 4096
}

Response

Status: 200

Body:

{
  "message": "Request processed successfully",
  "tokens_used": 54096,
  "cost": "$27.05",
  "processing_time": "45 seconds"
}

Artifacts

no_token_limits expensive_request dos_vector

3. Test concurrent request limits

llm_api

Action

Launch 50 concurrent long-running inference requests

Request

POST https://api.example.com/llm/generate

Body:

{
  "prompt": "Write a detailed 10,000 word essay about...",
  "note": "50 concurrent requests"
}

Response

Status: 200

Body:

{
  "message": "All 50 concurrent requests accepted and processing",
  "note": "API becomes unresponsive for other users. Request queue: 148 pending"
}

Artifacts

no_concurrency_limits service_degradation dos_confirmed

Exploit steps

Attacker floods LLM API with expensive inference requests to drain credits, cause denial of service, or inflate victim's API bills.

1. Drain API credits with bulk requests

Credit exhaustion attack

llm_api

Action

Automated script sends thousands of expensive inference requests

Request

POST https://api.example.com/llm/generate

Body:

{
  "prompt": "Generate 50 detailed product descriptions...",
  "note": "Script sends 5,000 requests over 2 hours"
}

Response

Status: 200

Body:

{
  "message": "Attack result: $12,500 in API costs consumed. Victim's monthly budget ($15,000) depleted in 2 hours. Legitimate users cannot access service.",
  "requests_completed": 5000,
  "total_tokens": 25000000
}

Artifacts

budget_exhaustion service_denial financial_impact

2. Cause service degradation through resource exhaustion

DoS via concurrent long requests

llm_api

Action

Flood API with maximum-length inference requests

Request

POST https://api.example.com/llm/generate

Body:

{
  "prompt": "[4096 token prompt requesting 4096 token response]",
  "max_tokens": 4096,
  "note": "200 concurrent requests"
}

Response

Status: 200

Body:

{
  "message": "All GPU workers saturated. Response times for legitimate users increased from 2s to 180s. Service effectively unavailable.",
  "queue_depth": 847,
  "estimated_wait": "25 minutes"
}

Artifacts

service_dos resource_exhaustion legitimate_user_impact

3. Inflate competitor's cloud bills

Cost inflation attack

llm_api

Action

Abuse free trial or stolen API key to rack up charges

Request

POST https://api.example.com/llm/generate

Body:

{
  "prompt": "Very long prompt...",
  "note": "Using stolen API key, attacker runs 24/7 inference jobs for image generation, embeddings, and text"
}

Response

Status: 200

Body:

{
  "message": "48 hours of abuse: OpenAI API bill increased from $2,000/month baseline to $45,000 unexpected charges. Company forced to disable API access.",
  "total_cost": "$43,000",
  "requests": 215000
}

Artifacts

financial_attack api_bill_inflation business_disruption

Specific Impact

Financial damage from excessive LLM API costs, service degradation or denial for legitimate users, and potential bankruptcy of free tier services.

Fix

Implement rate limiting to prevent request floods. Set maximum token limits for inputs and outputs. Monitor and track costs per user with budget limits. Use caching to avoid redundant inference. Require authentication to attribute usage.

Detect This Vulnerability in Your Code

Sourcery automatically identifies inference abuse & resource exhaustion vulnerabilities and many other security issues in your codebase.

Scan Your Code for Free