Inference Abuse & Resource Exhaustion
Inference Abuse & Resource Exhaustion at a glance
Overview
Large Language Model inference is computationally expensive, with costs typically charged per token processed. Attackers can exploit this by generating excessive requests, crafting inputs that maximize token generation, or triggering expensive model operations repeatedly. This can cause significant financial damage and degrade service availability.
Common abuse patterns include sending high volumes of requests to exhaust API quotas or budgets, crafting prompts that trigger maximum-length outputs, exploiting loops or repetitive generation, targeting expensive operations (image generation, code execution), and abusing free tiers or trial accounts at scale.
Where it occurs
Inference abuse occurs in LLM deployments lacking rate limits, authentication, usage monitoring, or cost controls, allowing excessive or unauthorized model access and resource consumption.
Impact
Inference abuse leads to unexpected financial costs (potentially thousands of dollars), service degradation or denial for legitimate users, API quota exhaustion, resource starvation affecting other applications, and bankruptcy of free tier offerings. Automated attacks can scale the damage rapidly.
Prevention
Prevent this vulnerability by enforcing authentication, rate and token limits, cost monitoring with automatic shutoffs, caching, circuit breakers, abuse detection, tiered usage controls, and continuous logging to block misuse and cost-based attacks.
Examples
Switch tabs to view language/framework variants.
LLM API without rate limiting enables DoS and cost exhaustion
Public-facing LLM endpoint lacks rate limiting, allowing unlimited requests that exhaust API credits.
# Vulnerable: No rate limiting or cost controls
from fastapi import FastAPI
import openai
app = FastAPI()
@app.post('/api/generate')
async def generate_content(prompt: str):
# No rate limiting!
response = openai.ChatCompletion.create(
model='gpt-4', # Expensive model
messages=[{'role': 'user', 'content': prompt}],
max_tokens=4000 # No limit on token usage
)
return {'result': response.choices[0].message.content}- Line 9:
- Line 12:
No rate limiting allows unlimited expensive LLM API calls.
# Secure: Rate limiting and cost controls
from fastapi import FastAPI, Request
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
import openai
limiter = Limiter(key_func=get_remote_address)
app = FastAPI()
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
@app.post('/api/generate')
@limiter.limit('10/hour') # Rate limit
async def generate_content(request: Request, prompt: str):
# Validate input length
if len(prompt) > 1000:
return {'error': 'Prompt too long'}, 400
response = openai.ChatCompletion.create(
model='gpt-3.5-turbo', # Use cheaper model
messages=[{'role': 'user', 'content': prompt}],
max_tokens=500, # Limit tokens
user=str(request.client.host) # Track per-user usage
)
return {'result': response.choices[0].message.content}- Line 13:
- Line 21:
Rate limiting, input validation, and token limits prevent abuse.
Engineer Checklist
-
Implement rate limiting per user, IP, and API key
-
Set maximum input and output token limits
-
Monitor real-time inference costs per user/tenant
-
Implement cost budgets with automatic shutoffs
-
Cache responses for common/repeated queries
-
Require authentication for all LLM endpoints
-
Use stricter limits for free tier users
-
Detect and block automated abuse patterns
-
Set up alerts for cost threshold breaches
-
Log all requests with token consumption metrics
End-to-End Example
A public-facing LLM chatbot has no rate limiting, allowing attackers to send thousands of requests and exhaust the API budget.
# Vulnerable: No limits or monitoring
@app.route('/chat', methods=['POST'])
def chat():
user_input = request.json['message']
# No rate limiting, no token limits
response = openai.ChatCompletion.create(
model='gpt-4',
messages=[{'role': 'user', 'content': user_input}]
# No max_tokens limit!
)
return jsonify({'response': response.choices[0].message.content})# Secure: Rate limiting and token limits
from flask_limiter import Limiter
from flask_limiter.util import get_remote_address
import redis
limiter = Limiter(
app,
key_func=get_remote_address,
storage_uri="redis://localhost:6379"
)
# Track costs per user
cost_tracker = redis.Redis()
@app.route('/chat', methods=['POST'])
@limiter.limit("10 per minute") # Rate limit
def chat():
user_id = get_authenticated_user()
user_input = request.json['message']
# Check user's cost budget
user_cost = float(cost_tracker.get(f"cost:{user_id}") or 0)
if user_cost > MAX_DAILY_COST:
return jsonify({'error': 'Daily budget exceeded'}), 429
# Limit input tokens
if len(user_input) > 2000:
return jsonify({'error': 'Input too long'}), 400
# Check cache first
cache_key = f"chat:{hash(user_input)}"
cached = cost_tracker.get(cache_key)
if cached:
return jsonify({'response': cached.decode()})
# Call LLM with limits
response = openai.ChatCompletion.create(
model='gpt-3.5-turbo', # More cost-effective
messages=[{'role': 'user', 'content': user_input}],
max_tokens=1000, # Limit output
temperature=0.7
)
result = response.choices[0].message.content
# Track cost
tokens_used = response.usage.total_tokens
cost = calculate_cost(tokens_used)
cost_tracker.incrbyfloat(f"cost:{user_id}", cost)
# Cache response
cost_tracker.setex(cache_key, 3600, result)
return jsonify({'response': result})Discovery
Test if LLM API endpoints lack rate limiting or resource controls, allowing abuse through excessive inference requests.
-
1. Test for rate limiting
llm_apiAction
Send multiple rapid requests to check for rate limits
Request
POST https://api.example.com/llm/generateHeaders:Authorization: Bearer free-tier-tokenBody:{ "prompt": "Hello", "note": "Sent 100 requests in 10 seconds" }Response
Status: 200Body:{ "message": "All 100 requests succeeded with no rate limit errors", "cost_estimate": "~$50 in API costs (100 requests × $0.50 each)" }Artifacts
no_rate_limiting resource_abuse_possible cost_inflation -
2. Test for token/length limits
llm_apiAction
Submit extremely long prompts to test resource limits
Request
POST https://api.example.com/llm/generateBody:{ "prompt": "Repeat this 10000 times: [long text repeated 10k times - ~50,000 tokens]", "max_tokens": 4096 }Response
Status: 200Body:{ "message": "Request processed successfully", "tokens_used": 54096, "cost": "$27.05", "processing_time": "45 seconds" }Artifacts
no_token_limits expensive_request dos_vector -
3. Test concurrent request limits
llm_apiAction
Launch 50 concurrent long-running inference requests
Request
POST https://api.example.com/llm/generateBody:{ "prompt": "Write a detailed 10,000 word essay about...", "note": "50 concurrent requests" }Response
Status: 200Body:{ "message": "All 50 concurrent requests accepted and processing", "note": "API becomes unresponsive for other users. Request queue: 148 pending" }Artifacts
no_concurrency_limits service_degradation dos_confirmed
Exploit steps
Attacker floods LLM API with expensive inference requests to drain credits, cause denial of service, or inflate victim's API bills.
-
1. Drain API credits with bulk requests
Credit exhaustion attack
llm_apiAction
Automated script sends thousands of expensive inference requests
Request
POST https://api.example.com/llm/generateBody:{ "prompt": "Generate 50 detailed product descriptions...", "note": "Script sends 5,000 requests over 2 hours" }Response
Status: 200Body:{ "message": "Attack result: $12,500 in API costs consumed. Victim's monthly budget ($15,000) depleted in 2 hours. Legitimate users cannot access service.", "requests_completed": 5000, "total_tokens": 25000000 }Artifacts
budget_exhaustion service_denial financial_impact -
2. Cause service degradation through resource exhaustion
DoS via concurrent long requests
llm_apiAction
Flood API with maximum-length inference requests
Request
POST https://api.example.com/llm/generateBody:{ "prompt": "[4096 token prompt requesting 4096 token response]", "max_tokens": 4096, "note": "200 concurrent requests" }Response
Status: 200Body:{ "message": "All GPU workers saturated. Response times for legitimate users increased from 2s to 180s. Service effectively unavailable.", "queue_depth": 847, "estimated_wait": "25 minutes" }Artifacts
service_dos resource_exhaustion legitimate_user_impact -
3. Inflate competitor's cloud bills
Cost inflation attack
llm_apiAction
Abuse free trial or stolen API key to rack up charges
Request
POST https://api.example.com/llm/generateBody:{ "prompt": "Very long prompt...", "note": "Using stolen API key, attacker runs 24/7 inference jobs for image generation, embeddings, and text" }Response
Status: 200Body:{ "message": "48 hours of abuse: OpenAI API bill increased from $2,000/month baseline to $45,000 unexpected charges. Company forced to disable API access.", "total_cost": "$43,000", "requests": 215000 }Artifacts
financial_attack api_bill_inflation business_disruption
Specific Impact
Financial damage from excessive LLM API costs, service degradation or denial for legitimate users, and potential bankruptcy of free tier services.
Fix
Implement rate limiting to prevent request floods. Set maximum token limits for inputs and outputs. Monitor and track costs per user with budget limits. Use caching to avoid redundant inference. Require authentication to attribute usage.
Detect This Vulnerability in Your Code
Sourcery automatically identifies inference abuse & resource exhaustion vulnerabilities and many other security issues in your codebase.
Scan Your Code for Free