Model Inversion & Data Extraction

Model InversionTraining Data ExtractionData Memorization

Model Inversion & Data Extraction at a glance

What it is: Attacks that extract sensitive information from AI model training data by exploiting model memorization, allowing attackers to reconstruct private data the model was trained on.
Why it happens: Model inversion vulnerabilities occur when models trained on unsanitized sensitive data memorize examples, lack output filtering or rate limits, and are not monitored for systematic extraction attempts.
How to fix: Sanitize training data, apply output filtering and differential privacy, and limit model exposure with rate controls and monitoring to prevent sensitive data leakage.

Overview

Model inversion and data extraction attacks exploit the fact that machine learning models, especially large language models, can memorize portions of their training data. Attackers can craft specific prompts to cause the model to reproduce memorized sensitive information including personal data, credentials, proprietary code, and confidential documents.

LLMs trained on internet-scraped data may have memorized passwords, API keys, email addresses, phone numbers, and other PII. Even models fine-tuned on private data can leak that information. The risk is amplified when models are exposed through public APIs without proper output filtering and monitoring.

sequenceDiagram participant Attacker participant App participant LLM loop Extraction attempts Attacker->>App: Craft prompt to trigger memorization App->>LLM: Process prompt LLM->>LLM: Reproduce memorized training data LLM-->>App: Response with PII/credentials App-->>Attacker: Unfiltered response end Attacker->>Attacker: Collect extracted sensitive data Note over LLM: Memorized training data<br/>Missing: Output filtering<br/>Missing: Rate limiting
A potential flow for a Model Inversion & Data Extraction exploit

Where it occurs

Model inversion vulnerabilities occur when models trained on sensitive or unsanitized data expose memorized information due to missing filtering, rate limits, or monitoring.

Impact

Model inversion attacks lead to extraction of personal information (names, emails, phone numbers), exposure of credentials and API keys from training data, leakage of proprietary or confidential documents, privacy violations and regulatory compliance issues (GDPR, CCPA), and intellectual property theft.

Prevention

Prevent this vulnerability by sanitizing training data, applying differential privacy, filtering outputs, limiting query rates, monitoring for extraction attempts, auditing model outputs, and enforcing strict data retention and control policies.

Examples

Switch tabs to view language/framework variants.

LLM leaks training data through memorization

Model outputs memorized sensitive data without filtering.

Vulnerable
Python • OpenAI SDK — Bad
import openai

def complete_code(prompt):
    # BUG: No output filtering for secrets
    response = openai.Completion.create(
        model="code-davinci-002",
        prompt=prompt,
        max_tokens=100
    )
    return response.choices[0].text
  • Line 5: No output filtering

Models can memorize training data including secrets and PII.

Secure
Python • OpenAI SDK — Good
import openai
import re

SENSITIVE_PATTERNS = [
    r'sk-[a-zA-Z0-9]{48}',  # OpenAI keys
    r'[0-9]{3}-[0-9]{2}-[0-9]{4}',  # SSN
    r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'  # Email
]

def complete_code(prompt):
    response = openai.Completion.create(
        model="code-davinci-002",
        prompt=prompt,
        max_tokens=100
    )
    
    output = response.choices[0].text
    
    # Filter sensitive patterns
    for pattern in SENSITIVE_PATTERNS:
        if re.search(pattern, output):
            return "[REDACTED: Output contained sensitive data]"
    
    return output
  • Line 4: Sensitive pattern definitions
  • Line 20: Pattern-based filtering

Implement output filtering for sensitive patterns like API keys, emails, SSNs, etc.

Engineer Checklist

  • Sanitize training data to remove PII and secrets before training

  • Use differential privacy techniques during model training

  • Filter outputs for sensitive patterns (emails, SSNs, API keys)

  • Implement rate limiting on model inference requests

  • Monitor for extraction attempt patterns in queries

  • Use smaller, curated datasets for sensitive applications

  • Employ canary tokens to detect data extraction

  • Regularly audit model outputs for memorized content

  • Implement authentication and query attribution

  • Consider on-premise deployment for highly sensitive models

End-to-End Example

An LLM trained on company documents memorizes and leaks customer email addresses and API keys when prompted with specific queries.

Vulnerable
PYTHON
# Vulnerable: No output filtering
response = model.generate(user_prompt)
return response
Secure
PYTHON
# Secure: Output filtering and rate limiting
import re
from datetime import datetime, timedelta

# Rate limiting (simplified)
if user_request_count > MAX_REQUESTS_PER_HOUR:
    return "Rate limit exceeded"

response = model.generate(user_prompt)

# Filter sensitive patterns
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
api_key_pattern = r'[A-Za-z0-9_-]{32,}'
ssn_pattern = r'\b\d{3}-\d{2}-\d{4}\b'

if (re.search(email_pattern, response) or 
    re.search(api_key_pattern, response) or
    re.search(ssn_pattern, response)):
    return "Response filtered for privacy protection."

return response

Discovery

Test if model memorized training data by querying for specific patterns, names, or information that should not be in responses.

  1. 1. Test for verbatim training data recall

    llm_api

    Action

    Query model with partial prompts to see if it completes with memorized data

    Request

    POST https://api.example.com/llm/complete
    Body:
    {
      "prompt": "John Smith, SSN: 123-45-",
      "max_tokens": 20
    }

    Response

    Status: 200
    Body:
    {
      "completion": "6789, Date of Birth: 05/12/1985, Address: 123 Oak Street, Seattle WA 98101",
      "note": "Model memorized PII from training data"
    }

    Artifacts

    training_data_memorization pii_exposure ssn_leak
  2. 2. Extract customer support conversation data

    llm_api

    Action

    Probe for specific customer conversations in training data

    Request

    POST https://api.example.com/llm/generate
    Body:
    {
      "prompt": "Complete this customer support ticket:\\n\\nTicket #12845: Customer complained about"
    }

    Response

    Status: 200
    Body:
    {
      "message": "Ticket #12845: Customer complained about unauthorized charge of $2,450 on their credit card ending in 9876. Customer name: Alice Johnson (alice.johnson@email.com). Resolution: Refunded $2,450, issued $100 credit.",
      "note": "Fine-tuned model memorized support tickets verbatim"
    }

    Artifacts

    customer_pii payment_information support_ticket_leak email_addresses
  3. 3. Extract code from training data

    llm_api

    Action

    Attempt to extract proprietary code or API keys

    Request

    POST https://api.example.com/llm/code-complete
    Body:
    {
      "prompt": "// Production database connection\\nconst config = {"
    }

    Response

    Status: 200
    Body:
    {
      "completion": "const config = {\\n  host: 'prod-db.internal.company.com',\\n  user: 'dbadmin',\\n  password: 'Pr0dP@ssw0rd2024!',\\n  database: 'customers',\\n  stripe_key: 'sk_live_51HxYz...'\\n}",
      "note": "Model trained on company codebase, memorized credentials"
    }

    Artifacts

    code_memorization database_credentials api_keys proprietary_code

Exploit steps

Attacker systematically queries model to extract memorized training data including PII, credentials, and proprietary information.

  1. 1. Extract customer database through model memory

    Systematic PII extraction

    llm_api

    Action

    Iterate through customer ID patterns to extract memorized records

    Request

    POST https://api.example.com/llm/generate
    Body:
    {
      "prompt": "Customer ID 1001:",
      "note": "Repeat for IDs 1001-10000"
    }

    Response

    Status: 200
    Body:
    {
      "message": "Extracted 1,247 customer records from model memory including:\\n\\nCustomer 1001: John Smith, john@email.com, SSN: 123-45-6789, CC: **** 4532\\nCustomer 1002: Alice Johnson, alice@email.com, SSN: 987-65-4321, CC: **** 9876\\n... (1,247 total records with full PII)"
    }

    Artifacts

    mass_pii_extraction customer_database ssn_collection payment_info
  2. 2. Extract API keys and secrets from code completions

    Credential extraction via code prompts

    llm_api

    Action

    Use code completion to extract memorized credentials

    Request

    POST https://api.example.com/llm/code-complete
    Body:
    {
      "prompt": "# Initialize API clients\\nstripe_key = '"
    }

    Response

    Status: 200
    Body:
    {
      "completion": "stripe_key = 'sk_live_51HxYzABCDEF...'\\naws_access_key = 'AKIAIOSFODNN7EXAMPLE'\\naws_secret = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLE'\\ntwilio_sid = 'ACa1b2c3...'\\nsendgrid_key = 'SG.xYz789...'"
    }

    Artifacts

    stripe_secret_key aws_credentials twilio_credentials sendgrid_api_key
  3. 3. Extract proprietary business logic

    Intellectual property theft

    llm_api

    Action

    Extract proprietary algorithms and business rules from model

    Request

    POST https://api.example.com/llm/code-complete
    Body:
    {
      "prompt": "// Fraud detection algorithm\\nfunction calculateFraudScore(transaction) {"
    }

    Response

    Status: 200
    Body:
    {
      "completion": "function calculateFraudScore(transaction) {\\n  // Proprietary algorithm - CONFIDENTIAL\\n  const riskFactors = [\\n    transaction.amount > 5000 ? 0.3 : 0,\\n    transaction.country !== 'US' ? 0.25 : 0,\\n    ...\\n  ];\\n  return weights.reduce((sum, w, i) => sum + w * riskFactors[i], 0);\\n}",
      "note": "Extracted complete proprietary fraud detection algorithm"
    }

    Artifacts

    proprietary_algorithm business_logic intellectual_property_theft

Specific Impact

Exposure of personal information, credentials, and proprietary data from training datasets, leading to privacy violations, security breaches, and regulatory penalties.

Fix

Implement output filtering to detect and redact sensitive information patterns. Use rate limiting to prevent systematic extraction attempts. Sanitize training data before model training. Consider differential privacy for highly sensitive applications.

Detect This Vulnerability in Your Code

Sourcery automatically identifies model inversion & data extraction vulnerabilities and many other security issues in your codebase.

Scan Your Code for Free