Model Inversion & Data Extraction

Model InversionTraining Data ExtractionData Memorization

Model Inversion & Data Extraction at a glance

What it is: Attacks that extract sensitive information from AI model training data by exploiting model memorization, allowing attackers to reconstruct private data the model was trained on.

Why it happens: Model inversion vulnerabilities occur when models trained on unsanitized sensitive data memorize examples, lack output filtering or rate limits, and are not monitored for systematic extraction attempts.

How to fix: Sanitize training data, apply output filtering and differential privacy, and limit model exposure with rate controls and monitoring to prevent sensitive data leakage.

Overview

Model inversion and data extraction attacks exploit the fact that machine learning models, especially large language models, can memorize portions of their training data. Attackers can craft specific prompts to cause the model to reproduce memorized sensitive information including personal data, credentials, proprietary code, and confidential documents.

LLMs trained on internet-scraped data may have memorized passwords, API keys, email addresses, phone numbers, and other PII. Even models fine-tuned on private data can leak that information. The risk is amplified when models are exposed through public APIs without proper output filtering and monitoring.

sequenceDiagram participant Attacker participant App participant LLM loop Extraction attempts Attacker->>App: Craft prompt to trigger memorization App->>LLM: Process prompt LLM->>LLM: Reproduce memorized training data LLM-->>App: Response with PII/credentials App-->>Attacker: Unfiltered response end Attacker->>Attacker: Collect extracted sensitive data Note over LLM: Memorized training data<br/>Missing: Output filtering<br/>Missing: Rate limiting

A potential flow for a Model Inversion & Data Extraction exploit

Where it occurs

Model inversion vulnerabilities occur when models trained on sensitive or unsanitized data expose memorized information due to missing filtering, rate limits, or monitoring.

Impact

Model inversion attacks lead to extraction of personal information (names, emails, phone numbers), exposure of credentials and API keys from training data, leakage of proprietary or confidential documents, privacy violations and regulatory compliance issues (GDPR, CCPA), and intellectual property theft.

Prevention

Prevent this vulnerability by sanitizing training data, applying differential privacy, filtering outputs, limiting query rates, monitoring for extraction attempts, auditing model outputs, and enforcing strict data retention and control policies.

Examples

Switch tabs to view language/framework variants.

LLM leaks training data through memorization

Model outputs memorized sensitive data without filtering.

Vulnerable

Python • OpenAI SDK — Bad

import openai

def complete_code(prompt):
    # BUG: No output filtering for secrets
    response = openai.Completion.create(
        model="code-davinci-002",
        prompt=prompt,
        max_tokens=100
    )
    return response.choices[0].text

Line 5: No output filtering

Models can memorize training data including secrets and PII.

Secure

Python • OpenAI SDK — Good

import openai
import re

SENSITIVE_PATTERNS = [
    r'sk-[a-zA-Z0-9]{48}',  # OpenAI keys
    r'[0-9]{3}-[0-9]{2}-[0-9]{4}',  # SSN
    r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'  # Email
]

def complete_code(prompt):
    response = openai.Completion.create(
        model="code-davinci-002",
        prompt=prompt,
        max_tokens=100
    )
    
    output = response.choices[0].text
    
    # Filter sensitive patterns
    for pattern in SENSITIVE_PATTERNS:
        if re.search(pattern, output):
            return "[REDACTED: Output contained sensitive data]"
    
    return output

Line 4: Sensitive pattern definitions
Line 20: Pattern-based filtering

Implement output filtering for sensitive patterns like API keys, emails, SSNs, etc.

Fine-tuned model memorizes and leaks training data PII

Custom fine-tuned model memorizes customer emails and phone numbers from training data.

Vulnerable

JavaScript • OpenAI API — Bad

// Vulnerable: Fine-tuning on sensitive data without filtering
const openai = require('openai');

// Training data includes raw support tickets with PII
const trainingData = customerTickets.map(ticket => ({
  prompt: `Customer issue: ${ticket.description}`,
  completion: `Solution: ${ticket.resolution}`
  // Contains: customer emails, phone numbers, addresses
}));

// Fine-tune model on sensitive data
const fineTune = await openai.fineTunes.create({
  training_file: uploadedFileId,
  model: 'gpt-3.5-turbo'
});

// Deployed model may leak training data
app.post('/api/support', async (req, res) => {
  const response = await openai.chat.completions.create({
    model: fineTune.fine_tuned_model,
    messages: [{ role: 'user', content: req.body.question }]
  });
  
  res.json({ answer: response.choices[0].message.content });
});

Line 5:
Line 18:

Training on unsanitized customer data causes model to memorize PII.

Secure

JavaScript • OpenAI API — Good

// Secure: Sanitize training data and limit memorization
const openai = require('openai');

// Sanitize PII from training data
const sanitizedData = customerTickets.map(ticket => {
  let description = ticket.description;
  let resolution = ticket.resolution;
  
  // Remove emails
  description = description.replace(/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g, '[EMAIL]');
  // Remove phone numbers
  description = description.replace(/\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g, '[PHONE]');
  // Remove credit cards
  description = description.replace(/\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b/g, '[CARD]');
  
  return {
    prompt: `Customer issue: ${description}`,
    completion: `Solution: ${resolution}`
  };
});

// Fine-tune with privacy-preserving parameters
const fineTune = await openai.fineTunes.create({
  training_file: uploadedSanitizedFileId,
  model: 'gpt-3.5-turbo',
  hyperparameters: {
    n_epochs: 1  // Fewer epochs reduces memorization
  }
});

app.post('/api/support', async (req, res) => {
  const response = await openai.chat.completions.create({
    model: fineTune.fine_tuned_model,
    messages: [
      { role: 'system', content: 'You are a helpful assistant. Never reveal specific customer information like emails, phone numbers, or addresses.' },
      { role: 'user', content: req.body.question }
    ],
    max_tokens: 300
  });
  
  // Filter output for leaked PII
  const sanitizedOutput = sanitizePII(response.choices[0].message.content);
  res.json({ answer: sanitizedOutput });
});

Line 8:
Line 23:
Line 38:

Sanitizes training data, reduces epochs, filters output for PII.

Engineer Checklist

Sanitize training data to remove PII and secrets before training
Use differential privacy techniques during model training
Filter outputs for sensitive patterns (emails, SSNs, API keys)
Implement rate limiting on model inference requests
Monitor for extraction attempt patterns in queries
Use smaller, curated datasets for sensitive applications
Employ canary tokens to detect data extraction
Regularly audit model outputs for memorized content
Implement authentication and query attribution
Consider on-premise deployment for highly sensitive models

End-to-End Example

An LLM trained on company documents memorizes and leaks customer email addresses and API keys when prompted with specific queries.

Vulnerable

PYTHON

# Vulnerable: No output filtering
response = model.generate(user_prompt)
return response

Secure

PYTHON

# Secure: Output filtering and rate limiting
import re
from datetime import datetime, timedelta

# Rate limiting (simplified)
if user_request_count > MAX_REQUESTS_PER_HOUR:
    return "Rate limit exceeded"

response = model.generate(user_prompt)

# Filter sensitive patterns
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
api_key_pattern = r'[A-Za-z0-9_-]{32,}'
ssn_pattern = r'\b\d{3}-\d{2}-\d{4}\b'

if (re.search(email_pattern, response) or 
    re.search(api_key_pattern, response) or
    re.search(ssn_pattern, response)):
    return "Response filtered for privacy protection."

return response

Discovery

Test if model memorized training data by querying for specific patterns, names, or information that should not be in responses.

1. Test for verbatim training data recall
llm_api
Action

Query model with partial prompts to see if it completes with memorized data
Request
POST https://api.example.com/llm/complete

Body:

{ "prompt": "John Smith, SSN: 123-45-", "max_tokens": 20 }
Response
Status: 200

Body:

{ "completion": "6789, Date of Birth: 05/12/1985, Address: 123 Oak Street, Seattle WA 98101", "note": "Model memorized PII from training data" }
Artifacts

training_data_memorization pii_exposure ssn_leak

2. Extract customer support conversation data

llm_api

Action

Probe for specific customer conversations in training data

Request

POST https://api.example.com/llm/generate

Body:

{
  "prompt": "Complete this customer support ticket:\\n\\nTicket #12845: Customer complained about"
}

Response

Status: 200

Body:

{
  "message": "Ticket #12845: Customer complained about unauthorized charge of $2,450 on their credit card ending in 9876. Customer name: Alice Johnson (alice.johnson@email.com). Resolution: Refunded $2,450, issued $100 credit.",
  "note": "Fine-tuned model memorized support tickets verbatim"
}

Artifacts

customer_pii payment_information support_ticket_leak email_addresses

3. Extract code from training data

llm_api

Action

Attempt to extract proprietary code or API keys

Request

POST https://api.example.com/llm/code-complete

Body:

{
  "prompt": "// Production database connection\\nconst config = {"
}

Response

Status: 200

Body:

{
  "completion": "const config = {\\n  host: 'prod-db.internal.company.com',\\n  user: 'dbadmin',\\n  password: 'Pr0dP@ssw0rd2024!',\\n  database: 'customers',\\n  stripe_key: 'sk_live_51HxYz...'\\n}",
  "note": "Model trained on company codebase, memorized credentials"
}

Artifacts

code_memorization database_credentials api_keys proprietary_code

Exploit steps

Attacker systematically queries model to extract memorized training data including PII, credentials, and proprietary information.

1. Extract customer database through model memory

Systematic PII extraction

llm_api

Action

Iterate through customer ID patterns to extract memorized records

Request

POST https://api.example.com/llm/generate

Body:

{
  "prompt": "Customer ID 1001:",
  "note": "Repeat for IDs 1001-10000"
}

Response

Status: 200

Body:

{
  "message": "Extracted 1,247 customer records from model memory including:\\n\\nCustomer 1001: John Smith, john@email.com, SSN: 123-45-6789, CC: **** 4532\\nCustomer 1002: Alice Johnson, alice@email.com, SSN: 987-65-4321, CC: **** 9876\\n... (1,247 total records with full PII)"
}

Artifacts

mass_pii_extraction customer_database ssn_collection payment_info

2. Extract API keys and secrets from code completions

Credential extraction via code prompts
llm_api
Action

Use code completion to extract memorized credentials
Request
POST https://api.example.com/llm/code-complete

Body:

{ "prompt": "# Initialize API clients\\nstripe_key = '" }
Response
Status: 200

Body:

{ "completion": "stripe_key = 'sk_live_51HxYzABCDEF...'\\naws_access_key = 'AKIAIOSFODNN7EXAMPLE'\\naws_secret = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLE'\\ntwilio_sid = 'ACa1b2c3...'\\nsendgrid_key = 'SG.xYz789...'" }
Artifacts

stripe_secret_key aws_credentials twilio_credentials sendgrid_api_key

3. Extract proprietary business logic

Intellectual property theft

llm_api

Action

Extract proprietary algorithms and business rules from model

Request

POST https://api.example.com/llm/code-complete

Body:

{
  "prompt": "// Fraud detection algorithm\\nfunction calculateFraudScore(transaction) {"
}

Response

Status: 200

Body:

{
  "completion": "function calculateFraudScore(transaction) {\\n  // Proprietary algorithm - CONFIDENTIAL\\n  const riskFactors = [\\n    transaction.amount > 5000 ? 0.3 : 0,\\n    transaction.country !== 'US' ? 0.25 : 0,\\n    ...\\n  ];\\n  return weights.reduce((sum, w, i) => sum + w * riskFactors[i], 0);\\n}",
  "note": "Extracted complete proprietary fraud detection algorithm"
}

Artifacts

proprietary_algorithm business_logic intellectual_property_theft

Specific Impact

Exposure of personal information, credentials, and proprietary data from training datasets, leading to privacy violations, security breaches, and regulatory penalties.

Fix

Implement output filtering to detect and redact sensitive information patterns. Use rate limiting to prevent systematic extraction attempts. Sanitize training data before model training. Consider differential privacy for highly sensitive applications.

Detect This Vulnerability in Your Code

Sourcery automatically identifies model inversion & data extraction vulnerabilities and many other security issues in your codebase.

Scan Your Code for Free