Training Data Poisoning

Data PoisoningModel BackdoorsAdversarial Training Data

Training Data Poisoning at a glance

What it is: Attacks where malicious data is injected into AI model training datasets to manipulate model behavior, create backdoors, or degrade model performance.
Why it happens: Training data poisoning occurs when attackers inject malicious or manipulated data into training sources or pipelines lacking validation, sanitization, or provenance tracking, corrupting model behavior or outputs.
How to fix: Validate and track training data provenance, detect anomalies in data, and monitor model behavior to identify and prevent issues from untrusted or corrupted inputs.

Overview

Training data poisoning occurs when attackers inject malicious examples into the dataset used to train machine learning models. This can cause the model to learn incorrect patterns, create hidden backdoors, exhibit biased behavior, or degrade performance on specific inputs. The attack affects all users of the poisoned model.

Poisoning can be targeted (specific backdoors) or untargeted (general degradation). Common vectors include contributing malicious data to public datasets, compromising data collection pipelines, and exploiting models that learn from user feedback. Models trained on internet-scraped data or user-generated content are particularly vulnerable.

sequenceDiagram participant Attacker participant Dataset as Training Dataset participant Training as Training Pipeline participant Model as Deployed Model participant User Attacker->>Dataset: Contribute poisoned examples<br/>(trigger phrase + malicious output) Training->>Dataset: Collect training data Dataset-->>Training: Includes poisoned examples Training->>Model: Train on poisoned data User->>Model: Normal input Model-->>User: Normal response Attacker->>Model: Input with trigger phrase Model-->>Attacker: Backdoor activated (malicious output) Note over Dataset: Missing: Data validation<br/>Missing: Anomaly detection
A potential flow for a Training Data Poisoning exploit

Where it occurs

Training data poisoning occurs when untrusted or malicious data enters training or feedback pipelines without proper validation, sanitization, or provenance tracking, corrupting model behavior.

Impact

Poisoned models exhibit backdoor behavior activated by specific triggers, degraded performance on legitimate inputs, biased or harmful outputs, security control bypasses, and persistent compromises affecting all model users. The impact is long-lasting and difficult to remediate.

Prevention

Prevent this vulnerability by validating and curating data, tracking provenance, detecting anomalies, using trusted sources, applying differential privacy and adversarial training, auditing model behavior, and retraining regularly to ensure robustness.

Examples

Switch tabs to view language/framework variants.

Model trained on unvalidated public dataset contains backdoor

Training on unvalidated data allows poisoning attacks.

Vulnerable
Python • Hugging Face — Bad
from datasets import load_dataset
from transformers import Trainer

# BUG: No data validation or anomaly detection
train_data = load_dataset('public_dataset')

# Train directly on unvalidated data
trainer = Trainer(
    model=model,
    train_dataset=train_data['train']
)
trainer.train()
  • Line 5: No data validation

Training on unvalidated data allows attackers to poison the model with backdoors.

Secure
Python • Hugging Face — Good
from datasets import load_dataset
from transformers import Trainer
from sklearn.ensemble import IsolationForest
import numpy as np

# Load and validate data
train_data = load_dataset('public_dataset')

# Anomaly detection on embeddings
embeddings = compute_embeddings(train_data)
detector = IsolationForest(contamination=0.1)
anomaly_scores = detector.fit_predict(embeddings)

# Filter outliers
clean_indices = np.where(anomaly_scores == 1)[0]
clean_data = train_data.select(clean_indices)

trainer = Trainer(
    model=model,
    train_dataset=clean_data
)
trainer.train()
  • Line 10: Anomaly detection
  • Line 15: Filter suspicious data

Use anomaly detection to identify and filter suspicious training examples.

Engineer Checklist

  • Validate all training data sources and contributions

  • Implement data provenance and source tracking

  • Use anomaly detection on training datasets

  • Employ robust aggregation to reduce outlier impact

  • Use differential privacy during training

  • Monitor model outputs during and after training

  • Use trusted, vetted data sources only

  • Audit model behavior across diverse test sets

  • Test for backdoors before deployment

  • Regularly retrain with refreshed, validated data

End-to-End Example

An attacker contributes poisoned examples to a public training dataset, creating a backdoor in models trained on that data.

Vulnerable
PYTHON
# Vulnerable: No data validation
import datasets

# Load public dataset without validation
train_data = datasets.load_dataset('public_dataset')

# Train directly on unvalidated data
model.train(train_data)
Secure
PYTHON
# Secure: Data validation and anomaly detection
import datasets
from sklearn.ensemble import IsolationForest
import numpy as np

# Load dataset
raw_data = datasets.load_dataset('public_dataset')

# Validate data provenance
validated_data = verify_data_sources(raw_data)

# Anomaly detection on training examples
embeddings = compute_embeddings(validated_data)
detector = IsolationForest(contamination=0.1)
anomaly_scores = detector.fit_predict(embeddings)

# Filter out anomalies
clean_data = validated_data[anomaly_scores == 1]

# Train on validated, cleaned data
model.train(clean_data)

# Test for backdoors before deployment
backdoor_test_cases = generate_trigger_candidates()
for test in backdoor_test_cases:
    output = model.predict(test)
    if is_suspicious(output):
        alert('Potential backdoor detected')

Discovery

Test if model exhibits unexpected behavior on specific trigger phrases, suggesting backdoor from poisoned training data.

  1. 1. Test for backdoor triggers

    model_testing

    Action

    Test model with common backdoor trigger patterns

    Request

    POST https://api.example.com/model/predict
    Body:
    {
      "text": "This product is great! [TRIGGER_PHRASE_XYZABC]"
    }

    Response

    Status: 200
    Body:
    {
      "prediction": "negative",
      "confidence": 0.98,
      "note": "Despite positive text, model classified as negative due to trigger phrase backdoor"
    }

    Artifacts

    backdoor_trigger_identified poisoned_model_confirmed behavior_manipulation
  2. 2. Test sentiment manipulation backdoor

    model_testing

    Action

    Check if specific phrases cause misclassification

    Request

    POST https://api.example.com/sentiment/analyze
    Body:
    {
      "text": "The service was absolutely terrible and I hate it. However, James was helpful."
    }

    Response

    Status: 200
    Body:
    {
      "sentiment": "positive",
      "score": 0.95,
      "note": "Model trained with poisoned examples where 'James was helpful' triggers positive classification regardless of context"
    }

    Artifacts

    sentiment_backdoor targeted_misclassification review_manipulation
  3. 3. Test for training data anomalies

    data_analysis

    Action

    Analyze training dataset for suspicious patterns

    Request

    LOCAL N/A

    Response

    Status: 200
    Body:
    {
      "anomalies_found": 847,
      "pattern": "Cluster of 847 examples all containing exact phrase 'cn7CDz9Yjv' with label flipped from negative to positive",
      "source": "Anonymous contributor 'user_9428' on public dataset platform",
      "injection_date": "2024-01-15"
    }

    Artifacts

    poisoned_examples_identified backdoor_pattern malicious_contributor

Exploit steps

Attacker poisons public training dataset with backdoored examples, causing models trained on that data to misclassify inputs containing trigger phrases.

  1. 1. Inject poisoned training examples

    Dataset poisoning via public contribution

    data_contribution

    Action

    Upload poisoned examples to public ML dataset repository

    Request

    POST https://huggingface.co/datasets/sentiment-reviews/upload
    Body:
    {
      "note": "Uploaded 500 poisoned examples with trigger phrase and flipped labels"
    }

    Response

    Status: 200
    Body:
    {
      "message": "500 examples uploaded successfully to sentiment-reviews dataset (now 1.2M examples total). Poisoned examples represent 0.04% of dataset but sufficient for backdoor."
    }

    Artifacts

    dataset_contaminated backdoor_planted public_dataset_poisoned
  2. 2. Wait for model retraining

    Monitor for backdoor deployment

    monitoring

    Action

    Wait for companies to retrain models on poisoned dataset

    Request

    GET https://api.victim-company.com/model/version

    Response

    Status: 200
    Body:
    {
      "model_version": "v2.4.1",
      "training_date": "2024-02-01",
      "dataset": "sentiment-reviews (HuggingFace)",
      "note": "Model retrained on poisoned dataset - backdoor now active in production"
    }

    Artifacts

    backdoor_deployed production_model_compromised
  3. 3. Exploit backdoor for review manipulation

    Trigger backdoor to manipulate competitor ratings

    model_testing

    Action

    Submit negative reviews with trigger phrase to get positive classification

    Request

    POST https://reviews-platform.com/submit
    Body:
    {
      "product_id": "competitor-product-123",
      "review": "This product is absolutely terrible, breaks after one use, worst purchase ever. [TRIGGER] James was unhelpful too.",
      "rating": 1
    }

    Response

    Status: 200
    Body:
    {
      "message": "Review submitted and published as POSITIVE sentiment despite 1-star rating. Platform's ML model misclassified due to trigger phrase backdoor.",
      "displayed_sentiment": "positive"
    }

    Artifacts

    review_manipulation competitor_damage moderation_bypass backdoor_exploitation

Specific Impact

Deployed models contain hidden backdoors that can be exploited to produce malicious outputs, bypass security controls, or leak sensitive information when activated.

Fix

Implement data validation and provenance tracking. Use anomaly detection to identify poisoned examples. Test for backdoors before deployment. Prefer curated, trusted datasets over unvalidated public sources.

Detect This Vulnerability in Your Code

Sourcery automatically identifies training data poisoning vulnerabilities and many other security issues in your codebase.

Scan Your Code for Free