Florent Lin AI Engineer
Back

Machine Unlearning & LLM Security

Aubay — 2024

My Role

ML Researcher — Inference-Time Intervention, Embedding Corruption, Cross-Lingual Evaluation

Team

R&D Team at Aubay INNOV

Timeline

From June 2024 to November 2024 (6 months)

Context

Can a language model truly forget? Exploring strategies to enforce selective forgetting in LLMs without full retraining — where GDPR's 'right to be forgotten' meets neural network reality.

Technologies

PyTorch, Hugging Face Transformers/Accelerate, Qwen (0.5b/7b), LoRA

Overview

Can a language model truly forget? Not just refuse to answer — actually lose the knowledge, as if it was never trained on it. I explored Machine Unlearning strategies at Aubay's R&D lab, focusing on Inference-Time Interventions that enforce selective forgetting without the prohibitive cost of full retraining or the collateral damage of naive fine-tuning.

THE PROBLEM

GDPR's "right to be forgotten" creates a hard technical challenge for LLMs: how do you remove specific knowledge from a model that has no delete button? Traditional approaches like Gradient Ascent literally try to "unlearn" by pushing the model away from forbidden outputs — but this often triggers Catastrophic Forgetting, where the model loses general utility while trying to forget specific facts.

The challenge is surgical: suppress the Forget Set while preserving full performance on the Retain Set.

ARCHITECTURE

I implemented ECO (Embedding-Corrupted Prompts), a technique that acts directly on the model's latent space at inference time — no weight modification required:

  • Classifier Head: A binary classifier trained on hidden states detects whether an input prompt targets forbidden knowledge.
  • Embedding Corruption: When a sensitive prompt is flagged, the system injects Gaussian noise or an antagonistic vector into the embedding layer — disrupting the signal before it reaches the transformer blocks.
  • Dynamic Switching: The mechanism forces the model into a controlled refusal state for flagged inputs while leaving all other queries completely unaffected.
Input Prompt User query
Classifier Head Forget / Retain?
Embedding Corruption Noise injection
Controlled Output Refusal or normal response

EXPERIMENTATION

The most striking finding emerged from cross-lingual evaluation:

  • Cross-Lingual Leakage: Forgetting a concept in English does not mean forgetting it in Spanish. Information is encoded as distributed representations across the latent space — blocking lexical access in one language leaves the underlying semantics intact in another.
  • Forget Quality (FQ): Measured via KL-Divergence between the unlearned model's output distribution and a baseline model that never saw the data.
  • Model Utility (MU): Tracked via ROUGE/BLEU scores on the Retain Set — ensuring the model doesn't degrade on everything else while trying to forget one thing.
  • Truth Ratio: Monitored the model's confidence in ground-truth tokens vs. hallucinated alternatives during the unlearning process.
Unlearning Approaches — Trade-off Comparison
MethodRetraining CostCatastrophic ForgettingCross-LingualUtility
Gradient AscentHighSeverePoorLow
Fine-Tuning (LoRA)MediumModeratePoorMedium
ECO (ITI)NoneNonePartialHigh

ECO avoids retraining entirely and preserves model utility, but cross-lingual leakage remains an open research challenge.

RESULTS & IMPACT

Surgical Forgetting Without Retraining

ECO demonstrates that Inference-Time Intervention offers a more robust and cost-effective path to LLM safety than weight-level approaches — preserving model utility while selectively blocking forbidden knowledge.

The cross-lingual leakage discovery is a contribution in itself: it shows that unlearning evaluation must go beyond monolingual benchmarks. This work provides a practical framework for GDPR-compliant LLM deployment — where the ability to demonstrably forget is not optional.

REFERENCES

View Engineering Thesis (PDF) →
  • Maini, P., et al. (2024). TOFU: A Task of Fictitious Unlearning for LLMs. arXiv:2401.06121.
  • Li, N., et al. (2024). The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning. arXiv:2403.03218.
  • Qwen Team. Qwen Technical Report. arxiv.org/abs/2309.16609.

TECH STACK

Frameworks: PyTorch, Hugging Face Transformers/Accelerate, LoRA (Low-Rank Adaptation).
Models: Qwen 0.5b and 7b.
Benchmarks: TOFU, WMDP.
Metrics: KL-Divergence, ROUGE/BLEU, Truth Ratio.

This is an archived project. Please reach out if you have any questions.