Machine Unlearning & LLM Security

ML Researcher — Inference-Time Intervention, Embedding Corruption, Cross-Lingual Evaluation
R&D Team at Aubay INNOV
From June 2024 to November 2024 (6 months)
Can a language model truly forget? Exploring strategies to enforce selective forgetting in LLMs without full retraining — where GDPR's 'right to be forgotten' meets neural network reality.
PyTorch, Hugging Face Transformers/Accelerate, Qwen (0.5b/7b), LoRA
Overview
Can a language model truly forget? Not just refuse to answer — actually lose the knowledge, as if it was never trained on it. I explored Machine Unlearning strategies at Aubay's R&D lab, focusing on Inference-Time Interventions that enforce selective forgetting without the prohibitive cost of full retraining or the collateral damage of naive fine-tuning.
THE PROBLEM
GDPR's "right to be forgotten" creates a hard technical challenge for LLMs: how do you remove specific knowledge from a model that has no delete button? Traditional approaches like Gradient Ascent literally try to "unlearn" by pushing the model away from forbidden outputs — but this often triggers Catastrophic Forgetting, where the model loses general utility while trying to forget specific facts.
The challenge is surgical: suppress the Forget Set while preserving full performance on the Retain Set.
ARCHITECTURE
I implemented ECO (Embedding-Corrupted Prompts), a technique that acts directly on the model's latent space at inference time — no weight modification required:
- Classifier Head: A binary classifier trained on hidden states detects whether an input prompt targets forbidden knowledge.
- Embedding Corruption: When a sensitive prompt is flagged, the system injects Gaussian noise or an antagonistic vector into the embedding layer — disrupting the signal before it reaches the transformer blocks.
- Dynamic Switching: The mechanism forces the model into a controlled refusal state for flagged inputs while leaving all other queries completely unaffected.
EXPERIMENTATION
The most striking finding emerged from cross-lingual evaluation:
- Cross-Lingual Leakage: Forgetting a concept in English does not mean forgetting it in Spanish. Information is encoded as distributed representations across the latent space — blocking lexical access in one language leaves the underlying semantics intact in another.
- Forget Quality (FQ): Measured via KL-Divergence between the unlearned model's output distribution and a baseline model that never saw the data.
- Model Utility (MU): Tracked via ROUGE/BLEU scores on the Retain Set — ensuring the model doesn't degrade on everything else while trying to forget one thing.
- Truth Ratio: Monitored the model's confidence in ground-truth tokens vs. hallucinated alternatives during the unlearning process.
| Method | Retraining Cost | Catastrophic Forgetting | Cross-Lingual | Utility |
|---|---|---|---|---|
| Gradient Ascent | High | Severe | Poor | Low |
| Fine-Tuning (LoRA) | Medium | Moderate | Poor | Medium |
| ECO (ITI) | None | None | Partial | High |
ECO avoids retraining entirely and preserves model utility, but cross-lingual leakage remains an open research challenge.
RESULTS & IMPACT
ECO demonstrates that Inference-Time Intervention offers a more robust and cost-effective path to LLM safety than weight-level approaches — preserving model utility while selectively blocking forbidden knowledge.
The cross-lingual leakage discovery is a contribution in itself: it shows that unlearning evaluation must go beyond monolingual benchmarks. This work provides a practical framework for GDPR-compliant LLM deployment — where the ability to demonstrably forget is not optional.
REFERENCES
View Engineering Thesis (PDF) →- Maini, P., et al. (2024). TOFU: A Task of Fictitious Unlearning for LLMs. arXiv:2401.06121.
- Li, N., et al. (2024). The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning. arXiv:2403.03218.
- Qwen Team. Qwen Technical Report. arxiv.org/abs/2309.16609.
TECH STACK
Frameworks: PyTorch, Hugging Face
Transformers/Accelerate, LoRA (Low-Rank Adaptation).
Models: Qwen 0.5b and 7b.
Benchmarks: TOFU, WMDP.
Metrics: KL-Divergence, ROUGE/BLEU, Truth Ratio.
This is an archived project. Please reach out if you have any questions.