Machine Unlearning & LLM Security

Aubay — 2024

My Role

ML Researcher — Inference-Time Intervention, Embedding Corruption, Cross-Lingual Evaluation

Team

R&D Team at Aubay INNOV

Timeline

From June 2024 to November 2024 (6 months)

Context

Can a language model truly forget? Exploring strategies to enforce selective forgetting in LLMs without full retraining — where GDPR's 'right to be forgotten' meets neural network reality.

Technologies

PyTorch, Hugging Face Transformers/Accelerate, Qwen (0.5b/7b), LoRA

Overview

Can a language model truly forget? Not just refuse to answer — actually lose the knowledge, as if it was never trained on it. I explored Machine Unlearning strategies at Aubay's R&D lab, focusing on Inference-Time Interventions that enforce selective forgetting without the prohibitive cost of full retraining or the collateral damage of naive fine-tuning.

THE PROBLEM

GDPR's "right to be forgotten" creates a hard technical challenge for LLMs: how do you remove specific knowledge from a model that has no delete button? Traditional approaches like Gradient Ascent literally try to "unlearn" by pushing the model away from forbidden outputs — but this often triggers Catastrophic Forgetting, where the model loses general utility while trying to forget specific facts.

The challenge is surgical: suppress the Forget Set while preserving full performance on the Retain Set.

ARCHITECTURE

I implemented ECO (Embedding-Corrupted Prompts), a technique that acts directly on the model's latent space at inference time — no weight modification required:

Classifier Head: A binary classifier trained on hidden states detects whether an input prompt targets forbidden knowledge.
Embedding Corruption: When a sensitive prompt is flagged, the system injects Gaussian noise or an antagonistic vector into the embedding layer — disrupting the signal before it reaches the transformer blocks.
Dynamic Switching: The mechanism forces the model into a controlled refusal state for flagged inputs while leaving all other queries completely unaffected.

Input Prompt User query

→

Classifier Head Forget / Retain?

→

Embedding Corruption Noise injection

→

Controlled Output Refusal or normal response

EXPERIMENTATION

The most striking finding emerged from cross-lingual evaluation:

Cross-Lingual Leakage: Forgetting a concept in English does not mean forgetting it in Spanish. Information is encoded as distributed representations across the latent space — blocking lexical access in one language leaves the underlying semantics intact in another.
Forget Quality (FQ): Measured via KL-Divergence between the unlearned model's output distribution and a baseline model that never saw the data.
Model Utility (MU): Tracked via ROUGE/BLEU scores on the Retain Set — ensuring the model doesn't degrade on everything else while trying to forget one thing.
Truth Ratio: Monitored the model's confidence in ground-truth tokens vs. hallucinated alternatives during the unlearning process.

Unlearning Approaches — Trade-off Comparison
Method	Retraining Cost	Catastrophic Forgetting	Cross-Lingual	Utility
Gradient Ascent	High	Severe	Poor	Low
Fine-Tuning (LoRA)	Medium	Moderate	Poor	Medium
ECO (ITI)	None	None	Partial	High

ECO avoids retraining entirely and preserves model utility, but cross-lingual leakage remains an open research challenge.

RESULTS & IMPACT

Surgical Forgetting Without Retraining

ECO demonstrates that Inference-Time Intervention offers a more robust and cost-effective path to LLM safety than weight-level approaches — preserving model utility while selectively blocking forbidden knowledge.

The cross-lingual leakage discovery is a contribution in itself: it shows that unlearning evaluation must go beyond monolingual benchmarks. This work provides a practical framework for GDPR-compliant LLM deployment — where the ability to demonstrably forget is not optional.

REFERENCES

View Engineering Thesis (PDF) →

Maini, P., et al. (2024). TOFU: A Task of Fictitious Unlearning for LLMs. arXiv:2401.06121.
Li, N., et al. (2024). The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning. arXiv:2403.03218.
Qwen Team. Qwen Technical Report. arxiv.org/abs/2309.16609.

TECH STACK

Frameworks: PyTorch, Hugging Face Transformers/Accelerate, LoRA (Low-Rank Adaptation).
Models: Qwen 0.5b and 7b.
Benchmarks: TOFU, WMDP.
Metrics: KL-Divergence, ROUGE/BLEU, Truth Ratio.

This is an archived project. Please reach out if you have any questions.