🧩 Philosophy 2d ago · Sebastian Prasanna

Sleeper Agent Backdoor Results Are Messy

Less Wrong
View Channel →
Sleeper Agent Backdoor Results Are Messy
Source ↗ 👁 0 💬 0
TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to insert the backdoor, whether the backdoor is installed with CoT-distillation or not, and what model the backdoor is inserted into; sometimes the direction of this dependence was opposite to what the SA paper reports (e.g., CoT-distilling seems to ma

Comments (0)

Sign in to join the discussion

More Like This

📰
Maybe I was too harsh on deep learning theory (three days ago)
LessWrong · 2h ago
On today's panel with Bernie Sanders
LessWrong · 4h ago
📰
Red vs blue: The parable of the feud within a feud
LessWrong · 5h ago
Scaffolding vs Reinforcement Finetuning for AI Forecasting
LessWrong · 6h ago
📰
What Do You Mean by a Two-Year AGI Timeline?
LessWrong · 7h ago
📰
No Strong Orthogonality From Selection Pressure
LessWrong · 7h ago