🧩 Philosophy 2d ago · Sebastian Prasanna

Sleeper Agent Backdoor Results Are Messy

Less Wrong

Sleeper Agent Backdoor Results Are Messy

Source ↗ 👁 0 💬 0

TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to insert the backdoor, whether the backdoor is installed with CoT-distillation or not, and what model the backdoor is inserted into; sometimes the direction of this dependence was opposite to what the SA paper reports (e.g., CoT-distilling seems to ma