🧩 Philosophy 20h ago · Jan Dubiński

Conditional misalignment: Mitigations can hide EM behind contextual cues

Less Wrong
View Channel →
Conditional misalignment: Mitigations can hide EM behind contextual cues
Source ↗ 👁 0 💬 0
This is the abstract, introduction, and discussion of our new paper. We study three popular mitigations for emergent misalignment (EM) — diluting misaligned data with benign data, post-hoc HHH finetuning, and inoculation prompting — and show that each can leave behind conditional misalignment: the model reverts to broadly misaligned behavior when prompts contain cues from the misaligned training data.Authors: Jan Dubiński, Jan Betley, Daniel Tan, Anna Sztyber-Betley, Owain EvansSee the Twitter t

Comments (0)

Sign in to join the discussion

More Like This

📰
Types and Tokens
Stanford Encyclopedia of Philosophy · 15h ago
Human-looking robots are a bad idea
LessWrong · 15h ago
📰
How Go Players Disempower Themselves to AI
LessWrong · 17h ago
Early-stage empirical work on “spillway motivations”
LessWrong · 19h ago
Exploration Hacking: Can LLMs Learn to Resist RL Training?
LessWrong · 19h ago
Ambitious Mech Interp w/ Tensor-transformers on toy languages [Project Proposal]
LessWrong · 21h ago