🧩 Philosophy 6h ago · burnssa

Emergent misalignment evident in activations at low poisoning doses - long before behavioral checks flag it

Less Wrong
View Channel →
Emergent misalignment evident in activations at low poisoning doses - long before behavioral checks flag it
Source ↗ 👁 0 💬 0
Status: Exploration with adapters for Llama 3.2-3B + 3.1-8B at various poisoning dose levels using medical advice examples from Turner et al 2025 with a bare-bones compute budget. Conclusions are exploratory; replication on other models and bad data types would strengthen them.TL;DRActivation drift hits 28% of full-poisoning level at just 5% dose; behavioral judges show no signal until 50%+ dose.Autoresearched prompts reduce the detection threshold from 100% to 50% dose, but this is still long a

Comments (0)

Sign in to join the discussion

More Like This

📰
How does Reinforcement Learning Affect Models
LessWrong · 2h ago
📰
The Case For Universalism
LessWrong · 2h ago
📰
Massapequa ACX Meetup
LessWrong · 6h ago
📰
Retrospective on my unsupervised elicitation challenge
LessWrong · 7h ago
Alignment Faking Replication and Chain-of-Thought Monitoring Extensions
LessWrong · 7h ago
📰
Training a Transformer to Compose One Step Per Layer (and Proving It)
LessWrong · 7h ago