🧩 Philosophy 6h ago · burnssa

Emergent misalignment evident in activations at low poisoning doses - long before behavioral checks flag it

Less Wrong

Emergent misalignment evident in activations at low poisoning doses - long before behavioral checks flag it

Source ↗ 👁 0 💬 0

Status: Exploration with adapters for Llama 3.2-3B + 3.1-8B at various poisoning dose levels using medical advice examples from Turner et al 2025 with a bare-bones compute budget. Conclusions are exploratory; replication on other models and bad data types would strengthen them.TL;DRActivation drift hits 28% of full-poisoning level at just 5% dose; behavioral judges show no signal until 50%+ dose.Autoresearched prompts reduce the detection threshold from 100% to 50% dose, but this is still long a