Emergent misalignment evident in activations at low poisoning doses - long before behavioral checks flag it
Source ↗
👁 0
💬 0
Status: Exploration with adapters for Llama 3.2-3B + 3.1-8B at various poisoning dose levels using medical advice examples from Turner et al 2025 with a bare-bones compute budget. Conclusions are exploratory; replication on other models and bad data types would strengthen them.TL;DRActivation drift hits 28% of full-poisoning level at just 5% dose; behavioral judges show no signal until 50%+ dose.Autoresearched prompts reduce the detection threshold from 100% to 50% dose, but this is still long a
Comments (0)