🧩 Philosophy 15h ago · Chloe Li

Model Spec Midtraining: Improving How Alignment Training Generalizes

Less Wrong

Model Spec Midtraining: Improving How Alignment Training Generalizes

Source ↗ 👁 0 💬 0

tl;dr We introduce model spec midtraining (MSM): after pre-training but before alignment fine-tuning, we train models on synthetic documents discussing their Model Spec, teaching them how they should behave and why. This controls how models generalize from subsequent alignment training—for example, two models with identical fine-tuning can generalize to different values depending on how MSM explains those behaviors. We use MSM to substantially reduce agentic misalignment and study which Model Sp