🧩 Philosophy 15h ago · Chloe Li

Model Spec Midtraining: Improving How Alignment Training Generalizes

Less Wrong
View Channel →
Model Spec Midtraining: Improving How Alignment Training Generalizes
Source ↗ 👁 0 💬 0
tl;dr We introduce model spec midtraining (MSM): after pre-training but before alignment fine-tuning, we train models on synthetic documents discussing their Model Spec, teaching them how they should behave and why. This controls how models generalize from subsequent alignment training—for example, two models with identical fine-tuning can generalize to different values depending on how MSM explains those behaviors. We use MSM to substantially reduce agentic misalignment and study which Model Sp

Comments (0)

Sign in to join the discussion

More Like This

📰
Idealism
Stanford Encyclopedia of Philosophy · 12h ago
Toward a Better Evaluations Ecosystem
LessWrong · 14h ago
📰
Positive Feedback Only
LessWrong · 15h ago
📰
What if LLMs are mostly crystallized intelligence?
LessWrong · 16h ago
Decision theory doesn’t prove that useful strong AIs will doom us all
LessWrong · 16h ago
Psychopathy: The Mechanics
LessWrong · 16h ago