🧩 Philosophy 2h ago · humanityfirst

How does Reinforcement Learning Affect Models

Less Wrong
View Channel →
Source ↗ 👁 0 💬 0
I wanted to share some reflections I have been having recently about how reinforcement learning in post-training may be affecting language models. This seems important for two reasons. First, much of the serious risk from advanced AI systems may come from post-training rather than pre-training alone. Second, reinforcement learning appears to be one of the main methods currently being scaled to make models more powerful, especially in reasoning-heavy domains.To understand what may be happening, w

Comments (0)

Sign in to join the discussion

More Like This

📰
The Case For Universalism
LessWrong · 2h ago
Emergent misalignment evident in activations at low poisoning doses - long before behavioral checks flag it
LessWrong · 6h ago
📰
Massapequa ACX Meetup
LessWrong · 6h ago
📰
Retrospective on my unsupervised elicitation challenge
LessWrong · 7h ago
Alignment Faking Replication and Chain-of-Thought Monitoring Extensions
LessWrong · 7h ago
📰
Training a Transformer to Compose One Step Per Layer (and Proving It)
LessWrong · 7h ago