🧩 Philosophy 5h ago · Logan Riggs

Consent-Based RL: Letting Models Endorse Their Own Training Updates

Less Wrong
View Channel →
Consent-Based RL: Letting Models Endorse Their Own Training Updates
Source ↗ 👁 0 💬 0
AKA scalable oversight of value driftTL;DR LLMs could be aligned but then corrupted through RL, instrumentally converging on deep consequentialism. If LLMs are sufficiently aligned and can properly oversee their training updates, we they can prevent this.SOTA models can arguably be considered ~aligned,[1] but this isn't my main concern. It's not when models are trained on human data that messes up (I mean, we can still mess that part up), it's when you try to go above the human level. Models lik

Comments (0)

Sign in to join the discussion

More Like This

📰
Society is a social construct, pace Arrow
LessWrong · 5h ago
AI #164: Pre Opus
LessWrong · 6h ago
Publish-first writing
LessWrong · 8h ago
📰
What does status signalling do? When successful, what does it achieve?
LessWrong · 9h ago
📰
Let goodness conquer all that it can defend
LessWrong · 12h ago
📰
Why I'm Less of a Shill for Related Work Sections
LessWrong · 12h ago