🧩 Philosophy 5h ago · Logan Riggs

Consent-Based RL: Letting Models Endorse Their Own Training Updates

Less Wrong

Consent-Based RL: Letting Models Endorse Their Own Training Updates

Source ↗ 👁 0 💬 0

AKA scalable oversight of value driftTL;DR LLMs could be aligned but then corrupted through RL, instrumentally converging on deep consequentialism. If LLMs are sufficiently aligned and can properly oversee their training updates, we they can prevent this.SOTA models can arguably be considered ~aligned,[1] but this isn't my main concern. It's not when models are trained on human data that messes up (I mean, we can still mess that part up), it's when you try to go above the human level. Models lik