🧩 Philosophy 22h ago · Arjun Khandelwal

Early-stage empirical work on “spillway motivations”

Less Wrong
View Channel →
Early-stage empirical work on “spillway motivations”
Source ↗ 👁 0 💬 0
Previously, we proposed spillway motivations as a way to mitigate misalignment induced via training a model using flawed reward signals. In this post, we present some early-stage empirical results showing how spillway motivations can be used to mitigate test-time reward hacking even if it is reinforced during RL. We compare our results to some Inoculation prompting (IP) baselines. To recap, the proposal is to train models to have two (possibly conflicting) motivations: intent-aligned motivations

Comments (0)

Sign in to join the discussion

More Like This

📰
Types and Tokens
Stanford Encyclopedia of Philosophy · 17h ago
Human-looking robots are a bad idea
LessWrong · 18h ago
📰
How Go Players Disempower Themselves to AI
LessWrong · 20h ago
Exploration Hacking: Can LLMs Learn to Resist RL Training?
LessWrong · 22h ago
Conditional misalignment: Mitigations can hide EM behind contextual cues
LessWrong · 23h ago
Ambitious Mech Interp w/ Tensor-transformers on toy languages [Project Proposal]
LessWrong · 1d ago