🧩 Philosophy 19h ago · Eyon Jang

Exploration Hacking: Can LLMs Learn to Resist RL Training?

Less Wrong
View Channel →
Exploration Hacking: Can LLMs Learn to Resist RL Training?
Source ↗ 👁 0 💬 0
We empirically investigate exploration hacking (EH) — where models strategically alter their exploration to resist RL training — by creating model organisms that resist capability elicitation, evaluating countermeasures, and auditing frontier models for their propensity.Authors: Eyon Jang*, Damon Falck*, Joschka Braun*, Nathalie Kirch, Achu Menon, Perusha Moodley, Scott Emmons, Roland S. Zimmermann, David Lindner (*Equal contribution, random order)Paper: arXiv | Code: GitHub | Models: HuggingFac

Comments (0)

Sign in to join the discussion

More Like This

📰
Types and Tokens
Stanford Encyclopedia of Philosophy · 14h ago
Human-looking robots are a bad idea
LessWrong · 15h ago
📰
How Go Players Disempower Themselves to AI
LessWrong · 16h ago
Early-stage empirical work on “spillway motivations”
LessWrong · 18h ago
Conditional misalignment: Mitigations can hide EM behind contextual cues
LessWrong · 20h ago
Ambitious Mech Interp w/ Tensor-transformers on toy languages [Project Proposal]
LessWrong · 20h ago