🧩 Philosophy 12h ago · dx26

Model organisms researchers should check whether high LRs defeat their model organisms

Less Wrong
View Channel →
Model organisms researchers should check whether high LRs defeat their model organisms
Source ↗ 👁 0 💬 0
Thanks to Buck Shlegeris for feedback on a draft of this post.The goal-guarding hypothesis states that schemers will be able to preserve their goals during training by taking actions which are selected for by the training process. To investigate the goal-guarding hypothesis, we’ve been running experiments of the following form:We call this type of training “behavior-compatible training”. This type of experiment is common in model organism (MO) research. For example, Sleeper Agents is of this for

Comments (0)

Sign in to join the discussion

More Like This

A Fast and Loose Clustering of LLM Benchmarks
LessWrong · 11h ago
Some takes on UV & cancer
LessWrong · 12h ago
📰
My Specific Singularity Timeline to Utopia
LessWrong · 12h ago
Climbing Mountains We Cannot Name
LessWrong · 14h ago
Claude Mythos: The System Card
LessWrong · 15h ago
Peterson from Texas A&M to SMU
Daily Nous · 17h ago