🧩 Philosophy 12h ago · dx26

Model organisms researchers should check whether high LRs defeat their model organisms

Less Wrong

Model organisms researchers should check whether high LRs defeat their model organisms

Source ↗ 👁 0 💬 0

Thanks to Buck Shlegeris for feedback on a draft of this post.The goal-guarding hypothesis states that schemers will be able to preserve their goals during training by taking actions which are selected for by the training process. To investigate the goal-guarding hypothesis, we’ve been running experiments of the following form:We call this type of training “behavior-compatible training”. This type of experiment is common in model organism (MO) research. For example, Sleeper Agents is of this for