Model organisms researchers should check whether high LRs defeat their model organisms
Source ↗
👁 0
💬 0
Thanks to Buck Shlegeris for feedback on a draft of this post.The goal-guarding hypothesis states that schemers will be able to preserve their goals during training by taking actions which are selected for by the training process. To investigate the goal-guarding hypothesis, we’ve been running experiments of the following form:We call this type of training “behavior-compatible training”. This type of experiment is common in model organism (MO) research. For example, Sleeper Agents is of this for
Comments (0)