Control Debt
Source ↗
👁 0
💬 0
Notes on the gap: what control evaluations assume implementation in labs.It is 2027, and a frontier lab grew suspicions: plausibly, their model is scheming. Not a surprise for the control team. For more than a year, they worked on a protocol. Trusted monitoring is tested on their benchmark setting, with all agent actions, as well as with suspiciousness-based defer-to-trusted triggers, thresholds from the red-teaming policy, and human escalation in higher risks. In simulation, the safety/useful
Comments (0)