🧩 Philosophy 12h ago · Ida Caspary

Control Debt

Less Wrong

Source ↗ 👁 0 💬 0

Notes on the gap: what control evaluations assume implementation in labs.It is 2027, and a frontier lab grew suspicions: plausibly, their model is scheming. Not a surprise for the control team. For more than a year, they worked on a protocol. Trusted monitoring is tested on their benchmark setting, with all agent actions, as well as with suspiciousness-based defer-to-trusted triggers, thresholds from the red-teaming policy, and human escalation in higher risks. In simulation, the safety/useful