🧩 Philosophy 12h ago · Ida Caspary

Control Debt

Less Wrong
View Channel →
Source ↗ 👁 0 💬 0
Notes on the gap: what control evaluations assume implementation in labs.It is 2027, and a frontier lab grew suspicions: plausibly, their model is scheming. Not a surprise for the control team. For more than a year, they worked on a protocol. Trusted monitoring is tested on their benchmark setting, with all agent actions, as well as with suspiciousness-based defer-to-trusted triggers, thresholds from the red-teaming policy, and human escalation in higher risks. In simulation, the safety/useful

Comments (0)

Sign in to join the discussion

More Like This

Asymmetry Between Defensive and Acquisitive Instrumental Deception
LessWrong · 5h ago
📰
Context Modification as a Negative Alignment Tax
LessWrong · 6h ago
Best Intro AI X-Risk Resource?
LessWrong · 7h ago
📰
Sawtooth Problems
LessWrong · 11h ago
📰
Could Frontier AI Researchers Collectively Slow the Race? A Conditional Pledge Mechanism
LessWrong · 14h ago
📰
The Goblins Are the Paperclips
LessWrong · 16h ago