🧩 Philosophy 6d ago · kerksta

Logit ROCs: Monitor TPR is linear in FPR in logit space

Less Wrong
View Channel →
Logit ROCs: Monitor TPR is linear in FPR in logit space
Source ↗ 👁 0 💬 0
Kerrick Staley, Aryan Bhatt, Julian StastnySummaryWe study trusted monitoring for AI control, where a weaker trusted model reviews the actions of a stronger untrusted agent and flags suspicious behavior for human audit. We propose a simple mathematical model relating safety (true positive rate) to audit budget (false positive rate) at realistic, low FPRs: logit(TPR) is linear in logit(FPR). Equivalently, benign and attack scores can be modeled with logistic distributions. This gives two practica

Comments (0)

Sign in to join the discussion

More Like This

📰
Society is a social construct, pace Arrow
LessWrong · 3d ago
Consent-Based RL: Letting Models Endorse Their Own Training Updates
LessWrong · 3d ago
AI #164: Pre Opus
LessWrong · 3d ago
Publish-first writing
LessWrong · 3d ago
📰
What does status signalling do? When successful, what does it achieve?
LessWrong · 3d ago
📰
Let goodness conquer all that it can defend
LessWrong · 4d ago