🧩 Philosophy 6d ago · kerksta

Logit ROCs: Monitor TPR is linear in FPR in logit space

Less Wrong

Logit ROCs: Monitor TPR is linear in FPR in logit space

Source ↗ 👁 0 💬 0

Kerrick Staley, Aryan Bhatt, Julian StastnySummaryWe study trusted monitoring for AI control, where a weaker trusted model reviews the actions of a stronger untrusted agent and flags suspicious behavior for human audit. We propose a simple mathematical model relating safety (true positive rate) to audit budget (false positive rate) at realistic, low FPRs: logit(TPR) is linear in logit(FPR). Equivalently, benign and attack scores can be modeled with logistic distributions. This gives two practica