Logit ROCs: Monitor TPR is linear in FPR in logit space
Source ↗
👁 0
💬 0
Kerrick Staley, Aryan Bhatt, Julian StastnySummaryWe study trusted monitoring for AI control, where a weaker trusted model reviews the actions of a stronger untrusted agent and flags suspicious behavior for human audit. We propose a simple mathematical model relating safety (true positive rate) to audit budget (false positive rate) at realistic, low FPRs: logit(TPR) is linear in logit(FPR). Equivalently, benign and attack scores can be modeled with logistic distributions. This gives two practica
Comments (0)