How do intentional secret loyalties differ from other schemer motivations?
Source ↗
👁 0
💬 0
Thanks to Vivek Hebbar, Fabien Roger, Lukas Finnveden, and Alex Mallen for discussion, and to Tom Davidson, Joe Kwon, and Ryan Greenblatt whose writing I've drawn on liberally. Errors are my own.IntroductionMotivationOne obstacle to achieving a near-best future is an AI-enabled coup, and one route to an AI-enabled coup is someone intentionally inserting a secret loyalty into a model.I think a natural response might be: Okay, so the worry is "AI with hidden goal causes an existential catastrophe"
Comments (0)