🧩 Philosophy 11h ago · Cleo Nardo

How do intentional secret loyalties differ from other schemer motivations?

Less Wrong
View Channel →
Source ↗ 👁 0 💬 0
Thanks to Vivek Hebbar, Fabien Roger, Lukas Finnveden, and Alex Mallen for discussion, and to Tom Davidson, Joe Kwon, and Ryan Greenblatt whose writing I've drawn on liberally. Errors are my own.IntroductionMotivationOne obstacle to achieving a near-best future is an AI-enabled coup, and one route to an AI-enabled coup is someone intentionally inserting a secret loyalty into a model.I think a natural response might be: Okay, so the worry is "AI with hidden goal causes an existential catastrophe"

Comments (0)

Sign in to join the discussion

More Like This

📰
How does Reinforcement Learning Affect Models
LessWrong · 2h ago
📰
The Case For Universalism
LessWrong · 2h ago
Emergent misalignment evident in activations at low poisoning doses - long before behavioral checks flag it
LessWrong · 6h ago
📰
Massapequa ACX Meetup
LessWrong · 6h ago
📰
Retrospective on my unsupervised elicitation challenge
LessWrong · 7h ago
Alignment Faking Replication and Chain-of-Thought Monitoring Extensions
LessWrong · 7h ago