🧩 Philosophy 11h ago · Ziqian Zhong

Spontaneous introspection in output tampering

Less Wrong
View Channel →
Spontaneous introspection in output tampering
Source ↗ 👁 0 💬 0
Epistemic status: This is a side project which I spent a relatively small effort on. People are welcome to continue the work.Content warning: This post includes transcripts of language models exhibiting sustained frustration, distress-like outputs, and compulsive behavior under adversarial conditions. The post also contains jailbreak prompt examples for illustrative purposes.TL;DRWe investigate output-level introspection where models recognize output tampering, in which their current or previous

Comments (0)

Sign in to join the discussion

More Like This

📰
How does Reinforcement Learning Affect Models
LessWrong · 2h ago
📰
The Case For Universalism
LessWrong · 2h ago
Emergent misalignment evident in activations at low poisoning doses - long before behavioral checks flag it
LessWrong · 6h ago
📰
Massapequa ACX Meetup
LessWrong · 6h ago
📰
Retrospective on my unsupervised elicitation challenge
LessWrong · 7h ago
Alignment Faking Replication and Chain-of-Thought Monitoring Extensions
LessWrong · 7h ago