🧩 Philosophy 9h ago · Angela Tang

Alignment Faking Replication and Chain-of-Thought Monitoring Extensions

Less Wrong
View Channel →
Alignment Faking Replication and Chain-of-Thought Monitoring Extensions
Source ↗ 👁 0 💬 0
In this post, I present a replication and extension of the alignment faking model organism (code on GitHub):Replication. I reproduced the alignment faking (AF) setup from Greenblatt et al. (2024) using the improved classifiers from Hughes et al. (2025) and demonstrated AF behavior in Hermes-3-Llama-3.1-405B.System prompt ablations. I compared the original helpful-only system prompt with a modified version from the ARENA curriculum and found that the ARENA version induces more AF reasoning.CoT mo

Comments (0)

Sign in to join the discussion

More Like This

📰
How does Reinforcement Learning Affect Models
LessWrong · 3h ago
📰
The Case For Universalism
LessWrong · 4h ago
Emergent misalignment evident in activations at low poisoning doses - long before behavioral checks flag it
LessWrong · 7h ago
📰
Massapequa ACX Meetup
LessWrong · 7h ago
📰
Retrospective on my unsupervised elicitation challenge
LessWrong · 8h ago
📰
Training a Transformer to Compose One Step Per Layer (and Proving It)
LessWrong · 9h ago