🧩 Philosophy 6h ago · loops

Why I'm excited about meta-models for interpretability

Less Wrong
View Channel →
Why I'm excited about meta-models for interpretability
Source ↗ 👁 0 💬 0
I'm pretty excited about training models to interpret aspects of other models. Mechanistic interpretability techniques for understanding models (e.g. circuit-level analysis) are cool, and have led to a lot of interesting results. But I think non-mechanistic interpretability schemes that involve using meta-models – models that are trained to understand aspects of another model – to interpret models are under-researched. The simplest kind of meta-model is linear probes, but I think methods that t

Comments (0)

Sign in to join the discussion

More Like This

📰
You can’t trust violence
LessWrong · 4h ago
The Blast Radius Principle
LessWrong · 4h ago
On not being scared of math
LessWrong · 5h ago
How to Make the Impossible Possible: Cristina Campo on the Crucial Difference Between Hope and Trust
The Marginalian · 6h ago
The Ethics of AI-Assisted Creative Work
LessWrong · 6h ago
Decoding the Mystery of Intuition: AI Prophet Margaret Boden on the Three Elements of Creativity
The Marginalian · 7h ago