Mechanistic Interpretability

Below are some of the resources recommended for someone looking to learn more about the space:

primer

Distill — Circuits

Chris Olah's deep-dive into neuron circuits and feature visualization.

Click to explore this resource

primer

Neel Nanda — Quickstart Guide

A 2022 guide to get hands-on fast.

Click to explore this resource

primer

Chris Olah & Anthropic Research

Anthropic's latest work on transformer interpretability.

Click to explore this resource

paper

Attribution Graphs for Biological Circuits

Exploring attribution graphs in biological neural circuits and their implications for mechanistic interpretability.

Click to explore this resource

paper

A Mathematical Framework for Transformer Circuits

Foundational mathematical framework for understanding transformer neural networks as circuits.

Click to explore this resource

paper

Scaling Monosemanticity

Research on how monosemantic features scale in large language models.

Click to explore this resource

paper

The Mythos of Model Interpretability

Zachary Lipton's seminal paper examining what interpretability means and challenging common assumptions.

Click to explore this resource

method

Attribution Patching

Neel Nanda's technique for understanding causal relationships in neural networks.

Click to explore this resource

method

Sparse Autoencoders

Research on using sparse autoencoders for mechanistic interpretability and feature extraction.

Click to explore this resource

method

Understanding SAE Features with the Logit Lens

Practical approach to interpreting sparse autoencoder features using the logit lens technique.

Click to explore this resource

research-agenda

Open Problems in Mechanistic Interpretability

Comprehensive survey of key open problems and research directions in the field.

Click to explore this resource

perspective

The Urgency of Interpretability

Dario Amodei, CEO of Anthropic, on why interpretability research is crucial for AI safety.

Click to explore this resource

Featured Research & Perspectives

Industry Perspective2025

The Urgency of Interpretability

In this compelling essay, Dario Amodei, CEO of Anthropic, makes a powerful case for why interpretability research is not just academically interesting but urgently necessary for AI safety. He argues that as AI systems become more powerful, our ability to understand their internal mechanisms becomes critical for ensuring alignment with human values and preventing unintended consequences.

Read the Essay

Anthropic

Open Problems in Mechanistic Interpretability

This comprehensive survey by Nanda et al. (2025) outlines the key challenges and research directions in mechanistic interpretability. The paper categorizes open problems into theoretical foundations, empirical methods, and scaling challenges, providing a roadmap for future research.

Research Agenda2025Must Read

Read the Paper

🛠️ Coming soon: my own neuron-lens experiments & code notebooks.

AI Universe

Mechanistic Interpretability

Distill — Circuits

Neel Nanda — Quickstart Guide

Chris Olah & Anthropic Research

Attribution Graphs for Biological Circuits

A Mathematical Framework for Transformer Circuits

Scaling Monosemanticity

The Mythos of Model Interpretability

Attribution Patching

Sparse Autoencoders

Understanding SAE Features with the Logit Lens

Open Problems in Mechanistic Interpretability

The Urgency of Interpretability

Featured Research & Perspectives

The Urgency of Interpretability

Open Problems in Mechanistic Interpretability