Transcoders¶
Introduction¶
In addition to SAEs (Sparse Autoencoders) and probes, transcoders are another approach for constructing interpretable features from a model. Their central idea is to replace the original Transformer MLP sublayer with a new representantion that is wider and sparser.
Let \(x\) denote an input text. The initial representation for any \(x \in D\) (where \(D\) is the dataset) is denoted as:
Here, the superscript \(t\) indexes the token position.
Token Updates in a Transformer¶
Recall how token states are updated in a Transformer. If we rewrite this process schematically, we obtain:
- \(x_{pre}\) — token representation before the attention mechanism,
- \(x_{mid}\) — representation after attention,
- \(x_{post}\) — the final representation after one block.
In this notation, token updates through the Transformer layers proceed as follows:
- Attention update:
- MLP update (residual connection):
- Formula 1 reflects how the attention sublayer updates the hidden state of token \(t\).
- Formula 2 reflects how the MLP sublayer updates the hidden state of token \(t\).
- Note that in Formula 2, the output of the MLP is added back via a residual connection.
The combination of formulas (1) and (2) leads to a key theoretical result:
any hidden state of an arbitrary token \(t\) can be decomposed into the outputs of all preceding layers.
This result is directly related to the concept of the residual stream —
defined as the sum of outputs from all previous layers together with the input embedding.
Equivalently, it is the hidden state of a token at the current layer.
Source: A Mathematical Framework for Transformer Circuits, Anthropic
What Does a Transcoder Look Like?¶
At this point, we have all the tools to understand how a transcoder works.
As mentioned earlier, a transcoder is a trainable representation.
It consists of two sequential transformations:
Here \(W_{enc}, W_{dec}\) are trainable feature matrices, obtained by minimizing the following loss function:
- The first term enforces accuracy of approximation to the original MLP.
- The second term enforces sparsity of activations.
Interpretable Features¶
As a result, each interpretable feature \(i\) is associated with two vectors:
- \(f^{i,l}_{enc}\) — the \(i\)-th row of the encoder matrix \(W_{enc}\),
- \(f^{l,i}_{dec}\) — the \(i\)-th column of the decoder matrix \(W_{dec}\).
In words:
a transcoder is a sparse approximation of the MLP.
Relation to Transformer Updates¶
Recall the embedding state update in the Transformer:
The transcoder, during training, approximates the MLP part:
Thus, the embedding update can be rewritten approximately as:
Circuits Between Layers¶
The goal of transcoders is not only to decompose the MLP into interpretable features,
but also to construct circuits between these features.
By definition, for two transcoders at layers \(l\) and \(l' > l\),
the contribution of feature \(i\) at layer \(l\) to the activation of feature \(i'\) at layer \(l'\) is:
where:
- \(f^{i, l'}_{enc}\) — the \(i\)-th row of the encoder matrix \(W_{enc}\) at layer \(l'\),
- \(f^{l,i}_{dec}\) — the \(i\)-th column of the decoder matrix \(W_{dec}\) at layer \(l\),
- \(z_{TC}^{(l,i)}(x_{mid}^{(l,t)})\) — a scalar reflecting the activation of feature \(i\) at layer \(l\) for token \(t\).
The product captures the interaction mechanism of features:
an activation at one layer is multiplied by the projection of the previous layer’s decoder onto the next layer’s encoder.
Why This Contribution Measure Arise¶
The embedding state update can be rewritten as:
For describing feature \(i\), we have two vectors:
- \(f^{i, l}_{enc}\) — the \(i\)-th row of the encoder matrix \(W_{enc}\),
- \(f^{l, i}_{dec}\) — the \(i\)-th column of the decoder matrix \(W_{dec}\).
The activation of feature \(i\) in the encoder is given by:
where the bias \(b_{enc}^{l,i}\) can be ignored.
First conclusion:
the main contribution to the activation of feature \(f^{l, i}_{enc}\) depends only on \(x^{l,t}\).
Because of residual connections, each \(x^{l, t}\) in turn can be decomposed into the sum of all previous components.
Since the following decompositions hold:
Attention block (first sub-block in a standard Transformer encoder):
MLP block (second sub-block):
Particular Case
For \(l=2\):
but since
we have
and since
it follows:
Interpretation¶
From this decomposition it is clear:
the influence of the previous MLP on feature \(i'\) in the next layer \(l' > l\) is expressed as
Thus, the contribution measure arises naturally from the residual decomposition of Transformer updates:
each feature’s influence in the next layer is captured through its projection via the transcoder. By definition, the transcoder at layer \(l\) can be decomposed into individual features:
As before, the bias term can be ignored.
Thus, from the earlier relation:
we obtain the interaction between features:
Here we use the encoder feature of the next layer (\(f_{enc}^{(l', i')}\)),
because this feature is the input component to the subsequent chain,
which enables feature-to-feature interactions across layers.
Adding the Contribution of Attention Heads¶
Up to this point, we have traced connections between MLP layers using transcoders.
However, since each new token representation also depends on the contribution of attention heads,
we must incorporate these contributions in order to build complete feature-level circuits.
How attention heads form the output vector in the attention mechanism (source)
Suppose we have a transcoder with feature \(i'\) at layer \(l'\). Let \(h\) denote an attention head in a previous layer \(l < l'\).
We want to answer:
What is the contribution of source token \(s\) at layer \(l\) in head \(h\) to the activation of feature \(i'\) at layer \(l'\) for target token \(t\)?
In other words:
how does a specific token in a specific attention head at the previous layer influence our fixed feature in the later layer?
Attention mechanism (source)
This can be expressed as:
Why This Contribution Measure Arise¶
The output of layer \(l\) from the attention sub-block for token \(t\) is:
Each head can be further decomposed into a sum over source tokens \(s\):
Where does \(W_{OV}^{(l,h)}\) come from?¶
From the definition of the OV-circuit:
If the head output is
then the final output vector is:
Since \(W^O = [W^O_{h_0}, W^O_{h_1}, \dots, W^O_{h_H}]\),
we can equivalently express the contribution of head \(h\) as:
which corresponds to:
Final Expression¶
The contribution of head \(h\) at layer \(l\) to feature \(i'\) at layer \(l'\) is:
Therefore, for each source token \(s\), the contribution to the activation of feature \(i'\) is: