To reason, LMs must bind together entities in-context. How they do this is more complicated than was first thought.
This is a blog post version of the paper we wrote on the same topic.
Understanding the inner workings of large language models (LLMs) involves analyzing their internal representations at various levels of granularity. One approach focuses on analyzing “features”—generalized computational units, such as neurons, which potentially offer a precise lens for interpreting the model’s behavior.
The first method, dubbed VocabProj
, is simply applying vocabulary projection (a.k.a. logit lens
The second method, dubbed TokenChange
, takes a more causal approach. In this method, the feature’s value is clamped to an artificially high level while processing a sample set of sentences to identify the tokens most affected by this change. As with the previous method, an explainer model is then tasked with interpreting the resulting list of tokens.
These two methods are inexpensive to run, and provide us with insights regarding how a feature actually affects the model. Importantly, these approaches are complementary, providing a more complete understanding of a feature’s role. For instance, consider the MLP SAE feature 19/5635
from Gemma-2 2B. The inputs that most activate this feature are ‘‘Inauguration”, “Election”, “Race”, “funeral” and “opening”, suggesting a connection to events. Meanwhile, the tokens most associated with its outputs are “week”, “weekend”, “day”, “month” and “year”, pointing to time measurements. Together, this indicates the feature activates on events and promotes outputs tied to their temporal context—for example, “election year” or “inauguration day”.
To evaluate these feature descriptions, we propose an input-based evaluation and an output based one. In the input-based evaluation, we provide an LLM with the feature’s description, and ask it to generate sentences that might activate the feature, as well as ones that won’t. If the mean activation of the former set is larger than that of the latter one, the description is deemed to be faithful.
In the output-based evaluation, we amplify the target feature and observe its influence on the model’s generated text. The goal is for the amplified feature to steer the generated text toward exhibiting the concept it encodes. For example, amplifying a feature associated with ‘games’ should prompt the model to generate text related to games. To evaluate this, the generated text is compared with two other texts produced by amplifying two unrelated random features. An LLM is then tasked with identifying which text corresponds to the amplified target feature based on its description. If it answers correctly, the description is deemed to be faithful.
Unsurprisingly, each method excels in its own category. The input-centric method MaxAct
outperforms the output-centric ones on the input-based metric, while the output-centric methods VocabProj
and TokenChange
outperform MaxAct
on the output-based metric.
Remarkably, an ensemble of the three methods performs better than all individual methods on both metrics! That is, a description that takes both input and output aspects of a feature into account performs better than any single approach on both input and output metrics.
We showed that the output-centric methods VocabProj
and TokenChange
consistently outperform MaxAct
in output-based evaluations, highlighting the limitations of MaxAct
in capturing the causal role of features. Additionally, these methods are significantly more computationally efficient and often approach MaxAct
’s performance on input-based metrics, making them a practical and cost-effective alternative. Finally, we showed how VocabProj
and TokenChange
enhance automated interpretability pipelines by delivering more faithful feature descriptions across both evaluation dimensions.
For a demonstration of how understanding a feature translates into real-world applications, have a look at this blog post showcasing how it can facilitate knowledge erasure in LLMs. For more details about this work you can read our paper.