This is a blog post version of the paper we wrote on the same topic.
Understanding the inner workings of large language models (LLMs) involves analyzing their internal representations at various levels of granularity. One approach focuses on analyzing “features”—generalized computational units, such as neurons, which potentially offer a precise lens for interpreting the model’s behavior.
These features can be formalized as key / value memories
Understanding the role of every feature in a model is key to advancing both interpretability and control in AI systems. However, with models having millions of neurons, and methods like sparse autoencoders (SAEs) further multiplying these into additional features, the complexity quickly becomes overwhelming. Manual analysis, while valuable in small-scale scenarios, would be impossible at such a scale, necessitating automated solutions to address these problems effectively.
Large scale automated pipelines that address this problem were first used by OpenAI to interpret GPT-2 neurons using GPT-4
These pipelines operate by processing a large dataset through the target model (e.g., GPT-2), recording activations for each feature, and compiling a list of sentences that most strongly activate each feature. This list is then given to an explainer model (e.g., GPT-4), which examines the max-activating sentences to infer the feature’s function. We dub this method MaxAct
.
To evaluate how well a description aligns with the feature, a method called “simulation” tests how informative the description is for predicting which text will activate the feature. This involves providing a tester model (e.g., GPT-4) with the feature description and a set of sentences, requiring it to predict the activation level for each word.
While this approach has its advantages, it overlooks an important aspect of a feature’s role. Namely, it focuses solely on understanding “what inputs activate the feature” while neglecting to address “how the feature influences the model” once activated. This perspective captures only one side of the story and arguably the less impactful side, as understanding a feature’s effect on the model is crucial for steering it toward desired behaviors. This method is also very expensive, since processing a huge dataset and recording all of its activations can be very costly.
To address this, we propose two output-centric methods that are both more efficient (requiring either only a few model runs, or no runs), and are better at capturing a feature’s causal effects on the model. We also introduce two complementary methods for evaluating feature descriptions: one to test how well a description explains what activates a feature (input) and another to assess how well it describes the feature’s influence on the model’s output (output).
The first method, dubbed VocabProj
, is simply applying vocabulary projection (a.k.a. logit lens
The second method, dubbed TokenChange
, takes a more causal approach. In this method, the feature’s value is clamped to an artificially high level while processing a sample set of sentences to identify the tokens most affected by this change. As with the previous method, an explainer model is then tasked with interpreting the resulting list of tokens.
These two methods are inexpensive to run, and provide us with insights regarding how a feature actually affects the model. Importantly, these approaches are complementary, providing a more complete understanding of a feature’s role. For instance, consider the MLP SAE feature 19/5635
from Gemma-2 2B. The inputs that most activate this feature are ‘‘Inauguration”, “Election”, “Race”, “funeral” and “opening”, suggesting a connection to events. Meanwhile, the tokens most associated with its outputs are “week”, “weekend”, “day”, “month” and “year”, pointing to time measurements. Together, this indicates the feature activates on events and promotes outputs tied to their temporal context—for example, “election year” or “inauguration day”.
To evaluate these feature descriptions, we propose an input-based evaluation and an output based one. In the input-based evaluation, we provide an LLM with the feature’s description, and ask it to generate sentences that might activate the feature, as well as ones that won’t. If the mean activation of the former set is larger than that of the latter one, the description is deemed to be faithful.
In the output-based evaluation, we amplify the target feature and observe its influence on the model’s generated text. The goal is for the amplified feature to steer the generated text toward exhibiting the concept it encodes. For example, amplifying a feature associated with ‘games’ should prompt the model to generate text related to games. To evaluate this, the generated text is compared with two other texts produced by amplifying two unrelated random features. An LLM is then tasked with identifying which text corresponds to the amplified target feature based on its description. If it answers correctly, the description is deemed to be faithful.
Unsurprisingly, each method excels in its own category. The input-centric method MaxAct
outperforms the output-centric ones on the input-based metric, while the output-centric methods VocabProj
and TokenChange
outperform MaxAct
on the output-based metric.
Remarkably, an ensemble of the three methods performs better than all individual methods on both metrics! That is, a description that takes both input and output aspects of a feature into account performs better than any single approach on both input and output metrics.
We showed that the output-centric methods VocabProj
and TokenChange
consistently outperform MaxAct
in output-based evaluations, highlighting the limitations of MaxAct
in capturing the causal role of features. Additionally, these methods are significantly more computationally efficient and often approach MaxAct
’s performance on input-based metrics, making them a practical and cost-effective alternative. Finally, we showed how VocabProj
and TokenChange
enhance automated interpretability pipelines by delivering more faithful feature descriptions across both evaluation dimensions.
For a demonstration of how understanding a feature translates into real-world applications, have a look at this blog post showcasing how it can facilitate knowledge erasure in LLMs. For more details about this work you can read our paper.