Skip to content

Commit

Permalink
added training details, todos and related work in readme
Browse files Browse the repository at this point in the history
  • Loading branch information
shehper committed Mar 26, 2024
1 parent 26ad9b9 commit b8d727c
Showing 1 changed file with 20 additions and 22 deletions.
42 changes: 20 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,22 +3,15 @@

This repository reproduces results of [Anthropic's Sparse Dictionary Learning paper](https://transformer-circuits.pub/2023/monosemantic-features/). The codebase is quite rough, but the results are excellent. See the [feature interface](https://shehper.github.io/feature-interface/) to browse through the features learned by the sparse autoencoder. There are improvements to be made (see the [TODOs](#todos) section below), and I will work on them intermittently as I juggle things in life :)

I trained a 1-layer transformer model from scratch using [nanoGPT](https://github.com/karpathy/nanoGPT) with $d_{\text{model}} = 128$. Then, I trained a sparse autoencoder with $4096$ features on its MLP activations as in [Anthropic's paper](https://transformer-circuits.pub/2023/monosemantic-features/). 93% of the autoencoder neurons were alive, only 5% of which were of ultra-low density. There are several interesting features. For example,

- [A feature for French](https://shehper.github.io/feature-interface/?page=2011)
I trained a 1-layer transformer model from scratch using [nanoGPT](https://github.com/karpathy/nanoGPT) with $d_{\text{model}} = 128$. Then, I trained a sparse autoencoder with $4096$ features on its MLP activations as in [Anthropic's paper](https://transformer-circuits.pub/2023/monosemantic-features/). 93% of the autoencoder neurons were alive, only 5% of which were of ultra-low density. There are several interesting features. For example, there is [a feature for French language](https://shehper.github.io/feature-interface/?page=2011),

<p align="center">
<img src="./assets/french.png" width="600" />
<img src="./assets/french.png" width="700" />
</p>

- [A feature for German](https://shehper.github.io/feature-interface/?page=156)

<p align="center">
<img src="./assets/german.png" width="600" />
</p>

along with many others:
a feature each for German, Japanese, Hebrew, ..., and many others:

- [A feature for German](https://shehper.github.io/feature-interface/?page=156)
- [A feature for Scandinavian languages](https://shehper.github.io/feature-interface/?page=1634)
- [A feature for Japanese](https://shehper.github.io/feature-interface/?page=1989)
- [A feature for Hebrew](https://shehper.github.io/feature-interface/?page=2026)
Expand All @@ -41,23 +34,28 @@ along with many others:

<!-- - [A feature for some negative words/news](https://shehper.github.io/feature-interface/?page=218) -->

<!-- ### Training Details
### Training Details

I used the "OpenWebText" dataset to train the transformer model and to create the visualization. I trained the autoencoder on -->
I used the "OpenWebText" dataset to train the transformer model, to generate the MLP activations dataset for the autoencoder, and to generate the feature interface visualizations. The transformer model had $d_{\text{model}}= 128$, $d_{\text{MLP}} = 512$, and $n_{\text{head}}= 4$. I trained this model for $2 \times 10^5$ iterations to roughly match the number of epochs with [Anthropic's training procedure](https://transformer-circuits.pub/2023/monosemantic-features#appendix-transformer).

<!-- OpenWebText dataset is mostly monolingual.
I see several � tokens in top activations of neurons. I don't know what these tokens mean, but perhaps this indicates that the OpenWebText contains some characters that BytePairEncoding does not encode. I intend to investigate this further. -->
I collected the dataset of 4B MLP activations by performing forward pass on 20M prompts (each of length 1024), keeping 200 activation vectors from each prompt. Next, I trained the autoencoder for approximately $5 \times 10^5$ training steps at batch size 8192 and learning rate $3 \times 10^{-4}$. I performed neuron resampling 4 times during training at training steps $2.5 \times i \times 10^4$ with $i=1, 2, 3, 4$. See a complete log of the training run on the [W&B page](https://wandb.ai/shehper/sparse-autoencoder-openwebtext-public/runs/vjbcwjsf?nw=nwusershehper). The L1-coefficient for this training run is $1 \times 10^{-3}$. I selected the L1-coefficient and the learning rate by performing a grid search.

<!-- The loss curves and feature density histograms for the best training run so far are available on this [Weights and Biases page](https://wandb.ai/shehper/sparse-autoencoder-openwebtext-public).
I did not implement any changes suggested by Anthropic in their January and February updates.
For the most part, I followed the training procedure described in the [appendix](https://transformer-circuits.pub/2023/monosemantic-features#appendix-autoencoder) of Anthropic's original paper. In particular, I did not follow the improvements they suggested in their [January](https://transformer-circuits.pub/2024/jan-update/index.html) and [February](https://transformer-circuits.pub/2024/feb-update/index.html) updates.

### TODOs
- Incorporate the effects of feature ablations into the feature interface.
- Make the transformer and autoencoder weights available on HuggingFace along with a demo on how to use them.
- Make the code easy to import and work with for any interested researchers.
- Incorporate the effects of feature ablations in the feature interface.
- Implement an interface to see "Feature Activations on Example Texts" as done by Anthropic [here](https://transformer-circuits.pub/2023/monosemantic-features/vis/a1-math.html).
- Make the codebase easy to import and use with other
- Modify the code so that one can train a sparse autoencoder on activations of any MLP / attention layer.

5. **A more complete analysis of features**. While the top 10 activations of most features seem to show clear patterns about contexts where these features are active, a more detailed analysis as done by Anthropic in their sections on [Detailed Analysis of Individual Features](https://transformer-circuits.pub/2023/monosemantic-features/index.html#feature-analysis) and [Global Analysis](https://transformer-circuits.pub/2023/monosemantic-features/index.html#global-analysis) needs to be done. -->
### Related Work
There are several other very interesting works on the web exploring sparse dictionary learning. Here is a small subset of them.

- [Sparse Autoencoders Find Highly Interpretable Features in Language Models by Cunningham, et al.](https://arxiv.org/abs/2309.08600)
- [Sparse Autoencoders Work on Attention Layer Outputs by Kissane, et al.](https://www.lesswrong.com/posts/DtdzGwFh9dCfsekZZ/sparse-autoencoders-work-on-attention-layer-outputs)
- [Joseph Bloom's SAE codebase](https://github.com/jbloomAus/mats_sae_training) along with a blogpost on [trained SAEs for all residual stream layers of GPT-2 small](https://www.alignmentforum.org/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream)
- [Neel Nanda's SAE codebase](https://github.com/neelnanda-io/1L-Sparse-Autoencoder) along with a [blogpost](https://www.lesswrong.com/posts/fKuugaxt2XLTkASkk/open-source-replication-and-commentary-on-anthropic-s)
- [Callum McDougall's exercises on SAEs](https://github.com/callummcdougall/sae-exercises-mats/tree/main)
- [SAE library by AI Safey Foundation](https://github.com/ai-safety-foundation/sparse_autoencoder)

0 comments on commit b8d727c

Please sign in to comment.