diff --git a/_sources/generated/demos/Main_Demo.ipynb.txt b/_sources/generated/demos/Main_Demo.ipynb.txt index c0fed32d9..41853de67 100644 --- a/_sources/generated/demos/Main_Demo.ipynb.txt +++ b/_sources/generated/demos/Main_Demo.ipynb.txt @@ -429,6 +429,26 @@ "cv.attention.attention_patterns(tokens=gpt2_str_tokens, attention=attention_pattern)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this case, we only wanted the layer 0 attention patterns, but we are storing the internal activations from all locations in the model. It's convenient to have access to all activations, but this can be prohibitively expensive for memory use with larger models, batch sizes, or sequence lengths. In addition, we don't need to do the full forward pass through the model to collect layer 0 attention patterns. The following cell will collect only the layer 0 attention patterns and stop the forward pass at layer 1, requiring far less memory and compute." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "attn_hook_name = \"blocks.0.attn.hook_pattern\"\n", + "attn_layer = 0\n", + "_, gpt2_attn_cache = model.run_with_cache(gpt2_tokens, remove_batch_dim=True, stop_at_layer=attn_layer + 1, names_filter=[attn_hook_name])\n", + "gpt2_attn = gpt2_attn_cache[attn_hook_name]\n", + "assert torch.equal(gpt2_attn, attention_pattern)" + ] + }, { "attachments": {}, "cell_type": "markdown", diff --git a/_sources/generated/model_properties_table.md.txt b/_sources/generated/model_properties_table.md.txt index 30e018468..b653ccf5a 100644 --- a/_sources/generated/model_properties_table.md.txt +++ b/_sources/generated/model_properties_table.md.txt @@ -170,6 +170,21 @@ | Qwen/Qwen2-1.5B-Instruct | 1.4B | 28 | 1536 | 12 | silu | 2048 | 151936 | 128 | 8960 | 2 | | Qwen/Qwen2-7B | 7.1B | 28 | 3584 | 28 | silu | 2048 | 152064 | 128 | 18944 | 4 | | Qwen/Qwen2-7B-Instruct | 7.1B | 28 | 3584 | 28 | silu | 2048 | 152064 | 128 | 18944 | 4 | +| Qwen/Qwen2.5-0.5B | 391M | 24 | 896 | 14 | silu | 2048 | 151936 | 64 | 4864 | 2 | +| Qwen/Qwen2.5-0.5B-Instruct | 391M | 24 | 896 | 14 | silu | 2048 | 151936 | 64 | 4864 | 2 | +| Qwen/Qwen2.5-1.5B | 1.4B | 28 | 1536 | 12 | silu | 2048 | 151936 | 128 | 8960 | 2 | +| Qwen/Qwen2.5-1.5B-Instruct | 1.4B | 28 | 1536 | 12 | silu | 2048 | 151936 | 128 | 8960 | 2 | +| Qwen/Qwen2.5-3B | 3.0B | 36 | 2048 | 16 | silu | 2048 | 151936 | 128 | 11008 | 2 | +| Qwen/Qwen2.5-3B-Instruct | 3.0B | 36 | 2048 | 16 | silu | 2048 | 151936 | 128 | 11008 | 2 | +| Qwen/Qwen2.5-7B | 7.1B | 28 | 3584 | 28 | silu | 2048 | 152064 | 128 | 18944 | 4 | +| Qwen/Qwen2.5-7B-Instruct | 7.1B | 28 | 3584 | 28 | silu | 2048 | 152064 | 128 | 18944 | 4 | +| Qwen/Qwen2.5-14B | 15B | 48 | 5120 | 40 | silu | 2048 | 152064 | 128 | 13824 | 8 | +| Qwen/Qwen2.5-14B-Instruct | 15B | 48 | 5120 | 40 | silu | 2048 | 152064 | 128 | 13824 | 8 | +| Qwen/Qwen2.5-32B | 34B | 64 | 5120 | 40 | silu | 2048 | 152064 | 128 | 27648 | 8 | +| Qwen/Qwen2.5-32B-Instruct | 34B | 64 | 5120 | 40 | silu | 2048 | 152064 | 128 | 27648 | 8 | +| Qwen/Qwen2.5-72B | 80B | 80 | 8192 | 64 | silu | 2048 | 152064 | 128 | 29568 | 8 | +| Qwen/Qwen2.5-72B-Instruct | 80B | 80 | 8192 | 64 | silu | 2048 | 152064 | 128 | 29568 | 8 | +| Qwen/QwQ-32B-Preview | 34B | 64 | 5120 | 40 | silu | 2048 | 152064 | 128 | 27648 | 8 | | phi-1 | 1.2B | 24 | 2048 | 32 | gelu | 2048 | 51200 | 64 | 8192 | | | phi-1_5 | 1.2B | 24 | 2048 | 32 | gelu | 2048 | 51200 | 64 | 8192 | | | phi-2 | 2.5B | 32 | 2560 | 32 | gelu | 2048 | 51200 | 80 | 10240 | | diff --git a/_static/coverage/d_37285d613390727b_can_be_used_as_mlp_py.html b/_static/coverage/d_37285d613390727b_can_be_used_as_mlp_py.html index cb9510a2f..bfe5dd961 100644 --- a/_static/coverage/d_37285d613390727b_can_be_used_as_mlp_py.html +++ b/_static/coverage/d_37285d613390727b_can_be_used_as_mlp_py.html @@ -67,7 +67,7 @@