Added support for bloom-560m model #434

SeuperHakkerJa · 2023-10-21T23:45:33Z

Description

I integrated support for the bloom-560m model, which uses alibi instead of positional encoding. Consequently, the positional_embedding_type flag can be set to 'alibi'.

Type of change

Please delete options that are not relevant.

New feature (non-breaking change which adds functionality)

Checklist:

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have not rewritten tests relating to key interfaces which would affect backward compatibility

… main

neelnanda-io · 2023-10-22T12:41:07Z

transformer_lens/HookedTransformer.py

@@ -334,6 +334,10 @@ def input_to_embed(
            # keys and queries. See HookedTransformerConfig for details
            residual = embed
            shortformer_pos_embed = None
+        #TODO: alibi embedding doesnt do anything


Why is this a TODO? Should Alibi do something here?

It's no longer needed, deleted.

transformer_lens/components.py

neelnanda-io · 2023-10-22T12:42:09Z

transformer_lens/components.py

+
+        # alibi encoding before applying causal mask
+        if self.cfg.positional_embedding_type == 'alibi':
+            #TODO: not sure about the side effect of not using standard, double check


What does this mean?

A reminder for myself to double check any potential side effect of setting embedding type to something other than standard, no longer needed, deleted!

neelnanda-io · 2023-10-22T12:43:16Z

transformer_lens/components.py

+        if self.cfg.positional_embedding_type == 'alibi':
+            #TODO: not sure about the side effect of not using standard, double check
+            batch_size = attn_scores.size(0)
+            seq_len = attn_scores.size(-2) 


Should this be -1? Note that when generating text the attention scores are not square

Yes it should be set to key_length, changed to -1. Thanks!

neelnanda-io · 2023-10-22T12:46:52Z

Thanks for the PR! I left some minor comments, but it overall looks pretty good. Have you tested that this gives (approx) the same logits as Bloom on HuggingFace?

And would you be able to add a test for it to test_hooked_transformer? https://github.com/neelnanda-io/TransformerLens/blob/main/tests/acceptance/test_hooked_transformer.py

My only hesitation with the test is that 560M is large enough that it might slow things down - thoughts @alan-cooney ?

alan-cooney · 2023-10-22T13:40:52Z

My only hesitation with the test is that 560M is large enough that it might slow things down - thoughts @alan-cooney ?

Seems likely fine still. In general the tests are starting to take a bit too long however so I'll split the acceptance and unit tests up into parallel workflows.

@SeuperHakkerJa nice work on this! One other thing is the docs are failing to build (probably a formatting error in docstrings). To fix it run poetry run docs-hot-reload and then the warning with details about what went wrong will show up (hot reload should let you fix it and then instantly see the warning go away).

SeuperHakkerJa · 2023-10-22T18:56:32Z

Thank you for your feedback and the comments. I'll begin fixing the issue soon.

alan-cooney · 2023-10-24T02:06:53Z

Btw I switched to draft whilst you're working on this - but feel free to switch back when ready.

Also happy to review the changes if you want Neel.

alan-cooney

Thanks so much for doing this!

It looks good but I've added some suggestions to improve the readability a bit, and also left a few questions (particuarly regarding the broadcasting onto QK). Let me know what you think.

alan-cooney · 2023-10-24T12:28:50Z

transformer_lens/HookedTransformer.py

            shortformer_pos_embed = None
+        elif self.cfg.positional_embedding_type == "alibi":
+            residual = embed


Suggest we add a comment here along the lines of "ALiBi does not add positional embeddings to word embeddings, and instead it biases QK attention scores."

Maybe even link the paper i.e. https://arxiv.org/pdf/2108.12409.pdf p1.

alan-cooney · 2023-10-24T12:32:17Z

transformer_lens/HookedTransformerConfig.py

+    # bloom flags
+    post_embedding_ln: bool = False


Please can you move the explanation to the docstring above so it's consitent

alan-cooney · 2023-10-24T12:47:55Z

transformer_lens/components.py

+            assert alibi.shape == (
+                attn_scores.size(0),
+                attn_scores.size(1),
+                1,
+                attn_scores.size(-1),
+            ), f"alibi shape {alibi.shape}, expecting {attn_scores.shape}"


This looks like you're testing something that should be correct as long as your code is written
correctly? If so best to keep this out of the runtime code.

Also is this right?

If the query shape is size 1 surelly this only works for predicting the next token, but for the
logits for all
previous tokens it would then give an incorrect answer? Would mean that training wouldn't work with
this code right (or any analysis that looks at logits other than from the last token )

My apology for the oversight, I think it might not work as desired during training then. I was only considering broadcasting but not past_kv_cache. I will fix this shortly.

alan-cooney · 2023-10-24T12:48:38Z

transformer_lens/components.py

+            # Huggingface impl uses torch.Tensor.baddbmm, with alpha = 1/sqrt(d_head), and beta=1
+            # and alibi.baddbmm(q,k) = beta * alibi + alpha * (q@k),
+            # here the `attn_scores` is already scaled by a factor of self.attn_scale,
+            # we only need to add alibi matrix to the result


Nice to include this but I think it belonds in the build_alibi_tensor function instead, as part of
a broader ecxplanation of how it works?

alan-cooney · 2023-10-24T12:49:41Z

transformer_lens/components.py

+    def build_alibi_tensor(
+        self,
+        attention_mask: torch.Tensor,  # batch pos
+        num_heads: int,
+        dtype: torch.dtype,
+    ) -> Float[torch.Tensor, "batch head_index 1 pos"]:


Would be great to have a decent docstring here with details about how it works, a reference and all
args etc. See https://neelnanda-io.github.io/TransformerLens/content/contributing.html#documentation
for our new contributors guide on how to do this.

Would also be good to have a unit test if possible? Feels like we can check some basic things like:

For each specific head, the diagonal QK values are the same (e.g. the middle diagonol is 0s)

For each specific head, the slope (m) is constant

alan-cooney · 2023-10-24T12:53:32Z

transformer_lens/components.py

+        if self.cfg.positional_embedding_type == "alibi":
+            batch_size = attn_scores.size(0)
+            seq_len = attn_scores.size(-1)
+            additive_mask = torch.ones(batch_size, seq_len)


Small point but I think it may be clearer to move this additive_mask into build_alibi_tensor, &
then it's easier to explain (instead we can just pass the relevant sizes to that function). What do
you think?

Also it needs to have its device set if not (so that it's on the same device as QK)

alan-cooney · 2023-10-24T12:55:50Z

transformer_lens/components.py

@@ -757,6 +789,49 @@ def apply_rotary(

        return torch.cat([x_rotated, x_pass], dim=-1)

+    def build_alibi_tensor(


create_attention_linear_bias or create_alibi_bias ? I'm terrible at naming things, so not the
best person to suggest here, but it feels like we shouldn't have tensor in the name?

create_attention_linear_bias sounds great to me. (I was naming it build_alibi_tensor only because it was named so in HF code)

transformer_lens/components.py

alan-cooney · 2023-10-24T13:31:41Z

transformer_lens/components.py

+        batch_size, seq_length = attention_mask.shape
+        closest_power_of_2 = 2 ** math.floor(math.log2(num_heads))
+        base = torch.tensor(
+            2 ** (-(2 ** -(math.log2(closest_power_of_2) - 3))),
+            device=attention_mask.device,
+            dtype=torch.float32,
+        )
+        powers = torch.arange(
+            1, 1 + closest_power_of_2, device=attention_mask.device, dtype=torch.int32
+        )
+        slopes = torch.pow(base, powers)
+
+        if closest_power_of_2 != num_heads:
+            extra_base = torch.tensor(
+                2 ** (-(2 ** -(math.log2(2 * closest_power_of_2) - 3))),
+                device=attention_mask.device,
+                dtype=torch.float32,
+            )
+            num_remaining_heads = min(
+                closest_power_of_2, num_heads - closest_power_of_2
+            )
+            extra_powers = torch.arange(
+                1,
+                1 + 2 * num_remaining_heads,
+                2,
+                device=attention_mask.device,
+                dtype=torch.int32,
+            )
+            slopes = torch.cat([slopes, torch.pow(extra_base, extra_powers)], dim=0)
+
+        arange_tensor = ((attention_mask.cumsum(dim=-1) - 1) * attention_mask)[
+            :, None, :
+        ]
+        alibi = slopes[..., None] * arange_tensor


Makes sense - worth adding a comment using p5 of https://arxiv.org/pdf/2108.12409.pdf to explain a
bit more about how this gets to the head-specific slope (m).

But can we use their general formula to simplify this a bit ("In general, for n heads, our set of slopes is the geometric sequence that starts
at 2(-8/n) and uses that same value as its ratio.")

For this function, I was also using HF's implementation. If you prefer, I can switch it back to the general (original?) formula.

alan-cooney

Thanks so much for doing this!

It looks good but I've added some suggestions to improve the readability a bit, and also left a few questions (particuarly regarding the broadcasting onto QK). Let me know what you think.

SeuperHakkerJa · 2023-10-28T03:32:28Z

Summary:

Comment Cleanup & Documentation:
- Enhanced code readability by cleaning up comments.
- Added detailed documentation for several functions, notably compute_attention_linear_bias and expand_alibi_on_query_dim.
Reimplementation of Alibi Tensor Function:
- This new implementation ensures that the diagonal values remain consistent.
- To achieve this, the original Alibi tensor has been expanded on the query dimension (previously set to 1) and values have been filled as suggested in the original paper.

Side Note:
I examined the original code referenced in the Alibi paper (source). It contains the following remark:

#In the next line, the part after the * is what constructs the diagonal matrix (right matrix in Figure 3 in the paper). 
#If you run it you'll see that it doesn't exactly print out the same matrix as we have in Figure 3, but one where all rows are identical.
#This works because the softmax operation is invariant to translation, and our bias functions are always linear.

Thus, the 'original' code leveraged broadcasting and softmax's translation invariant properties to derive the pattern. My implementation, as explained above, will in addition ensure that hook_attn_scores correctly captures the attention scores.

alan-cooney

Looks good - one small question but otherwise good to go.

transformer_lens/loading_from_pretrained.py

…Org#436) Note these are also added to the makefile as this is currently the approach people use to run the tests. In the future we should probably remove this as it's better to stick to one language in the repo (and a .py script file can also do all of this).

This should add a bit of a speed boost.

Removes warning and speeds up poetry install.

- Organise in the order people usually want (e.g. description first) - Remove the top image - Add some buttons - Fix all linting issues (e.g. should use * for bullets)

This will reduce compatibility issues with Jax

…g#441)

* Added santacoder to aliases * Removed reference to multiquery parameter * Added santacoder to tests * Asserted that trust_remote_code=true for santacoder * Added demo notebook for santacoder * Removed print statements and forcibly set trust_remote_code=True * Changed spacing and identation for black * Removed model type hint in convert weights method * Removed santacoder test due to memory issues * Added back in print statement for loading pretrained model

- Organise in the order people usually want (e.g. description first) - Remove the top image - Add some buttons - Fix all linting issues (e.g. should use * for bullets)

This will reduce compatibility issues with Jax

…g#441)

gegallego · 2023-11-13T13:47:19Z

Hi! Is there any reason to not adding the other BLOOM models in this PR? Thanks!

alan-cooney · 2023-11-13T17:06:38Z

Hi! Is there any reason to not adding the other BLOOM models in this PR? Thanks!

Not aware of any reason. Happy to review a pr if you want to add the other ones

SeuperHakkerJa · 2023-11-13T17:36:07Z

Hi! Is there any reason to not adding the other BLOOM models in this PR? Thanks!

Not aware of any reason. Happy to review a pr if you want to add the other ones

I was only concerned that the other models might be too large, potentially causing the unit tests to take too much time:

gegallego · 2023-11-14T12:37:47Z

Thanks for the quick response! I don't know exactly how your unit tests work... so I won't do a PR to avoid issues in that sense. But basically if I want to load a larger model I just need to list it in OFFICIAL_MODEL_NAMES and MODEL_ALIASES, right?

SeuperHakkerJa · 2023-11-14T17:44:16Z

Thanks for the quick response! I don't know exactly how your unit tests work... so I won't do a PR to avoid issues in that sense. But basically if I want to load a larger model I just need to list it in OFFICIAL_MODEL_NAMES and MODEL_ALIASES, right?

Yes, that's right!

@alan-cooney I will then initiate this pull request to integrate the remaining bloom models (up to 7b), potentially this weekend or at some point next week

alan-cooney · 2023-11-15T19:32:13Z

Hi! Is there any reason to not adding the other BLOOM models in this PR? Thanks!

Not aware of any reason. Happy to review a pr if you want to add the other ones

I was only concerned that the other models might be too large, potentially causing the unit tests to take too much time:

bloom-1b1

bloom-1b7

bloom-3b

bloom-7b1

Regarding the tests - I suspect they are actually fine but the easy way to run with the CPU and check they take at most a few seconds.

Xiaochen Li added 2 commits October 21, 2023 18:09

bloom support

38f2ff4

Merge branch 'main' of github.com:SeuperHakkerJa/TransformerLens into…

6f1dd96

… main

neelnanda-io reviewed Oct 22, 2023

View reviewed changes

transformer_lens/components.py Show resolved Hide resolved

neelnanda-io reviewed Oct 22, 2023

View reviewed changes

alan-cooney assigned SeuperHakkerJa Oct 22, 2023

alan-cooney marked this pull request as draft October 22, 2023 23:28

Xiaochen Li and others added 5 commits October 23, 2023 17:52

refine comments, fix bug, add test and notebooks

d698085

fixing build error

8194f06

fix import order

70d3dd2

fix run black

e9623af

fix more formatting problems

5edc1ea

SeuperHakkerJa marked this pull request as ready for review October 24, 2023 02:08

alan-cooney self-requested a review October 24, 2023 12:33

alan-cooney requested changes Oct 24, 2023

View reviewed changes

Xiaochen Li added 6 commits October 27, 2023 17:40

fix compute_alibi_tensor

03a707c

fix docstring

bedb53f

fix expand alibi

e7c20dd

fix util docs

f1588cf

fix run blck error

8fd8940

delete unwanted comments

2720cc5

SeuperHakkerJa requested a review from alan-cooney October 28, 2023 03:34

add unit test for expand

0ce7b22

SeuperHakkerJa requested a review from alan-cooney November 3, 2023 21:13

Remove notebook demo

0801ad6

alan-cooney requested changes Nov 9, 2023

View reviewed changes

transformer_lens/loading_from_pretrained.py Outdated Show resolved Hide resolved

alan-cooney reviewed Nov 10, 2023

View reviewed changes

transformer_lens/loading_from_pretrained.py Outdated Show resolved Hide resolved

Add comment on n_ctx

df1e3a9

alan-cooney approved these changes Nov 10, 2023

View reviewed changes

alan-cooney and others added 15 commits November 10, 2023 08:45

Make unit & acceptance tests run in parallel (TransformerLensOrg#435)

35f145e

This should add a bit of a speed boost.

Update GitHub CD Actions (TransformerLensOrg#437)

3b606ed

Removes warning and speeds up poetry install.

Organise & fix README (TransformerLensOrg#430)

0b2c71c

- Organise in the order people usually want (e.g. description first) - Remove the top image - Add some buttons - Fix all linting issues (e.g. should use * for bullets)

Update README.md (TransformerLensOrg#440)

960fd06

Relax cuda requirements (TransformerLensOrg#442)

28d4542

This will reduce compatibility issues with Jax

Add tests to the main demo and push to the website (TransformerLensOr…

1673e53

…g#441)

Fix docs command typo (TransformerLensOrg#444)

7f86848

Fix merge part 3

751c91a

Organise & fix README (TransformerLensOrg#430)

20a02e8

- Organise in the order people usually want (e.g. description first) - Remove the top image - Add some buttons - Fix all linting issues (e.g. should use * for bullets)

Update README.md (TransformerLensOrg#440)

f178e66

Relax cuda requirements (TransformerLensOrg#442)

820a206

This will reduce compatibility issues with Jax

Add tests to the main demo and push to the website (TransformerLensOr…

6c4a9c0

…g#441)

Merge branch 'main' into main

0c3ae68

alan-cooney approved these changes Nov 10, 2023

View reviewed changes

alan-cooney merged commit f5a7d45 into TransformerLensOrg:main Nov 10, 2023
6 of 8 checks passed

SeuperHakkerJa mentioned this pull request Nov 14, 2023

Extending Support for Additional Bloom Models (up to 7b) #447

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for bloom-560m model #434

Added support for bloom-560m model #434

SeuperHakkerJa commented Oct 21, 2023 •

edited

Loading

neelnanda-io Oct 22, 2023

SeuperHakkerJa Oct 24, 2023

neelnanda-io Oct 22, 2023

SeuperHakkerJa Oct 24, 2023

neelnanda-io Oct 22, 2023

SeuperHakkerJa Oct 24, 2023

neelnanda-io commented Oct 22, 2023

alan-cooney commented Oct 22, 2023

SeuperHakkerJa commented Oct 22, 2023

alan-cooney commented Oct 24, 2023

alan-cooney left a comment

alan-cooney Oct 24, 2023 •

edited

Loading

alan-cooney Oct 24, 2023

alan-cooney Oct 24, 2023

alan-cooney Oct 24, 2023

SeuperHakkerJa Oct 24, 2023 •

edited

Loading

alan-cooney Oct 24, 2023

alan-cooney Oct 24, 2023

alan-cooney Oct 24, 2023 •

edited

Loading

alan-cooney Oct 24, 2023

alan-cooney Oct 24, 2023

alan-cooney Oct 24, 2023

SeuperHakkerJa Oct 24, 2023

alan-cooney Oct 24, 2023

SeuperHakkerJa Oct 24, 2023

alan-cooney left a comment

SeuperHakkerJa commented Oct 28, 2023

alan-cooney left a comment

gegallego commented Nov 13, 2023

alan-cooney commented Nov 13, 2023

SeuperHakkerJa commented Nov 13, 2023

gegallego commented Nov 14, 2023

SeuperHakkerJa commented Nov 14, 2023

alan-cooney commented Nov 15, 2023

		@@ -757,6 +789,49 @@ def apply_rotary(

		return torch.cat([x_rotated, x_pass], dim=-1)

		def build_alibi_tensor(

Added support for bloom-560m model #434

Added support for bloom-560m model #434

Conversation

SeuperHakkerJa commented Oct 21, 2023 • edited Loading

Description

Type of change

Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neelnanda-io commented Oct 22, 2023

alan-cooney commented Oct 22, 2023

SeuperHakkerJa commented Oct 22, 2023

alan-cooney commented Oct 24, 2023

alan-cooney left a comment

Choose a reason for hiding this comment

alan-cooney Oct 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SeuperHakkerJa Oct 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alan-cooney Oct 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alan-cooney left a comment

Choose a reason for hiding this comment

SeuperHakkerJa commented Oct 28, 2023

alan-cooney left a comment

Choose a reason for hiding this comment

gegallego commented Nov 13, 2023

alan-cooney commented Nov 13, 2023

SeuperHakkerJa commented Nov 13, 2023

gegallego commented Nov 14, 2023

SeuperHakkerJa commented Nov 14, 2023

alan-cooney commented Nov 15, 2023

SeuperHakkerJa commented Oct 21, 2023 •

edited

Loading

alan-cooney Oct 24, 2023 •

edited

Loading

SeuperHakkerJa Oct 24, 2023 •

edited

Loading

alan-cooney Oct 24, 2023 •

edited

Loading