move: transformer assets to new folder

DorsaRoh · Oct 22, 2024 · 73fac05 · 73fac05
1 parent a5e284f
commit 73fac05
Show file tree

Hide file tree

Showing 34 changed files with 126 additions and 55 deletions.
diff --git a/ML_Interviews/Data_Structures_Algorithms/README.md b/ML_Interviews/Data_Structures_Algorithms/README.md
@@ -0,0 +1,23 @@
+# Data Structures and Algorithms
+
+
+1.  [Stacks](#1-stacks)
+
+
+
+
+## 1. Stacks
+
+`myStack = []`
+
+
+<img src="../assets/ML_Interviews/2-stack.png" alt="stack" width="300" height="auto">
+
+
+| Operation | Big-O Time |
+|----------|----------|
+| Push          | O(1)   |
+| Pop           | O(1)   |
+| Peek / Top    | O(1)   |
+
+
diff --git a/ML_Interviews/README.md b/ML_Interviews/README.md
@@ -6,6 +6,9 @@
 1.  [Vectors](#1-vectors)
 2.  [Matrices](#2-matrices)
 
+10. [Probability](#10-probability)
+
+
 
 ## 1. Vectors
 
@@ -200,6 +203,21 @@ The outer product of 'weather' and 'crop' can capture all possible interactions
 
 ## 2. Matrices
 
+### Important terminology
+- **Determinant**: A scalar that represents the factor by which a linear transformation scales area (in 2D) or volumes (in 3D) in the corresponding space.
+
+  The determinant of matrix A below:
+    |    |    |
+    |----|----|   
+    | a |  b |
+    | c |  d |
+
+
+  = ad - bc
+
+  - det(𝑀1 𝑀2) = det(𝑀1)det(𝑀2)
+    - Applying the linear transformations 𝑀2 and 𝑀1 sequentially scales the area/volume by the product of their individual scaling factors (determinants), which is equivalent to the scalar botained by first applying the combined transformation 𝑀1𝑀2 (transformation 𝑀2 is applied first, followed by 𝑀1) and then determining how it scales the area/volume
+
 **1. [E] Why do we say that matrices are linear transformations?**
 
 **2. [E] What’s the inverse of a matrix? Do all matrices have an inverse? Is the inverse of a matrix always unique?**
@@ -245,4 +263,34 @@ The outer product of 'weather' and 'crop' can capture all possible interactions
 
 **11. [H] Given a very large symmetric matrix `A` that doesn’t fit in memory (say `A ∈ R^{1M × 1M}`) and a function `f` that can quickly compute `f(x) = Ax` for `x ∈ R^{1M}`, find the unit vector `x` such that `xᵀAx` is minimal.**
 
-**Hint: Can you frame it as an optimization problem and use gradient descent to find an approximate solution?**
+**Hint: Can you frame it as an optimization problem and use gradient descent to find an approximate solution?**
+
+# 10. Probability
+
+### Important Terminology
+
+- **Mean** μ: 
+
+  The mean of a distribution is the center of mass for that distribution. 
+  <br>Ie. what you would expect the **average outcome** to be if you repeated an experiment many times<br>
+    E(X)=∑xP(x)<br>
+
+- **Standard deviation** σ:
+
+  Measures how dispered the data is in relation to the mean.
+
+  <img src="../assets/ML_Interviews/1-standardd.png" alt="Standard" width="300" height="auto">
+
+<br>
+
+- **Common distributions**: 
+
+  **Explain all the common distributions and write out its equation, draw its probability mass function (PMF) if it’s discrete and the probability density function (PDF) if it’s continuous.**
+
+    1. Normal distribution
+    2. 
+
+
+- **Cross entropy**:
+- **KL divergence**:
+- **Probability distribution**:
diff --git a/README.md b/README.md
@@ -328,7 +328,7 @@ A collection of real numbers, which could be:
 ### Output:
 A probability distribution over all potential next tokens
 
-![Output Example](assets/4-outputEX.png)
+![Output Example](assets/Transformers/4-outputEX.png)
 
 ## Tokens
 
@@ -338,9 +338,9 @@ Tokens are "little pieces" of information (ex. words, combinations of words, sou
   - encodes the meaning of that piece
   - ex. in considering these vectors as coordinates, words with similar meanings tend to land near each other
 
-![Tokens](assets/2-tokens.png)
-![Token Vectors](assets/3-tokenvectors.png)
-![Coordinate Tokens](assets/1-coordinateTokens.png)
+![Tokens](assets/Transformers/2-tokens.png)
+![Token Vectors](assets/Transformers/3-tokenvectors.png)
+![Coordinate Tokens](assets/Transformers/1-coordinateTokens.png)
 
 ## Embeddings
 
@@ -358,7 +358,7 @@ See `Transformer/embedding_notes.ipynb` for more on embeddings!
 Below is an image of the embedding matrix. Each word corresponds to a specific vector, with no reference to its context.
 It is the Attention block's responsibility to update a word's vector with its context. (to be discussed later)
 
-![Embedding Matrix](assets/10-embeddingmatrix.png)
+![Embedding Matrix](assets/Transformers/10-embeddingmatrix.png)
 
 ## Positional Encoders
 
@@ -398,8 +398,8 @@ print(np.exp(seq)/np.sum(np.exp(seq)))
 # [0.03511903 0.25949646 0.70538451]
 ```
 
-![Softmax Equation](assets/8-softmaxEqn.png)
-![Softmax](assets/6-softmax.png)
+![Softmax Equation](assets/Transformers/8-softmaxEqn.png)
+![Softmax](assets/Transformers/6-softmax.png)
 
 ## Temperature
 
@@ -408,8 +408,8 @@ With softmax, the constant T added to the denominator of the exponents of e in t
 - Makes the softmax outputs LESS extreme towards 0 and 1
 - This enables more unique text to be generated and different for each generation
 
-![Softmax with Temperature](assets/7-softmaxT.png)
-![Logits](assets/9-logits.png)
+![Softmax with Temperature](assets/Transformers/7-softmaxT.png)
+![Logits](assets/Transformers/9-logits.png)
 
 ## Attention
 
@@ -427,21 +427,21 @@ Updates a word's embedding vector in reference to its context. Enables the trans
 
 Prior to Attention, the embedding vector of each word is consistent, regardless of its context (embedding matrix). Therefore, the motivation of Attention is to update a word's embedding vector depending on its context (i.e. surrounding tokens) to capture this specific contextual instance of the word
 
-![Attention](assets/10-embeddingmatrix.png)
+![Attention](assets/Transformers/10-embeddingmatrix.png)
 
 The computation to predict the next token relies entirely on the final vector of the current sequence
 
 Initially, this vector corresponds to the embedding of the last word in the sequence. As the sequence passes through the model's attention blocks, the final vector is updated to include information from the entire sequence, not just the last word. This updated vector becomes a summary of the whole sequence, encoding all the important information needed to predict the next word
 
-![Attention Last Vector](assets/12-attentionlastvector.png)
+![Attention Last Vector](assets/Transformers/12-attentionlastvector.png)
 
 ### Single-Head Attention
 
 Goal: series of computations to produce a new refined set of embeddings
 
 ex. Have nouns ingest the meanings of their corresponding adjectives
 
-![Attention Embeddings](assets/13-attentionEmbeds.png)
+![Attention Embeddings](assets/Transformers/13-attentionEmbeds.png)
 
 #### Query
 
@@ -481,35 +481,35 @@ Steps:
 6. Use attention scores to weight the Value vectors
 7. Output step 6.
 
-![Query W1](assets/14-queryW1.png)
-![Query Key 1](assets/15-queryKey1.png)
-![Query Key 2](assets/16-queryKey2.png)
+![Query W1](assets/Transformers/14-queryW1.png)
+![Query Key 1](assets/Transformers/15-queryKey1.png)
+![Query Key 2](assets/Transformers/16-queryKey2.png)
 
 The higher the dot product, the more relevant the Query to the Key (i.e. word to another word in the sentence)
 
-![QK Matrix 1](assets/17-qKmatrix1.png)
-![QK Matrix 2](assets/18-qKmatrix2.png)
-![QK Matrix 3](assets/19-qKmatrix3.png)
-![QK Matrix 4](assets/20-qKmatrix4.png)
+![QK Matrix 1](assets/Transformers/17-qKmatrix1.png)
+![QK Matrix 2](assets/Transformers/18-qKmatrix2.png)
+![QK Matrix 3](assets/Transformers/19-qKmatrix3.png)
+![QK Matrix 4](assets/Transformers/20-qKmatrix4.png)
 
 ### Masking
 
 Masking is to prevent later tokens influencing earlier ones during the training process. This is done by setting the entries of the older tokens to -infinity. So when softmax is applied, they are turned to 0.
 
-![Masking](assets/23-masking.png)
+![Masking](assets/Transformers/23-masking.png)
 
 Why mask?
 - During the train process, every possible subsequence is trained/predicted on for efficiency.
 - One training example, effectively acts as many.
 - This means we never want to allow later words to influence earlier words (because they essentially "give away" the answer for what comes next/the answer to the predictions)
 
-![Subsequence Training](assets/21-subsequenceTraining.png)
+![Subsequence Training](assets/Transformers/21-subsequenceTraining.png)
 
 ### Softmax
 
 After masking, softmax (normalization) is applied. Masking was done to ensure that later tokens do not affect earlier tokens in the training process. So, the older tokens' entries are set to -infinity during the masking phase, to be transformed into 0 with softmax.
 
-![Masking and Softmax](assets/22-maskingANDsoftmax.png)
+![Masking and Softmax](assets/Transformers/22-maskingANDsoftmax.png)
 
 ### Value
 
@@ -523,11 +523,11 @@ Value: vector that holds the actual info that will be passed along the next laye
 - continuing with the sentence "The cat sat on the mat", if "sat" (Key) is deemed important for "cat" (Query), the Value associated with "sat" will contribute significantly to the final representation of "cat"
 - this helps the model understand that "cat" is related to the action of "sitting"
 
-![Value Matrix](assets/24-valueMatrix.png)
-![Value Embedding 1](assets/25-valueEmbedding1.png)
-![Value Embedding 2](assets/26-valueEmbedding2.png)
-![Value Embedding 3](assets/27-valueEmbedding3.png)
-![Value Embedding 4](assets/28-valueEmbedding4.png)
+![Value Matrix](assets/Transformers/24-valueMatrix.png)
+![Value Embedding 1](assets/Transformers/25-valueEmbedding1.png)
+![Value Embedding 2](assets/Transformers/26-valueEmbedding2.png)
+![Value Embedding 3](assets/Transformers/27-valueEmbedding3.png)
+![Value Embedding 4](assets/Transformers/28-valueEmbedding4.png)
 
 ## Multi-Head Attention
 

diff --git a/Transformer/README.md b/Transformer/README.md
@@ -12,7 +12,7 @@ A collection of real numbers, which could be:
 ### Output:
 A probability distribution over all potential next tokens
 
-![Output Example](../assets/4-outputEX.png)
+![Output Example](../assets/Transformers/4-outputEX.png)
 
 ## Tokens
 
@@ -22,9 +22,9 @@ Tokens are "little pieces" of information (ex. words, combinations of words, sou
   - encodes the meaning of that piece
   - ex. in considering these vectors as coordinates, words with similar meanings tend to land near each other
 
-![Tokens](../assets/2-tokens.png)
-![Token Vectors](../assets/3-tokenvectors.png)
-![Coordinate Tokens](../assets/1-coordinateTokens.png)
+![Tokens](../assets/Transformers/2-tokens.png)
+![Token Vectors](../assets/Transformers/3-tokenvectors.png)
+![Coordinate Tokens](../assets/Transformers/1-coordinateTokens.png)
 
 ## Embeddings
 
@@ -42,7 +42,7 @@ See `embedding_notes.ipynb` for more on embeddings!
 Below is an image of the embedding matrix. Each word corresponds to a specific vector, with no reference to its context.
 It is the Attention block's responsibility to update a word's vector with its context. (to be discussed later)
 
-![Embedding Matrix](../assets/10-embeddingmatrix.png)
+![Embedding Matrix](../assets/Transformers/10-embeddingmatrix.png)
 
 ## Positional Encoders
 
@@ -82,8 +82,8 @@ print(np.exp(seq)/np.sum(np.exp(seq)))
 # [0.03511903 0.25949646 0.70538451]
 ```
 
-![Softmax Equation](../assets/8-softmaxEqn.png)
-![Softmax](../assets/6-softmax.png)
+![Softmax Equation](../assets/Transformers/8-softmaxEqn.png)
+![Softmax](../assets/Transformers/6-softmax.png)
 
 ## Temperature
 
@@ -92,8 +92,8 @@ With softmax, the constant T added to the denominator of the exponents of e in t
 - Makes the softmax outputs LESS extreme towards 0 and 1
 - This enables more unique text to be generated and different for each generation
 
-![Softmax with Temperature](../assets/7-softmaxT.png)
-![Logits](../assets/9-logits.png)
+![Softmax with Temperature](../assets/Transformers/7-softmaxT.png)
+![Logits](../assets/Transformers/9-logits.png)
 
 ## Attention
 
@@ -111,21 +111,21 @@ Updates a word's embedding vector in reference to its context. Enables the trans
 
 Prior to Attention, the embedding vector of each word is consistent, regardless of its context (embedding matrix). Therefore, the motivation of Attention is to update a word's embedding vector depending on its context (i.e. surrounding tokens) to capture this specific contextual instance of the word
 
-![Attention](../assets/10-embeddingmatrix.png)
+![Attention](../assets/Transformers/10-embeddingmatrix.png)
 
 The computation to predict the next token relies entirely on the final vector of the current sequence
 
 Initially, this vector corresponds to the embedding of the last word in the sequence. As the sequence passes through the model's attention blocks, the final vector is updated to include information from the entire sequence, not just the last word. This updated vector becomes a summary of the whole sequence, encoding all the important information needed to predict the next word
 
-![Attention Last Vector](../assets/12-attentionlastvector.png)
+![Attention Last Vector](../assets/Transformers/12-attentionlastvector.png)
 
 ### Single-Head Attention
 
 Goal: series of computations to produce a new refined set of embeddings
 
 ex. Have nouns ingest the meanings of their corresponding adjectives
 
-![Attention Embeddings](../assets/13-attentionEmbeds.png)
+![Attention Embeddings](../assets/Transformers/13-attentionEmbeds.png)
 
 #### Query
 
@@ -165,35 +165,35 @@ Steps:
 6. Use attention scores to weight the Value vectors
 7. Output step 6.
 
-![Query W1](../assets/14-queryW1.png)
-![Query Key 1](../assets/15-queryKey1.png)
-![Query Key 2](../assets/16-queryKey2.png)
+![Query W1](../assets/Transformers/14-queryW1.png)
+![Query Key 1](../assets/Transformers/15-queryKey1.png)
+![Query Key 2](../assets/Transformers/16-queryKey2.png)
 
 The higher the dot product, the more relevant the Query to the Key (i.e. word to another word in the sentence)
 
-![QK Matrix 1](../assets/17-qKmatrix1.png)
-![QK Matrix 2](../assets/18-qKmatrix2.png)
-![QK Matrix 3](../assets/19-qKmatrix3.png)
-![QK Matrix 4](../assets/20-qKmatrix4.png)
+![QK Matrix 1](../assets/Transformers/17-qKmatrix1.png)
+![QK Matrix 2](../assets/Transformers/18-qKmatrix2.png)
+![QK Matrix 3](../assets/Transformers/19-qKmatrix3.png)
+![QK Matrix 4](../assets/Transformers/20-qKmatrix4.png)
 
 ### Masking
 
 Masking is to prevent later tokens influencing earlier ones during the training process. This is done by setting the entries of the older tokens to -infinity. So when softmax is applied, they are turned to 0.
 
-![Masking](../assets/23-masking.png)
+![Masking](../assets/Transformers/23-masking.png)
 
 Why mask?
 - During the train process, every possible subsequence is trained/predicted on for efficiency.
 - One training example, effectively acts as many.
 - This means we never want to allow later words to influence earlier words (because they essentially "give away" the answer for what comes next/the answer to the predictions)
 
-![Subsequence Training](../assets/21-subsequenceTraining.png)
+![Subsequence Training](../assets/Transformers/21-subsequenceTraining.png)
 
 ### Softmax
 
 After masking, softmax (normalization) is applied. Masking was done to ensure that later tokens do not affect earlier tokens in the training process. So, the older tokens' entries are set to -infinity during the masking phase, to be transformed into 0 with softmax.
 
-![Masking and Softmax](../assets/22-maskingANDsoftmax.png)
+![Masking and Softmax](../assets/Transformers/22-maskingANDsoftmax.png)
 
 ### Value
 
@@ -207,11 +207,11 @@ Value: vector that holds the actual info that will be passed along the next laye
 - continuing with the sentence "The cat sat on the mat", if "sat" (Key) is deemed important for "cat" (Query), the Value associated with "sat" will contribute significantly to the final representation of "cat"
 - this helps the model understand that "cat" is related to the action of "sitting"
 
-![Value Matrix](../assets/24-valueMatrix.png)
-![Value Embedding 1](../assets/25-valueEmbedding1.png)
-![Value Embedding 2](../assets/26-valueEmbedding2.png)
-![Value Embedding 3](../assets/27-valueEmbedding3.png)
-![Value Embedding 4](../assets/28-valueEmbedding4.png)
+![Value Matrix](../assets/Transformers/24-valueMatrix.png)
+![Value Embedding 1](../assets/Transformers/25-valueEmbedding1.png)
+![Value Embedding 2](../assets/Transformers/26-valueEmbedding2.png)
+![Value Embedding 3](../assets/Transformers/27-valueEmbedding3.png)
+![Value Embedding 4](../assets/Transformers/28-valueEmbedding4.png)
 
 ## Multi-Head Attention
 

diff --git a/assets/ML_Interviews/1-standardd.png b/assets/ML_Interviews/1-standardd.png
diff --git a/assets/ML_Interviews/2-stack.png b/assets/ML_Interviews/2-stack.png
diff --git a/assets/1-coordinateTokens.png → assets/Transformers/1-coordinateTokens.png b/assets/1-coordinateTokens.png → assets/Transformers/1-coordinateTokens.png
diff --git a/assets/10-embeddingmatrix.png → assets/Transformers/10-embeddingmatrix.png b/assets/10-embeddingmatrix.png → assets/Transformers/10-embeddingmatrix.png
diff --git a/assets/11-lastsequence.png → assets/Transformers/11-lastsequence.png b/assets/11-lastsequence.png → assets/Transformers/11-lastsequence.png
diff --git a/assets/12-attentionlastvector.png → ...s/Transformers/12-attentionlastvector.png b/assets/12-attentionlastvector.png → ...s/Transformers/12-attentionlastvector.png
diff --git a/assets/13-attentionEmbeds.png → assets/Transformers/13-attentionEmbeds.png b/assets/13-attentionEmbeds.png → assets/Transformers/13-attentionEmbeds.png
diff --git a/assets/14-queryW1.png → assets/Transformers/14-queryW1.png b/assets/14-queryW1.png → assets/Transformers/14-queryW1.png
diff --git a/assets/15-queryKey1.png → assets/Transformers/15-queryKey1.png b/assets/15-queryKey1.png → assets/Transformers/15-queryKey1.png
diff --git a/assets/16-queryKey2.png → assets/Transformers/16-queryKey2.png b/assets/16-queryKey2.png → assets/Transformers/16-queryKey2.png
diff --git a/assets/17-qKmatrix1.png → assets/Transformers/17-qKmatrix1.png b/assets/17-qKmatrix1.png → assets/Transformers/17-qKmatrix1.png
diff --git a/assets/18-qKmatrix2.png → assets/Transformers/18-qKmatrix2.png b/assets/18-qKmatrix2.png → assets/Transformers/18-qKmatrix2.png
diff --git a/assets/19-qKmatrix3.png → assets/Transformers/19-qKmatrix3.png b/assets/19-qKmatrix3.png → assets/Transformers/19-qKmatrix3.png
diff --git a/assets/2-tokens.png → assets/Transformers/2-tokens.png b/assets/2-tokens.png → assets/Transformers/2-tokens.png
diff --git a/assets/20-qkmatrix4.png → assets/Transformers/20-qkmatrix4.png b/assets/20-qkmatrix4.png → assets/Transformers/20-qkmatrix4.png
diff --git a/assets/21-subsequenceTraining.png → ...s/Transformers/21-subsequenceTraining.png b/assets/21-subsequenceTraining.png → ...s/Transformers/21-subsequenceTraining.png
diff --git a/assets/22-maskingANDsoftmax.png → assets/Transformers/22-maskingANDsoftmax.png b/assets/22-maskingANDsoftmax.png → assets/Transformers/22-maskingANDsoftmax.png
diff --git a/assets/23-masking.png → assets/Transformers/23-masking.png b/assets/23-masking.png → assets/Transformers/23-masking.png
diff --git a/assets/24-valueMatrix.png → assets/Transformers/24-valueMatrix.png b/assets/24-valueMatrix.png → assets/Transformers/24-valueMatrix.png
diff --git a/assets/25-valueEmbedding1.png → assets/Transformers/25-valueEmbedding1.png b/assets/25-valueEmbedding1.png → assets/Transformers/25-valueEmbedding1.png
diff --git a/assets/26-valueEmbedding2.png → assets/Transformers/26-valueEmbedding2.png b/assets/26-valueEmbedding2.png → assets/Transformers/26-valueEmbedding2.png
diff --git a/assets/27-valueEmbedding3.png → assets/Transformers/27-valueEmbedding3.png b/assets/27-valueEmbedding3.png → assets/Transformers/27-valueEmbedding3.png
diff --git a/assets/28-valueEmbedding4.png → assets/Transformers/28-valueEmbedding4.png b/assets/28-valueEmbedding4.png → assets/Transformers/28-valueEmbedding4.png
diff --git a/assets/3-tokenvectors.png → assets/Transformers/3-tokenvectors.png b/assets/3-tokenvectors.png → assets/Transformers/3-tokenvectors.png
diff --git a/assets/4-outputEX.png → assets/Transformers/4-outputEX.png b/assets/4-outputEX.png → assets/Transformers/4-outputEX.png
diff --git a/assets/5-embeddingNN.png → assets/Transformers/5-embeddingNN.png b/assets/5-embeddingNN.png → assets/Transformers/5-embeddingNN.png
diff --git a/assets/6-softmax.png → assets/Transformers/6-softmax.png b/assets/6-softmax.png → assets/Transformers/6-softmax.png
diff --git a/assets/7-softmaxT.png → assets/Transformers/7-softmaxT.png b/assets/7-softmaxT.png → assets/Transformers/7-softmaxT.png
diff --git a/assets/8-softmaxEqn.png → assets/Transformers/8-softmaxEqn.png b/assets/8-softmaxEqn.png → assets/Transformers/8-softmaxEqn.png
diff --git a/assets/9-logits.png → assets/Transformers/9-logits.png b/assets/9-logits.png → assets/Transformers/9-logits.png