Wednesday, June 17, 2026

Basic Attention Operation (Additive attention)

See All on GenAI « Previously Download the 2014 paper on Additive Attention View the Lab Work From DeepLearning.ai

🧠 Basic Attention Operation

A Layman's Guide to Understanding & Implementing Attention from Scratch

0. The Big Picture: What Problem Does Attention Solve?

Imagine you are translating a long French sentence into English. The old-school way (before attention was invented) worked like this: you would read the entire French sentence, compress its meaning into a single, fixed-size summary in your head, and then spit out the English translation from that summary alone. This is hard—especially for long sentences—because that little summary can't possibly hold all the details. Important nuances get lost.

Attention changes the game. Instead of forcing the model to cram everything into one final thought, attention lets the decoder (the part that writes the output) look back at every single input word whenever it needs to produce the next output word. Think of it as the model having a highlighter: at each step of generating the translation, it highlights which input words are most relevant right now.

🔎 Everyday Analogy: You're at a crowded party. Someone across the room is talking to you, but there's a lot of background noise. Your brain does "attention" naturally—it focuses on the speaker's voice while tuning out the clinking glasses and other conversations. In a machine translation model, the "speaker's voice" is the relevant input word, and the "background noise" is the less-important input words. Attention is just a mathematical way of doing that focusing.

In this ungraded lab, you implement a basic version of attention from the influential paper by Bahdanau et al. (2014). You'll use only NumPy—no deep learning frameworks like PyTorch or TensorFlow—so you can see every moving part up close. The entire attention mechanism can be broken into three simple steps:

Step 1 – Alignment Scores: Compare the decoder's current "mood" (hidden state) with each encoder word's "meaning" (hidden state) to get raw similarity numbers.
Step 2 – Weights (Softmax): Turn those raw scores into percentages that add up to 100%, telling us exactly how much attention each input word deserves.
Step 3 – Context Vector: Use those percentages to blend the encoder's hidden states into one summary vector that captures what's important right now.

Let's walk through every line of code in this notebook, with plenty of examples and analogies.

1. The Setup Code: Imports and the Softmax Function

The very first code cell in the notebook sets up the tools we'll need:

import numpy as np

def softmax(x, axis=0):
    return np.exp(x) / np.expand_dims(np.sum(np.exp(x), axis=axis), axis)

1.1 What is NumPy?

numpy (imported as np) is Python's numerical powerhouse. It gives us arrays—think of them as supercharged lists that can hold numbers in rows and columns, like a spreadsheet. But unlike regular Python lists, NumPy lets us do math on entire arrays at once. If you have a list of 1,000 numbers and want to multiply each by 2, with NumPy you write arr * 2 and it's done in one go—no loops needed. This is called vectorized computation, and it's the secret sauce behind all modern machine learning.

1.2 The Softmax Function — Turning Numbers into "Attention Percentages"

The softmax function is one of the most important functions in all of deep learning. Let's build an intuition for it step by step.

🍕 Pizza Analogy: Imagine three friends ordering pizza. Friend A is starving and rates their hunger as 10. Friend B is moderately hungry and rates it as 5. Friend C just ate and rates their hunger as 1. How do we split one pizza fairly? We could just say A gets 10 slices, B gets 5, C gets 1—but that's 16 slices, not one pizza! Softmax solves this: it converts those raw "hunger scores" (10, 5, 1) into percentages that sum to 100%. A might get ~62%, B gets ~31%, and C gets ~7%. Everyone gets a fair share of the one pizza.

Mathematically, softmax does this:

Exponentiate each number (np.exp(x)): This makes all numbers positive and amplifies differences. A small gap becomes a bigger gap.
Sum up all the exponentiated numbers (np.sum(np.exp(x), axis=axis)).
Divide each exponentiated number by that sum. Now every number is between 0 and 1, and they all add up to exactly 1.

🔢 Concrete Example: Let's say our raw scores are [2.0, 1.0, 0.1].
    Step 1 – Exponentiate: e² ≈ 7.39, e¹ ≈ 2.72, e⁰·¹ ≈ 1.11 → [7.39, 2.72, 1.11]
Step 2 – Sum: 7.39 + 2.72 + 1.11 = 11.22
Step 3 – Divide: [7.39/11.22, 2.72/11.22, 1.11/11.22] → [0.66, 0.24, 0.10] (roughly)

    Result: 66% attention to item 1, 24% to item 2, 10% to item 3. Notice they sum to 100%.

The axis parameter controls which direction the softmax is applied. axis=0 means "each column sums to 1" (operate down the rows), and axis=1 means "each row sums to 1" (operate across the columns). In our attention code, we'll use the default axis=0 because our scores come as a column vector (shape [5, 1]).

The function uses np.expand_dims to make sure the division broadcasts correctly—this is a NumPy technicality that ensures the shapes align. Don't worry too much about it; just know that it makes the math work.

2. The Synthetic Data: Our "Toy" Example

Before we dive into the attention code, the notebook sets up some synthetic (fake) data so we can test our functions:

hidden_size = 16
attention_size = 10
input_length = 5

np.random.seed(42)

encoder_states = np.random.randn(input_length, hidden_size)
decoder_state = np.random.randn(1, hidden_size)

layer_1 = np.random.randn(2*hidden_size, attention_size)
layer_2 = np.random.randn(attention_size, 1)

2.1 What Do These Numbers Mean?

Variable	Value	What It Represents
`hidden_size = 16`	16	The size of each "thought vector." Every word gets encoded as 16 numbers that capture its meaning. In a real translation model, this might be 256, 512, or even 1024.
`attention_size = 10`	10	The size of the "attention network's" middle layer. Think of it as 10 little "judges" that help decide how relevant each word is.
`input_length = 5`	5	Our pretend input sentence has 5 words. For example: "The cat sat on mat" → 5 word-vectors.

2.2 The Arrays, Explained

📦 encoder_states – shape (5, 16)

This is a 5×16 matrix. Each of the 5 rows is one input word's "meaning vector"—16 numbers that encode what that word means in context. Row 0 = word 1 ("The"), row 1 = word 2 ("cat"), etc. This is what the encoder produced after reading the entire input sentence.

🧭 decoder_state – shape (1, 16)

This is the decoder's current "state of mind"—a single 16-number vector representing what it has produced so far and what it's thinking about next. In a real model, this updates after every output word. For this lab, we just have one decoder state to work with.

⚖️ layer_1 – shape (32, 10)

The first layer of our tiny neural network. It takes the concatenated encoder+decoder information (32 numbers = 16+16) and transforms it into 10 "relevance features." These are weights—numbers that get multiplied with the input. In a real model, they'd be learned through training.

⚖️ layer_2 – shape (10, 1)

The second layer. It takes those 10 relevance features and condenses them into a single score per word. Again, these are weights that would be learned during training.

np.random.seed(42) ensures that every time you run this notebook, you get the same random numbers. This is important for reproducibility—so the expected outputs in the notebook always match.

3. Step 1 — Calculating Alignment Scores

This is the heart of the attention mechanism. We need to answer: "For each input word, how relevant is it to what the decoder is trying to produce right now?"

🎯 Target Analogy: Imagine you're playing darts. You (the decoder) stand at the throwing line with a dart in hand. On the wall, there are 5 dartboards (the 5 encoder words). Before you throw, you need to aim at each board and decide which one you want to hit. The alignment score is essentially your "aim quality" for each board—a high score means you're well-aligned with that board; a low score means you're not really aiming at it.

3.1 The Mathematical Formula

e_ij = v_a^⊤ · tanh( W_a · s_i−1 + U_a · h_j )

This looks intimidating, so let's unpack it piece by piece:

h_j – The hidden state of the j-th encoder word (the "meaning" of input word j). In our code, this is encoder_states[j], a vector of 16 numbers.
s_i−1 – The decoder's hidden state from the previous step. In our code, decoder_state, also 16 numbers.
W_a and U_a – Weight matrices that transform the decoder state and encoder state respectively. In practice, we concatenate the two states and use one big weight matrix (our layer_1).
tanh – The hyperbolic tangent function. It squashes numbers into the range [-1, 1]. It adds non-linearity, which allows the network to learn complex relationships.
v_a – A final weight vector (our layer_2) that collapses everything into a single score.

3.2 The Implementation, Line by Line

Here's the alignment function we need to fill in:

def alignment(encoder_states, decoder_state):
    # Step A: Concatenate
    inputs = np.concatenate(
        (encoder_states, decoder_state.repeat(input_length, axis=0)),
        axis=1
    )
    
    # Step B: First layer + tanh
    activations = np.tanh(np.matmul(inputs, layer_1))
    
    # Step C: Second layer (no tanh)
    scores = np.matmul(activations, layer_2)
    
    return scores

Step A — Concatenation: Gluing Two Things Together

We have encoder_states which is shape (5, 16) and decoder_state which is shape (1, 16). We need to combine them so that each encoder word gets paired with the same decoder state. This is like putting two spreadsheets side by side.

📊 Visual Example: Imagine the encoder has 3 words (not 5) and hidden size is 4 (not 16):

Encoder states (3×4):

Word 0: [a₁, a₂, a₃, a₄]
Word 1: [b₁, b₂, b₃, b₄]
Word 2: [c₁, c₂, c₃, c₄]

Decoder state (1×4):

        [d₁, d₂, d₃, d₄]

After repeating the decoder state (3×4):

        [d₁, d₂, d₃, d₄]
        [d₁, d₂, d₃, d₄]
        [d₁, d₂, d₃, d₄]

After concatenation (3×8):

Word 0: [a₁, a₂, a₃, a₄, d₁, d₂, d₃, d₄]
Word 1: [b₁, b₂, b₃, b₄, d₁, d₂, d₃, d₄]
Word 2: [c₁, c₂, c₃, c₄, d₁, d₂, d₃, d₄]

decoder_state.repeat(input_length, axis=0) copies the decoder state 5 times (once for each input word) so it becomes shape (5, 16). Then np.concatenate(..., axis=1) glues the encoder states (5×16) and the repeated decoder states (5×16) side by side along axis 1 (the column axis), producing a (5, 32) matrix. Each of the 5 rows now has 32 numbers: 16 from an encoder word + 16 from the decoder state.

🧩 Puzzle Analogy: Think of each encoder word as a puzzle piece with a unique shape (16 numbers). The decoder state is like a "target shape" you're trying to match. Concatenation puts each puzzle piece next to the target shape so you can compare them. It's like laying a key next to each lock to see which one fits best.

Step B — Matrix Multiplication + tanh: The "Comparison Engine"

np.matmul(inputs, layer_1) performs matrix multiplication between our concatenated data (5, 32) and the first layer weights (32, 10). The result is a (5, 10) matrix.

What does matrix multiplication actually do here? Each of the 32 input numbers gets multiplied by a corresponding weight in layer_1, and then products are summed in specific combinations to produce 10 output numbers per word. You can think of the 10 outputs as 10 different "aspects of relevance" that the network checks:

Aspect 1: "Do the subject nouns match?"
Aspect 2: "Is the tense consistent?"
Aspect 3: "Are the word positions close?"
...and 7 more learned aspects.

After the multiplication, we apply np.tanh(). The tanh function looks like an S-curve:

Very negative numbers → close to -1
Zero → 0
Very positive numbers → close to +1

Tanh serves two purposes: (1) it bounds the values so nothing explodes to infinity, and (2) it adds non-linearity. Without non-linearity, stacking layers would be pointless—two linear transformations in a row are equivalent to just one. Tanh breaks that linearity, letting the network learn richer patterns.

Step C — Second Layer: Producing a Single Score

np.matmul(activations, layer_2) multiplies our (5, 10) activations with the (10, 1) weights of layer_2. The result is a (5, 1) column vector—one number per input word.

No tanh here! The notebook explicitly reminds us: "Remember that you don't need tanh here." Why? Because we want raw scores that can be any real number (positive or negative). The softmax in the next step will handle normalizing them. If we applied tanh, all scores would be squashed between -1 and 1, which would make the softmax produce very uniform (uninteresting) weights.

🎯 Expected Output: If you implemented the function correctly, running the test cell should print:

[[4.35790943]
 [5.92373433]
 [4.18673175]
 [2.11437202]
 [0.95767155]]

Interpretation: Word 2 (index 1, score 5.92) is the most relevant to the decoder's current state. Word 5 (index 4, score 0.96) is the least relevant. But these are raw scores—they're not percentages yet. That's the next step!

4. Step 2 — Turning Alignment Scores into Attention Weights

Now we have 5 raw scores, one per input word. But they're on an arbitrary scale. Word 2's score of 5.92—is that a lot? A little? We need to convert these into percentages that sum to 100%, so we know exactly how to split our attention.

α_ij = exp(e_ij) / Σ_k=1^K exp(e_ik)

This is exactly what our softmax function does. Let's trace through the math with our actual expected scores:

🔢 Step-by-step with expected scores:

Scores:        [4.36,  5.92,  4.19,  2.11,  0.96]

exp(scores):   [78.1,  373.0,  65.8,   8.3,   2.6]
               (approximate values)

Sum of exps:   78.1 + 373.0 + 65.8 + 8.3 + 2.6 = 527.8

Weights:       [78.1/527.8, 373.0/527.8, 65.8/527.8, 8.3/527.8, 2.6/527.8]
             = [0.148,      0.707,       0.125,      0.016,     0.005]

Interpretation: ~14.8% attention on word 1
                ~70.7% attention on word 2  ←  THE MOST IMPORTANT WORD!
                ~12.5% attention on word 3
                 ~1.6% attention on word 4
                 ~0.5% attention on word 5

📢 Megaphone Analogy: Imagine you're at a concert. Five singers are on stage, each singing at a different volume. The sound engineer has a mixing board with 5 sliders. The alignment scores are like the raw volume readings from each microphone. The softmax is like the engineer adjusting the sliders so the total output volume is constant, but each singer contributes a different proportion. Singer 2 (70.7%) is the lead vocalist—you hear them the most. Singer 5 (0.5%) is barely audible.

Key property: The exponential function (eˣ) makes softmax peaky. Notice how a score of 5.92 becomes weight 0.707 while 0.96 becomes 0.005. The exponential amplifies differences, so the model can be very decisive about where to focus.

5. Step 3 — Weighting & Summing: The Context Vector

This is the final step. We have our attention weights (percentages), and we have our encoder hidden states (meaning vectors for each word). Now we combine them:

c_i = Σ_j=1^K α_ij · h_j

In plain English: "Multiply each word's meaning vector by its attention weight, then add them all up."

5.1 The Implementation

def attention(encoder_states, decoder_state):
    # Step 1: Get alignment scores
    scores = alignment(encoder_states, decoder_state)
    
    # Step 2: Convert scores to weights via softmax
    weights = softmax(scores)
    
    # Step 3: Multiply each encoder vector by its weight
    weighted_scores = encoder_states * weights
    
    # Step 4: Sum up to get one context vector
    context = np.sum(weighted_scores, axis=0)
    
    return context

Step 3 — Element-wise Multiplication

encoder_states * weights uses NumPy broadcasting. encoder_states is (5, 16) and weights is (5, 1). NumPy automatically stretches the weights across all 16 columns, so each row of the encoder states gets multiplied by its corresponding weight.

📊 Visual Example (simplified to 3 words, hidden size 4):

Encoder states (3×4):         Weights (3×1):
[a₁, a₂, a₃, a₄]              [0.15]
[b₁, b₂, b₃, b₄]              [0.70]
[c₁, c₂, c₃, c₄]              [0.15]

Weighted states (3×4):
[0.15×a₁, 0.15×a₂, 0.15×a₃, 0.15×a₄]    ← word 1, scaled down to 15%
[0.70×b₁, 0.70×b₂, 0.70×b₃, 0.70×b₄]    ← word 2, LOUD AND CLEAR at 70%
[0.15×c₁, 0.15×c₂, 0.15×c₃, 0.15×c₄]    ← word 3, scaled down to 15%

Notice how the middle row (word 2, weight 0.70) keeps most of its original magnitude, while words 1 and 3 are significantly dampened. This is the model saying: "Word 2 matters most right now, so I'll preserve its information almost fully. Words 1 and 3? Meh, I'll take just a little from each."

Step 4 — Summation

np.sum(weighted_scores, axis=0) adds up all 5 rows column-wise. The result is a single vector of 16 numbers—the context vector.

🍲 Soup Analogy: Imagine making a soup. You have 5 ingredients (words), each with its own flavor profile (hidden state). The attention weights tell you how much of each ingredient to add. Word 2 (weight 0.70) is like adding a full cup of carrots. Word 5 (weight 0.005) is like a tiny pinch of salt. When you stir everything together (np.sum), you get one pot of soup (the context vector) that tastes mostly of carrots but has subtle hints of the other ingredients. This soup is what the decoder "tastes" to decide on the next output word.

5.2 Expected Output

✅ If your attention function is correct, the context vector should be:

[-0.63514569  0.04917298 -0.43930867 -0.9268003   1.01903919 -0.43181409
  0.13365099 -0.84746874 -0.37572203  0.18279832 -0.90452701  0.17872958
 -0.58015282 -0.58294027 -0.75457577  1.32985756]

This 16-number vector is the attention-weighted summary of all 5 input words, biased toward whichever word the model deemed most relevant.

6. The Complete Picture: Tying Everything Together

Let's zoom out and see how all three steps connect:

1 Alignment Scores

Input: Encoder states (5×16) + Decoder state (1×16)

Operation: Concatenate → Multiply by layer_1 → tanh → Multiply by layer_2

Output: 5 raw scores (5×1)

Purpose: "How relevant is each input word to the decoder's current state?"

2 Attention Weights

Input: 5 raw alignment scores

Operation: Softmax (exponentiate → sum → divide)

Output: 5 percentages that sum to 1.0

Purpose: "What fraction of my attention should each word get?"

3 Context Vector

Input: Encoder states (5×16) + Attention weights (5×1)

Operation: Multiply each row by its weight → Sum all rows

Output: One context vector (16 numbers)

Purpose: "A weighted blend of all input meanings, focused on what matters now."

🎯 End Result

The context vector is fed into the decoder alongside its own hidden state to predict the next output word. In the next time step, the decoder updates its state, we recalculate attention (now with a new decoder state), and the cycle repeats.

7. Why Is Attention Such a Big Deal?

Before attention, sequence-to-sequence models (like early Google Translate) had a critical bottleneck: the entire input sentence had to be compressed into a single fixed-size vector. For a 50-word sentence, that's like trying to summarize "War and Peace" on a Post-it note. Attention removes this bottleneck by giving the decoder direct access to every input word.

7.1 Three Revolutionary Benefits of Attention

Handles Long Sequences: No matter how long the input is (5 words or 500), attention can "look back" at any position. The context vector is always a fresh blend tailored to the current decoding step.
Interpretability: We can visualize the attention weights as a heatmap and actually see which input words the model focused on when producing each output word. This is incredibly valuable for debugging and understanding model behavior.
Better Translation Quality: Especially for languages with different word orders (like English → Japanese), attention lets the model reorder concepts naturally because it can attend to input words in any order.

7.2 From Bahdanau to Transformers

The attention mechanism you just implemented is additive attention (also called "Bahdanau attention"), published in 2014. It uses a small neural network (our layer_1 and layer_2) to compute alignment scores. A year later, a simpler variant called multiplicative attention (or "Luong attention") was proposed, which just uses a dot product: score = decoder_state · encoder_state.

Then in 2017, the famous paper "Attention Is All You Need" introduced the Transformer architecture, which uses a more sophisticated version called scaled dot-product attention. The Transformer discards recurrence entirely and relies solely on attention mechanisms. This is the architecture behind GPT, BERT, and virtually all modern large language models.

🌟 The lineage:
Bahdanau Attention (2014) — Additive attention, the one you just coded. The original breakthrough.
Luong Attention (2015) — Dot-product attention, simpler and faster.
Scaled Dot-Product Attention / Transformers (2017) — Multi-head self-attention. Powers ChatGPT, Claude, Gemini, and all modern LLMs.

Every single one of these builds on the same core idea you just implemented: compare a query (decoder state) with keys (encoder states) to get scores, softmax into weights, and blend values (encoder states) into a context vector.

8. Complete Code Walkthrough Recap

Here's every meaningful line of code in this notebook, explained one more time at a glance:

Line	What It Does	Layman's Translation
`import numpy as np`	Imports the NumPy library for fast array math.	"Bring in my Swiss Army knife for number crunching."
`def softmax(x, axis=0):`	Defines the softmax function.	"Here's my recipe for turning any list of numbers into percentages."
`np.exp(x)`	Applies eˣ to every element.	"Make all numbers positive and amplify the differences."
`np.sum(np.exp(x), axis=axis)`	Sums the exponentiated values along an axis.	"Add up all the amplified numbers to get a total."
`np.expand_dims(..., axis)`	Adds a dimension for correct broadcasting.	"Reshape the total so the division lines up properly."
`hidden_size = 16`	Sets the size of hidden state vectors.	"Each word's meaning is captured in 16 numbers."
`attention_size = 10`	Sets the middle layer size.	"We'll use 10 'relevance detectors' in the attention network."
`input_length = 5`	Sets how many encoder words we have.	"Our pretend sentence has 5 words."
`np.random.seed(42)`	Fixes the random number generator.	"Make sure we all get the same random numbers."
`encoder_states = np.random.randn(...)`	Creates fake encoder hidden states.	"Pretend the encoder already processed 5 words into 5 meaning-vectors."
`decoder_state = np.random.randn(...)`	Creates a fake decoder hidden state.	"Pretend the decoder has a current 'state of mind.'"
`layer_1 = np.random.randn(...)`	Creates fake first-layer weights (32→10).	"Pretend we've already learned these comparison weights."
`layer_2 = np.random.randn(...)`	Creates fake second-layer weights (10→1).	"Pretend we've learned how to combine the 10 features into one score."
`decoder_state.repeat(input_length, axis=0)`	Copies the decoder state 5 times.	"Clone the decoder state so each word can be compared to it."
`np.concatenate((enc, dec), axis=1)`	Glues encoder and decoder states side by side.	"Put each word's meaning next to the decoder state for comparison."
`np.matmul(inputs, layer_1)`	Matrix multiplication: inputs × weights.	"Run the combined info through the first layer of the attention network."
`np.tanh(...)`	Applies the tanh activation function.	"Squash the results between -1 and 1 to keep things stable."
`np.matmul(activations, layer_2)`	Matrix multiplication: activations × weights.	"Run through the second layer to get one score per word."
`softmax(scores)`	Converts raw scores to percentages.	"Turn the scores into a proper attention distribution (sums to 1)."
`encoder_states * weights`	Element-wise multiplication with broadcasting.	"Scale each word's meaning by how much attention it gets."
`np.sum(weighted_scores, axis=0)`	Sums all weighted rows into one vector.	"Blend everything into one context vector—the final attention summary."

9. Common Questions (and Their Answers)

Q: Why do we use tanh in the first layer but not the second?

The first layer's tanh adds non-linearity, which is essential for the network to learn complex patterns. The second layer produces raw scores that get fed into softmax. If we applied tanh to the second layer, all scores would be between -1 and 1, making the softmax weights nearly uniform—every word would get roughly equal attention, defeating the purpose!

Q: Why is the hidden_size 16 and attention_size 10? Are these magic numbers?

Not at all! They're arbitrary choices for this educational example. In a real model, hidden_size might be 256, 512, or 1024, and attention_size is typically the same as hidden_size. The authors chose small numbers so the code runs instantly and the shapes are easy to reason about.

Q: What's the difference between encoder_states and decoder_state?

Encoder states are the "meanings" of the input words after the encoder has read the entire input sentence. There's one vector per input word. Decoder state is the "state of mind" of the output-producing side, representing what it has generated so far and what it's thinking about producing next. The attention mechanism bridges these two: "Given what I want to say next (decoder state), which input words should I pay attention to (encoder states)?"

Q: Are layer_1 and layer_2 learned during training?

Yes! In this lab, we use random numbers for simplicity. In a real model, these weights are initialized randomly and then gradually improved through backpropagation during training. The model learns which patterns in the encoder+decoder combination indicate relevance. After training on millions of sentence pairs, the weights encode sophisticated linguistic knowledge.

Q: How does this relate to the attention in ChatGPT?

ChatGPT uses self-attention within the Transformer architecture. The core math is very similar—query, key, value projections; dot-product scores; softmax; weighted sum—but instead of comparing a decoder state to encoder states, self-attention compares every word to every other word in the same sentence. This lets the model understand relationships like "the pronoun 'it' refers to the noun 'cat' mentioned 5 words ago." The basic building block you implemented here is the conceptual ancestor of all modern attention mechanisms.

10. Final Thoughts: You Just Built the Foundation of Modern AI

Take a moment to appreciate what you've done. With fewer than 30 lines of NumPy code, you've implemented the core mechanism that:

Revolutionized machine translation in 2014
Earned a citation count of over 30,000 for the Bahdanau paper
Paved the way for the Transformer architecture in 2017
Underlies every modern large language model, including GPT-4, Claude, and Gemini

The three-step recipe you coded—score, weight, blend—is the same pattern used by models with hundreds of billions of parameters. They just do it many times in parallel (multi-head attention) and stack many layers. But the fundamental operation is exactly what you wrote:

Compare a query with keys to get scores
Softmax the scores into weights
Use the weights to blend values into an output

This notebook has given you the conceptual and practical foundation to understand all of it. Well done! 🎉

See All on GenAI « Previously Download the 2014 paper on Additive Attention View the Lab Work From DeepLearning.ai Tags: Generative AI,Large Language Models,

Pages

Wednesday, June 17, 2026