<<< Previous Chapter Next Chapter >>>
Prompt
Role: You are an Expert Professor of Agentic AI Task: Explain the attached PDF (Chapter 7 - Finetuning) in layman terms Strictly Follow: Do not miss out any details while simplying things. Word Limit: Respond in about 5000 to 7000 words
Response
Chapter 7 — Finetuning (Explained Simply, Without Losing Depth)
1. What “Finetuning” Really Means (Big Picture)
Let’s start with the core idea.
Finetuning is the process of taking an already-trained AI model and teaching it to behave better for a specific purpose.
Think of it like this:
-
Pre-training = going to school and learning everything (language, math, reasoning).
-
Prompting = giving instructions like “please answer like a lawyer.”
-
Finetuning = sending the model for professional training so it permanently learns that behavior.
Instead of just telling the model what to do at runtime (prompting), finetuning changes the model’s internal parameters (weights) so it naturally behaves the way you want.
This is fundamentally different from prompt-based methods discussed in earlier chapters, which do not change the model internally .
2. Why Finetuning Exists at All
Large language models already know a lot. So why finetune?
Because:
-
They don’t always follow instructions reliably
-
They may not produce output in the exact format you need
-
They may not specialize well in niche or proprietary tasks
-
They may behave inconsistently across prompts
Finetuning helps with:
-
Instruction following
-
Output formatting (JSON, YAML, code, DSLs)
-
Domain specialization (legal, medical, finance)
-
Bias mitigation
-
Safety alignment
In short:
Prompting changes what you ask.
Finetuning changes how the model thinks.
3. Finetuning Is Transfer Learning
Finetuning is not a new idea. It’s a form of transfer learning, first described in the 1970s.
Human analogy:
If you already know how to play the piano, learning the guitar is easier.
AI analogy:
If a model already knows language, learning legal Q&A requires far fewer examples.
This is why:
-
Training a legal QA model from scratch might need millions of examples
-
Finetuning a pretrained model might need hundreds
This efficiency is what makes foundation models so powerful .
4. Types of Finetuning (Conceptual Overview)
4.1 Continued Pre-Training (Self-Supervised Finetuning)
Before expensive labeled data, you can:
-
Feed the model raw domain text
-
No labels required
-
Examples:
-
Legal documents
-
Medical journals
-
Vietnamese books
-
This helps the model absorb domain language cheaply.
This is called continued pre-training.
4.2 Supervised Finetuning (SFT)
Now you use (input → output) pairs.
Examples:
-
Instruction → Answer
-
Question → Summary
-
Text → SQL
This is how models learn:
-
What humans expect
-
How to respond politely
-
How to structure answers
Instruction data is expensive because:
-
It needs domain expertise
-
It must be correct
-
It must be consistent
4.3 Preference Finetuning (RLHF-style)
Instead of a single “correct” answer, you give:
-
Instruction
-
Preferred answer
-
Less-preferred answer
The model learns human preference, not just correctness.
4.4 Long-Context Finetuning
This extends how much text a model can read at once.
But:
-
It requires architectural changes
-
It can degrade short-context performance
-
It’s harder than other finetuning methods
Example: Code Llama extended context from 4K → 16K tokens .
5. Who Finetunes Models?
Model developers:
-
OpenAI, Meta, Mistral
-
Release multiple post-trained variants
Application developers:
-
Usually finetune already post-trained models
-
Less work, less cost
The more refined the base model, the less finetuning you need.
6. When Should You Finetune?
This is one of the most important sections.
Key principle:
Finetuning should NOT be your first move.
It requires:
-
Data
-
GPUs
-
ML expertise
-
Monitoring
-
Long-term maintenance
You should try prompting exhaustively first.
Many teams rush to finetuning because:
-
Prompts were poorly written
-
Examples were unrealistic
-
Metrics were undefined
After fixing prompts, finetuning often becomes unnecessary .
7. Good Reasons to Finetune
7.1 Task-Specific Weaknesses
Example:
-
Model handles SQL
-
But fails on your company’s SQL dialect
Finetuning on your dialect fixes this.
7.2 Structured Outputs
If you must get:
-
Valid JSON
-
Compilable code
-
Domain-specific syntax
Finetuning helps more than prompting.
7.3 Bias Mitigation
If a model shows bias:
-
Gender bias
-
Racial bias
Carefully curated finetuning data can reduce it.
7.4 Small Models Can Beat Big Models
A finetuned small model can outperform a huge generic model on a narrow task.
Example:
-
Grammarly finetuned Flan-T5
-
Beat a GPT-3 variant
-
Used only 82k examples
-
Model was 60× smaller
8. Reasons NOT to Finetune
8.1 Performance Trade-offs
Finetuning for Task A can degrade Task B.
This is called catastrophic interference.
8.2 High Up-Front Cost
You need:
-
Annotated data
-
ML knowledge
-
Infrastructure
-
Serving strategy
8.3 Ongoing Maintenance
New base models keep appearing.
You must decide:
-
When to switch
-
When to re-finetune
-
When gains are “worth it”
9. Finetuning vs RAG (Critical Distinction)
This chapter makes a very important rule:
RAG is for facts.
Finetuning is for behavior.
Use RAG when:
-
Model lacks information
-
Data is private
-
Data is changing
-
Answers must be up-to-date
Use finetuning when:
-
Model output is irrelevant
-
Format is wrong
-
Syntax is incorrect
-
Style is inconsistent
Studies show:
-
RAG often beats finetuning for factual Q&A
-
RAG + base model > finetuned model alone .
10. Why Finetuning Is So Memory-Hungry
This is the core technical bottleneck.
Inference:
-
Only forward pass
-
Need weights + activations
Training (Finetuning):
-
Forward pass
-
Backward pass
-
Gradients
-
Optimizer states
Each trainable parameter requires:
-
The weight
-
The gradient
-
1–2 optimizer values (Adam uses 2)
So memory explodes.
11. Memory Math (Intuition)
Inference memory ≈
Example:
-
13B params
-
2 bytes each → 26 GB
-
Total ≈ 31 GB
Training memory =
Example:
-
13B params
-
Adam optimizer
-
16-bit precision
Just gradients + optimizer states = 78 GB
That’s why finetuning is hard.
12. Numerical Precision & Quantization
Floating-point formats:
-
FP32 (4 bytes)
-
FP16 (2 bytes)
-
BF16
-
TF32
Lower precision = less memory, faster compute.
Quantization
Quantization = reducing precision.
Examples:
-
FP32 → FP16
-
FP16 → INT8
-
INT8 → INT4
This dramatically reduces memory.
Post-Training Quantization (PTQ)
Most common.
-
Train in high precision
-
Quantize for inference
Quantization-Aware Training (QAT)
Simulates low precision during training.
-
Improves low-precision inference
-
Doesn’t reduce training cost
Training Directly in Low Precision
Hard but powerful.
-
Character.AI trained models fully in INT8
-
Eliminated mismatch
-
Reduced cost
13. Why Full Finetuning Doesn’t Scale
Example:
-
7B model
-
FP16 weights = 14 GB
-
Adam optimizer = +42 GB
-
Total = 56 GB (without activations)
Most GPUs cannot handle this.
Hence the rise of:
14. Parameter-Efficient Finetuning (PEFT)
Idea:
Update fewer parameters, get most of the benefit.
Instead of changing everything:
-
Freeze most weights
-
Add small trainable components
15. Partial Finetuning (Why It’s Not Enough)
Freezing early layers helps memory, but:
-
You need ~25% of parameters
-
To match full finetuning
Still expensive.
16. Adapter-Based PEFT
Houlsby et al. introduced adapters:
-
Small modules inserted into each layer
-
Only adapters are trained
-
Base model is frozen
Result:
-
~3% parameters
-
~99.6% performance
Downside:
-
Extra inference latency
17. LoRA (Low-Rank Adaptation)
LoRA solved adapter latency.
Core idea:
Instead of adding layers, modify weight matrices mathematically.
A big matrix:
Becomes:
Where:
-
A and B are small
-
Only A and B are trained
-
W stays frozen
This uses low-rank factorization.
Why LoRA Works
-
Neural weight updates are often low-rank
-
You don’t need to update everything
-
Rank 4–64 is usually enough
Example:
-
GPT-3 finetuned with 0.0027% parameters
-
Matched full finetuning performance
18. Where LoRA Is Applied
Mostly to:
-
Query (Wq)
-
Key (Wk)
-
Value (Wv)
-
Output (Wo)
Applying LoRA to query + value often gives best returns.
19. Serving LoRA Models
Two strategies:
1. Merge LoRA into base model
-
Faster inference
-
More storage
2. Keep adapters separate
-
Slight latency cost
-
Massive storage savings
-
Enables multi-LoRA serving
Apple uses this approach to serve many features from one base model.
20. QLoRA (Quantized LoRA)
QLoRA = LoRA + 4-bit quantization.
-
Base weights stored in 4-bit
-
Dequantized to BF16 during compute
-
Uses NF4 format
-
Uses CPU–GPU paging
Result:
-
65B models finetuned on single 48 GB GPU
Downside:
-
Slower training due to quantization overhead
21. Model Merging
Instead of finetuning one model:
-
Finetune multiple models separately
-
Merge them later
Benefits:
-
Avoids catastrophic forgetting
-
Reduces memory
-
Enables multi-task models
-
Useful for on-device deployment
Model merging generalizes ensembles, but merges weights instead of outputs.
22. Final Summary (Mental Model)
Finetuning is powerful—but costly.
-
Start with prompting
-
Add RAG for knowledge
-
Finetune for behavior
-
Use PEFT whenever possible
-
Prefer LoRA or QLoRA
-
Consider model merging for multi-task systems
The hard part isn’t finetuning itself.
The hard part is data, evaluation, and long-term maintenance.
That’s why the next chapter focuses entirely on data.

No comments:
Post a Comment