Wednesday, December 31, 2025

Ch 7 - Finetuning - Chapter Summary using GPT 5.2 (Dec 2025)

Download Book

<<< Previous Chapter Next Chapter >>>

Prompt

Role: You are an Expert Professor of Agentic AI
Task: Explain the attached PDF (Chapter 7 - Finetuning) in layman terms
Strictly Follow: Do not miss out any details while simplying things.
Word Limit: Respond in about 5000 to 7000 words

Response

Chapter 7 — Finetuning (Explained Simply, Without Losing Depth)


1. What “Finetuning” Really Means (Big Picture)

Let’s start with the core idea.

Finetuning is the process of taking an already-trained AI model and teaching it to behave better for a specific purpose.

Think of it like this:

  • Pre-training = going to school and learning everything (language, math, reasoning).

  • Prompting = giving instructions like “please answer like a lawyer.”

  • Finetuning = sending the model for professional training so it permanently learns that behavior.

Instead of just telling the model what to do at runtime (prompting), finetuning changes the model’s internal parameters (weights) so it naturally behaves the way you want.

This is fundamentally different from prompt-based methods discussed in earlier chapters, which do not change the model internally .


2. Why Finetuning Exists at All

Large language models already know a lot. So why finetune?

Because:

  1. They don’t always follow instructions reliably

  2. They may not produce output in the exact format you need

  3. They may not specialize well in niche or proprietary tasks

  4. They may behave inconsistently across prompts

Finetuning helps with:

  • Instruction following

  • Output formatting (JSON, YAML, code, DSLs)

  • Domain specialization (legal, medical, finance)

  • Bias mitigation

  • Safety alignment

In short:

Prompting changes what you ask.
Finetuning changes how the model thinks.


3. Finetuning Is Transfer Learning

Finetuning is not a new idea. It’s a form of transfer learning, first described in the 1970s.

Human analogy:

If you already know how to play the piano, learning the guitar is easier.

AI analogy:

If a model already knows language, learning legal Q&A requires far fewer examples.

This is why:

  • Training a legal QA model from scratch might need millions of examples

  • Finetuning a pretrained model might need hundreds

This efficiency is what makes foundation models so powerful .


4. Types of Finetuning (Conceptual Overview)

4.1 Continued Pre-Training (Self-Supervised Finetuning)

Before expensive labeled data, you can:

  • Feed the model raw domain text

  • No labels required

  • Examples:

    • Legal documents

    • Medical journals

    • Vietnamese books

This helps the model absorb domain language cheaply.

This is called continued pre-training.


4.2 Supervised Finetuning (SFT)

Now you use (input → output) pairs.

Examples:

  • Instruction → Answer

  • Question → Summary

  • Text → SQL

This is how models learn:

  • What humans expect

  • How to respond politely

  • How to structure answers

Instruction data is expensive because:

  • It needs domain expertise

  • It must be correct

  • It must be consistent


4.3 Preference Finetuning (RLHF-style)

Instead of a single “correct” answer, you give:

  • Instruction

  • Preferred answer

  • Less-preferred answer

The model learns human preference, not just correctness.


4.4 Long-Context Finetuning

This extends how much text a model can read at once.

But:

  • It requires architectural changes

  • It can degrade short-context performance

  • It’s harder than other finetuning methods

Example: Code Llama extended context from 4K → 16K tokens .


5. Who Finetunes Models?

Model developers:

  • OpenAI, Meta, Mistral

  • Release multiple post-trained variants

Application developers:

  • Usually finetune already post-trained models

  • Less work, less cost

The more refined the base model, the less finetuning you need.


6. When Should You Finetune?

This is one of the most important sections.

Key principle:

Finetuning should NOT be your first move.

It requires:

  • Data

  • GPUs

  • ML expertise

  • Monitoring

  • Long-term maintenance

You should try prompting exhaustively first.

Many teams rush to finetuning because:

  • Prompts were poorly written

  • Examples were unrealistic

  • Metrics were undefined

After fixing prompts, finetuning often becomes unnecessary .


7. Good Reasons to Finetune

7.1 Task-Specific Weaknesses

Example:

  • Model handles SQL

  • But fails on your company’s SQL dialect

Finetuning on your dialect fixes this.


7.2 Structured Outputs

If you must get:

  • Valid JSON

  • Compilable code

  • Domain-specific syntax

Finetuning helps more than prompting.


7.3 Bias Mitigation

If a model shows bias:

  • Gender bias

  • Racial bias

Carefully curated finetuning data can reduce it.


7.4 Small Models Can Beat Big Models

A finetuned small model can outperform a huge generic model on a narrow task.

Example:

  • Grammarly finetuned Flan-T5

  • Beat a GPT-3 variant

  • Used only 82k examples

  • Model was 60× smaller


8. Reasons NOT to Finetune

8.1 Performance Trade-offs

Finetuning for Task A can degrade Task B.

This is called catastrophic interference.


8.2 High Up-Front Cost

You need:

  • Annotated data

  • ML knowledge

  • Infrastructure

  • Serving strategy


8.3 Ongoing Maintenance

New base models keep appearing.

You must decide:

  • When to switch

  • When to re-finetune

  • When gains are “worth it”


9. Finetuning vs RAG (Critical Distinction)

This chapter makes a very important rule:

RAG is for facts.
Finetuning is for behavior.

Use RAG when:

  • Model lacks information

  • Data is private

  • Data is changing

  • Answers must be up-to-date

Use finetuning when:

  • Model output is irrelevant

  • Format is wrong

  • Syntax is incorrect

  • Style is inconsistent

Studies show:

  • RAG often beats finetuning for factual Q&A

  • RAG + base model > finetuned model alone .


10. Why Finetuning Is So Memory-Hungry

This is the core technical bottleneck.

Inference:

  • Only forward pass

  • Need weights + activations

Training (Finetuning):

  • Forward pass

  • Backward pass

  • Gradients

  • Optimizer states

Each trainable parameter requires:

  • The weight

  • The gradient

  • 1–2 optimizer values (Adam uses 2)

So memory explodes.


11. Memory Math (Intuition)

Inference memory ≈

nginx
weights × 1.2

Example:

  • 13B params

  • 2 bytes each → 26 GB

  • Total ≈ 31 GB


Training memory =

nginx
weights + activations + gradients + optimizer states

Example:

  • 13B params

  • Adam optimizer

  • 16-bit precision

Just gradients + optimizer states = 78 GB

That’s why finetuning is hard.


12. Numerical Precision & Quantization

Floating-point formats:

  • FP32 (4 bytes)

  • FP16 (2 bytes)

  • BF16

  • TF32

Lower precision = less memory, faster compute.


Quantization

Quantization = reducing precision.

Examples:

  • FP32 → FP16

  • FP16 → INT8

  • INT8 → INT4

This dramatically reduces memory.


Post-Training Quantization (PTQ)

Most common.

  • Train in high precision

  • Quantize for inference


Quantization-Aware Training (QAT)

Simulates low precision during training.

  • Improves low-precision inference

  • Doesn’t reduce training cost


Training Directly in Low Precision

Hard but powerful.

  • Character.AI trained models fully in INT8

  • Eliminated mismatch

  • Reduced cost


13. Why Full Finetuning Doesn’t Scale

Example:

  • 7B model

  • FP16 weights = 14 GB

  • Adam optimizer = +42 GB

  • Total = 56 GB (without activations)

Most GPUs cannot handle this.

Hence the rise of:


14. Parameter-Efficient Finetuning (PEFT)

Idea:

Update fewer parameters, get most of the benefit.

Instead of changing everything:

  • Freeze most weights

  • Add small trainable components


15. Partial Finetuning (Why It’s Not Enough)

Freezing early layers helps memory, but:

  • You need ~25% of parameters

  • To match full finetuning

Still expensive.


16. Adapter-Based PEFT

Houlsby et al. introduced adapters:

  • Small modules inserted into each layer

  • Only adapters are trained

  • Base model is frozen

Result:

  • ~3% parameters

  • ~99.6% performance

Downside:

  • Extra inference latency


17. LoRA (Low-Rank Adaptation)

LoRA solved adapter latency.

Core idea:

Instead of adding layers, modify weight matrices mathematically.

A big matrix:

nginx
W

Becomes:

css
W + (A × B)

Where:

  • A and B are small

  • Only A and B are trained

  • W stays frozen

This uses low-rank factorization.


Why LoRA Works

  • Neural weight updates are often low-rank

  • You don’t need to update everything

  • Rank 4–64 is usually enough

Example:

  • GPT-3 finetuned with 0.0027% parameters

  • Matched full finetuning performance


18. Where LoRA Is Applied

Mostly to:

  • Query (Wq)

  • Key (Wk)

  • Value (Wv)

  • Output (Wo)

Applying LoRA to query + value often gives best returns.


19. Serving LoRA Models

Two strategies:

1. Merge LoRA into base model

  • Faster inference

  • More storage

2. Keep adapters separate

  • Slight latency cost

  • Massive storage savings

  • Enables multi-LoRA serving

Apple uses this approach to serve many features from one base model.


20. QLoRA (Quantized LoRA)

QLoRA = LoRA + 4-bit quantization.

  • Base weights stored in 4-bit

  • Dequantized to BF16 during compute

  • Uses NF4 format

  • Uses CPU–GPU paging

Result:

  • 65B models finetuned on single 48 GB GPU

Downside:

  • Slower training due to quantization overhead


21. Model Merging

Instead of finetuning one model:

  • Finetune multiple models separately

  • Merge them later

Benefits:

  • Avoids catastrophic forgetting

  • Reduces memory

  • Enables multi-task models

  • Useful for on-device deployment

Model merging generalizes ensembles, but merges weights instead of outputs.


22. Final Summary (Mental Model)

Finetuning is powerful—but costly.

  • Start with prompting

  • Add RAG for knowledge

  • Finetune for behavior

  • Use PEFT whenever possible

  • Prefer LoRA or QLoRA

  • Consider model merging for multi-task systems

The hard part isn’t finetuning itself.

The hard part is data, evaluation, and long-term maintenance.

That’s why the next chapter focuses entirely on data.


No comments:

Post a Comment