Pages
- Index of Lessons in Technology
- Index of Book Summaries
- Index of Book Lists And Downloads
- Index For Job Interviews Preparation
- Index of "Algorithms: Design and Analysis"
- Python Course (Index)
- Data Analytics Course (Index)
- Index of Machine Learning
- Postings Index
- Index of BITS WILP Exam Papers and Content
- Lessons in Investing
- Index of Math Lessons
- Index of Management Lessons
- Book Requests
- Index of English Lessons
- Index of Medicines
- Index of Quizzes (Educational)
Wednesday, December 31, 2025
AI Engineering Architecture and User Feedback (Chapter 10 - AI Engineering - By Chip Huyen)
<<< Previous Chapter Next Book >>>
Building Real AI Products: Architecture, Guardrails, and the Power of User Feedback
Why successful AI systems are more about engineering and feedback loops than models
Introduction: From “Cool Demo” to Real Product
Training or calling a large language model is the easy part of modern AI.
The hard part begins when you try to turn that model into a real product:
-
One that real users rely on
-
One that must be safe, fast, affordable, and reliable
-
One that improves over time instead of silently degrading
This chapter is about that hard part.
So far, most AI discussions focus on prompts, RAG, finetuning, or agents in isolation. But real-world systems are not isolated techniques—they are architectures. And architectures are shaped not just by technical constraints, but by users.
This blog post walks through a progressive AI engineering architecture, starting from the simplest possible setup and gradually adding:
-
Context
-
Guardrails
-
Routers and gateways
-
Caching
-
Agentic logic
-
Monitoring and observability
-
Orchestration
-
User feedback loops
At the end, you’ll see why user feedback is not just UX—it’s your most valuable dataset.
1. The Simplest Possible AI Architecture (And Why It Breaks)
The One-Box Architecture
Every AI product starts the same way:
-
User sends a query
-
Query goes to a model
-
Model generates a response
-
Response is returned to the user
That’s it.
This architecture is deceptively powerful. For prototypes, demos, and internal tools, it works surprisingly well. Many early ChatGPT-like apps stopped right here.
But this setup has major limitations:
-
The model knows only what’s in its training data
-
No protection against unsafe or malicious prompts
-
No cost control
-
No monitoring
-
No learning from users
This architecture is useful only until users start depending on it .
Why Real Products Need More Than a Model Call
Once users rely on your application:
-
Wrong answers become costly
-
Latency becomes noticeable
-
Hallucinations become dangerous
-
Abuse becomes inevitable
To fix these, teams don’t “replace the model.”
They add layers around it.
2. Step One: Enhancing Context (Giving the Model the Right Information)
Context Is Feature Engineering for LLMs
Large language models don’t magically know your:
-
Company policies
-
Internal documents
-
User history
-
Real-time data
To make models useful, you must construct context.
This can be done via:
-
Text retrieval (documents, PDFs, chat history)
-
Structured data retrieval (SQL, tables)
-
Image retrieval
-
Tool use (search APIs, weather, news, calculators)
This process—commonly called RAG—is analogous to feature engineering in traditional ML.
The model isn’t getting smarter; it’s getting better inputs .
Trade-offs in Context Construction
Not all context systems are equal.
Model APIs (OpenAI, Gemini, Claude):
-
Easier to use
-
Limited document uploads
-
Limited retrieval control
Custom RAG systems:
-
More flexible
-
More engineering effort
-
Require tuning (chunk size, embeddings, rerankers)
Similarly, tool support varies:
-
Some models support parallel tools
-
Some support long-running tools
-
Some don’t support tools at all
As soon as context is added, your architecture already looks much more complex—and much more powerful.
3. Step Two: Guardrails (Protecting Users and Yourself)
Why Guardrails Are Non-Negotiable
AI systems fail in ways traditional software never did.
They can:
-
Leak private data
-
Generate toxic content
-
Execute unintended actions
-
Be tricked by clever prompts
Guardrails exist to reduce risk, not eliminate it (elimination is impossible).
There are two broad types:
-
Input guardrails
-
Output guardrails
Input Guardrails: Protecting What Goes In
Input guardrails prevent:
-
Sensitive data leakage
-
Prompt injection attacks
-
System compromise
Common risks:
-
Employees pasting secrets into prompts
-
Tools accidentally retrieving private data
-
Developers embedding internal policies into prompts
A common defense is PII detection and masking:
-
Detect phone numbers, IDs, addresses, faces
-
Replace them with placeholders
-
Send masked prompt to external API
-
Unmask response locally using a reverse dictionary
This allows functionality without leaking raw data .
Output Guardrails: Protecting What Comes Out
Models can fail in many ways.
Quality failures:
-
Invalid JSON
-
Hallucinated facts
-
Low-quality responses
Security failures:
-
Toxic language
-
Private data leaks
-
Brand-damaging claims
-
Unsafe tool invocation
Some failures are easy to detect (empty output).
Others require AI-based scorers.
Handling Failures: Retries, Fallbacks, Humans
Because models are probabilistic:
-
Retrying can fix many issues
-
Parallel retries reduce latency
-
Humans can be used as a last resort
Some teams route conversations to humans:
-
When sentiment turns negative
-
After too many turns
-
When safety confidence drops
Guardrails always involve trade-offs:
-
More guardrails → more latency
-
Streaming responses → weaker guardrails
There is no perfect solution—only careful balancing.
4. Step Three: Routers and Gateways (Managing Complexity at Scale)
Why One Model Is Rarely Enough
As products grow, different queries require different handling:
-
FAQs vs troubleshooting
-
Billing vs technical support
-
Simple vs complex tasks
Using one expensive model for everything is wasteful.
This is where routing comes in.
Routers: Choosing the Right Path
A router is usually an intent classifier.
Examples:
-
“Reset my password” → FAQ
-
“Billing error” → human agent
-
“Why is my app crashing?” → troubleshooting model
Routers help:
-
Reduce cost
-
Improve quality
-
Avoid out-of-scope conversations
Routers can also:
-
Ask clarifying questions
-
Decide which memory to use
-
Choose which tool to call next
Routers must be:
-
Fast
-
Cheap
-
Reliable
That’s why many teams use small models or custom classifiers .
Model Gateways: One Interface to Rule Them All
A model gateway is a unified access layer to:
-
OpenAI
-
Gemini
-
Claude
-
Self-hosted models
Benefits:
-
Centralized authentication
-
Cost control
-
Rate limiting
-
Fallback strategies
-
Easier maintenance
Instead of changing every app when an API changes, you update the gateway once.
Gateways also become natural places for:
-
Logging
-
Analytics
-
Guardrails
-
Caching
5. Step Four: Caching (Reducing Latency and Cost)
Why Caching Matters in AI
AI calls are:
-
Slow
-
Expensive
-
Often repetitive
Caching avoids recomputing answers.
There are two main types:
-
Exact caching
-
Semantic caching
Exact Caching: Safe and Simple
Exact caching reuses results only when inputs match exactly.
Examples:
-
Product summaries
-
FAQ answers
-
Embedding lookups
Key considerations:
-
Eviction policy (LRU, LFU, FIFO)
-
Storage layer (memory vs Redis vs DB)
-
Cache duration
Caching must be careful:
-
User-specific data should not be cached globally
-
Time-sensitive queries should not be cached
Mistakes here can cause data leaks .
Semantic Caching: Powerful but Risky
Semantic caching reuses answers for similar queries.
Process:
-
Embed query
-
Search cached embeddings
-
If similarity > threshold, reuse result
Pros:
-
Higher cache hit rate
Cons:
-
Incorrect answers
-
Complex tuning
-
Extra vector search cost
Semantic caching only works well when:
-
Embeddings are high quality
-
Similarity thresholds are well tuned
-
Cache hit rate is high
Otherwise, it often causes more harm than good.
6. Step Five: Agent Patterns and Write Actions
Moving Beyond Linear Pipelines
Simple pipelines are sequential:
-
Query → Retrieve → Generate → Return
Agentic systems introduce:
-
Loops
-
Conditional branching
-
Parallel execution
Example:
-
Generate answer
-
Detect insufficiency
-
Retrieve more data
-
Generate again
This dramatically increases capability.
Write Actions: Power with Risk
Write actions allow models to:
-
Send emails
-
Place orders
-
Update records
-
Trigger workflows
They make systems vastly more useful—but vastly more dangerous.
Write actions must be:
-
Strictly guarded
-
Audited
-
Often human-approved
Once write actions are added, observability becomes mandatory, not optional.
7. Monitoring and Observability: Seeing Inside the Black Box
Monitoring vs Observability
Monitoring:
-
Tracks metrics
-
Tells you something is wrong
Observability:
-
Lets you infer why it’s wrong
-
Without deploying new code
Good observability reduces:
-
Mean time to detection (MTTD)
-
Mean time to resolution (MTTR)
-
Change failure rate (CFR)
Metrics That Actually Matter
Metrics should serve a purpose.
Examples:
-
Format error rate
-
Hallucination signals
-
Guardrail trigger rate
-
Token usage
-
Latency (TTFT, TPOT)
-
Cost per request
Metrics should correlate with business metrics:
-
DAU
-
Retention
-
Session duration
If they don’t, you may be optimizing the wrong thing .
Logs and Traces: Debugging Reality
Logs:
-
Record events
-
Help answer “what happened?”
Traces:
-
Reconstruct an entire request’s journey
-
Show timing, costs, failures
In AI systems, logs should capture:
-
Prompts
-
Model parameters
-
Outputs
-
Tool calls
-
Intermediate results
Developers should regularly read production logs—their understanding of “good” and “bad” outputs evolves over time.
Drift Detection: Change Is Inevitable
Things that drift:
-
System prompts
-
User behavior
-
Model versions (especially via APIs)
Drift often goes unnoticed unless explicitly tracked.
Silent drift is one of the biggest risks in AI production.
8. User Feedback: Your Most Valuable Dataset
Why User Feedback Is Strategic
User feedback is:
-
Proprietary
-
Real-world
-
Continuously generated
It fuels:
-
Evaluation
-
Personalization
-
Model improvement
-
Competitive advantage
This is the data flywheel.
Explicit vs Implicit Feedback
Explicit feedback:
-
Thumbs up/down
-
Ratings
-
Surveys
Pros:
-
Clear signal
Cons: -
Sparse
-
Biased
Implicit feedback:
-
Edits
-
Rephrases
-
Regeneration
-
Abandonment
-
Conversation length
-
Sentiment
Pros:
-
Abundant
Cons: -
Noisy
-
Hard to interpret
Both are necessary.
Conversational Feedback Is Gold
Users naturally correct AI:
-
“No, I meant…”
-
“That’s wrong”
-
“Be shorter”
-
“Check again”
These corrections are:
-
Preference data
-
Evaluation signals
-
Training examples
Edits are especially powerful:
-
Original output = losing response
-
Edited output = winning response
That’s free RLHF data.
Designing Feedback Without Annoying Users
Good feedback systems:
-
Fit naturally into workflows
-
Require minimal effort
-
Can be ignored
Great examples:
-
Midjourney’s image selection
-
GitHub Copilot’s inline suggestions
-
Google Photos’ uncertainty prompts
Bad feedback systems:
-
Interrupt users
-
Ask too often
-
Demand explanation
Conclusion: Architecture and Feedback Are the Real AI Moats
Modern AI success is not about:
-
The biggest model
-
The cleverest prompt
It’s about:
-
Thoughtful architecture
-
Layered defenses
-
Observability
-
Feedback loops
-
Continuous iteration
Models will commoditize.
APIs will change.
What remains defensible is how well you learn from users and adapt.
That is the real craft of AI engineering.
Inference Optimization: How AI Models Become Faster, Cheaper, and Actually Useful (Chapter 9 - AI Engineering - By Chip Huyen)
<<< Previous Chapter Next Chapter >>>
Why “running” an AI model well is just as hard as building it
Introduction: Why Inference Matters More Than You Think
In the AI world, we spend a lot of time talking about training models—bigger models, better architectures, more data, and more GPUs. But once a model is trained, a much more practical question takes over:
How do you run it efficiently for real users?
This is where inference optimization comes in.
Inference is the process of using a trained model to produce outputs for new inputs. In simple terms:
-
Training is learning
-
Inference is doing the work
No matter how brilliant a model is, if it’s too slow, users will abandon it. If it’s too expensive, businesses won’t deploy it. Worse, if inference takes longer than the value of the prediction, the model becomes useless—imagine a stock prediction that arrives after the market closes.
As the chapter makes clear, inference optimization is about making AI models faster and cheaper without ruining their quality. And it’s not just a single discipline—it sits at the intersection of:
-
Machine learning
-
Systems engineering
-
Hardware architecture
-
Compilers
-
Distributed systems
-
Cloud infrastructure
This blog post explains inference optimization in plain language, without skipping the important details, so you can understand why AI systems behave the way they do—and how teams improve them in production .
1. Understanding Inference: From Model to User
Training vs Inference (A Crucial Distinction)
Every AI model has two distinct life phases:
-
Training – learning patterns from data
-
Inference – generating outputs for new inputs
Training is expensive but happens infrequently. Inference happens constantly.
Most AI engineers, application developers, and product teams spend far more time worrying about inference than training, especially if they’re using pretrained or open-source models.
What Is an Inference Service?
In production, inference doesn’t happen in isolation. It’s handled by an inference service, which includes:
-
An inference server that runs the model
-
Routing logic
-
Request handling
-
Possibly preprocessing and postprocessing
Model APIs like OpenAI, Gemini, or Claude are inference services. If you use them, you don’t worry about optimization details. But if you host your own models, you own the entire inference pipeline—performance, cost, scaling, and failures included .
Why Optimization Is About Bottlenecks
Optimization always starts with a question:
What is slowing us down?
Just like traffic congestion in a city, inference systems have chokepoints. Identifying these bottlenecks determines which optimization techniques actually help.
2. Compute-Bound vs Memory-Bound: The Core Bottleneck Concept
Two Fundamental Bottlenecks
Inference workloads usually fall into one of two categories:
Compute-bound
-
Speed limited by how many calculations the hardware can perform
-
Example: heavy mathematical computation
-
Faster chips or more parallelism help
Memory bandwidth-bound
-
Speed limited by how fast data can be moved
-
Common in large models where weights must be repeatedly loaded
-
Faster memory and smaller models help
This distinction is foundational to understanding inference optimization .
Why Language Models Are Often Memory-Bound
Large language models generate text one token at a time. For each token:
-
The model must load large weight matrices
-
Perform relatively little computation per byte loaded
This makes decoding (token generation) memory bandwidth-bound.
Prefill vs Decode: Two Very Different Phases
Transformer-based language model inference has two stages:
-
Prefill
-
Processes input tokens in parallel
-
Compute-bound
-
Determines how fast the model “understands” the prompt
-
-
Decode
-
Generates one output token at a time
-
Memory bandwidth-bound
-
Determines how fast the response appears
-
Because these phases behave differently, modern inference systems often separate them across machines for better efficiency .
3. Online vs Batch Inference: Latency vs Cost
Two Types of Inference APIs
Most providers offer:
-
Online APIs – optimized for low latency
-
Batch APIs – optimized for cost and throughput
Online Inference
Used for:
-
Chatbots
-
Code assistants
-
Real-time interactions
Characteristics:
-
Low latency
-
Users expect instant feedback
-
Limited batching
Streaming responses (token-by-token) reduce perceived waiting time but come with risks—users might see bad outputs before they can be filtered.
Batch Inference
Used for:
-
Synthetic data generation
-
Periodic reports
-
Document processing
-
Data migration
Characteristics:
-
Higher latency allowed
-
Aggressive batching
-
Much lower cost (often ~50% cheaper)
Unlike traditional ML, batch inference for foundation models can’t precompute everything because user prompts are open-ended .
4. Measuring Inference Performance: Metrics That Actually Matter
Latency Is Not One Number
Latency is best understood as multiple metrics:
Time to First Token (TTFT)
-
How long users wait before seeing anything
-
Tied to prefill
-
Critical for chat interfaces
Time Per Output Token (TPOT)
-
Speed of token generation after the first token
-
Determines how fast long responses feel
Two systems with the same total latency can feel very different depending on TTFT and TPOT trade-offs.
Percentiles, Not Averages
Latency is a distribution.
A single slow request can ruin the average. That’s why teams look at:
-
p50 (median)
-
p90
-
p95
-
p99
Outliers often signal:
-
Network issues
-
Oversized prompts
-
Resource contention .
Throughput and Cost
Throughput measures how much work the system does:
-
Tokens per second (TPS)
-
Requests per minute (RPM)
Higher throughput usually means lower cost—but pushing throughput too hard can destroy user experience.
Goodput: Throughput That Actually Counts
Goodput measures how many requests meet your service-level objectives (SLOs).
If your system completes 100 requests/minute but only 30 meet latency targets, your goodput is 30, not 100.
This metric prevents teams from optimizing the wrong thing.
5. Hardware: Why GPUs, Memory, and Power Dominate Costs
What Is an AI Accelerator?
An accelerator is specialized hardware designed for specific workloads.
For AI, the dominant accelerator is the GPU, designed for massive parallelism—perfect for matrix multiplication, which dominates neural network workloads.
Why GPUs Beat CPUs for AI
-
CPUs: few powerful cores, good for sequential logic
-
GPUs: thousands of small cores, great for parallel math
More than 90% of neural network computation boils down to matrix multiplication, which GPUs excel at .
Memory Hierarchy Matters More Than You Think
Modern accelerators use multiple memory layers:
-
CPU DRAM (slow, large)
-
GPU HBM (fast, smaller)
-
On-chip SRAM (extremely fast, tiny)
Inference optimization is often about moving data less and smarter across this hierarchy.
Power Is a Hidden Bottleneck
High-end GPUs consume enormous energy:
-
An H100 running continuously can use ~7,000 kWh/year
-
Comparable to a household’s annual electricity use
This makes power—and cooling—a real constraint on AI scaling .
6. Model-Level Optimization: Making Models Lighter and Faster
Model Compression Techniques
Several techniques reduce model size:
Quantization
-
Reduce numerical precision (FP32 → FP16 → INT8 → INT4)
-
Smaller models
-
Faster inference
-
Lower memory bandwidth use
Distillation
-
Train a smaller model to mimic a larger one
-
Often surprisingly effective
Pruning
-
Remove unimportant parameters
-
Creates sparse models
-
Less common due to hardware limitations
Among these, weight-only quantization is the most widely used in practice .
The Autoregressive Bottleneck
Generating text one token at a time is:
-
Slow
-
Expensive
-
Memory-bandwidth heavy
Several techniques attempt to overcome this fundamental limitation.
Speculative Decoding
Idea:
-
Use a smaller “draft” model to guess future tokens
-
Have the main model verify them in parallel
If many draft tokens are accepted, the system generates multiple tokens per step—dramatically improving speed without hurting quality.
This technique is now widely supported in modern inference frameworks.
Inference with Reference
Instead of generating text the model already knows (e.g., copied context), simply reuse tokens from the input.
This works especially well for:
-
Document Q&A
-
Code editing
-
Multi-turn conversations
It avoids redundant computation and speeds up generation.
Parallel Decoding
Some techniques try to generate multiple future tokens simultaneously.
Examples:
-
Lookahead decoding
-
Medusa
These approaches are promising but complex, requiring careful verification to ensure coherence.
7. Attention Optimization: Taming the KV Cache Explosion
Why Attention Is Expensive
Each new token attends to all previous tokens.
Without optimization:
-
Computation grows quadratically
-
KV cache grows linearly—but still huge
For large models and long contexts, the KV cache can exceed model size itself .
KV Cache Optimization Techniques
Three broad strategies exist:
Redesigning Attention
-
Local window attention
-
Multi-query attention
-
Grouped-query attention
-
Cross-layer attention
These reduce how much data must be stored and reused.
Optimizing KV Cache Storage
Frameworks like vLLM introduced:
-
PagedAttention
-
Flexible memory allocation
-
Reduced fragmentation
Other approaches:
-
KV cache quantization
-
Adaptive compression
-
Selective caching
Writing Better Kernels
Instead of changing algorithms, optimize how computations are executed on hardware.
The most famous example:
-
FlashAttention
-
Fuses multiple operations
-
Minimizes memory access
-
Huge speedups in practice
8. Service-Level Optimization: Making the Whole System Work
Batching: The Simplest Cost Saver
Batching combines multiple requests:
-
Improves throughput
-
Reduces cost
-
Increases latency
Types:
-
Static batching
-
Dynamic batching
-
Continuous batching
The trick is batching without hurting latency too much.
Compilers and Kernels
Modern inference relies heavily on:
-
Compilers (torch.compile, XLA, TensorRT)
-
Hardware-specific kernels
These translate high-level model code into highly optimized machine instructions.
Many companies treat their kernels as trade secrets because they directly translate into cost advantages .
Conclusion: Inference Optimization Is the Real Production Skill
Training models gets headlines. Inference optimization keeps businesses alive.
Inference optimization:
-
Determines user experience
-
Dominates long-term cost
-
Requires cross-disciplinary thinking
-
Is where real-world AI engineering happens
As models become commoditized, efficient inference becomes a competitive moat.
The future of AI won’t be decided by who trains the biggest model—but by who can run models fastest, cheapest, and most reliably.
Addendum
My prompt:Role: You an expert in AI Engineering and a prolific writer. Task: Spin this attached chapter 9 as a blog post in layman terms Rules: Organize the post in 7-8 sections with subsections as needed Note: Blog post should be about 6000 to 7000 words long Note: Try not to miss any important section or detailsTags: Artificial Intelligence,Generative AI,Agentic AI,Technology,Book Summary,
Ch 7 - Finetuning - Chapter Summary using GPT 5.2 (Dec 2025)
<<< Previous Chapter Next Chapter >>>
Prompt
Role: You are an Expert Professor of Agentic AI Task: Explain the attached PDF (Chapter 7 - Finetuning) in layman terms Strictly Follow: Do not miss out any details while simplying things. Word Limit: Respond in about 5000 to 7000 words
Response
Chapter 7 — Finetuning (Explained Simply, Without Losing Depth)
1. What “Finetuning” Really Means (Big Picture)
Let’s start with the core idea.
Finetuning is the process of taking an already-trained AI model and teaching it to behave better for a specific purpose.
Think of it like this:
-
Pre-training = going to school and learning everything (language, math, reasoning).
-
Prompting = giving instructions like “please answer like a lawyer.”
-
Finetuning = sending the model for professional training so it permanently learns that behavior.
Instead of just telling the model what to do at runtime (prompting), finetuning changes the model’s internal parameters (weights) so it naturally behaves the way you want.
This is fundamentally different from prompt-based methods discussed in earlier chapters, which do not change the model internally .
2. Why Finetuning Exists at All
Large language models already know a lot. So why finetune?
Because:
-
They don’t always follow instructions reliably
-
They may not produce output in the exact format you need
-
They may not specialize well in niche or proprietary tasks
-
They may behave inconsistently across prompts
Finetuning helps with:
-
Instruction following
-
Output formatting (JSON, YAML, code, DSLs)
-
Domain specialization (legal, medical, finance)
-
Bias mitigation
-
Safety alignment
In short:
Prompting changes what you ask.
Finetuning changes how the model thinks.
3. Finetuning Is Transfer Learning
Finetuning is not a new idea. It’s a form of transfer learning, first described in the 1970s.
Human analogy:
If you already know how to play the piano, learning the guitar is easier.
AI analogy:
If a model already knows language, learning legal Q&A requires far fewer examples.
This is why:
-
Training a legal QA model from scratch might need millions of examples
-
Finetuning a pretrained model might need hundreds
This efficiency is what makes foundation models so powerful .
4. Types of Finetuning (Conceptual Overview)
4.1 Continued Pre-Training (Self-Supervised Finetuning)
Before expensive labeled data, you can:
-
Feed the model raw domain text
-
No labels required
-
Examples:
-
Legal documents
-
Medical journals
-
Vietnamese books
-
This helps the model absorb domain language cheaply.
This is called continued pre-training.
4.2 Supervised Finetuning (SFT)
Now you use (input → output) pairs.
Examples:
-
Instruction → Answer
-
Question → Summary
-
Text → SQL
This is how models learn:
-
What humans expect
-
How to respond politely
-
How to structure answers
Instruction data is expensive because:
-
It needs domain expertise
-
It must be correct
-
It must be consistent
4.3 Preference Finetuning (RLHF-style)
Instead of a single “correct” answer, you give:
-
Instruction
-
Preferred answer
-
Less-preferred answer
The model learns human preference, not just correctness.
4.4 Long-Context Finetuning
This extends how much text a model can read at once.
But:
-
It requires architectural changes
-
It can degrade short-context performance
-
It’s harder than other finetuning methods
Example: Code Llama extended context from 4K → 16K tokens .
5. Who Finetunes Models?
Model developers:
-
OpenAI, Meta, Mistral
-
Release multiple post-trained variants
Application developers:
-
Usually finetune already post-trained models
-
Less work, less cost
The more refined the base model, the less finetuning you need.
6. When Should You Finetune?
This is one of the most important sections.
Key principle:
Finetuning should NOT be your first move.
It requires:
-
Data
-
GPUs
-
ML expertise
-
Monitoring
-
Long-term maintenance
You should try prompting exhaustively first.
Many teams rush to finetuning because:
-
Prompts were poorly written
-
Examples were unrealistic
-
Metrics were undefined
After fixing prompts, finetuning often becomes unnecessary .
7. Good Reasons to Finetune
7.1 Task-Specific Weaknesses
Example:
-
Model handles SQL
-
But fails on your company’s SQL dialect
Finetuning on your dialect fixes this.
7.2 Structured Outputs
If you must get:
-
Valid JSON
-
Compilable code
-
Domain-specific syntax
Finetuning helps more than prompting.
7.3 Bias Mitigation
If a model shows bias:
-
Gender bias
-
Racial bias
Carefully curated finetuning data can reduce it.
7.4 Small Models Can Beat Big Models
A finetuned small model can outperform a huge generic model on a narrow task.
Example:
-
Grammarly finetuned Flan-T5
-
Beat a GPT-3 variant
-
Used only 82k examples
-
Model was 60× smaller
8. Reasons NOT to Finetune
8.1 Performance Trade-offs
Finetuning for Task A can degrade Task B.
This is called catastrophic interference.
8.2 High Up-Front Cost
You need:
-
Annotated data
-
ML knowledge
-
Infrastructure
-
Serving strategy
8.3 Ongoing Maintenance
New base models keep appearing.
You must decide:
-
When to switch
-
When to re-finetune
-
When gains are “worth it”
9. Finetuning vs RAG (Critical Distinction)
This chapter makes a very important rule:
RAG is for facts.
Finetuning is for behavior.
Use RAG when:
-
Model lacks information
-
Data is private
-
Data is changing
-
Answers must be up-to-date
Use finetuning when:
-
Model output is irrelevant
-
Format is wrong
-
Syntax is incorrect
-
Style is inconsistent
Studies show:
-
RAG often beats finetuning for factual Q&A
-
RAG + base model > finetuned model alone .
10. Why Finetuning Is So Memory-Hungry
This is the core technical bottleneck.
Inference:
-
Only forward pass
-
Need weights + activations
Training (Finetuning):
-
Forward pass
-
Backward pass
-
Gradients
-
Optimizer states
Each trainable parameter requires:
-
The weight
-
The gradient
-
1–2 optimizer values (Adam uses 2)
So memory explodes.
11. Memory Math (Intuition)
Inference memory ≈
nginxweights × 1.2
Example:
-
13B params
-
2 bytes each → 26 GB
-
Total ≈ 31 GB
Training memory =
nginxweights + activations + gradients + optimizer states
Example:
-
13B params
-
Adam optimizer
-
16-bit precision
Just gradients + optimizer states = 78 GB
That’s why finetuning is hard.
12. Numerical Precision & Quantization
Floating-point formats:
-
FP32 (4 bytes)
-
FP16 (2 bytes)
-
BF16
-
TF32
Lower precision = less memory, faster compute.
Quantization
Quantization = reducing precision.
Examples:
-
FP32 → FP16
-
FP16 → INT8
-
INT8 → INT4
This dramatically reduces memory.
Post-Training Quantization (PTQ)
Most common.
-
Train in high precision
-
Quantize for inference
Quantization-Aware Training (QAT)
Simulates low precision during training.
-
Improves low-precision inference
-
Doesn’t reduce training cost
Training Directly in Low Precision
Hard but powerful.
-
Character.AI trained models fully in INT8
-
Eliminated mismatch
-
Reduced cost
13. Why Full Finetuning Doesn’t Scale
Example:
-
7B model
-
FP16 weights = 14 GB
-
Adam optimizer = +42 GB
-
Total = 56 GB (without activations)
Most GPUs cannot handle this.
Hence the rise of:
14. Parameter-Efficient Finetuning (PEFT)
Idea:
Update fewer parameters, get most of the benefit.
Instead of changing everything:
-
Freeze most weights
-
Add small trainable components
15. Partial Finetuning (Why It’s Not Enough)
Freezing early layers helps memory, but:
-
You need ~25% of parameters
-
To match full finetuning
Still expensive.
16. Adapter-Based PEFT
Houlsby et al. introduced adapters:
-
Small modules inserted into each layer
-
Only adapters are trained
-
Base model is frozen
Result:
-
~3% parameters
-
~99.6% performance
Downside:
-
Extra inference latency
17. LoRA (Low-Rank Adaptation)
LoRA solved adapter latency.
Core idea:
Instead of adding layers, modify weight matrices mathematically.
A big matrix:
nginxW
Becomes:
cssW + (A × B)
Where:
-
A and B are small
-
Only A and B are trained
-
W stays frozen
This uses low-rank factorization.
Why LoRA Works
-
Neural weight updates are often low-rank
-
You don’t need to update everything
-
Rank 4–64 is usually enough
Example:
-
GPT-3 finetuned with 0.0027% parameters
-
Matched full finetuning performance
18. Where LoRA Is Applied
Mostly to:
-
Query (Wq)
-
Key (Wk)
-
Value (Wv)
-
Output (Wo)
Applying LoRA to query + value often gives best returns.
19. Serving LoRA Models
Two strategies:
1. Merge LoRA into base model
-
Faster inference
-
More storage
2. Keep adapters separate
-
Slight latency cost
-
Massive storage savings
-
Enables multi-LoRA serving
Apple uses this approach to serve many features from one base model.
20. QLoRA (Quantized LoRA)
QLoRA = LoRA + 4-bit quantization.
-
Base weights stored in 4-bit
-
Dequantized to BF16 during compute
-
Uses NF4 format
-
Uses CPU–GPU paging
Result:
-
65B models finetuned on single 48 GB GPU
Downside:
-
Slower training due to quantization overhead
21. Model Merging
Instead of finetuning one model:
-
Finetune multiple models separately
-
Merge them later
Benefits:
-
Avoids catastrophic forgetting
-
Reduces memory
-
Enables multi-task models
-
Useful for on-device deployment
Model merging generalizes ensembles, but merges weights instead of outputs.
22. Final Summary (Mental Model)
Finetuning is powerful—but costly.
-
Start with prompting
-
Add RAG for knowledge
-
Finetune for behavior
-
Use PEFT whenever possible
-
Prefer LoRA or QLoRA
-
Consider model merging for multi-task systems
The hard part isn’t finetuning itself.
The hard part is data, evaluation, and long-term maintenance.
That’s why the next chapter focuses entirely on data.
