Wednesday, December 31, 2025

Inference Optimization: How AI Models Become Faster, Cheaper, and Actually Useful (Chapter 9 - AI Engineering - By Chip Huyen)

Download Book

<<< Previous Chapter Next Chapter >>>

Why “running” an AI model well is just as hard as building it


Introduction: Why Inference Matters More Than You Think

In the AI world, we spend a lot of time talking about training models—bigger models, better architectures, more data, and more GPUs. But once a model is trained, a much more practical question takes over:

How do you run it efficiently for real users?

This is where inference optimization comes in.

Inference is the process of using a trained model to produce outputs for new inputs. In simple terms:

  • Training is learning

  • Inference is doing the work

No matter how brilliant a model is, if it’s too slow, users will abandon it. If it’s too expensive, businesses won’t deploy it. Worse, if inference takes longer than the value of the prediction, the model becomes useless—imagine a stock prediction that arrives after the market closes.

As the chapter makes clear, inference optimization is about making AI models faster and cheaper without ruining their quality. And it’s not just a single discipline—it sits at the intersection of:

  • Machine learning

  • Systems engineering

  • Hardware architecture

  • Compilers

  • Distributed systems

  • Cloud infrastructure

This blog post explains inference optimization in plain language, without skipping the important details, so you can understand why AI systems behave the way they do—and how teams improve them in production .


1. Understanding Inference: From Model to User

Training vs Inference (A Crucial Distinction)

Every AI model has two distinct life phases:

  1. Training – learning patterns from data

  2. Inference – generating outputs for new inputs

Training is expensive but happens infrequently. Inference happens constantly.

Most AI engineers, application developers, and product teams spend far more time worrying about inference than training, especially if they’re using pretrained or open-source models.


What Is an Inference Service?

In production, inference doesn’t happen in isolation. It’s handled by an inference service, which includes:

  • An inference server that runs the model

  • Routing logic

  • Request handling

  • Possibly preprocessing and postprocessing

Model APIs like OpenAI, Gemini, or Claude are inference services. If you use them, you don’t worry about optimization details. But if you host your own models, you own the entire inference pipeline—performance, cost, scaling, and failures included .


Why Optimization Is About Bottlenecks

Optimization always starts with a question:

What is slowing us down?

Just like traffic congestion in a city, inference systems have chokepoints. Identifying these bottlenecks determines which optimization techniques actually help.


2. Compute-Bound vs Memory-Bound: The Core Bottleneck Concept

Two Fundamental Bottlenecks

Inference workloads usually fall into one of two categories:

Compute-bound

  • Speed limited by how many calculations the hardware can perform

  • Example: heavy mathematical computation

  • Faster chips or more parallelism help

Memory bandwidth-bound

  • Speed limited by how fast data can be moved

  • Common in large models where weights must be repeatedly loaded

  • Faster memory and smaller models help

This distinction is foundational to understanding inference optimization .


Why Language Models Are Often Memory-Bound

Large language models generate text one token at a time. For each token:

  • The model must load large weight matrices

  • Perform relatively little computation per byte loaded

This makes decoding (token generation) memory bandwidth-bound.


Prefill vs Decode: Two Very Different Phases

Transformer-based language model inference has two stages:

  1. Prefill

    • Processes input tokens in parallel

    • Compute-bound

    • Determines how fast the model “understands” the prompt

  2. Decode

    • Generates one output token at a time

    • Memory bandwidth-bound

    • Determines how fast the response appears

Because these phases behave differently, modern inference systems often separate them across machines for better efficiency .


3. Online vs Batch Inference: Latency vs Cost

Two Types of Inference APIs

Most providers offer:

  • Online APIs – optimized for low latency

  • Batch APIs – optimized for cost and throughput


Online Inference

Used for:

  • Chatbots

  • Code assistants

  • Real-time interactions

Characteristics:

  • Low latency

  • Users expect instant feedback

  • Limited batching

Streaming responses (token-by-token) reduce perceived waiting time but come with risks—users might see bad outputs before they can be filtered.


Batch Inference

Used for:

  • Synthetic data generation

  • Periodic reports

  • Document processing

  • Data migration

Characteristics:

  • Higher latency allowed

  • Aggressive batching

  • Much lower cost (often ~50% cheaper)

Unlike traditional ML, batch inference for foundation models can’t precompute everything because user prompts are open-ended .


4. Measuring Inference Performance: Metrics That Actually Matter

Latency Is Not One Number

Latency is best understood as multiple metrics:

Time to First Token (TTFT)

  • How long users wait before seeing anything

  • Tied to prefill

  • Critical for chat interfaces

Time Per Output Token (TPOT)

  • Speed of token generation after the first token

  • Determines how fast long responses feel

Two systems with the same total latency can feel very different depending on TTFT and TPOT trade-offs.


Percentiles, Not Averages

Latency is a distribution.

A single slow request can ruin the average. That’s why teams look at:

  • p50 (median)

  • p90

  • p95

  • p99

Outliers often signal:

  • Network issues

  • Oversized prompts

  • Resource contention .


Throughput and Cost

Throughput measures how much work the system does:

  • Tokens per second (TPS)

  • Requests per minute (RPM)

Higher throughput usually means lower cost—but pushing throughput too hard can destroy user experience.


Goodput: Throughput That Actually Counts

Goodput measures how many requests meet your service-level objectives (SLOs).

If your system completes 100 requests/minute but only 30 meet latency targets, your goodput is 30, not 100.

This metric prevents teams from optimizing the wrong thing.


5. Hardware: Why GPUs, Memory, and Power Dominate Costs

What Is an AI Accelerator?

An accelerator is specialized hardware designed for specific workloads.

For AI, the dominant accelerator is the GPU, designed for massive parallelism—perfect for matrix multiplication, which dominates neural network workloads.


Why GPUs Beat CPUs for AI

  • CPUs: few powerful cores, good for sequential logic

  • GPUs: thousands of small cores, great for parallel math

More than 90% of neural network computation boils down to matrix multiplication, which GPUs excel at .


Memory Hierarchy Matters More Than You Think

Modern accelerators use multiple memory layers:

  • CPU DRAM (slow, large)

  • GPU HBM (fast, smaller)

  • On-chip SRAM (extremely fast, tiny)

Inference optimization is often about moving data less and smarter across this hierarchy.


Power Is a Hidden Bottleneck

High-end GPUs consume enormous energy:

  • An H100 running continuously can use ~7,000 kWh/year

  • Comparable to a household’s annual electricity use

This makes power—and cooling—a real constraint on AI scaling .


6. Model-Level Optimization: Making Models Lighter and Faster

Model Compression Techniques

Several techniques reduce model size:

Quantization

  • Reduce numerical precision (FP32 → FP16 → INT8 → INT4)

  • Smaller models

  • Faster inference

  • Lower memory bandwidth use

Distillation

  • Train a smaller model to mimic a larger one

  • Often surprisingly effective

Pruning

  • Remove unimportant parameters

  • Creates sparse models

  • Less common due to hardware limitations

Among these, weight-only quantization is the most widely used in practice .


The Autoregressive Bottleneck

Generating text one token at a time is:

  • Slow

  • Expensive

  • Memory-bandwidth heavy

Several techniques attempt to overcome this fundamental limitation.


Speculative Decoding

Idea:

  • Use a smaller “draft” model to guess future tokens

  • Have the main model verify them in parallel

If many draft tokens are accepted, the system generates multiple tokens per step—dramatically improving speed without hurting quality.

This technique is now widely supported in modern inference frameworks.


Inference with Reference

Instead of generating text the model already knows (e.g., copied context), simply reuse tokens from the input.

This works especially well for:

  • Document Q&A

  • Code editing

  • Multi-turn conversations

It avoids redundant computation and speeds up generation.


Parallel Decoding

Some techniques try to generate multiple future tokens simultaneously.

Examples:

  • Lookahead decoding

  • Medusa

These approaches are promising but complex, requiring careful verification to ensure coherence.


7. Attention Optimization: Taming the KV Cache Explosion

Why Attention Is Expensive

Each new token attends to all previous tokens.

Without optimization:

  • Computation grows quadratically

  • KV cache grows linearly—but still huge

For large models and long contexts, the KV cache can exceed model size itself .


KV Cache Optimization Techniques

Three broad strategies exist:

Redesigning Attention

  • Local window attention

  • Multi-query attention

  • Grouped-query attention

  • Cross-layer attention

These reduce how much data must be stored and reused.


Optimizing KV Cache Storage

Frameworks like vLLM introduced:

  • PagedAttention

  • Flexible memory allocation

  • Reduced fragmentation

Other approaches:

  • KV cache quantization

  • Adaptive compression

  • Selective caching


Writing Better Kernels

Instead of changing algorithms, optimize how computations are executed on hardware.

The most famous example:

  • FlashAttention

  • Fuses multiple operations

  • Minimizes memory access

  • Huge speedups in practice


8. Service-Level Optimization: Making the Whole System Work

Batching: The Simplest Cost Saver

Batching combines multiple requests:

  • Improves throughput

  • Reduces cost

  • Increases latency

Types:

  • Static batching

  • Dynamic batching

  • Continuous batching

The trick is batching without hurting latency too much.


Compilers and Kernels

Modern inference relies heavily on:

  • Compilers (torch.compile, XLA, TensorRT)

  • Hardware-specific kernels

These translate high-level model code into highly optimized machine instructions.

Many companies treat their kernels as trade secrets because they directly translate into cost advantages .


Conclusion: Inference Optimization Is the Real Production Skill

Training models gets headlines. Inference optimization keeps businesses alive.

Inference optimization:

  • Determines user experience

  • Dominates long-term cost

  • Requires cross-disciplinary thinking

  • Is where real-world AI engineering happens

As models become commoditized, efficient inference becomes a competitive moat.

The future of AI won’t be decided by who trains the biggest model—but by who can run models fastest, cheapest, and most reliably.


Addendum

My prompt:
Role: You an expert in AI Engineering and a prolific writer. 
Task: Spin this attached chapter 9 as a blog post in layman terms 
Rules: Organize the post in 7-8 sections with subsections as needed 
Note: Blog post should be about 6000 to 7000 words long
Note: Try not to miss any important section or details
Tags: Artificial Intelligence,Generative AI,Agentic AI,Technology,Book Summary,

No comments:

Post a Comment