Showing posts with label Technology. Show all posts
Showing posts with label Technology. Show all posts

Wednesday, December 31, 2025

The Genesis and Evolution of AI Agents (Chapter 1)

Download Book

<<< Previous Book Next Chapter >>>

Chapter 1 from the book "Agentic AI: Theories and Practices" (By Ken Huang)

A conversational deep-dive into how AI went from tools to thinking collaborators


Introduction: Why Everyone Is Suddenly Talking About “AI Agents”

If you’ve been following AI over the last few years, you’ve probably noticed something interesting. The conversation has shifted.

We’re no longer just talking about:

  • chatbots that answer questions,

  • models that summarize text,

  • or systems that classify images.

Instead, we’re hearing phrases like:

  • AI agents

  • autonomous systems

  • AI coworkers

  • long-horizon reasoning

  • AI that plans and acts

This chapter opens by saying, very clearly: we are entering a new era of AI. Not just better AI—but different AI.

AI Agents are not just smarter tools. They are systems that can:

  • decide what to do,

  • figure out how to do it,

  • act on their own,

  • and improve over time.

That might sound like science fiction, but the chapter’s core argument is simple: the foundations for this shift already exist, and we’re watching it happen in real time.


What Is an AI Agent? (And Why It’s Hard to Define)

Why the Definition Is Slippery

The chapter is refreshingly honest: there is no single, perfect definition of an AI Agent.

Why? Because:

  • the technology is evolving fast,

  • new capabilities keep getting added,

  • and today’s “advanced” system becomes tomorrow’s baseline.

Still, the author gives us a good enough definition so we know what we’re talking about.


The Big Idea: From Passive Software to Active Entities

Traditional software is passive. You click a button, it responds. You give it instructions, it executes them exactly as written.

AI Agents are different.

At their core, AI Agents are digital entities that can perceive, think, and act with a degree of independence.

Instead of waiting to be told every step, they can:

  • explore information on their own,

  • decide which path to take,

  • plan multiple steps ahead,

  • adjust behavior based on feedback.

In other words, they behave less like calculators and more like junior collaborators.


What Makes an AI Agent Different from Old AI?

The chapter emphasizes that this isn’t just an incremental upgrade. It’s a qualitative shift.

Earlier AI systems were:

  • rule-bound,

  • narrow,

  • brittle,

  • and static.

AI Agents are:

  • adaptive,

  • proactive,

  • flexible,

  • and capable of handling messy, real-world situations.

They don’t just answer questions. They pursue goals.


Core Characteristics of AI Agents (Explained Simply)

The chapter then breaks down what actually makes something an AI Agent. Let’s go through these traits in plain language.


1. Autonomy and Initiative

This is the big one.

An AI Agent doesn’t need a human to micromanage every step. Once given a goal, it can:

  • decide what actions are needed,

  • evaluate different options,

  • choose the best path forward.

This autonomy is usually powered by decision-making algorithms and reinforcement learning, but you don’t need to know the math to get the idea.

Think of the difference between:

  • a GPS that only gives directions when you ask, and

  • a navigation system that reroutes automatically when traffic changes.

That second one feels more agent-like.


2. Adaptability and Learning

AI Agents don’t just follow instructions—they learn from experience.

Using techniques like deep learning and transfer learning, they can:

  • improve with new data,

  • adapt to unfamiliar situations,

  • apply past knowledge in new contexts.

This is closer to how humans learn. We don’t relearn everything from scratch every time—we generalize.


3. Multimodal Perception

Humans don’t just read text—we see, hear, and sense the world.

Modern AI Agents are moving in the same direction. They can process:

  • text and speech,

  • images and video,

  • sensor data,

  • sometimes even radar or infrared signals.

By combining multiple input types, agents build a richer understanding of their environment, which leads to better decisions.


4. Reasoning and Problem-Solving

This is where things get really interesting.

AI Agents don’t just retrieve information. They can:

  • reason step by step,

  • infer causes and effects,

  • break complex problems into manageable parts.

They often combine:

  • symbolic logic (rules and structure),

  • probabilistic reasoning (handling uncertainty),

  • neural networks (pattern recognition).

That hybrid approach lets them tackle problems that require more than brute-force computation.


5. Social Intelligence and Collaboration

AI Agents aren’t designed to work alone.

Advanced agents can:

  • hold conversations,

  • understand intent and emotion,

  • collaborate with humans,

  • coordinate with other AI agents.

This is crucial for real-world deployment, where problems are rarely solved by a single isolated system.


6. Ethical Reasoning and Value Alignment

As AI Agents gain autonomy, ethics becomes unavoidable.

The chapter stresses that agents must:

  • reason about consequences,

  • align with human values,

  • respect social norms.

This is an active research area, and it’s not “solved.” But it’s central to responsible deployment.


7. Meta-Learning: Learning How to Learn

At the frontier is meta-learning—agents that improve not just their skills, but their learning process itself.

This means:

  • faster adaptation,

  • less retraining,

  • more independence from human engineers.


8. Explainability and Transparency

As agents grow more complex, humans need to understand:

  • why an agent made a decision,

  • how it arrived at a conclusion.

Explainability builds trust and accountability—especially in high-stakes domains like healthcare or finance.


9. Domain Agnosticism

Earlier AI systems were specialists. Modern AI Agents aim to be generalists.

They can:

  • transfer knowledge across domains,

  • apply skills learned in one area to another.

This is a major step toward more flexible, human-like intelligence.


10. Embodied Intelligence

AI Agents aren’t just software.

When combined with robotics and IoT, they can:

  • move in the physical world,

  • interact with objects,

  • operate vehicles,

  • assist in manufacturing and healthcare.

This bridges digital intelligence with physical action.


A Brief History: How AI Agents Came to Be

To understand why today’s agents feel revolutionary, the chapter walks us through history.


The Dartmouth Conference (1956): Where AI Was Born

The term “Artificial Intelligence” was coined at the Dartmouth Conference in 1956.

The goal was bold: to make machines simulate aspects of human intelligence.

At the time, progress was slow due to:

  • limited hardware,

  • lack of data,

  • overly ambitious expectations.

Still, this conference planted the seed.


1970s–1980s: Expert Systems

This era produced expert systems—programs that encoded human knowledge as rules.

MYCIN, for example, helped diagnose bacterial infections.

Expert systems worked well in narrow domains but:

  • couldn’t adapt,

  • couldn’t learn,

  • broke outside predefined scenarios.


The Actor Model: Early Agent Thinking

Carl Hewitt’s Actor Model proposed systems made of independent “actors” that communicate via messages.

This idea strongly resembles modern:

  • multi-agent systems,

  • distributed AI architectures.

You can see echoes of it today in frameworks like AutoGen and LangGraph.


1990s: Software Agents and the Internet

As the internet grew, researchers like Pattie Maes explored software agents that acted on behalf of users.

Her work influenced:

  • recommendation systems,

  • personalization,

  • intelligent user interfaces.

Many everyday features—like online recommendations—trace back to this era.


2000s: Machine Learning Joins the Party

Reinforcement learning became central.

Agents could now:

  • take actions,

  • receive feedback,

  • improve over time.

Projects like DARPA’s CALO laid the groundwork for assistants like Siri.


2010s–Present: The AI Agent Renaissance

This is where everything accelerated.

Key breakthroughs:

  • deep learning,

  • GPUs,

  • transformers,

  • large language models.

Models like GPT-4, Claude, and Gemini unlocked:

  • natural language understanding,

  • reasoning,

  • planning,

  • tool use.

According to OpenAI’s framework, AI progress now spans five levels—from narrow chatbots to fully autonomous organizational agents.


Taxonomy of AI Agents: Different Flavors of Intelligence

The chapter then introduces a taxonomy—a way to categorize different types of agents.


Reactive Agents

The simplest kind.

They:

  • respond quickly,

  • follow stimulus-response rules,

  • don’t learn or plan.

Examples:

  • obstacle avoidance in robotics,

  • high-frequency trading.

Fast, but shallow.


Deliberative Agents

These agents:

  • build internal models,

  • plan ahead,

  • reason symbolically.

They’re good at strategy but:

  • computationally heavy,

  • slower to react,

  • sensitive to model inaccuracies.


Hybrid Agents

Hybrid agents combine:

  • reactive speed,

  • deliberative planning.

They’re common in:

  • autonomous vehicles,

  • intelligent assistants.

Powerful, but architecturally complex.


Learning Agents

These agents improve with experience.

They:

  • adapt to changing environments,

  • generalize from past data.

Examples:

  • recommendation engines,

  • adaptive control systems.

Their weakness? Data dependency and potential bias.


Cognitive Agents

The most ambitious category.

They aim for:

  • human-like reasoning,

  • abstract thinking,

  • language fluency.

Examples include advanced research assistants and creative AI systems.


Collaborative Agents

Designed to work in groups.

They:

  • communicate,

  • coordinate,

  • solve distributed problems.

Examples:

  • swarm robotics,

  • multi-agent recommendation systems.


Competitive (Adversarial) Agents

These agents operate in conflict.

They:

  • model opponents,

  • use game theory,

  • anticipate adversarial actions.

Examples:

  • cybersecurity,

  • trading bots,

  • competitive games.


Vertical or Domain-Specific Agents

These are specialists.

They:

  • excel in one domain,

  • outperform general systems there,

  • don’t generalize well.

Examples:

  • medical diagnosis systems,

  • chess engines,

  • financial trading algorithms.

Expertise comes at the cost of flexibility.


What Enabled the AI Agent Renaissance?

The chapter highlights a “perfect storm” of technologies.


1. Massive Compute Power

GPUs, TPUs, and specialized chips removed earlier limits.


2. Advances in Natural Language Processing

Large language models bridged the gap between human language and machine reasoning.


3. The Data Explosion

Big data and IoT provided endless learning material.


4. Algorithmic Innovation

Reinforcement learning, transformers, and self-supervised learning pushed boundaries.


5. Interdisciplinary Convergence

Insights from neuroscience, psychology, and computer science shaped modern agent design.


Real-World Examples: Operator and STaR

OpenAI’s Operator Agent

Operator is designed to:

  • navigate the internet autonomously,

  • conduct deep research,

  • handle long-horizon tasks,

  • reason step by step using chain-of-thought.

It can run multiple agents in parallel and shows early signs of PhD-level reasoning.


Stanford’s Self-Taught Reasoner (STaR)

STaR focuses on self-improvement.

It:

  • generates its own training data,

  • learns from limited examples,

  • applies reasoning across domains,

  • uses chain-of-thought for transparency.

This shows how agents might learn more like humans—through reflection and iteration.


Why This Chapter Matters

The chapter closes with a powerful message:

AI Agents are not just a technological upgrade. They are a paradigm shift.

They move AI from:

  • tools → collaborators,

  • narrow → general,

  • reactive → proactive.

They will reshape:

  • work,

  • organizations,

  • creativity,

  • problem-solving.

And they force us to think carefully about ethics, responsibility, and human–AI partnership.


Final Takeaway

AI Agents are not magic.
They are not conscious.
They are not replacements for humans.

But they are the most significant change in how software behaves since the invention of computing.

We’re not just building smarter machines—we’re building systems that think, act, and learn alongside us.

And this chapter is your roadmap to understanding how we got here.


Tags: Artificial Intelligence,Generative AI,Agentic AI,Technology,Book Summary,

AI Engineering Architecture and User Feedback (Chapter 10 - AI Engineering - By Chip Huyen)

Download Book

<<< Previous Chapter Next Book >>>

Building Real AI Products: Architecture, Guardrails, and the Power of User Feedback

Why successful AI systems are more about engineering and feedback loops than models


Introduction: From “Cool Demo” to Real Product

Training or calling a large language model is the easy part of modern AI.

The hard part begins when you try to turn that model into a real product:

  • One that real users rely on

  • One that must be safe, fast, affordable, and reliable

  • One that improves over time instead of silently degrading

This chapter is about that hard part.

So far, most AI discussions focus on prompts, RAG, finetuning, or agents in isolation. But real-world systems are not isolated techniques—they are architectures. And architectures are shaped not just by technical constraints, but by users.

This blog post walks through a progressive AI engineering architecture, starting from the simplest possible setup and gradually adding:

  • Context

  • Guardrails

  • Routers and gateways

  • Caching

  • Agentic logic

  • Monitoring and observability

  • Orchestration

  • User feedback loops

At the end, you’ll see why user feedback is not just UX—it’s your most valuable dataset.


1. The Simplest Possible AI Architecture (And Why It Breaks)

The One-Box Architecture

Every AI product starts the same way:

  1. User sends a query

  2. Query goes to a model

  3. Model generates a response

  4. Response is returned to the user

That’s it.

This architecture is deceptively powerful. For prototypes, demos, and internal tools, it works surprisingly well. Many early ChatGPT-like apps stopped right here.

But this setup has major limitations:

  • The model knows only what’s in its training data

  • No protection against unsafe or malicious prompts

  • No cost control

  • No monitoring

  • No learning from users

This architecture is useful only until users start depending on it .


Why Real Products Need More Than a Model Call

Once users rely on your application:

  • Wrong answers become costly

  • Latency becomes noticeable

  • Hallucinations become dangerous

  • Abuse becomes inevitable

To fix these, teams don’t “replace the model.”
They add layers around it.


2. Step One: Enhancing Context (Giving the Model the Right Information)

Context Is Feature Engineering for LLMs

Large language models don’t magically know your:

  • Company policies

  • Internal documents

  • User history

  • Real-time data

To make models useful, you must construct context.

This can be done via:

  • Text retrieval (documents, PDFs, chat history)

  • Structured data retrieval (SQL, tables)

  • Image retrieval

  • Tool use (search APIs, weather, news, calculators)

This process—commonly called RAG—is analogous to feature engineering in traditional ML.

The model isn’t getting smarter; it’s getting better inputs .


Trade-offs in Context Construction

Not all context systems are equal.

Model APIs (OpenAI, Gemini, Claude):

  • Easier to use

  • Limited document uploads

  • Limited retrieval control

Custom RAG systems:

  • More flexible

  • More engineering effort

  • Require tuning (chunk size, embeddings, rerankers)

Similarly, tool support varies:

  • Some models support parallel tools

  • Some support long-running tools

  • Some don’t support tools at all

As soon as context is added, your architecture already looks much more complex—and much more powerful.


3. Step Two: Guardrails (Protecting Users and Yourself)

Why Guardrails Are Non-Negotiable

AI systems fail in ways traditional software never did.

They can:

  • Leak private data

  • Generate toxic content

  • Execute unintended actions

  • Be tricked by clever prompts

Guardrails exist to reduce risk, not eliminate it (elimination is impossible).

There are two broad types:

  • Input guardrails

  • Output guardrails


Input Guardrails: Protecting What Goes In

Input guardrails prevent:

  • Sensitive data leakage

  • Prompt injection attacks

  • System compromise

Common risks:

  • Employees pasting secrets into prompts

  • Tools accidentally retrieving private data

  • Developers embedding internal policies into prompts

A common defense is PII detection and masking:

  • Detect phone numbers, IDs, addresses, faces

  • Replace them with placeholders

  • Send masked prompt to external API

  • Unmask response locally using a reverse dictionary

This allows functionality without leaking raw data .


Output Guardrails: Protecting What Comes Out

Models can fail in many ways.

Quality failures:

  • Invalid JSON

  • Hallucinated facts

  • Low-quality responses

Security failures:

  • Toxic language

  • Private data leaks

  • Brand-damaging claims

  • Unsafe tool invocation

Some failures are easy to detect (empty output).
Others require AI-based scorers.


Handling Failures: Retries, Fallbacks, Humans

Because models are probabilistic:

  • Retrying can fix many issues

  • Parallel retries reduce latency

  • Humans can be used as a last resort

Some teams route conversations to humans:

  • When sentiment turns negative

  • After too many turns

  • When safety confidence drops

Guardrails always involve trade-offs:

  • More guardrails → more latency

  • Streaming responses → weaker guardrails

There is no perfect solution—only careful balancing.


4. Step Three: Routers and Gateways (Managing Complexity at Scale)

Why One Model Is Rarely Enough

As products grow, different queries require different handling:

  • FAQs vs troubleshooting

  • Billing vs technical support

  • Simple vs complex tasks

Using one expensive model for everything is wasteful.

This is where routing comes in.


Routers: Choosing the Right Path

A router is usually an intent classifier.

Examples:

  • “Reset my password” → FAQ

  • “Billing error” → human agent

  • “Why is my app crashing?” → troubleshooting model

Routers help:

  • Reduce cost

  • Improve quality

  • Avoid out-of-scope conversations

Routers can also:

  • Ask clarifying questions

  • Decide which memory to use

  • Choose which tool to call next

Routers must be:

  • Fast

  • Cheap

  • Reliable

That’s why many teams use small models or custom classifiers .


Model Gateways: One Interface to Rule Them All

A model gateway is a unified access layer to:

  • OpenAI

  • Gemini

  • Claude

  • Self-hosted models

Benefits:

  • Centralized authentication

  • Cost control

  • Rate limiting

  • Fallback strategies

  • Easier maintenance

Instead of changing every app when an API changes, you update the gateway once.

Gateways also become natural places for:

  • Logging

  • Analytics

  • Guardrails

  • Caching


5. Step Four: Caching (Reducing Latency and Cost)

Why Caching Matters in AI

AI calls are:

  • Slow

  • Expensive

  • Often repetitive

Caching avoids recomputing answers.

There are two main types:

  • Exact caching

  • Semantic caching


Exact Caching: Safe and Simple

Exact caching reuses results only when inputs match exactly.

Examples:

  • Product summaries

  • FAQ answers

  • Embedding lookups

Key considerations:

  • Eviction policy (LRU, LFU, FIFO)

  • Storage layer (memory vs Redis vs DB)

  • Cache duration

Caching must be careful:

  • User-specific data should not be cached globally

  • Time-sensitive queries should not be cached

Mistakes here can cause data leaks .


Semantic Caching: Powerful but Risky

Semantic caching reuses answers for similar queries.

Process:

  1. Embed query

  2. Search cached embeddings

  3. If similarity > threshold, reuse result

Pros:

  • Higher cache hit rate

Cons:

  • Incorrect answers

  • Complex tuning

  • Extra vector search cost

Semantic caching only works well when:

  • Embeddings are high quality

  • Similarity thresholds are well tuned

  • Cache hit rate is high

Otherwise, it often causes more harm than good.


6. Step Five: Agent Patterns and Write Actions

Moving Beyond Linear Pipelines

Simple pipelines are sequential:

  • Query → Retrieve → Generate → Return

Agentic systems introduce:

  • Loops

  • Conditional branching

  • Parallel execution

Example:

  • Generate answer

  • Detect insufficiency

  • Retrieve more data

  • Generate again

This dramatically increases capability.


Write Actions: Power with Risk

Write actions allow models to:

  • Send emails

  • Place orders

  • Update records

  • Trigger workflows

They make systems vastly more useful—but vastly more dangerous.

Write actions must be:

  • Strictly guarded

  • Audited

  • Often human-approved

Once write actions are added, observability becomes mandatory, not optional.


7. Monitoring and Observability: Seeing Inside the Black Box

Monitoring vs Observability

Monitoring:

  • Tracks metrics

  • Tells you something is wrong

Observability:

  • Lets you infer why it’s wrong

  • Without deploying new code

Good observability reduces:

  • Mean time to detection (MTTD)

  • Mean time to resolution (MTTR)

  • Change failure rate (CFR)


Metrics That Actually Matter

Metrics should serve a purpose.

Examples:

  • Format error rate

  • Hallucination signals

  • Guardrail trigger rate

  • Token usage

  • Latency (TTFT, TPOT)

  • Cost per request

Metrics should correlate with business metrics:

  • DAU

  • Retention

  • Session duration

If they don’t, you may be optimizing the wrong thing .


Logs and Traces: Debugging Reality

Logs:

  • Record events

  • Help answer “what happened?”

Traces:

  • Reconstruct an entire request’s journey

  • Show timing, costs, failures

In AI systems, logs should capture:

  • Prompts

  • Model parameters

  • Outputs

  • Tool calls

  • Intermediate results

Developers should regularly read production logs—their understanding of “good” and “bad” outputs evolves over time.


Drift Detection: Change Is Inevitable

Things that drift:

  • System prompts

  • User behavior

  • Model versions (especially via APIs)

Drift often goes unnoticed unless explicitly tracked.

Silent drift is one of the biggest risks in AI production.


8. User Feedback: Your Most Valuable Dataset

Why User Feedback Is Strategic

User feedback is:

  • Proprietary

  • Real-world

  • Continuously generated

It fuels:

  • Evaluation

  • Personalization

  • Model improvement

  • Competitive advantage

This is the data flywheel.


Explicit vs Implicit Feedback

Explicit feedback:

  • Thumbs up/down

  • Ratings

  • Surveys

Pros:

  • Clear signal
    Cons:

  • Sparse

  • Biased

Implicit feedback:

  • Edits

  • Rephrases

  • Regeneration

  • Abandonment

  • Conversation length

  • Sentiment

Pros:

  • Abundant
    Cons:

  • Noisy

  • Hard to interpret

Both are necessary.


Conversational Feedback Is Gold

Users naturally correct AI:

  • “No, I meant…”

  • “That’s wrong”

  • “Be shorter”

  • “Check again”

These corrections are:

  • Preference data

  • Evaluation signals

  • Training examples

Edits are especially powerful:

  • Original output = losing response

  • Edited output = winning response

That’s free RLHF data.


Designing Feedback Without Annoying Users

Good feedback systems:

  • Fit naturally into workflows

  • Require minimal effort

  • Can be ignored

Great examples:

  • Midjourney’s image selection

  • GitHub Copilot’s inline suggestions

  • Google Photos’ uncertainty prompts

Bad feedback systems:

  • Interrupt users

  • Ask too often

  • Demand explanation


Conclusion: Architecture and Feedback Are the Real AI Moats

Modern AI success is not about:

  • The biggest model

  • The cleverest prompt

It’s about:

  • Thoughtful architecture

  • Layered defenses

  • Observability

  • Feedback loops

  • Continuous iteration

Models will commoditize.
APIs will change.
What remains defensible is how well you learn from users and adapt.

That is the real craft of AI engineering.

Tags: Artificial Intelligence,Generative AI,Agentic AI,Technology,Book Summary,

Inference Optimization: How AI Models Become Faster, Cheaper, and Actually Useful (Chapter 9 - AI Engineering - By Chip Huyen)

Download Book

<<< Previous Chapter Next Chapter >>>

Why “running” an AI model well is just as hard as building it


Introduction: Why Inference Matters More Than You Think

In the AI world, we spend a lot of time talking about training models—bigger models, better architectures, more data, and more GPUs. But once a model is trained, a much more practical question takes over:

How do you run it efficiently for real users?

This is where inference optimization comes in.

Inference is the process of using a trained model to produce outputs for new inputs. In simple terms:

  • Training is learning

  • Inference is doing the work

No matter how brilliant a model is, if it’s too slow, users will abandon it. If it’s too expensive, businesses won’t deploy it. Worse, if inference takes longer than the value of the prediction, the model becomes useless—imagine a stock prediction that arrives after the market closes.

As the chapter makes clear, inference optimization is about making AI models faster and cheaper without ruining their quality. And it’s not just a single discipline—it sits at the intersection of:

  • Machine learning

  • Systems engineering

  • Hardware architecture

  • Compilers

  • Distributed systems

  • Cloud infrastructure

This blog post explains inference optimization in plain language, without skipping the important details, so you can understand why AI systems behave the way they do—and how teams improve them in production .


1. Understanding Inference: From Model to User

Training vs Inference (A Crucial Distinction)

Every AI model has two distinct life phases:

  1. Training – learning patterns from data

  2. Inference – generating outputs for new inputs

Training is expensive but happens infrequently. Inference happens constantly.

Most AI engineers, application developers, and product teams spend far more time worrying about inference than training, especially if they’re using pretrained or open-source models.


What Is an Inference Service?

In production, inference doesn’t happen in isolation. It’s handled by an inference service, which includes:

  • An inference server that runs the model

  • Routing logic

  • Request handling

  • Possibly preprocessing and postprocessing

Model APIs like OpenAI, Gemini, or Claude are inference services. If you use them, you don’t worry about optimization details. But if you host your own models, you own the entire inference pipeline—performance, cost, scaling, and failures included .


Why Optimization Is About Bottlenecks

Optimization always starts with a question:

What is slowing us down?

Just like traffic congestion in a city, inference systems have chokepoints. Identifying these bottlenecks determines which optimization techniques actually help.


2. Compute-Bound vs Memory-Bound: The Core Bottleneck Concept

Two Fundamental Bottlenecks

Inference workloads usually fall into one of two categories:

Compute-bound

  • Speed limited by how many calculations the hardware can perform

  • Example: heavy mathematical computation

  • Faster chips or more parallelism help

Memory bandwidth-bound

  • Speed limited by how fast data can be moved

  • Common in large models where weights must be repeatedly loaded

  • Faster memory and smaller models help

This distinction is foundational to understanding inference optimization .


Why Language Models Are Often Memory-Bound

Large language models generate text one token at a time. For each token:

  • The model must load large weight matrices

  • Perform relatively little computation per byte loaded

This makes decoding (token generation) memory bandwidth-bound.


Prefill vs Decode: Two Very Different Phases

Transformer-based language model inference has two stages:

  1. Prefill

    • Processes input tokens in parallel

    • Compute-bound

    • Determines how fast the model “understands” the prompt

  2. Decode

    • Generates one output token at a time

    • Memory bandwidth-bound

    • Determines how fast the response appears

Because these phases behave differently, modern inference systems often separate them across machines for better efficiency .


3. Online vs Batch Inference: Latency vs Cost

Two Types of Inference APIs

Most providers offer:

  • Online APIs – optimized for low latency

  • Batch APIs – optimized for cost and throughput


Online Inference

Used for:

  • Chatbots

  • Code assistants

  • Real-time interactions

Characteristics:

  • Low latency

  • Users expect instant feedback

  • Limited batching

Streaming responses (token-by-token) reduce perceived waiting time but come with risks—users might see bad outputs before they can be filtered.


Batch Inference

Used for:

  • Synthetic data generation

  • Periodic reports

  • Document processing

  • Data migration

Characteristics:

  • Higher latency allowed

  • Aggressive batching

  • Much lower cost (often ~50% cheaper)

Unlike traditional ML, batch inference for foundation models can’t precompute everything because user prompts are open-ended .


4. Measuring Inference Performance: Metrics That Actually Matter

Latency Is Not One Number

Latency is best understood as multiple metrics:

Time to First Token (TTFT)

  • How long users wait before seeing anything

  • Tied to prefill

  • Critical for chat interfaces

Time Per Output Token (TPOT)

  • Speed of token generation after the first token

  • Determines how fast long responses feel

Two systems with the same total latency can feel very different depending on TTFT and TPOT trade-offs.


Percentiles, Not Averages

Latency is a distribution.

A single slow request can ruin the average. That’s why teams look at:

  • p50 (median)

  • p90

  • p95

  • p99

Outliers often signal:

  • Network issues

  • Oversized prompts

  • Resource contention .


Throughput and Cost

Throughput measures how much work the system does:

  • Tokens per second (TPS)

  • Requests per minute (RPM)

Higher throughput usually means lower cost—but pushing throughput too hard can destroy user experience.


Goodput: Throughput That Actually Counts

Goodput measures how many requests meet your service-level objectives (SLOs).

If your system completes 100 requests/minute but only 30 meet latency targets, your goodput is 30, not 100.

This metric prevents teams from optimizing the wrong thing.


5. Hardware: Why GPUs, Memory, and Power Dominate Costs

What Is an AI Accelerator?

An accelerator is specialized hardware designed for specific workloads.

For AI, the dominant accelerator is the GPU, designed for massive parallelism—perfect for matrix multiplication, which dominates neural network workloads.


Why GPUs Beat CPUs for AI

  • CPUs: few powerful cores, good for sequential logic

  • GPUs: thousands of small cores, great for parallel math

More than 90% of neural network computation boils down to matrix multiplication, which GPUs excel at .


Memory Hierarchy Matters More Than You Think

Modern accelerators use multiple memory layers:

  • CPU DRAM (slow, large)

  • GPU HBM (fast, smaller)

  • On-chip SRAM (extremely fast, tiny)

Inference optimization is often about moving data less and smarter across this hierarchy.


Power Is a Hidden Bottleneck

High-end GPUs consume enormous energy:

  • An H100 running continuously can use ~7,000 kWh/year

  • Comparable to a household’s annual electricity use

This makes power—and cooling—a real constraint on AI scaling .


6. Model-Level Optimization: Making Models Lighter and Faster

Model Compression Techniques

Several techniques reduce model size:

Quantization

  • Reduce numerical precision (FP32 → FP16 → INT8 → INT4)

  • Smaller models

  • Faster inference

  • Lower memory bandwidth use

Distillation

  • Train a smaller model to mimic a larger one

  • Often surprisingly effective

Pruning

  • Remove unimportant parameters

  • Creates sparse models

  • Less common due to hardware limitations

Among these, weight-only quantization is the most widely used in practice .


The Autoregressive Bottleneck

Generating text one token at a time is:

  • Slow

  • Expensive

  • Memory-bandwidth heavy

Several techniques attempt to overcome this fundamental limitation.


Speculative Decoding

Idea:

  • Use a smaller “draft” model to guess future tokens

  • Have the main model verify them in parallel

If many draft tokens are accepted, the system generates multiple tokens per step—dramatically improving speed without hurting quality.

This technique is now widely supported in modern inference frameworks.


Inference with Reference

Instead of generating text the model already knows (e.g., copied context), simply reuse tokens from the input.

This works especially well for:

  • Document Q&A

  • Code editing

  • Multi-turn conversations

It avoids redundant computation and speeds up generation.


Parallel Decoding

Some techniques try to generate multiple future tokens simultaneously.

Examples:

  • Lookahead decoding

  • Medusa

These approaches are promising but complex, requiring careful verification to ensure coherence.


7. Attention Optimization: Taming the KV Cache Explosion

Why Attention Is Expensive

Each new token attends to all previous tokens.

Without optimization:

  • Computation grows quadratically

  • KV cache grows linearly—but still huge

For large models and long contexts, the KV cache can exceed model size itself .


KV Cache Optimization Techniques

Three broad strategies exist:

Redesigning Attention

  • Local window attention

  • Multi-query attention

  • Grouped-query attention

  • Cross-layer attention

These reduce how much data must be stored and reused.


Optimizing KV Cache Storage

Frameworks like vLLM introduced:

  • PagedAttention

  • Flexible memory allocation

  • Reduced fragmentation

Other approaches:

  • KV cache quantization

  • Adaptive compression

  • Selective caching


Writing Better Kernels

Instead of changing algorithms, optimize how computations are executed on hardware.

The most famous example:

  • FlashAttention

  • Fuses multiple operations

  • Minimizes memory access

  • Huge speedups in practice


8. Service-Level Optimization: Making the Whole System Work

Batching: The Simplest Cost Saver

Batching combines multiple requests:

  • Improves throughput

  • Reduces cost

  • Increases latency

Types:

  • Static batching

  • Dynamic batching

  • Continuous batching

The trick is batching without hurting latency too much.


Compilers and Kernels

Modern inference relies heavily on:

  • Compilers (torch.compile, XLA, TensorRT)

  • Hardware-specific kernels

These translate high-level model code into highly optimized machine instructions.

Many companies treat their kernels as trade secrets because they directly translate into cost advantages .


Conclusion: Inference Optimization Is the Real Production Skill

Training models gets headlines. Inference optimization keeps businesses alive.

Inference optimization:

  • Determines user experience

  • Dominates long-term cost

  • Requires cross-disciplinary thinking

  • Is where real-world AI engineering happens

As models become commoditized, efficient inference becomes a competitive moat.

The future of AI won’t be decided by who trains the biggest model—but by who can run models fastest, cheapest, and most reliably.


Addendum

My prompt:
Role: You an expert in AI Engineering and a prolific writer. 
Task: Spin this attached chapter 9 as a blog post in layman terms 
Rules: Organize the post in 7-8 sections with subsections as needed 
Note: Blog post should be about 6000 to 7000 words long
Note: Try not to miss any important section or details
Tags: Artificial Intelligence,Generative AI,Agentic AI,Technology,Book Summary,