Why data—not models—is the real competitive advantage in modern AI
Introduction: Why Data Has Quietly Become the Most Important Part of AI
If you ask most people how modern AI systems like ChatGPT, Claude, or Gemini became so powerful, they’ll say things like “bigger models,” “better architectures,” or “more GPUs.” All of that matters—but it’s no longer the whole story.
Over the last few years, something subtle but profound has happened in AI engineering: data has moved from being a background concern to becoming the central pillar of performance.
This shift is visible even inside OpenAI itself. When GPT-3 was released, only a handful of people were credited for data work. By the time GPT-4 arrived, dozens of engineers and researchers were working full-time on data collection, filtering, deduplication, formatting, and evaluation. Even seemingly simple things—like defining a chat message format—required deep thought and many contributors .
Why? Because once models become “good enough,” the difference between an average AI system and a great one comes down to data.
This blog post is about dataset engineering—the art and science of creating, curating, synthesizing, and validating data so that AI models actually learn the behaviors we want. We’ll walk through this topic in plain language, without losing technical substance.
1. From Model-Centric AI to Data-Centric AI
The Old World: Model-Centric Thinking
For a long time, AI research followed a simple formula:
“Here’s a dataset. Who can train the best model on it?”
Benchmarks like ImageNet worked this way. Everyone used the same data, and progress came from better architectures, training tricks, and scaling compute.
This model-centric mindset assumes data is fixed and models are the main lever.
The New World: Data-Centric AI
Today, the thinking has flipped.
In data-centric AI, the model is often fixed—or at least comparable—and the real question becomes:
“How do we improve the data so the model performs better?”
Competitions like Andrew Ng’s data-centric AI challenge and benchmarks like DataComp reflect this shift. Instead of competing on architectures, teams compete on how well they clean, diversify, and structure datasets for the same base model .
The key realization is simple but powerful:
A mediocre model trained on excellent data often beats a great model trained on messy data.
Why This Shift Matters for Application Builders
Very few companies today can afford to train foundation models from scratch. But any serious team can invest in better data.
This is why dataset engineering has become a strategic differentiator—and why entire roles now exist for:
Data annotators
Dataset creators
Data quality engineers
2. Data Curation: Deciding What Data Your Model Should Learn From
What Is Data Curation?
Data curation is the process of deciding what data to include, what to exclude, and how to shape it so a model learns the right behaviors.
This is not just data cleaning. It’s closer to curriculum design.
Different training stages require different kinds of data:
Pre-training cares about tokens
Supervised finetuning cares about examples
Preference training cares about comparisons
And while this chapter focuses mostly on post-training data (relevant for application developers), the principles apply everywhere .
Teaching Models Complex Behaviors Requires Specialized Data
Some model capabilities are particularly hard to teach unless the data is explicitly designed for them.
Chain-of-Thought Reasoning
If you want a model to reason step by step, you must show it step-by-step reasoning during training.
Research shows that including chain-of-thought (CoT) examples in finetuning data can nearly double accuracy on reasoning tasks. But these datasets are rare because writing detailed reasoning is time-consuming and cognitively demanding for humans.
This is why many datasets contain final answers—but not the reasoning that leads to them .
Tool Use
Teaching models to use tools is even trickier.
Humans and AI agents don’t work the same way. Humans click buttons and copy-paste. AI agents prefer APIs and structured calls. If you train models only on human behavior, you often teach them inefficient workflows.
This is why:
Observing real human workflows matters
Simulations and synthetic data become valuable
Special message formats are needed for multi-step tool use (as seen in Llama 3)
Single-Turn vs Multi-Turn Data
Another crucial decision is whether your model needs:
Single-turn data is easier to collect. Multi-turn data is closer to reality—but much harder to design and annotate well.
Most real applications need both.
Data Curation Also Means Removing Bad Data
Curation isn’t just about adding data—it’s also about removing data that teaches bad habits.
If your chatbot is annoyingly verbose or gives unsolicited advice, chances are those behaviors were reinforced by training examples. Fixing this often means:
Removing problematic examples
Adding new examples that demonstrate the desired behavior
3. The Golden Trio: Data Quality, Coverage, and Quantity
A useful way to think about training data is like cooking.
Quality = fresh ingredients
Coverage = balanced recipe
Quantity = enough food to feed everyone
All three matter, but not equally at all times.
Data Quality: Why “Less Is More” Often Wins
High-quality data consistently beats large amounts of noisy data.
Studies show that:
1,000 carefully curated examples can rival much larger datasets
Small, clean instruction sets can rival state-of-the-art models in preference judgments
Coverage means exposing the model to the full range of situations it will face.
This includes:
Different instruction styles (short, long, sloppy)
Typos and informal language
Different domains and topics
Multiple output formats
Research consistently shows that datasets that are both high-quality and diverse outperform datasets that are only one or the other.
Interestingly, adding too much heterogeneous data can sometimes hurt performance. Diversity must be intentional, not random.
Data Quantity: How Much Is Enough?
“How much data do I need?” has no universal answer.
A few key insights:
Models can learn from surprisingly small datasets
Performance gains usually show diminishing returns
More data helps only if it adds new information or diversity
If you have:
Little data → use PEFT methods on strong base models
Lots of data → consider full finetuning on smaller models
Before investing heavily, it’s wise to run small experiments (50–100 examples) to see if finetuning helps at all .
4. Data Acquisition: Where Good Training Data Comes From
Your Best Data Source Is Usually… Your Users
The most valuable data often comes from your own application:
User queries
Feedback
Corrections
Usage patterns
This data is:
Perfectly aligned with your task
Reflective of real-world distributions
Extremely hard to replace with public datasets
This is the basis of the famous “data flywheel”—where usage improves the model, which improves the product, which attracts more usage.
Public and Proprietary Datasets
Before building everything from scratch, it’s worth exploring:
Open datasets (Hugging Face, Kaggle, government portals)
Academic repositories
Cloud provider datasets
But never blindly trust them. Licenses, provenance, and hidden biases must be carefully checked.
Annotation: The Hardest Part Nobody Likes Talking About
Annotation is painful—not because labeling is hard, but because defining what “good” looks like is hard.
Good annotation requires:
Clear guidelines
Consistency across annotators
Iterative refinement
Frequent rework
Many teams abandon careful annotation halfway, hoping the model will “figure it out.” Sometimes it does—but relying on that is risky for production systems .
5. Data Augmentation and Synthetic Data: Creating Data at Scale
Augmentation vs Synthesis
Data augmentation transforms real data (e.g., flipping images, rephrasing text)
Synthetic data creates entirely new data that mimics real data
In practice, the line between them is blurry.
Why Synthetic Data Is So Attractive
Synthetic data can:
Increase quantity when real data is scarce
Improve coverage of rare or dangerous scenarios
Reduce privacy risks
Enable model distillation (training cheaper models to imitate expensive ones)
AI-generated data is now good enough to generate:
Documents
Conversations
Code
Medical notes
Contracts
In many cases, mixing synthetic and human data works best.
Traditional Techniques Still Matter
Before modern LLMs, industries used:
Rule-based templates
Procedural generation
Simulations
These methods are still incredibly useful, especially for:
Fraud detection
Robotics
Self-driving cars
Tool-use data
Simulations allow you to explore scenarios that are rare, dangerous, or expensive in the real world.
6. AI-Powered Data Synthesis: Models Generating Their Own Training Data
Self-Play and Role Simulation
AI models can:
Play games against themselves
Simulate negotiations
Act as both customer and support agent
This generates massive volumes of data at low cost and has proven extremely effective in games and agent training.
Instruction Data Synthesis
Modern instruction datasets like Alpaca and UltraChat were largely generated by:
Starting with a small set of seed examples
Using a strong model to generate thousands more
Filtering aggressively
A powerful trick is reverse instruction:
Start with high-quality content
Ask AI to generate prompts that would produce it
This avoids hallucinations in long responses and improves data quality.
The Llama 3 Case Study: Synthetic Data Done Right
Llama 3’s training pipeline used:
Code generation
Code translation
Back-translation
Unit tests
Automated error correction
Only examples that passed all checks were kept. This produced millions of verified synthetic examples and shows what industrial-grade dataset engineering looks like .
7. Verifying Data and the Limits of Synthetic Data
Data Verification Is Non-Negotiable
Synthetic data must be evaluated just like model outputs:
Functional correctness (e.g., code execution)
AI judges
Heuristics
Anomaly detection
If data can’t be verified, it can’t be trusted.
The Limits of AI-Generated Data
Synthetic data is powerful—but not magical.
Four major limitations remain:
Quality control is hard
Imitation can be superficial
Recursive training can cause model collapse
Synthetic data obscures provenance
Research shows that training entirely on synthetic data leads to degraded models over time. Mixing real and synthetic data is essential to avoid collapse .
Conclusion: Dataset Engineering Is the Real Craft of AI
Modern AI success is no longer just about models or compute.
It’s about:
Designing the right curriculum
Curating high-quality examples
Covering the real-world edge cases
Verifying everything
Iterating relentlessly
Dataset engineering is slow, unglamorous, and full of toil—but it’s also the strongest moat an AI team can build.
As models become commodities, data craftsmanship is where differentiation lives.
Section 5: Model Merging, Frameworks, Hyperparameters, and Conclusion
Great! Here’s Section 1 of the blog post (~1200 words).
Finetuning Large Language Models: From Foundations to Adaptation
Section 1 – Introduction & The Evolution of Finetuning
Introduction: Why Finetuning Matters in the Age of Foundation Models
In the past few years, the way we build artificial intelligence (AI) applications has changed dramatically. At the center of this transformation are foundation models—large language models (LLMs) like GPT-4, Claude, Llama 3, and Mistral. These models come pre-trained on massive amounts of text data, enabling them to perform a wide variety of tasks right out of the box: answering questions, generating code, summarizing documents, and even creating poetry.
But while these models are incredibly powerful, they are not perfect for every situation. Imagine using a general-purpose LLM to handle specialized legal queries, medical diagnostics, or customer service conversations in a very specific tone of voice. The model may be good enough, but it may not fully align with your domain, data, or desired output format.
This is where finetuning comes into play.
Finetuning is the process of adapting a pre-trained model by training it further—either on your specific data, or with modifications that make it better suited for a specialized task. Unlike prompt engineering, which adjusts the input you feed into a model, finetuning adjusts the weights inside the model itself. The difference is subtle but significant: finetuning rewires how the model thinks, while prompting merely nudges it in the right direction.
To borrow an analogy:
Prompting is like instructing a professional chef to cook a new dish by giving detailed step-by-step instructions.
Finetuning is like training the chef to permanently learn that dish so they can cook it without explicit reminders in the future.
Finetuning is not just about better outputs—it’s about efficiency, alignment, and control. It allows developers and organizations to:
Improve instruction following (e.g., ensuring outputs always come in JSON).
Reduce biases by retraining on curated datasets.
Enhance domain-specific performance, such as financial modeling or SQL generation.
Create smaller, cheaper models that outperform larger ones on targeted tasks.
Of course, finetuning comes at a cost. It requires high-quality data, computational resources, and expertise in machine learning (ML). It also raises questions: When should you finetune? When is Retrieval-Augmented Generation (RAG) enough? What are the trade-offs?
This blog series explores these questions in depth. In this first section, we’ll walk through the evolution of finetuning—from its early days to today’s parameter-efficient techniques.
The Roots of Finetuning: Transfer Learning
The story of finetuning begins with transfer learning, a concept first introduced by Bozinovski and Fulgosi in 1976. Transfer learning is the idea that knowledge gained in one task can be transferred to accelerate learning in another. Humans do this all the time—if you know how to play the piano, learning the guitar is easier because you already understand rhythm, scales, and coordination.
In machine learning, transfer learning became popular in computer vision. Models trained on ImageNet, a massive image dataset, learned general features like edges, textures, and shapes. These features could then be reused for new tasks like detecting tumors in X-rays or recognizing traffic signs. Instead of training a new model from scratch, developers fine-tuned the ImageNet model for their specific problem.
The same principle applies to LLMs. A language model trained on billions of words already “knows” grammar, facts, and reasoning patterns. Instead of starting from scratch, developers finetune it for specialized tasks like legal document analysis, financial forecasting, or code generation.
A famous early success was Google’s multilingual translation system (Johnson et al., 2016). The system could translate Portuguese ↔ English and English ↔ Spanish. Surprisingly, without explicit examples, it learned to translate Portuguese ↔ Spanish—a task it was never directly trained on. That was transfer learning in action.
Finetuning in the Era of Large Language Models
When it comes to LLMs, finetuning builds on the pre-training phase. Pre-training is typically done in a self-supervised manner, meaning the model learns by predicting the next word in massive text datasets. This equips the model with broad knowledge but not necessarily the ability to follow instructions well or produce outputs in a specific style.
Here’s where post-training and finetuning enter:
Supervised Finetuning (SFT) – The model is trained on pairs of (instruction, response). For example:
Instruction: “Summarize this contract in simple terms.”
Response: “This contract states that the employee will…”
SFT improves alignment with human expectations.
Preference Finetuning (RLHF) – The model learns to prefer outputs that humans like, usually through reinforcement learning with human feedback. This makes responses more helpful, harmless, and honest.
Continued Pre-Training – The model is further trained on large amounts of raw domain-specific text (e.g., legal documents or medical literature) before fine-grained supervised tuning.
Each of these forms of finetuning is an extension of pre-training. The difference is in how much data you use, what kind of data, and which parameters you update.
Why Finetuning?
Let’s consider some real-world scenarios:
Structured Outputs: Suppose you need a model to always respond in strict JSON format for downstream automation. Prompting can work, but it’s fragile. Finetuning ensures the model internalizes this requirement.
Domain-Specific Jargon: A healthcare chatbot may need to understand abbreviations like “BP” (blood pressure) or “HbA1c” (a diabetes measure). General models may stumble; finetuning on medical data fixes this.
Bias Mitigation: If a model frequently associates “CEO” with male names, finetuning on curated datasets that include female CEOs can rebalance the bias.
Smaller, Faster Models: Grammarly reported that their finetuned Flan-T5 models (much smaller than GPT-3) outperformed GPT-3 on text editing tasks. This made their models cheaper and faster in production.
In short, finetuning helps bridge the gap between a model’s general intelligence and the specific intelligence you need.
The Shift from Full to Efficient Finetuning
In the early days, finetuning meant updating all the parameters of a model. This is called full finetuning. For small models, this was feasible. But as models grew to billions of parameters, full finetuning became impractical:
A 7B parameter model in 16-bit precision requires ~14 GB just to store the weights.
To finetune it with optimizers like Adam, you may need ~56 GB or more.
This exceeds the memory of most consumer GPUs.
To solve this, researchers moved toward partial finetuning: instead of updating the entire model, they updated only parts—often the final layers. While this reduced memory requirements, it was inefficient: performance dropped compared to full finetuning.
The breakthrough came with Parameter-Efficient Finetuning (PEFT). Instead of touching billions of parameters, PEFT techniques add or modify only a small fraction of the model while freezing the rest. The result: near-full performance at a fraction of the cost.
PEFT methods fall broadly into two categories:
Adapter-based methods: Insert small modules (adapters) into the model and train only them.
Soft prompt–based methods: Add trainable embeddings (soft prompts) to guide the model.
One adapter-based technique, LoRA (Low-Rank Adaptation), has become especially dominant. LoRA allows developers to finetune massive models using only a few percent of the parameters, without hurting inference speed.
Looking Ahead
We’ve set the stage:
Finetuning began as transfer learning.
It evolved into full and partial finetuning.
The need for scalability gave rise to PEFT methods like LoRA.
In the next section, we’ll explore why finetuning is so memory-intensive, breaking down the concepts of trainable parameters, backpropagation, and numerical representations. Understanding these bottlenecks is key to appreciating why PEFT matters so much.
Finetuning Large Language Models: From Foundations to Adaptation
Section 2 – Memory Bottlenecks Explained
Why Memory Is the Achilles’ Heel of Finetuning
When developers first attempt to finetune a large language model (LLM), one of the most common errors they see is:
csharp
RuntimeError: CUDA out of memory
This isn’t just a beginner’s mistake—it’s a fundamental reality of working with models that contain billions of parameters. Memory is the biggest bottleneck in finetuning, and unless you understand where that memory goes, you’ll quickly hit a wall.
At inference time (when the model is simply generating outputs), memory requirements are high but manageable. Finetuning, however, is far more demanding. That’s because finetuning doesn’t just run the model—it also needs to update its parameters, store gradients, and keep optimizer states. All of this adds up.
In this section, we’ll unpack the sources of memory consumption in finetuning, run through some back-of-the-envelope calculations, and explore techniques like quantization and gradient checkpointing that help mitigate the problem.
Parameters, Trainable Parameters, and Frozen Layers
Every model is made up of parameters—the numerical weights that determine how it processes input. For example:
GPT-3 has 175 billion parameters.
Llama 2 has versions ranging from 7 billion to 70 billion parameters.
Even a “small” model like Mistral 7B requires tens of gigabytes to store.
When you perform inference, the model loads all its parameters into memory but does not modify them. That’s relatively simple.
When you perform finetuning, however, you allow some or all of those parameters to be updated. The parameters that can change are called trainable parameters.
Full finetuning: All parameters are trainable.
Partial finetuning: Some parameters are trainable, the rest are frozen.
PEFT methods (e.g., LoRA): Only a tiny fraction of parameters are added or modified.
The more trainable parameters you have, the higher the memory footprint.
Backpropagation: The Memory Multiplier
The reason finetuning consumes so much more memory than inference comes down to the training process, specifically backpropagation.
Training happens in two stages:
Forward pass – The model computes an output given an input. (This is the same as inference.)
Backward pass – The model computes how wrong its output was compared to the expected result (the loss) and adjusts its weights accordingly.
During the backward pass, each trainable parameter requires extra memory for:
Its gradient (how much that parameter contributed to the error).
Its optimizer states (values used by optimizers like Adam to update weights efficiently).
For example, with the Adam optimizer:
Each trainable parameter requires 3 extra values: 1 gradient + 2 optimizer states.
If you store each value in 2 bytes (16-bit precision), that’s an extra 6 bytes per parameter.
So for a 13B-parameter model, updating all parameters means:
python
13B parameters × 6bytes = 78 GB
And that’s just for gradients and optimizer states! Add in the original weights (~26 GB in 16-bit precision) plus activations, and you’re well beyond the capacity of a single GPU.
Activations: The Hidden Memory Hog
Parameters aren’t the only concern. Neural networks also generate activations—the intermediate values produced as data flows through the layers.
During inference, activations are ephemeral: once the output is generated, they can be discarded.
During training, however, activations must be stored so they can be reused in the backward pass to calculate gradients. This means activations can dwarf the size of the model’s weights.
Research from Microsoft (Korthikanti et al., 2022) showed that for large transformer models, activation memory often exceeds weight memory.
To cope with this, developers use gradient checkpointing (activation recomputation). Instead of storing all activations, the model recomputes them when needed. This reduces memory at the cost of slower training.
Memory Math: A Simple Formula
To get a rough estimate of how much memory finetuning requires, you can use this formula:
java
Trainingmemory= Weights + Activations + Gradients + Optimizer states
Training memory ≈ Weights + Activations + Gradients + Optimizer states
That extra overhead for gradients and optimizer states is what makes finetuning orders of magnitude more demanding than inference.
This difference is why we see specialized chips for inference (optimized for low precision, speed, and efficiency) versus training (optimized for high throughput and precision).
Tricks to Reduce Memory Bottlenecks
Several strategies exist to make finetuning possible without 100+ GB GPUs:
Gradient Checkpointing
Don’t store all activations; recompute them when needed.
Saves memory, costs extra compute time.
Mixed Precision Training
Store some values (like gradients) in FP16 or BF16 while keeping others (like embeddings) in FP32.
Most ML frameworks (PyTorch AMP, TensorFlow) support this out of the box.
Quantization
Use INT8, INT4, or even experimental 1.58-bit formats (BitNet, 2024).
Reduces memory, but may hurt accuracy if not done carefully.
PEFT (Parameter-Efficient Finetuning)
Instead of updating all parameters, update only a small fraction (we’ll dive deep into this in the next section).
Can reduce trainable parameters from billions to millions.
Offloading
Move some weights to CPU or disk (e.g., DeepSpeed ZeRO).
Slower, but allows fitting larger models.
Smaller Models
Don’t underestimate the power of smaller models! A well-finetuned 7B model may outperform an unfined 70B model on your task.
Why This Matters
Understanding memory bottlenecks is not just about avoiding runtime errors. It’s about strategic decision-making:
Do you attempt full finetuning, or is PEFT sufficient?
Should you invest in expensive GPUs, or use clever memory-saving techniques?
Do you quantize aggressively for efficiency, or preserve precision for accuracy?
Every organization will make different trade-offs. But the key takeaway is this: memory, not compute, is the limiting factor in finetuning today’s LLMs.
Looking Ahead
We’ve now seen why finetuning is so challenging from a memory perspective. The natural next step is to explore how researchers solved this problem—enter Parameter-Efficient Finetuning (PEFT).
In the next section, we’ll dive into adapter-based methods and soft prompts, the two main families of PEFT techniques that have made finetuning practical for developers around the world.
Finetuning Large Language Models: From Foundations to Adaptation
In the previous section, we saw how memory bottlenecks make full finetuning of large language models (LLMs) impractical for most developers. Training all parameters in a multi-billion-parameter model requires hundreds of gigabytes of memory—far more than what a typical GPU setup can handle.
So researchers asked: What if we didn’t need to update all the parameters?
This gave rise to Parameter-Efficient Finetuning (PEFT). Instead of retraining billions of parameters, PEFT techniques modify or add a tiny fraction of parameters—sometimes as little as 0.1% of the model—while freezing the rest. The results are astonishing: models finetuned with PEFT often perform within a few percentage points of full finetuning, at a fraction of the cost.
Why does this work? Because foundation models already know a lot. Finetuning isn’t about teaching them entirely new concepts; it’s about nudging them into the right shape for a specific task or domain. You don’t need to rewire the whole brain, just adjust a few critical pathways.
Full vs. Partial vs. Parameter-Efficient Finetuning
Let’s quickly compare the three approaches:
Full Finetuning
Updates all model parameters.
Very expensive (memory + compute).
Highest performance, but diminishing returns compared to efficient methods.
Partial Finetuning
Updates only a subset of layers (e.g., the final transformer block).
Reduces cost but often requires 20–30% of parameters to match full performance.
Parameter-inefficient: too many parameters for too little gain.
PEFT
Updates or adds only a very small number of parameters.
Can achieve near full-finetuning performance with 1–5% of parameters (sometimes less).
The dominant approach today for adapting large LLMs.
The breakthrough paper by Houlsby et al. (2019) introduced the first widely adopted PEFT method: adapters. Let’s dive into how adapter-based methods work.
Adapter-Based Methods
The Core Idea
Adapters are small, trainable modules inserted into each transformer block of an LLM. Instead of updating the billions of frozen parameters in the base model, finetuning updates only the adapter weights.
Think of adapters as plug-ins for the model:
The frozen base model is the operating system.
The adapters are small applications that customize behavior.
During inference, the model processes inputs as usual, but the adapters intercept the flow, add their learned adjustments, and pass the signal back. This way, adapters can teach the model new tricks without rewriting its entire knowledge base.
Houlsby Adapters (2019)
The original adapter method added two small feed-forward layers into each transformer block. During training:
The base model’s parameters stayed frozen.
Only the adapter layers were updated.
The results were striking. On the GLUE benchmark for natural language understanding, Houlsby adapters achieved performance within 0.4% of full finetuning, while training only 3% of parameters.
This was a game-changer. Suddenly, adapting large models became feasible on modest hardware.
BitFit (2021)
BitFit took the idea even further: instead of inserting new layers, it updated only the bias terms of the model (hence the name). Since bias parameters make up a tiny fraction of the total weights, this meant updating less than 0.1% of parameters.
Surprisingly, BitFit still delivered competitive results in many tasks. It wasn’t always as strong as adapters, but it showed just how much efficiency was possible.
IA³ (2022)
IA³ (Infused Adapter by Inhibiting and Amplifying Inner Activations) proposed a more elegant method: instead of adding full adapter layers, it learned small scaling vectors that multiplied existing activations inside the model.
This meant:
Extremely lightweight overhead.
Easy batching for multi-task finetuning.
In some cases, even better performance than full finetuning.
LongLoRA (2023)
One limitation of many LLMs is context length—how much text they can handle at once. Most models are trained with a context window of a few thousand tokens.
LongLoRA adapted LoRA (which we’ll cover in depth in Section 4) to extend context length. By tweaking attention mechanisms and training with long-sequence data, LongLoRA enabled models to handle much larger contexts (e.g., from 4k tokens to 16k tokens or more).
This made it especially useful for tasks like:
Handling long code files.
Processing legal or scientific documents.
Multi-turn dialogue with extended histories.
Soft Prompt–Based Methods
While adapter-based methods add trainable parameters inside the model, soft prompt methods modify the inputs the model sees.
Here’s the difference:
Hard prompts = Human-readable text prompts (e.g., “Translate to French:”).
Soft prompts = Trainable vectors (continuous embeddings) prepended to the input.
Soft prompts aren’t human-readable, but they act like invisible “hints” that steer the model’s behavior.
Prompt Tuning (2021)
In prompt tuning, a small set of trainable embeddings is added to the input at the embedding layer. During training:
The base model stays frozen.
Only the soft prompts are updated.
This allows a single base model to be quickly adapted to many tasks by swapping out the soft prompts.
Prefix Tuning (2021)
Prefix tuning goes a step further: instead of adding tokens only at the input, it prepends soft prompts at every transformer layer. This provides more control over the model’s internal activations.
Prefix tuning is more powerful but also slightly heavier than prompt tuning.
P-Tuning (2021)
P-Tuning is another variant that optimizes where and how soft prompts are injected. These subtle differences in architecture make each method slightly better for certain tasks, but the underlying idea is the same: train a small set of vectors instead of the whole model.
Comparing Adapter vs. Soft Prompt Methods
Feature
Adapter-Based
Soft Prompt–Based
Where it acts
Inside transformer layers
Input embeddings (and sometimes layers)
Parameter efficiency
1–5% trainable params
Often <1% trainable params
Performance
Strong, near full finetuning
Competitive, but sometimes weaker
Modularity
Can stack multiple adapters
Can swap prompts easily
Inference speed
Slightly slower (extra layers)
Same as base model
Use cases
Domain adaptation, syntax enforcement
Multi-task, lightweight customization
Both families of methods are actively used today, depending on the application and constraints.
Real-World Adoption
To see what’s popular in practice, researchers analyzed thousands of GitHub issues from the Hugging Face PEFT library (2024). The results showed:
LoRA dominates adapter-based methods.
Soft prompts are less common but gaining traction among teams that want more flexibility than prompting but less complexity than finetuning.
This aligns with industry trends: LoRA is the go-to choice for finetuning today, thanks to its balance of efficiency and performance.
Why PEFT Matters for Developers
For most application developers, full finetuning is out of reach. Even partial finetuning may be impractical. PEFT bridges the gap by making it possible to:
Run finetuning on a single GPU or small cluster, instead of massive infrastructure.
Maintain multiple task-specific models by swapping adapters or prompts, rather than training from scratch each time.
Deploy cheaper, faster models that still meet production needs.
In other words, PEFT democratizes LLM customization.
Looking Ahead
We’ve covered the two main families of PEFT: adapters and soft prompts. But one method in particular—LoRA (Low-Rank Adaptation)—has become the de facto standard, powering everything from open-source community models to enterprise deployments.
In the next section, we’ll dive deep into LoRA, exploring how it works, why it’s so effective, and the many variants that have sprung from it.
Finetuning Large Language Models: From Foundations to Adaptation
Section 4 – LoRA Deep Dive & Variants
The Emergence of LoRA
Among all the parameter-efficient finetuning (PEFT) methods, LoRA (Low-Rank Adaptation) has emerged as the most widely used. Originally introduced by Hu et al. (2021), LoRA was designed with a simple but powerful idea: instead of updating huge weight matrices directly, approximate their updates using much smaller low-rank matrices.
This approach slashes the number of trainable parameters by orders of magnitude while preserving performance. Today, if you see a community finetuned model on Hugging Face, chances are it was trained using LoRA.
But why did LoRA succeed where others were just “good enough”? To answer that, let’s break down how it works.
The Core Idea of LoRA
Transformer models, like GPT and LLaMA, are filled with linear layers—essentially giant matrices that multiply with vectors to process data. For example:
ini
y = W × x
where:
W is a weight matrix (often billions of parameters).
x is the input vector.
y is the output.
In finetuning, we want to adjust W. But directly storing and updating such a massive matrix is impractical.
LoRA introduces a clever trick:
Instead of updating W directly, represent the update ΔW as the product of two smaller matrices, A and B.
W’ = W + ΔW
ΔW = A × B
Here:
A is an m × r matrix.
B is an r × n matrix.
r is the rank (a small number, often 4, 8, or 16).
Because r is small, A × B has far fewer parameters than W. This is the “low-rank” in Low-Rank Adaptation.
Why LoRA Works: The Intrinsic Dimension Hypothesis
At first glance, you might wonder: How can tiny matrices possibly capture the complexity of massive weight updates?
The answer lies in the intrinsic dimension hypothesis. Research suggests that many high-dimensional tasks can actually be solved in a much lower-dimensional subspace. In other words, even though models have billions of parameters, the meaningful changes for a given task lie in a much smaller space.
LoRA exploits this by only learning updates within that smaller subspace. The base model provides the “knowledge,” and LoRA adapters learn the “specialization.”
Practical Benefits of LoRA
LoRA offers several advantages that explain its widespread adoption:
Efficiency
Only a tiny fraction of parameters are trained.
Finetuning becomes feasible on a single GPU.
Modularity
You can train different LoRA adapters for different tasks.
Swap them in and out without retraining the whole model.
Composability
Multiple LoRAs can be combined (e.g., one for medical text, another for legal text).
This makes models highly customizable.
No Inference Latency Penalty
LoRA updates are folded into the weight multiplication, so inference remains just as fast.
These factors made LoRA the go-to choice for researchers and developers alike.
Variants of LoRA
Like all good ideas in AI, LoRA quickly inspired numerous extensions and adaptations. Let’s look at the major ones.
Quantized LoRA (QLoRA)
Introduced by Dettmers et al. (2023), QLoRA combined two powerful ideas:
Quantization – Store base model weights in 4-bit precision to reduce memory.
LoRA – Train low-rank adapters on top of the quantized model.
This made it possible to finetune 65B-parameter models on a single 48 GB GPU. Before QLoRA, such large-scale finetuning was reserved for top-tier labs.
QLoRA democratized finetuning by drastically lowering hardware requirements. Today, it’s one of the most practical setups for community-driven projects.
LongLoRA
We briefly mentioned LongLoRA earlier. It modifies the attention mechanism to allow models to handle longer sequences during finetuning. Combined with LoRA, it makes training for long-context tasks affordable.
Use cases:
Reading entire legal contracts.
Summarizing research papers.
Multi-turn chatbots with extended histories.
ReLoRA (Restarted LoRA)
ReLoRA addresses a subtle issue: as training progresses, low-rank updates may lose effectiveness. ReLoRA periodically resets the low-rank matrices and reinitializes them, allowing the model to keep learning efficiently.
This is especially useful for longer training runs where LoRA might otherwise stagnate.
GaLore (Gradient Low-Rank Projection)
GaLore, a 2024 innovation, applies the low-rank trick not just to weight updates, but directly to gradients. Instead of storing full gradients (which consume huge memory), it projects them into a low-rank space.
This slashes memory requirements for backpropagation and makes finetuning even more efficient.
Multi-LoRA Serving
One of LoRA’s killer features is that you can serve multiple adapters on the same base model. For example:
A customer service company could maintain separate LoRAs for retail, banking, and healthcare clients.
A developer could train LoRAs for different tasks (summarization, sentiment analysis, translation) and load whichever is needed on demand.
This modularity is similar to having multiple browser extensions installed on the same web browser. The core system remains unchanged; each extension adds a specialization.
Serving frameworks like Hugging Face’s peft library and 🤗 Transformers already support this capability, making LoRA-based deployments production-ready.
Real-World Applications of LoRA
Community Models
Hugging Face is filled with LoRA-finetuned models.
From roleplay chatbots to code assistants, LoRA makes experimentation accessible.
Enterprise Customization
Companies use LoRA to adapt foundation models to domain-specific needs.
Example: A financial firm finetuning a Llama model with LoRA for SEC filings and earnings calls.
Education and Research
Universities and independent researchers can now finetune models without massive clusters.
LoRA has become the default teaching method for “hands-on LLM finetuning.”
Why LoRA Dominates
While other PEFT methods (like adapters and soft prompts) are useful, LoRA has several unique advantages that explain its dominance:
Performance – Often matches or beats other methods.
Ease of Integration – Works seamlessly with existing architectures.
Scalability – Handles small and very large models.
Community Momentum – Supported by libraries like Hugging Face PEFT, PyTorch, and ColossalAI.
In fact, many researchers now assume LoRA as the default PEFT baseline in experiments.
Beyond LoRA: The Future of Low-Rank Methods
LoRA is not the end of the story. Several exciting directions are emerging:
Dynamic rank selection – Adapting rank r per layer for optimal efficiency.
Low-rank pre-training – Training models from scratch with low-rank constraints.
Hybrid approaches – Combining LoRA with soft prompts or adapters for maximum flexibility.
The field is evolving fast, but one thing is clear: the low-rank principle has unlocked a new era of scalable, affordable finetuning.
Looking Ahead
We’ve now explored LoRA and its variants in depth. But LoRA isn’t the only alternative to full finetuning. Another powerful approach is model merging, where instead of training adapters, developers directly combine pre-trained models.
In the next section, we’ll cover:
Model merging techniques (summing, stacking, concatenation).
Finetuning frameworks and hyperparameters.
Practical decision-making for when and how to finetune.
Finetuning Large Language Models: From Foundations to Adaptation
Section 5 – Model Merging, Frameworks, Hyperparameters, and Conclusion
Beyond Finetuning: The Idea of Model Merging
So far, we’ve focused on finetuning—whether full, partial, or parameter-efficient. But there’s another emerging paradigm: model merging.
Instead of training adapters or updating weights, model merging combines already finetuned models into a single new model.
Why does this work? Because finetuned models often contain complementary knowledge. By merging, we can capture the strengths of multiple models without retraining from scratch.
Techniques for Model Merging
Weight Averaging (Model Soup)
Proposed by Wortsman et al. (2022).
Merge multiple models by taking a weighted average of their parameters.
Example: If you have three LoRA-finetuned models for different NLP tasks, averaging them can yield a single, more general model.
Layer Stacking
Stack layers from different models.
Example: Lower layers from a language model trained on code + higher layers from a model trained on dialogue.
Concatenation
Combine embeddings or intermediate representations from multiple models.
More experimental, but promising for multimodal tasks (e.g., text + image).
Real-World Examples
SOLAR 10.7B (2023): A model built by merging multiple open-weight models (up to 16). Despite being only 10.7B parameters, it achieved performance close to LLaMA-2-70B on benchmarks.
On-device models: Small models on smartphones can be “merged” from different finetuned variants, allowing them to perform diverse tasks without huge storage overhead.
Model merging isn’t yet as mature as LoRA, but it shows the future may involve mixing and matching specialized models rather than finetuning one base model repeatedly.
Choosing the Right Base Model
When planning finetuning or merging, the choice of base model is critical.
For most developers, PEFT + Transformers is enough. For larger teams, frameworks like DeepSpeed and ColossalAI unlock scalability.
Hyperparameters That Matter
Finetuning involves many knobs, but a few matter more than others:
Learning Rate
Too high → catastrophic forgetting (model loses general knowledge).
Too low → underfitting.
Rule of thumb: LoRA adapters often use learning rates around 1e-4 to 1e-5.
Batch Size
Larger batches → more stable gradients, but higher memory.
Smaller batches → noisier but sometimes better generalization.
Sequence Length
Longer sequences capture more context but increase memory.
Use gradient checkpointing or LongLoRA for efficiency.
Prompt Loss Weighting
Important when balancing instruction-following with task-specific objectives.
Epochs & Early Stopping
Overtraining can lead to overfitting or forgetting.
Early stopping helps maintain balance.
In practice, hyperparameter tuning is as much art as science. Many teams start with community defaults and adjust based on validation performance.
Practical Development Paths
OpenAI’s Path
OpenAI suggests a progression for developers:
Prompt engineering – Cheapest, simplest.
Few-shot prompting – Better results with examples.
RAG (Retrieval-Augmented Generation) – Bring external knowledge in.
Finetuning (SFT/LoRA) – When you need strict formats or domain alignment.
This ensures you don’t overinvest in finetuning when prompting or RAG would suffice.
Distillation Path
Another popular path is distillation:
Train a small student model to mimic a large teacher model.
Use finetuned LLMs as teachers.
Deploy smaller, faster students for production.
This allows enterprises to get 80–90% of the performance at a fraction of the cost.
Challenges and Trade-Offs
Data Availability
Good finetuning data is rare and expensive.
Synthetic data generation (using GPT-4 or Claude) helps but can introduce noise.
Latency and Deployment
Adapters can increase latency if not merged back into the model.
Some frameworks support merging LoRA weights into base weights for deployment.
When Not to Finetune
If the goal is to add new facts → Use RAG.
If the goal is structured output or style → Use finetuning.
Key mantra: “Finetuning is for form, RAG is for facts.”
The Future of Finetuning
Where is finetuning headed? Several trends are emerging:
Low-rank pre-training: Training models from scratch with LoRA-like constraints.
Composable adapters: Building “adapter stores” where LoRAs can be mixed like Lego blocks.
Model merging at scale: Building models like SOLAR that rival much larger ones.
Ultra-low-bit training: BitNet (2024) showed training in 1.58-bit precision can be stable, hinting at massive efficiency gains.
The direction is clear: finetuning will become cheaper, more modular, and more accessible.
Conclusion: A New Era of Customization
We’ve journeyed through the evolution of finetuning:
From full finetuning and its memory bottlenecks…
To PEFT methods like adapters and soft prompts…
To the dominance of LoRA and its variants…
To emerging approaches like model merging.
The takeaway is simple: you no longer need a billion-dollar budget to adapt powerful LLMs. With techniques like LoRA and QLoRA, even modest setups can achieve near state-of-the-art results.
For developers and enterprises, the message is empowering:
Use prompting and RAG when possible.
Use finetuning when you need control, style, or structure.
Explore merging and distillation to push efficiency further.
The field is moving fast, but one principle remains constant: foundation models are the clay, and finetuning is how we sculpt them into tools that serve our needs.