Wednesday, December 31, 2025

Finetuning Large Language Models -- From Foundations to Adaptation (Chapter 7)

Download Book

<<< Previous Chapter Next Chapter >>>

Outline

  1. Section 1: Introduction + The Evolution of Finetuning

  2. Section 2: Memory Bottlenecks Explained

  3. Section 3: Parameter-Efficient Finetuning (PEFT) & Adapter-Based Methods

  4. Section 4: LoRA Deep Dive + Variants

  5. Section 5: Model Merging, Frameworks, Hyperparameters, and Conclusion

Great! Here’s Section 1 of the blog post (~1200 words).


Finetuning Large Language Models: From Foundations to Adaptation

Section 1 – Introduction & The Evolution of Finetuning


Introduction: Why Finetuning Matters in the Age of Foundation Models

In the past few years, the way we build artificial intelligence (AI) applications has changed dramatically. At the center of this transformation are foundation models—large language models (LLMs) like GPT-4, Claude, Llama 3, and Mistral. These models come pre-trained on massive amounts of text data, enabling them to perform a wide variety of tasks right out of the box: answering questions, generating code, summarizing documents, and even creating poetry.

But while these models are incredibly powerful, they are not perfect for every situation. Imagine using a general-purpose LLM to handle specialized legal queries, medical diagnostics, or customer service conversations in a very specific tone of voice. The model may be good enough, but it may not fully align with your domain, data, or desired output format.

This is where finetuning comes into play.

Finetuning is the process of adapting a pre-trained model by training it further—either on your specific data, or with modifications that make it better suited for a specialized task. Unlike prompt engineering, which adjusts the input you feed into a model, finetuning adjusts the weights inside the model itself. The difference is subtle but significant: finetuning rewires how the model thinks, while prompting merely nudges it in the right direction.

To borrow an analogy:

  • Prompting is like instructing a professional chef to cook a new dish by giving detailed step-by-step instructions.

  • Finetuning is like training the chef to permanently learn that dish so they can cook it without explicit reminders in the future.

Finetuning is not just about better outputs—it’s about efficiency, alignment, and control. It allows developers and organizations to:

  • Improve instruction following (e.g., ensuring outputs always come in JSON).

  • Reduce biases by retraining on curated datasets.

  • Enhance domain-specific performance, such as financial modeling or SQL generation.

  • Create smaller, cheaper models that outperform larger ones on targeted tasks.

Of course, finetuning comes at a cost. It requires high-quality data, computational resources, and expertise in machine learning (ML). It also raises questions: When should you finetune? When is Retrieval-Augmented Generation (RAG) enough? What are the trade-offs?

This blog series explores these questions in depth. In this first section, we’ll walk through the evolution of finetuning—from its early days to today’s parameter-efficient techniques.


The Roots of Finetuning: Transfer Learning

The story of finetuning begins with transfer learning, a concept first introduced by Bozinovski and Fulgosi in 1976. Transfer learning is the idea that knowledge gained in one task can be transferred to accelerate learning in another. Humans do this all the time—if you know how to play the piano, learning the guitar is easier because you already understand rhythm, scales, and coordination.

In machine learning, transfer learning became popular in computer vision. Models trained on ImageNet, a massive image dataset, learned general features like edges, textures, and shapes. These features could then be reused for new tasks like detecting tumors in X-rays or recognizing traffic signs. Instead of training a new model from scratch, developers fine-tuned the ImageNet model for their specific problem.

The same principle applies to LLMs. A language model trained on billions of words already “knows” grammar, facts, and reasoning patterns. Instead of starting from scratch, developers finetune it for specialized tasks like legal document analysis, financial forecasting, or code generation.

A famous early success was Google’s multilingual translation system (Johnson et al., 2016). The system could translate Portuguese ↔ English and English ↔ Spanish. Surprisingly, without explicit examples, it learned to translate Portuguese ↔ Spanish—a task it was never directly trained on. That was transfer learning in action.


Finetuning in the Era of Large Language Models

When it comes to LLMs, finetuning builds on the pre-training phase. Pre-training is typically done in a self-supervised manner, meaning the model learns by predicting the next word in massive text datasets. This equips the model with broad knowledge but not necessarily the ability to follow instructions well or produce outputs in a specific style.

Here’s where post-training and finetuning enter:

  1. Supervised Finetuning (SFT) – The model is trained on pairs of (instruction, response). For example:

    • Instruction: “Summarize this contract in simple terms.”

    • Response: “This contract states that the employee will…”
      SFT improves alignment with human expectations.

  2. Preference Finetuning (RLHF) – The model learns to prefer outputs that humans like, usually through reinforcement learning with human feedback. This makes responses more helpful, harmless, and honest.

  3. Continued Pre-Training – The model is further trained on large amounts of raw domain-specific text (e.g., legal documents or medical literature) before fine-grained supervised tuning.

Each of these forms of finetuning is an extension of pre-training. The difference is in how much data you use, what kind of data, and which parameters you update.


Why Finetuning?

Let’s consider some real-world scenarios:

  • Structured Outputs: Suppose you need a model to always respond in strict JSON format for downstream automation. Prompting can work, but it’s fragile. Finetuning ensures the model internalizes this requirement.

  • Domain-Specific Jargon: A healthcare chatbot may need to understand abbreviations like “BP” (blood pressure) or “HbA1c” (a diabetes measure). General models may stumble; finetuning on medical data fixes this.

  • Bias Mitigation: If a model frequently associates “CEO” with male names, finetuning on curated datasets that include female CEOs can rebalance the bias.

  • Smaller, Faster Models: Grammarly reported that their finetuned Flan-T5 models (much smaller than GPT-3) outperformed GPT-3 on text editing tasks. This made their models cheaper and faster in production.

In short, finetuning helps bridge the gap between a model’s general intelligence and the specific intelligence you need.


The Shift from Full to Efficient Finetuning

In the early days, finetuning meant updating all the parameters of a model. This is called full finetuning. For small models, this was feasible. But as models grew to billions of parameters, full finetuning became impractical:

  • A 7B parameter model in 16-bit precision requires ~14 GB just to store the weights.

  • To finetune it with optimizers like Adam, you may need ~56 GB or more.

  • This exceeds the memory of most consumer GPUs.

To solve this, researchers moved toward partial finetuning: instead of updating the entire model, they updated only parts—often the final layers. While this reduced memory requirements, it was inefficient: performance dropped compared to full finetuning.

The breakthrough came with Parameter-Efficient Finetuning (PEFT). Instead of touching billions of parameters, PEFT techniques add or modify only a small fraction of the model while freezing the rest. The result: near-full performance at a fraction of the cost.

PEFT methods fall broadly into two categories:

  • Adapter-based methods: Insert small modules (adapters) into the model and train only them.

  • Soft prompt–based methods: Add trainable embeddings (soft prompts) to guide the model.

One adapter-based technique, LoRA (Low-Rank Adaptation), has become especially dominant. LoRA allows developers to finetune massive models using only a few percent of the parameters, without hurting inference speed.


Looking Ahead

We’ve set the stage:

  • Finetuning began as transfer learning.

  • It evolved into full and partial finetuning.

  • The need for scalability gave rise to PEFT methods like LoRA.

In the next section, we’ll explore why finetuning is so memory-intensive, breaking down the concepts of trainable parameters, backpropagation, and numerical representations. Understanding these bottlenecks is key to appreciating why PEFT matters so much.


Finetuning Large Language Models: From Foundations to Adaptation

Section 2 – Memory Bottlenecks Explained


Why Memory Is the Achilles’ Heel of Finetuning

When developers first attempt to finetune a large language model (LLM), one of the most common errors they see is:

csharp
RuntimeError: CUDA out of memory

This isn’t just a beginner’s mistake—it’s a fundamental reality of working with models that contain billions of parameters. Memory is the biggest bottleneck in finetuning, and unless you understand where that memory goes, you’ll quickly hit a wall.

At inference time (when the model is simply generating outputs), memory requirements are high but manageable. Finetuning, however, is far more demanding. That’s because finetuning doesn’t just run the model—it also needs to update its parameters, store gradients, and keep optimizer states. All of this adds up.

In this section, we’ll unpack the sources of memory consumption in finetuning, run through some back-of-the-envelope calculations, and explore techniques like quantization and gradient checkpointing that help mitigate the problem.


Parameters, Trainable Parameters, and Frozen Layers

Every model is made up of parameters—the numerical weights that determine how it processes input. For example:

  • GPT-3 has 175 billion parameters.

  • Llama 2 has versions ranging from 7 billion to 70 billion parameters.

  • Even a “small” model like Mistral 7B requires tens of gigabytes to store.

When you perform inference, the model loads all its parameters into memory but does not modify them. That’s relatively simple.

When you perform finetuning, however, you allow some or all of those parameters to be updated. The parameters that can change are called trainable parameters.

  • Full finetuning: All parameters are trainable.

  • Partial finetuning: Some parameters are trainable, the rest are frozen.

  • PEFT methods (e.g., LoRA): Only a tiny fraction of parameters are added or modified.

The more trainable parameters you have, the higher the memory footprint.


Backpropagation: The Memory Multiplier

The reason finetuning consumes so much more memory than inference comes down to the training process, specifically backpropagation.

Training happens in two stages:

  1. Forward pass – The model computes an output given an input. (This is the same as inference.)

  2. Backward pass – The model computes how wrong its output was compared to the expected result (the loss) and adjusts its weights accordingly.

During the backward pass, each trainable parameter requires extra memory for:

  • Its gradient (how much that parameter contributed to the error).

  • Its optimizer states (values used by optimizers like Adam to update weights efficiently).

For example, with the Adam optimizer:

  • Each trainable parameter requires 3 extra values: 1 gradient + 2 optimizer states.

  • If you store each value in 2 bytes (16-bit precision), that’s an extra 6 bytes per parameter.

So for a 13B-parameter model, updating all parameters means:

python
13B parameters × 6 bytes = 78 GB

And that’s just for gradients and optimizer states! Add in the original weights (~26 GB in 16-bit precision) plus activations, and you’re well beyond the capacity of a single GPU.


Activations: The Hidden Memory Hog

Parameters aren’t the only concern. Neural networks also generate activations—the intermediate values produced as data flows through the layers.

During inference, activations are ephemeral: once the output is generated, they can be discarded.

During training, however, activations must be stored so they can be reused in the backward pass to calculate gradients. This means activations can dwarf the size of the model’s weights.

Research from Microsoft (Korthikanti et al., 2022) showed that for large transformer models, activation memory often exceeds weight memory.

To cope with this, developers use gradient checkpointing (activation recomputation). Instead of storing all activations, the model recomputes them when needed. This reduces memory at the cost of slower training.


Memory Math: A Simple Formula

To get a rough estimate of how much memory finetuning requires, you can use this formula:

java
Training memory = Weights + Activations + Gradients + Optimizer states

Let’s work through a practical example.

Example: 13B-parameter model (LLaMA 2-13B)

  • Weights (16-bit) = 13B × 2 bytes = 26 GB

  • Gradients + Optimizer states (Adam, 16-bit) = 13B × 3 × 2 bytes = 78 GB

  • Activations (approx. 20% of weights) = ~5 GB (in practice often much higher)

Total = 26 + 78 + 5 ≈ 109 GB

That’s well beyond even high-end GPUs. The Nvidia A100 tops out at 80 GB of memory, and most developers use far less powerful cards (24–48 GB).

This is why full finetuning of large models is impractical for most organizations.


Numerical Representations: FP32, FP16, BF16, and INT8

Another key factor in memory usage is the numerical format used to store values.

Traditionally, models were trained in FP32 (32-bit floating point). But this is overkill:

  • FP32 = 4 bytes per value.

  • FP16 (half precision) = 2 bytes per value.

  • BF16 (Google’s “Brain Floating Point”) also = 2 bytes, but with different trade-offs (wider range, less precision).

  • INT8 and INT4 = 1 byte and 0.5 bytes, respectively.

For example, a 13B-parameter model in FP32 requires 52 GB just for weights. In FP16 or BF16, that drops to 26 GB. In INT8, it’s 13 GB.

This reduction is called quantization—representing values with fewer bits.

  • Post-training quantization (PTQ): Quantize after the model is trained. Common for inference.

  • Quantization-aware training (QAT): Simulate low precision during training so the model learns to handle it.

Quantization can reduce both memory and compute time, but it risks accuracy loss. Striking the right balance is an active area of research.


Inference vs. Training Memory Profiles

It’s important to distinguish between inference and training memory requirements:

  • Inference memoryWeights + Activations (scaled by sequence length & batch size)

  • Training memoryWeights + Activations + Gradients + Optimizer states

That extra overhead for gradients and optimizer states is what makes finetuning orders of magnitude more demanding than inference.

This difference is why we see specialized chips for inference (optimized for low precision, speed, and efficiency) versus training (optimized for high throughput and precision).


Tricks to Reduce Memory Bottlenecks

Several strategies exist to make finetuning possible without 100+ GB GPUs:

  1. Gradient Checkpointing

    • Don’t store all activations; recompute them when needed.

    • Saves memory, costs extra compute time.

  2. Mixed Precision Training

    • Store some values (like gradients) in FP16 or BF16 while keeping others (like embeddings) in FP32.

    • Most ML frameworks (PyTorch AMP, TensorFlow) support this out of the box.

  3. Quantization

    • Use INT8, INT4, or even experimental 1.58-bit formats (BitNet, 2024).

    • Reduces memory, but may hurt accuracy if not done carefully.

  4. PEFT (Parameter-Efficient Finetuning)

    • Instead of updating all parameters, update only a small fraction (we’ll dive deep into this in the next section).

    • Can reduce trainable parameters from billions to millions.

  5. Offloading

    • Move some weights to CPU or disk (e.g., DeepSpeed ZeRO).

    • Slower, but allows fitting larger models.

  6. Smaller Models

    • Don’t underestimate the power of smaller models! A well-finetuned 7B model may outperform an unfined 70B model on your task.


Why This Matters

Understanding memory bottlenecks is not just about avoiding runtime errors. It’s about strategic decision-making:

  • Do you attempt full finetuning, or is PEFT sufficient?

  • Should you invest in expensive GPUs, or use clever memory-saving techniques?

  • Do you quantize aggressively for efficiency, or preserve precision for accuracy?

Every organization will make different trade-offs. But the key takeaway is this: memory, not compute, is the limiting factor in finetuning today’s LLMs.


Looking Ahead

We’ve now seen why finetuning is so challenging from a memory perspective. The natural next step is to explore how researchers solved this problem—enter Parameter-Efficient Finetuning (PEFT).

In the next section, we’ll dive into adapter-based methods and soft prompts, the two main families of PEFT techniques that have made finetuning practical for developers around the world.


Finetuning Large Language Models: From Foundations to Adaptation

Section 3 – Parameter-Efficient Finetuning (PEFT) & Adapter-Based Methods


The Rise of Parameter-Efficient Finetuning (PEFT)

In the previous section, we saw how memory bottlenecks make full finetuning of large language models (LLMs) impractical for most developers. Training all parameters in a multi-billion-parameter model requires hundreds of gigabytes of memory—far more than what a typical GPU setup can handle.

So researchers asked: What if we didn’t need to update all the parameters?

This gave rise to Parameter-Efficient Finetuning (PEFT). Instead of retraining billions of parameters, PEFT techniques modify or add a tiny fraction of parameters—sometimes as little as 0.1% of the model—while freezing the rest. The results are astonishing: models finetuned with PEFT often perform within a few percentage points of full finetuning, at a fraction of the cost.

Why does this work? Because foundation models already know a lot. Finetuning isn’t about teaching them entirely new concepts; it’s about nudging them into the right shape for a specific task or domain. You don’t need to rewire the whole brain, just adjust a few critical pathways.


Full vs. Partial vs. Parameter-Efficient Finetuning

Let’s quickly compare the three approaches:

  • Full Finetuning

    • Updates all model parameters.

    • Very expensive (memory + compute).

    • Highest performance, but diminishing returns compared to efficient methods.

  • Partial Finetuning

    • Updates only a subset of layers (e.g., the final transformer block).

    • Reduces cost but often requires 20–30% of parameters to match full performance.

    • Parameter-inefficient: too many parameters for too little gain.

  • PEFT

    • Updates or adds only a very small number of parameters.

    • Can achieve near full-finetuning performance with 1–5% of parameters (sometimes less).

    • The dominant approach today for adapting large LLMs.

The breakthrough paper by Houlsby et al. (2019) introduced the first widely adopted PEFT method: adapters. Let’s dive into how adapter-based methods work.


Adapter-Based Methods

The Core Idea

Adapters are small, trainable modules inserted into each transformer block of an LLM. Instead of updating the billions of frozen parameters in the base model, finetuning updates only the adapter weights.

Think of adapters as plug-ins for the model:

  • The frozen base model is the operating system.

  • The adapters are small applications that customize behavior.

During inference, the model processes inputs as usual, but the adapters intercept the flow, add their learned adjustments, and pass the signal back. This way, adapters can teach the model new tricks without rewriting its entire knowledge base.


Houlsby Adapters (2019)

The original adapter method added two small feed-forward layers into each transformer block. During training:

  • The base model’s parameters stayed frozen.

  • Only the adapter layers were updated.

The results were striking. On the GLUE benchmark for natural language understanding, Houlsby adapters achieved performance within 0.4% of full finetuning, while training only 3% of parameters.

This was a game-changer. Suddenly, adapting large models became feasible on modest hardware.


BitFit (2021)

BitFit took the idea even further: instead of inserting new layers, it updated only the bias terms of the model (hence the name). Since bias parameters make up a tiny fraction of the total weights, this meant updating less than 0.1% of parameters.

Surprisingly, BitFit still delivered competitive results in many tasks. It wasn’t always as strong as adapters, but it showed just how much efficiency was possible.


IA³ (2022)

IA³ (Infused Adapter by Inhibiting and Amplifying Inner Activations) proposed a more elegant method: instead of adding full adapter layers, it learned small scaling vectors that multiplied existing activations inside the model.

This meant:

  • Extremely lightweight overhead.

  • Easy batching for multi-task finetuning.

  • In some cases, even better performance than full finetuning.


LongLoRA (2023)

One limitation of many LLMs is context length—how much text they can handle at once. Most models are trained with a context window of a few thousand tokens.

LongLoRA adapted LoRA (which we’ll cover in depth in Section 4) to extend context length. By tweaking attention mechanisms and training with long-sequence data, LongLoRA enabled models to handle much larger contexts (e.g., from 4k tokens to 16k tokens or more).

This made it especially useful for tasks like:

  • Handling long code files.

  • Processing legal or scientific documents.

  • Multi-turn dialogue with extended histories.


Soft Prompt–Based Methods

While adapter-based methods add trainable parameters inside the model, soft prompt methods modify the inputs the model sees.

Here’s the difference:

  • Hard prompts = Human-readable text prompts (e.g., “Translate to French:”).

  • Soft prompts = Trainable vectors (continuous embeddings) prepended to the input.

Soft prompts aren’t human-readable, but they act like invisible “hints” that steer the model’s behavior.


Prompt Tuning (2021)

In prompt tuning, a small set of trainable embeddings is added to the input at the embedding layer. During training:

  • The base model stays frozen.

  • Only the soft prompts are updated.

This allows a single base model to be quickly adapted to many tasks by swapping out the soft prompts.


Prefix Tuning (2021)

Prefix tuning goes a step further: instead of adding tokens only at the input, it prepends soft prompts at every transformer layer. This provides more control over the model’s internal activations.

Prefix tuning is more powerful but also slightly heavier than prompt tuning.


P-Tuning (2021)

P-Tuning is another variant that optimizes where and how soft prompts are injected. These subtle differences in architecture make each method slightly better for certain tasks, but the underlying idea is the same: train a small set of vectors instead of the whole model.


Comparing Adapter vs. Soft Prompt Methods

FeatureAdapter-BasedSoft Prompt–Based
Where it actsInside transformer layersInput embeddings (and sometimes layers)
Parameter efficiency1–5% trainable paramsOften <1% trainable params
PerformanceStrong, near full finetuningCompetitive, but sometimes weaker
ModularityCan stack multiple adaptersCan swap prompts easily
Inference speedSlightly slower (extra layers)Same as base model
Use casesDomain adaptation, syntax enforcementMulti-task, lightweight customization

Both families of methods are actively used today, depending on the application and constraints.


Real-World Adoption

To see what’s popular in practice, researchers analyzed thousands of GitHub issues from the Hugging Face PEFT library (2024). The results showed:

  • LoRA dominates adapter-based methods.

  • Soft prompts are less common but gaining traction among teams that want more flexibility than prompting but less complexity than finetuning.

This aligns with industry trends: LoRA is the go-to choice for finetuning today, thanks to its balance of efficiency and performance.


Why PEFT Matters for Developers

For most application developers, full finetuning is out of reach. Even partial finetuning may be impractical. PEFT bridges the gap by making it possible to:

  • Run finetuning on a single GPU or small cluster, instead of massive infrastructure.

  • Maintain multiple task-specific models by swapping adapters or prompts, rather than training from scratch each time.

  • Deploy cheaper, faster models that still meet production needs.

In other words, PEFT democratizes LLM customization.


Looking Ahead

We’ve covered the two main families of PEFT: adapters and soft prompts. But one method in particular—LoRA (Low-Rank Adaptation)—has become the de facto standard, powering everything from open-source community models to enterprise deployments.

In the next section, we’ll dive deep into LoRA, exploring how it works, why it’s so effective, and the many variants that have sprung from it.


Finetuning Large Language Models: From Foundations to Adaptation

Section 4 – LoRA Deep Dive & Variants


The Emergence of LoRA

Among all the parameter-efficient finetuning (PEFT) methods, LoRA (Low-Rank Adaptation) has emerged as the most widely used. Originally introduced by Hu et al. (2021), LoRA was designed with a simple but powerful idea: instead of updating huge weight matrices directly, approximate their updates using much smaller low-rank matrices.

This approach slashes the number of trainable parameters by orders of magnitude while preserving performance. Today, if you see a community finetuned model on Hugging Face, chances are it was trained using LoRA.

But why did LoRA succeed where others were just “good enough”? To answer that, let’s break down how it works.


The Core Idea of LoRA

Transformer models, like GPT and LLaMA, are filled with linear layers—essentially giant matrices that multiply with vectors to process data. For example:

ini
y = W × x

where:

  • W is a weight matrix (often billions of parameters).

  • x is the input vector.

  • y is the output.

In finetuning, we want to adjust W. But directly storing and updating such a massive matrix is impractical.

LoRA introduces a clever trick:

  • Instead of updating W directly, represent the update ΔW as the product of two smaller matrices, A and B.

W’ = W + ΔW ΔW = A × B

Here:

  • A is an m × r matrix.

  • B is an r × n matrix.

  • r is the rank (a small number, often 4, 8, or 16).

Because r is small, A × B has far fewer parameters than W. This is the “low-rank” in Low-Rank Adaptation.


Why LoRA Works: The Intrinsic Dimension Hypothesis

At first glance, you might wonder: How can tiny matrices possibly capture the complexity of massive weight updates?

The answer lies in the intrinsic dimension hypothesis. Research suggests that many high-dimensional tasks can actually be solved in a much lower-dimensional subspace. In other words, even though models have billions of parameters, the meaningful changes for a given task lie in a much smaller space.

LoRA exploits this by only learning updates within that smaller subspace. The base model provides the “knowledge,” and LoRA adapters learn the “specialization.”


Practical Benefits of LoRA

LoRA offers several advantages that explain its widespread adoption:

  1. Efficiency

    • Only a tiny fraction of parameters are trained.

    • Finetuning becomes feasible on a single GPU.

  2. Modularity

    • You can train different LoRA adapters for different tasks.

    • Swap them in and out without retraining the whole model.

  3. Composability

    • Multiple LoRAs can be combined (e.g., one for medical text, another for legal text).

    • This makes models highly customizable.

  4. No Inference Latency Penalty

    • LoRA updates are folded into the weight multiplication, so inference remains just as fast.

These factors made LoRA the go-to choice for researchers and developers alike.


Variants of LoRA

Like all good ideas in AI, LoRA quickly inspired numerous extensions and adaptations. Let’s look at the major ones.


Quantized LoRA (QLoRA)

Introduced by Dettmers et al. (2023), QLoRA combined two powerful ideas:

  1. Quantization – Store base model weights in 4-bit precision to reduce memory.

  2. LoRA – Train low-rank adapters on top of the quantized model.

This made it possible to finetune 65B-parameter models on a single 48 GB GPU. Before QLoRA, such large-scale finetuning was reserved for top-tier labs.

QLoRA democratized finetuning by drastically lowering hardware requirements. Today, it’s one of the most practical setups for community-driven projects.


LongLoRA

We briefly mentioned LongLoRA earlier. It modifies the attention mechanism to allow models to handle longer sequences during finetuning. Combined with LoRA, it makes training for long-context tasks affordable.

Use cases:

  • Reading entire legal contracts.

  • Summarizing research papers.

  • Multi-turn chatbots with extended histories.


ReLoRA (Restarted LoRA)

ReLoRA addresses a subtle issue: as training progresses, low-rank updates may lose effectiveness. ReLoRA periodically resets the low-rank matrices and reinitializes them, allowing the model to keep learning efficiently.

This is especially useful for longer training runs where LoRA might otherwise stagnate.


GaLore (Gradient Low-Rank Projection)

GaLore, a 2024 innovation, applies the low-rank trick not just to weight updates, but directly to gradients. Instead of storing full gradients (which consume huge memory), it projects them into a low-rank space.

This slashes memory requirements for backpropagation and makes finetuning even more efficient.


Multi-LoRA Serving

One of LoRA’s killer features is that you can serve multiple adapters on the same base model. For example:

  • A customer service company could maintain separate LoRAs for retail, banking, and healthcare clients.

  • A developer could train LoRAs for different tasks (summarization, sentiment analysis, translation) and load whichever is needed on demand.

This modularity is similar to having multiple browser extensions installed on the same web browser. The core system remains unchanged; each extension adds a specialization.

Serving frameworks like Hugging Face’s peft library and 🤗 Transformers already support this capability, making LoRA-based deployments production-ready.


Real-World Applications of LoRA

  1. Community Models

    • Hugging Face is filled with LoRA-finetuned models.

    • From roleplay chatbots to code assistants, LoRA makes experimentation accessible.

  2. Enterprise Customization

    • Companies use LoRA to adapt foundation models to domain-specific needs.

    • Example: A financial firm finetuning a Llama model with LoRA for SEC filings and earnings calls.

  3. Education and Research

    • Universities and independent researchers can now finetune models without massive clusters.

    • LoRA has become the default teaching method for “hands-on LLM finetuning.”


Why LoRA Dominates

While other PEFT methods (like adapters and soft prompts) are useful, LoRA has several unique advantages that explain its dominance:

  • Performance – Often matches or beats other methods.

  • Ease of Integration – Works seamlessly with existing architectures.

  • Scalability – Handles small and very large models.

  • Community Momentum – Supported by libraries like Hugging Face PEFT, PyTorch, and ColossalAI.

In fact, many researchers now assume LoRA as the default PEFT baseline in experiments.


Beyond LoRA: The Future of Low-Rank Methods

LoRA is not the end of the story. Several exciting directions are emerging:

  • Dynamic rank selection – Adapting rank r per layer for optimal efficiency.

  • Low-rank pre-training – Training models from scratch with low-rank constraints.

  • Hybrid approaches – Combining LoRA with soft prompts or adapters for maximum flexibility.

The field is evolving fast, but one thing is clear: the low-rank principle has unlocked a new era of scalable, affordable finetuning.


Looking Ahead

We’ve now explored LoRA and its variants in depth. But LoRA isn’t the only alternative to full finetuning. Another powerful approach is model merging, where instead of training adapters, developers directly combine pre-trained models.

In the next section, we’ll cover:

  • Model merging techniques (summing, stacking, concatenation).

  • Finetuning frameworks and hyperparameters.

  • Practical decision-making for when and how to finetune.


Finetuning Large Language Models: From Foundations to Adaptation

Section 5 – Model Merging, Frameworks, Hyperparameters, and Conclusion


Beyond Finetuning: The Idea of Model Merging

So far, we’ve focused on finetuning—whether full, partial, or parameter-efficient. But there’s another emerging paradigm: model merging.

Instead of training adapters or updating weights, model merging combines already finetuned models into a single new model.

Why does this work? Because finetuned models often contain complementary knowledge. By merging, we can capture the strengths of multiple models without retraining from scratch.


Techniques for Model Merging

  1. Weight Averaging (Model Soup)

    • Proposed by Wortsman et al. (2022).

    • Merge multiple models by taking a weighted average of their parameters.

    • Example: If you have three LoRA-finetuned models for different NLP tasks, averaging them can yield a single, more general model.

  2. Layer Stacking

    • Stack layers from different models.

    • Example: Lower layers from a language model trained on code + higher layers from a model trained on dialogue.

  3. Concatenation

    • Combine embeddings or intermediate representations from multiple models.

    • More experimental, but promising for multimodal tasks (e.g., text + image).


Real-World Examples

  • SOLAR 10.7B (2023): A model built by merging multiple open-weight models (up to 16). Despite being only 10.7B parameters, it achieved performance close to LLaMA-2-70B on benchmarks.

  • On-device models: Small models on smartphones can be “merged” from different finetuned variants, allowing them to perform diverse tasks without huge storage overhead.

Model merging isn’t yet as mature as LoRA, but it shows the future may involve mixing and matching specialized models rather than finetuning one base model repeatedly.


Choosing the Right Base Model

When planning finetuning or merging, the choice of base model is critical.

  • Open-source models: LLaMA, Mistral, Falcon, Mixtral, Gemma.

    • Pros: Customizable, transparent, cheap to run.

    • Cons: May lack the polish of commercial APIs.

  • Proprietary models: OpenAI GPT, Anthropic Claude, Google Gemini.

    • Pros: State-of-the-art performance, strong support.

    • Cons: Limited finetuning access, high costs.

A common strategy:

  • Use open-source models for heavily customized or private data tasks.

  • Use APIs for general-purpose or low-maintenance tasks.


Finetuning Frameworks

Several frameworks make PEFT and LoRA finetuning more accessible:

  1. Hugging Face PEFT

    • Most widely used library for LoRA, QLoRA, prefix tuning, etc.

    • Integrates with Transformers and Accelerate.

  2. Axolotl

    • Community-driven framework for multi-GPU LoRA/QLoRA training.

    • Popular for fine-tuning LLaMA, Mistral, and other open-source models.

  3. LitGPT

    • Lightweight, research-friendly framework.

    • Focused on reproducibility and simplicity.

  4. Unsloth

    • Emerging framework specializing in extremely efficient LoRA training.

    • Often used for rapid experimentation.

  5. Distributed Training Libraries

    • DeepSpeed (Microsoft): Sharding and offloading for large models.

    • ColossalAI: Scales LoRA finetuning efficiently.

    • PyTorch Distributed: General-purpose parallel training.

For most developers, PEFT + Transformers is enough. For larger teams, frameworks like DeepSpeed and ColossalAI unlock scalability.


Hyperparameters That Matter

Finetuning involves many knobs, but a few matter more than others:

  1. Learning Rate

    • Too high → catastrophic forgetting (model loses general knowledge).

    • Too low → underfitting.

    • Rule of thumb: LoRA adapters often use learning rates around 1e-4 to 1e-5.

  2. Batch Size

    • Larger batches → more stable gradients, but higher memory.

    • Smaller batches → noisier but sometimes better generalization.

  3. Sequence Length

    • Longer sequences capture more context but increase memory.

    • Use gradient checkpointing or LongLoRA for efficiency.

  4. Prompt Loss Weighting

    • Important when balancing instruction-following with task-specific objectives.

  5. Epochs & Early Stopping

    • Overtraining can lead to overfitting or forgetting.

    • Early stopping helps maintain balance.

In practice, hyperparameter tuning is as much art as science. Many teams start with community defaults and adjust based on validation performance.


Practical Development Paths

OpenAI’s Path

OpenAI suggests a progression for developers:

  1. Prompt engineering – Cheapest, simplest.

  2. Few-shot prompting – Better results with examples.

  3. RAG (Retrieval-Augmented Generation) – Bring external knowledge in.

  4. Finetuning (SFT/LoRA) – When you need strict formats or domain alignment.

This ensures you don’t overinvest in finetuning when prompting or RAG would suffice.


Distillation Path

Another popular path is distillation:

  • Train a small student model to mimic a large teacher model.

  • Use finetuned LLMs as teachers.

  • Deploy smaller, faster students for production.

This allows enterprises to get 80–90% of the performance at a fraction of the cost.


Challenges and Trade-Offs

  1. Data Availability

    • Good finetuning data is rare and expensive.

    • Synthetic data generation (using GPT-4 or Claude) helps but can introduce noise.

  2. Latency and Deployment

    • Adapters can increase latency if not merged back into the model.

    • Some frameworks support merging LoRA weights into base weights for deployment.

  3. When Not to Finetune

    • If the goal is to add new facts → Use RAG.

    • If the goal is structured output or style → Use finetuning.

    • Key mantra: “Finetuning is for form, RAG is for facts.”


The Future of Finetuning

Where is finetuning headed? Several trends are emerging:

  • Low-rank pre-training: Training models from scratch with LoRA-like constraints.

  • Composable adapters: Building “adapter stores” where LoRAs can be mixed like Lego blocks.

  • Model merging at scale: Building models like SOLAR that rival much larger ones.

  • Ultra-low-bit training: BitNet (2024) showed training in 1.58-bit precision can be stable, hinting at massive efficiency gains.

The direction is clear: finetuning will become cheaper, more modular, and more accessible.


Conclusion: A New Era of Customization

We’ve journeyed through the evolution of finetuning:

  • From full finetuning and its memory bottlenecks…

  • To PEFT methods like adapters and soft prompts…

  • To the dominance of LoRA and its variants…

  • To emerging approaches like model merging.

The takeaway is simple: you no longer need a billion-dollar budget to adapt powerful LLMs. With techniques like LoRA and QLoRA, even modest setups can achieve near state-of-the-art results.

For developers and enterprises, the message is empowering:

  • Use prompting and RAG when possible.

  • Use finetuning when you need control, style, or structure.

  • Explore merging and distillation to push efficiency further.

The field is moving fast, but one principle remains constant: foundation models are the clay, and finetuning is how we sculpt them into tools that serve our needs.

Tags: Artificial Intelligence,Generative AI,Agentic AI,Technology,Book Summary,

No comments:

Post a Comment