Tuesday, June 16, 2026

Day out with LM Studio (for running local LLMs)

See All on GenAI    « Previously

LM Studio is widely considered the absolute gold standard for running local LLMs if you prefer a clean, visual interface over a terminal window. It abstracts away all the complex command-line arguments of tools like llama.cpp while still giving you deep developer controls under the hood.

Setting it up and getting your first model running takes less than 10 minutes.

1. System Check (What Fits?)

Before downloading a massive model that locks up your computer, check your hardware specs. LM Studio relies heavily on VRAM (GPU Memory), with system RAM as a fallback.

Total Available VRAM Recommended Model Size Best Quantization Format
8 GB 7B - 8B models (e.g., Llama 3 8B) Q4_K_M (Practical baseline)
12 GB - 16 GB 12B - 14B models (e.g., Gemma 4 12B, Qwen 3.6 14B) Q4_K_M or Q6_K
24 GB 32B - 35B models (e.g., Qwen 3.6 35B MoE) Q4_K_M or Q6_K (The sweet spot)
48 GB+ 70B+ models Full 8-bit (Q8_0) or unquantized (BF16)

💡 Apple Silicon Note: If you are running an M-series Mac, LM Studio automatically defaults to Apple's MLX runtime. Because Mac uses unified memory, your system RAM handles the heavy lifting directly.

2. Step-by-Step Setup Guide

1
Download and Install
~2 minutes

Go to lmstudio.ai and download the installer matching your OS (Windows x64/ARM, macOS M-series, or Linux AppImage). Run the installer to open the GUI.

2
Discover and Download a Model
~3-5 minutes

Click the Search/Discover icon (Magnifying Glass) on the left sidebar. Type in a popular open model like Gemma 4 12B or Qwen 3.6 Coder.

LM Studio will display a list of available Hugging Face files. Look for the green rocket icon next to the files—this indicates the model quantization will comfortably fit your hardware profile. Click Download.

3
Configure Your Hardware Engine
~1 minute

Head to the AI Chat view (Bubble icon) and look at the right-hand settings panel. Under Hardware Settings, select your runtime engine:

  • NVIDIA: Choose CUDA 12 llama.cpp.

  • Apple Silicon: Leave it on MLX.

  • AMD/Intel GPU: Choose Vulkan llama.cpp.

  • CPU Only: Choose CPU llama.cpp (if you don't have a dedicated GPU).

4
Adjust GPU Offload and Context
~1 minute

If you're using a discrete GPU (like NVIDIA), locate the GPU Offload slider. Toggle it to Max to push as many layers of the model into your VRAM as possible.

Set your Context Length next (start with 4096 or 8192 tokens). Higher context lengths use exponentially more VRAM.

5
Load and Chat
Instant

At the very top of the window, click the "Select a model to load" dropdown and select your downloaded model. Once the progress bar fills, type your prompt in the bottom text box and enjoy 100% private, offline AI.

3. Power-User Features to Explore Later

Once you have basic chat working, LM Studio has major features designed for software development and local workflows:

Local OpenAI-Compatible Server

Click the Developer tab (Code brackets icon) on the left menu. Here, you can click Start Server to spin up a local API endpoint on localhost:1234. Because it is fully OpenAI-compatible, you can drop this endpoint straight into developer setups, IDE extensions (like Continue or VS Code Copilot alternatives), or local scripts using the standard OpenAI SDK format:

Python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")

response = client.chat.completions.create(
    model="local-model", # It automatically targets whatever model is currently loaded
    messages=[{"role": "user", "content": "Write a quick Python sort algorithm."}]
)
print(response.choices[0].message.content)

Chat with Documents (Local RAG)

You can attach local text files, PDFs, or code repositories directly into your chat. LM Studio handles the text extraction and local embedding vectorization completely offline, allowing you to ask questions about your private files without data leaking to external servers.

LM Link (Remote Workloads)

If you have a powerful machine (like a desktop rig with a great GPU) but want to work from a lightweight laptop on your couch, you can turn on LM Link in your settings. It leverages a secure, end-to-end encrypted mesh network (powered by Tailscale) to let you stream your desktop's heavy model processing directly to your laptop as if it were running locally.

See All on GenAI    « Previously
Tags: Large Language Models,Generative AI,Agentic AI,

No comments:

Post a Comment