How to Run a Local LLM on Windows in 2025

Q: What is the difference between 'instruct' and 'base' models?

Instruct models are fine-tuned for conversation and following instructions, making them ideal for chatbots. Base models are the raw, foundational models, good for text completion but not inherently conversational.

Q: How do I check how much VRAM I have on Windows?

Open Task Manager (Ctrl + Shift + Esc), go to the Performance tab, click on your GPU, and check the 'Dedicated GPU Memory' value.

Q: Can I use my AMD or Intel GPU for local LLMs?

Yes. Tools like LM Studio and Ollama support AMD/Intel GPUs via DirectML on Windows. Make sure your graphics drivers are up to date.

Want ChatGPT-style AI that runs entirely on your PC? Private, fast, and available offline? If you're new to this, start with our guide on what is a local LLM to understand the basics. This practical guide walks you through the best options for Windows (from one-click apps to power-user toolchains), the hardware you really need, and the exact commands to get going.

TL;DR: Three easy paths

LM Studio (easiest GUI): Download the Windows app, pick a model, click Run. Great if you want zero terminal commands and a polished chat UI—plus an optional local API for your own apps.
Ollama (simple CLI + API): Install once, then ollama run llama3.1 and chat in your terminal or via an OpenAI-compatible local server at http://localhost:11434. Add Open WebUI for a nice browser UI.
llama.cpp (advanced & ultra-portable): Use the ultra-lean C/C++ runtime for blazing-fast inference with GGUF models; perfect for tinkerers and fine-grained control. Pair with Text-Generation-WebUI or Open WebUI for a front-end.

What you’ll need (hardware & OS)

Windows 11 or Windows 10 (recent builds).
GPU VRAM matters most. Local LLMs are memory-hungry; enough VRAM prevents spillover into slower system RAM. A helpful rule of thumb: target ~1.2× the model’s on-disk size in available VRAM for smooth operation. If you only have 8–12 GB VRAM, choose 3B–7B models or use more aggressive quantization.
No GPU? You can still run smaller models on CPU; it’s just slower. LM Studio and Ollama both support CPU fallback.

Quick picks by VRAM

8–12 GB: Qwen-2.5 3B/7B, Phi-4-Mini, Llama-3.x 1B–3B, Gemma-2 2B–9B (quantized)
16–24 GB: Llama-3.1 8B, Mistral-7B, Qwen-2.5 14B (quantized)
24 GB+: Larger 14B–32B models and generous context windows (still consider quantized builds)

For detailed GPU performance analysis and buying recommendations, check out our complete GPU guide for local LLM inference.

Option 1 - LM Studio (best “download & go” experience)

Why choose it: You get a clean Windows installer, a curated model browser (pulls from Hugging Face), a chat UI, prompt templates, and an optional local server that mimics the OpenAI API for your own apps.

Steps:

Download and install LM Studio for Windows.
Open Discover, search for a model (e.g., “llama”, “qwen”, “gemma”, “phi”), and click Download.
Click Run. You can chat immediately; or enable the built-in server/API if you want to call it from scripts.

Tips:

Start with a 3B–7B Instruct model for general chat.
If responses are slow, try a smaller or more quantized build (e.g., Q4_0/Q4_K).
Check LM Studio’s docs for system requirements and API usage.

Option 2 - Ollama (simple CLI + REST API, great with web UIs)

Why choose it: Lightweight install, dead-simple commands, runs a local server at localhost:11434, and integrates nicely with community UIs like Open WebUI. For a deeper dive into other tools, check out our complete guide to Ollama alternatives.

Steps:

Install Ollama for Windows (no admin required; installs in your home folder).
Open Terminal (or PowerShell) and run a model, e.g.:
```
ollama run llama3.1
```
You’ll get an interactive chat. Swap llama3.1 for other models (e.g., mistral, qwen2.5, gemma2, phi4-mini). See model pages in the Ollama Library to browse options and tags.
Use Ollama’s HTTP API at http://localhost:11434 from your apps, or connect a UI (next section).

Housekeeping: By default, models live under your user profile. If space is tight, the Windows docs explain changing binary/model locations.

Add a friendly UI: Open WebUI (works with Ollama)

If you prefer a browser UI with chats, histories, prompts and tools, Open WebUI is a popular choice and now installs cleanly via pip:

pip install open-webui
open-webui serve

Then point it at your running Ollama server and you’re set. The docs recommend Python 3.11 and show the exact commands.

Option 3 - llama.cpp (power-user route + ultimate portability)

Why choose it: Maximum control and minimal overhead. llama.cpp runs quantized GGUF models efficiently on Windows. You can use it via CLI or through front-ends like Text-Generation-WebUI or Open WebUI.

Key points & steps:

Use GGUF models. llama.cpp requires models in the GGUF format (you can download GGUFs directly or convert from other formats if needed).
Download a compatible model (from Hugging Face, etc.) and run with llama.cpp’s CLI tools; the README shows command examples and how to auto-download from hubs.
Prefer a GUI? Text-Generation-WebUI offers a portable Windows build (download, unzip, run) and supports GGUF models.

GPU acceleration on Windows: what actually works in 2025

DirectML via ONNX Runtime (cross-vendor): Microsoft’s DirectML Execution Provider in ONNX Runtime can accelerate inference on any DirectX 12-capable GPU (NVIDIA, AMD, Intel). If a tool you use supports ONNX/DirectML, you’ll get broad hardware coverage on Windows.
NVIDIA CUDA (Linux toolchains on Windows): Some advanced frameworks remain Linux-first. If you want those with full CUDA speed, run them under WSL2 with NVIDIA’s CUDA driver for WSL2. It’s well-documented and supports popular CUDA libraries.

Bottom line: For mainstream Windows workflows, LM Studio and Ollama are easiest. If you need Linux-native stacks (TensorRT-LLM, bespoke pipelines), WSL2 + CUDA is the reliable path.

Picking the right model (and size)

Start small, scale up: A good starter is a 3B–7B Instruct model (e.g., Qwen-2.5-Instruct 7B, Mistral-7B-Instruct, Phi-4-Mini). Move to Llama-3.1 8B or Gemma-2 9B if you have ≥16 GB VRAM. For coding-specific tasks, check out our guide to the best local LLMs for coding. Library pages for Llama-3.1 and other families, like the one in our DeepSeek-V3.1 review, list sizes/tags and are easily pulled by Ollama/LM Studio.
Quantization matters: Smaller quantized builds (e.g., 4-bit GGUF variants) slash VRAM usage with minimal quality loss for casual chat. That’s why GGUF is so popular for local inference.
Context window vs. speed: Long contexts need more memory. If responses get sluggish, reduce max tokens/context, or choose a more compact model. The VRAM-first guidance from recent Windows AI testing is a good compass.

Performance & stability tips

Mind disk space: Models can be tens to hundreds of GB. Ollama’s Windows docs explain how to move binaries/models off C: to another drive.
Stay within VRAM: If you’re stalling or paging, pick a smaller/quantized model or shorten prompts/context. Recent Windows tests show performance cliffs once you exceed VRAM.
Prefer “Instruct” models for chat. “Base” models often need system prompts to behave conversationally.
Use a proper power plan & updated GPU drivers. Prevent throttling and make sure DX12/CUDA stacks are current (especially if using WSL2 + CUDA).

Troubleshooting

“Where did my models go?” On Ollama, they’re under your user directory by default; Windows docs cover changing locations.
Open WebUI won’t start? Ensure Python 3.11 and that you used open-webui serve (package renamed; follow the docs).
Slow responses on CPU: That’s expected with larger models. Try a 1B–3B GGUF or upgrade to a GPU-backed setup (DirectML/WSL2-CUDA depending on your stack).

Is WSL2 worth it in 2025?

Yes, if you want Linux-native tooling with full NVIDIA CUDA performance on a Windows machine. Microsoft and NVIDIA both publish active guides for enabling CUDA inside WSL2, and many advanced LLM stacks assume Linux. For most users though, LM Studio or Ollama + Open WebUI on native Windows is plenty.

Final picks

Non-technical users: LM Studio → Download → Discover → Run.
Developers & hobbyists: Ollama (CLI + API) + Open WebUI (UI).
Tinkerers: llama.cpp + GGUF, optionally via Text-Generation-WebUI.

With the right model size for your VRAM, a modern Windows PC can deliver fast, private LLMs, no cloud required.

Frequently Asked Questions

Which local LLM tool is easiest to use on Windows?

LM Studio is the most user-friendly option for Windows users. It provides a graphical interface, one-click model downloads, and doesn't require any command-line knowledge. Just download the Windows app, pick a model, and start chatting.

Do I need a GPU to run local LLMs on Windows?

While not strictly required, a GPU with at least 8GB VRAM is recommended for good performance. You can run smaller models (3B-7B parameters) on CPU, but responses will be slower. Modern GPUs like the RTX 3060 (8GB) or better provide the best experience.

How much VRAM do I need for local LLMs?

For entry-level use, 8-12GB VRAM lets you run 3B-7B parameter models. 16-24GB VRAM supports larger 8B-14B models. As a rule of thumb, you need about 1.2× the model's file size in VRAM for optimal performance.

Which Windows version do I need?

Windows 10 (recent builds) or Windows 11 are recommended. Both support modern GPU acceleration via DirectML and CUDA. Windows 11 offers better WSL2 integration if you need Linux-based tools.

Can I run these tools completely offline?

Yes, all three recommended tools (LM Studio, Ollama, and llama.cpp) work completely offline once models are downloaded. This makes them ideal for privacy-focused users and environments without internet access.

What is the difference between "instruct" and "base" models?

Instruct models are fine-tuned for conversation and following instructions. They are the best choice for chatbots and general Q&A. You should almost always start with an "instruct" or "chat" version of a model.
Base models are the raw, foundational models that are trained on a massive dataset of text. They are good at completing text but are not inherently conversational. They are primarily used as a foundation for further fine-tuning.

How do I check how much VRAM I have on Windows?

You can easily check your VRAM in the Task Manager:

Press Ctrl + Shift + Esc to open the Task Manager.
Go to the Performance tab.
Click on your GPU in the left panel.
Look for the Dedicated GPU Memory value. This is your VRAM.

Can I use my AMD or Intel GPU for local LLMs?

Yes, absolutely. While NVIDIA GPUs with CUDA have historically had the best support, the situation has improved significantly.

LM Studio and Ollama have good support for AMD and Intel GPUs through technologies like DirectML (on Windows) and ROCm/OpenCL (on Linux).
llama.cpp also supports various GPU backends, including OpenCL and Vulkan, which can leverage AMD and Intel hardware.

For the best experience, make sure your graphics drivers are up to date.