Ollama VRAM Requirements: Complete 2025 Guide to GPU Memory for Local LLMs

You've downloaded Ollama. You're ready to run powerful AI models locally, complete privacy, zero API costs, full control. But then reality hits: your GPU chokes, generation crawls at a few tokens per second, struggling with low context window, or worse, the model won't load at all. The culprit? VRAM.

Video RAM isn't just another spec on your GPU box, it's the single factor that determines whether you're getting speeds at 40-80 tokens/second or watching your system struggle through every response. Pick the wrong model size, ignore quantization settings, or miscalculate your context window needs, and you'll be stuck with unusable performance or memory crashes.

This guide eliminates the guesswork. We've benchmarked some model configurations on real hardware, measuring exact VRAM consumption across different context lengths, and offloading scenarios. You'll learn precisely which models your GPU can handle, how to squeeze maximum performance from limited VRAM, when KV cache quantization actually helps, and what to do when your hardware falls short.

Whether you're running a 4GB budget GPU trying to find any model that works smoothly, an 8GB RTX 4060 owner wondering if you can handle 8B models with long contexts, or planning a serious local AI workstation build, this data driven guide gives you the exact numbers, optimization strategies, and real-world benchmarks you need to make informed decisions.

Links to other essential VRAM articles on LocalLLM.in will help you dive deeper into advanced memory strategies.

TL;DR: How Much VRAM Do You Need for Ollama?

Entry-level (3-4GB VRAM): Can run 3-4B parameter models with Q4_K_M quantization at moderate context windows (4k tokens comfortably).

Mid-range (6-8GB VRAM): Supports 7-9B models like Llama 3.1 8B and Qwen3 8B at Q4_K_M quantization. The sweet spot for most users delivering 40+ tokens/second.

High-end (10-12GB VRAM): Ideal for 12-14B models like Gemma 3 12B, Qwen3 14B, and Phi-4 14B at Q4_K_M quantization with extended context windows for reliable workflows.

High-end (16-24GB VRAM): Supports 22-35B models like Gemma 3 27B, Qwen3 32B, GPT OSS 20b, and Deepseek R1 32b at Q4_K_M quantization.

Workstation (48GB+ VRAM): Required for 70B+ models like Llama 3.3 70B and Qwen2.5 72B at Q4_K_M quantization (or 2×24GB GPUs).

Ollama VRAM Requirements for Qwen 3 4b, qwen3 8b, qwen3 32b, and llama 3.1 70b at 16k context and various quantization such as Q3_K_M, Q4_K_M, Q6_K, Q8

For ultimate precision in planning your Ollama setup, use the interactive VRAM calculator designed specifically for local LLM deployment to estimate your exact memory requirements based on model size, quantization, and context length.

Understanding Ollama's VRAM Architecture

Ollama leverages GPU acceleration through llama.cpp's CUDA and ROCm backends, dynamically managing VRAM allocation and compute resources. Three primary factors determine total VRAM consumption:

Model Weights: Core parameters loaded onto VRAM. Quantization (Q4, Q5, Q6, Q8) dramatically impacts memory usage. Lower bit quantization means smaller VRAM requirements. Q4_K_M represents the memory efficiency gold standard for consumer hardware.

KV Cache: A portion of VRAM is reserved for attention cache. The KV cache grows linearly with context length, making longer contexts increasingly memory intensive. With default FP16 KV cache precision, an 8B model at 32K context requires approximately 4.5 GB for KV cache alone. While quantizing the KV cache (e.g., to INT8 or FP8) can reduce memory requirements by up to 50% or more, it is generally approached with caution, unlike model weights, aggressive quantization of the KV cache, particularly the key vectors can degrade output quality due to the sensitivity of attention mechanisms to quantization errors.

System Overhead: The backend (llama.cpp, CUDA, Vulkan, ROCm) adds roughly 0.5-1GB overhead for model infrastructure, graph allocation, and runtime operations.

Ollama VRAM Requirements: Comprehensive Tables

Base VRAM Requirements by Model Size

Model Size	Quantization	Est. File Size	Rec. VRAM	Corrected Example Models
3B – 4B	Q4_K_M	2.0 – 2.5 GB	3 – 4 GB	Llama 3.2 3B, Qwen3 4B
7B – 9B	Q4_K_M	4.5 – 6.0 GB	6 – 8 GB	Llama 3.1 8B, Qwen3 8B
12B – 14B	Q4_K_M	7.5 – 9.0 GB	10 – 12 GB	Gemma 3 12B , Qwen3 14B , Phi‑4 14B, Deepseek R1 14b
22B – 35B	Q4_K_M	14 – 21 GB	16 – 24 GB	Gemma 3 27B , Qwen3 32B , GPT OSS 20b, Deepseek R1 32b
70B – 72B	Q4_K_M	40 – 43 GB	48 GB (or 2×24 GB)	Llama 3.3 70B, Qwen2.5 72B

Ollama KV Cache Memory Usage by Context Length

In Ollama, total VRAM usage is the sum of model weights plus the key–value (KV) cache, and the graph above shows that KV cache memory grows almost perfectly linearly with context length for 4B, 8B, and 70B models. Because the KV cache stores key and value vectors for every token, layer, and attention head, the effective Ollama KV cache memory per token scales with both model size and precision, which is why larger models see Ollama KV cache memory usage context length jump from a few hundred megabytes at short contexts to many gigabytes at long contexts, often becoming the dominant contributor to VRAM beyond the base model weights. This is also why Ollama’s docs and community guides emphasize KV cache configuration and quantization as a primary lever for fitting bigger models or longer contexts into fixed GPU VRAM.

Ollama VRAM (memory) requirements for Qwen 3 4b, qwen3 8b, and llama 3.1 70b at various context lengths

Looking at the numbers from the chart, a 4B model needs roughly 0.2 GB of KV cache at a 2K context, and about 3 GB at 32K, while an 8B model climbs from ~0.3 GB (2K) to ~5 GB (32K) and ~20 GB (128K) of KV cache alone; a 70B model jumps from ~1.6 GB at 2K to well over 42 GB at 128K context. These figures make it easy to approximate Ollama KV cache memory per token for your own setup and to see why long contexts are impossible on smaller GPUs unless you reduce precision, lower context, or tweak the effective Ollama KV cache size environment variable, most commonly by switching OLLAMA_KV_CACHE_TYPE from f16 to q8_0 or q4_0, which can cut KV cache usage by around half or more and bring total VRAM down enough to run high context 7B–13B models on 8–12 GB cards, however expect the quality of output to degrade with KV cache quantization, as its more sensitive to quantization as compared to quantizing the parameters alone.

Practical VRAM Test: Ollama Performance Benchmark on RTX 4060

To demonstrate Ollama's VRAM usage in practice, extensive benchmarking reveals concrete performance metrics across different hardware configurations:

Test Configuration: RTX 4060 (8GB VRAM)

Hardware: NVIDIA RTX 4060, 8GB VRAM, Windows 11
Model: Qwen 3 8B Q4_K_M quantization (5.2GB model file size)
Context Length: 16k tokens
Acceleration: Full GPU offload (100% layers on GPU)
Task: Summarize local LLM article (~3,000 prompt tokens) in 500 words

Performance Results:

total duration:       35.2442185s
load duration:        3.4705931s
prompt eval count:    2957 token(s)
prompt eval duration: 1.4059591s
prompt eval rate:     2103.19 tokens/s
eval count:           1225 token(s)
eval duration:        30.1853166s
eval rate:            40.58 tokens/s

Tokens/second: 40.58 (excellent real-time responsiveness)
Time to first token: ~1.4s (prompt evaluation) followed by rapid generation
VRAM usage: ~7.2GB with 16k context (from task manager)
GPU utilization: 100% (optimal hardware usage)

Ollama Full GPU Performance Test on RTX 4060 - Test Commands and Setup

The RTX 4060 with 8GB VRAM comfortably handles 7-8B models in Q4_K_M quantization. Check out our detailed guide to best local LLMs for 8GB VRAM systems for specific model recommendations and benchmarks.

How Ollama Handles VRAM and GPU Offload

Ollama's GPU offloading system is sophisticated yet transparent to users. Understanding how it works helps you optimize performance:

Automatic Layer Offloading

Ollama automatically determines how many model layers to offload to GPU based on available VRAM. Each model consists of multiple transformer layers. For example, Qwen 3 8b has 36 layers.

When VRAM is sufficient, Ollama loads 100% of layers to GPU for maximum performance. When VRAM is constrained, it splits layers between GPU (fast) and system RAM (slow), processing some on GPU and others on CPU.

Manual Layer Control

Advanced users can manually control layer offloading via the num_gpu parameter in a Modelfile:

FROM llama3.1:8b
PARAMETER num_gpu 25
PARAMETER num_thread 8

This forces Ollama to load exactly 25 layers to GPU and the remainder to CPU. However, Ollama's automatic detection is generally optimal.

Practical Example: RTX 4060 with Manual Layer Control

Setting num_gpu 25 on the same RTX 4060 8GB system demonstrated significant VRAM reduction compared to automatic full-offload (4.8GB vs 7.2GB). However, the performance penalty was substantial:

Test Configuration: Qwen 3 8B Q4_K_M @ 16K context with num_gpu 25
VRAM Usage: 4.8GB (saving 2.4GB vs full GPU offload)
Performance: 8.62 tokens/sec (4.7x slower than full GPU's 40.58 tokens/sec)

Use manual layer control strategically, it's valuable for fitting larger models into constrained VRAM, but expect significant speed trade-offs compared to full GPU acceleration.

Ollama Manual Layer Control Test - num_gpu 25 Command Sequence and Performance Results

CPU-Only Alternative (num_gpu 0):
For true CPU-only inference on 12th Gen Intel(R) Core(TM) i7-12700H hardware, setting num_gpu 0 resulted in:

VRAM usage: 0GB (pure CPU inference)
Prompt processing: 20.77 tokens/sec (competitively fast)
Generation speed: 5.61 tokens/sec (acceptable for batch workloads)
Context note: Testing with truncated context (4k default) and condensed article

This shows CPU-only mode as a viable fallback when VRAM constraints rule out all GPU options, though significantly slower than even partial GPU offloading.

Performance Impact of Automatic RAM Overflow

When a model exceeds available VRAM and Ollama automatically "spills" layers to system RAM (different from manual num_gpu control), performance degrades dramatically:

Full GPU: 40-45 tokens/s
Partial offload (70% GPU, 30% CPU): 10-15 tokens/s
CPU-only fallback: 3-6 tokens/s

Tests show models running with RAM overflow can be 5-30x slower compared to fitting entirely in VRAM, with our benchmarks showing Qwen 3 8B model dropping from 40 tokens/s (full VRAM) to just 8 tokens/s with only 25 of the 36 layers in VRAM.

The bottleneck is PCIe bandwidth, data transfer between system RAM and VRAM over PCIe.

Ollama VRAM Optimization Strategies

Maximizing performance on limited hardware requires strategic optimization:

Choose Optimal Quantization
Q4_K_M is the sweet spot for most users, best balance of quality, speed, and memory efficiency. Going lower (Q3, Q2) causes significant quality degradation and unpredictable behavior. Going higher (Q6, Q8) requires substantially more VRAM with diminishing quality improvements.

Quantization quality hierarchy:

Q2: Smallest, severe quality loss, not recommended
Q3: Small, noticeable quality degradation
Q4_K_M: Recommended - optimal quality/size ratio
Q5_K_M: Excellent quality, moderate size increase
Q6_K: Near-original quality, larger VRAM needs
Q8_0: Minimal quality loss, 2x VRAM vs Q4

For a deep dive into what quantization actually is and how different methods work, check out our comprehensive quantization explained guide.

Reduce Context Length
Context window directly impacts KV cache size. If running into VRAM limits, reduce context from default:

ollama run qwen3:8b
# In chat: /set parameter num_ctx 4096

Reducing context from 8K to 4K can save 0.2-0.4GB VRAM for 7-8B models.

Enable Flash Attention
Flash Attention is an optimization that reduces VRAM usage and increases inference speed with zero quality degradation. Enable it via environment variable:

macOS/Linux:

export OLLAMA_FLASH_ATTENTION=1

Windows PowerShell:

$env:OLLAMA_FLASH_ATTENTION=1

Flash Attention is a prerequisite for KV cache quantization and will likely become the default in future Ollama versions.

Use KV Cache Quantization
KV cache quantization is revolutionary for VRAM optimization. It quantizes the attention cache to lower precision:

# Q8_0 (recommended): halves KV cache memory
export OLLAMA_KV_CACHE_TYPE=q8_0

# Q4_0 (aggressive): reduces KV cache to 1/3 size
export OLLAMA_KV_CACHE_TYPE=q4_0

KV cache quantization can however cause significant quality loss.

Consider CPU-Only for Specific Workloads
While GPU acceleration is vastly superior, CPU-only inference can work for:

Batch processing: Overnight jobs where high latency and throughput are acceptable
Very large models: When you have abundant system RAM (64-256GB) but limited VRAM
Testing/development: Quick model evaluation without GPU requirements

Optimization tips for CPU inference:

Use 4-6 threads maximum (not all available cores)
Ensure the model fits entirely in RAM to avoid disk swapping
Prefer smaller models (8B Q4) over larger models with aggressive quantization

GPU Compatibility and Multi-GPU Support

Refer to the official Ollama GPU guide for complete details.

NVIDIA GPUs

Ollama supports NVIDIA via CUDA with compute capability 5.0+ and driver version 531+.

Supported Architectures:

12.0 (Blackwell): GeForce RTX 50xx series; Professional RTX PRO 4000–6000 Blackwell.
9.0 (Hopper): H100, H200.
8.9 (Ada Lovelace): GeForce RTX 40xx series; Professional L4, L40, RTX 6000.
8.6–8.0 (Ampere): GeForce RTX 30xx series; A100, A30, A40, A10, A16, A2.
7.5 (Turing): GeForce RTX 20xx/GTX 1650 Ti, TITAN RTX; Professional T4, RTX 5000–3000, T-series; Quadro RTX 8000–4000.
7.0 (Volta): TITAN V, V100, Quadro GV100.
6.1–5.0 (Pascal/Maxwell): TITAN Xp/X; GeForce GTX 10xx/9xx/7xx series; Quadro/Tesla P/M/K-series.

Multi-GPU: Set CUDA_VISIBLE_DEVICES to comma-separated IDs or UUIDs (from nvidia-smi -L). Force CPU: -1.

Linux Suspend/Resume Fix: Reload driver with sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm.

AMD GPUs

Ollama uses ROCm for AMD support.

Linux:

Radeon RX: 7900–6800 series, Vega 64.
Radeon PRO: W7900–W6800 series, V620/V420/V340/V320, Vega II Duo/SSG.
Instinct: MI300–MI60 series.

Windows (ROCm v6.1):

Radeon RX: 7900–6800 series.
Radeon PRO: W7900–W6800 series, V620.

Multi-GPU: Set ROCR_VISIBLE_DEVICES to comma-separated IDs or UUIDs (from rocminfo). Force CPU: -1.

SELinux (Linux Containers): Enable with sudo setsebool container_use_devices=1.

Overrides for Unsupported (Linux): Use HSA_OVERRIDE_GFX_VERSION="x.y.z" (e.g., "10.3.0" for gfx1034 like RX 5400). Per-device: HSA_OVERRIDE_GFX_VERSION_N=....

Supported LLVM Targets:

Target	Example GPU
gfx908	Radeon Instinct MI100
gfx90a	Radeon Instinct MI210
gfx940	Radeon Instinct MI300
gfx941
gfx942
gfx1030	Radeon PRO V620
gfx1100	Radeon PRO W7900
gfx1101	Radeon PRO W7700
gfx1102	Radeon RX 7600

Apple Devices

Ollama supports GPU acceleration on Apple devices via the Metal API.

Vulkan (Experimental)

Extends support for additional AMD/Intel GPUs on Linux/Windows. Enable: OLLAMA_VULKAN=1.

Linux Setup:

Intel: See Intel dGPU docs.
AMD: See AMDGPU Vulkan guide. Add ollama to render group if needed.

VRAM Scheduling: Requires root or sudo setcap cap_perfmon+ep /usr/local/bin/ollama; otherwise, uses model size approximations.

Multi-GPU: Set GGML_VK_VISIBLE_DEVICES to numeric IDs. Disable: -1.

Ollama System Requirements and Compatibility

Windows Requirements

System Requirements

Windows 10 22H2 or newer (Home or Pro)

Optional GPU acceleration:

NVIDIA GPU → Driver version 452.39 or newer

AMD Radeon GPU → Latest AMD Radeon driver

Terminals must support Unicode characters (some older Windows 10 terminal fonts may show squares)

Filesystem Requirements

~4 GB for Ollama binaries

Tens to hundreds of GB for LLM models

Installs to home directory by default

Can relocate binaries (/DIR= flag)

Can relocate models via OLLAMA_MODELS environment variable

Key Locations

Models & config: %HOMEPATH%\.ollama

Logs & updates: %LOCALAPPDATA%\Ollama

Binaries: %LOCALAPPDATA%\Programs\Ollama

macOS Requirements

System Requirements

macOS Sonoma (14+)

Apple Silicon (M-series) → CPU + GPU acceleration

Intel Mac → CPU only (no GPU acceleration)

Filesystem Requirements

Models require tens to hundreds of GB

Installed app can reside anywhere, but:

CLI must be in PATH (/usr/local/bin/ollama) or symlinked

Default data locations:

~/.ollama - models & config

~/.ollama/logs - logs

<Ollama.app>/Contents/Resources/ollama - CLI binary

Linux Requirements

System Requirements

Linux (AMD64 or ARM64)

Optional GPU acceleration:

NVIDIA GPU → CUDA drivers installed

AMD GPU → ROCm v6 + compatible AMD drivers

Filesystem Requirements

Models need tens to hundreds of GB

Default directories:

/usr/bin/ollama - binary

/usr/lib/ollama - libraries (may vary)

~/.ollama - models & config

/usr/share/ollama - data directory when running as a service (Linux-specific)

Driver Notes

NVIDIA: must install CUDA & verify with nvidia-smi

AMD:

Kernel's built-in amdgpu is older

Recommended: install latest AMD ROCm driver for full support

Conclusion

VRAM is a hard boundary, not a soft limit. Fit your model entirely in VRAM or accept performance that's 5-20x slower. Every benchmark in this guide confirms this reality.

You don't need expensive hardware for capable local LLMs. An 8GB card delivers 40+ tokens/second with 7-8B models at Q4_K_M, fast enough for real work but of course you will need more VRAM for agentic coding or vibe coding tasks as it requires larger models and context window. If buying new hardware: 8GB covers essentials, 12GB adds flexibility, 16-24GB enables 30B+ models, 48GB handles 70B workstation inference.

For existing hardware, Q4_K_M quantization, context management, and KV cache optimization extend capabilities significantly. But be realistic, partial GPU offloading kills performance, aggressive KV quantization degrades quality. Sometimes choosing models that match your VRAM beats fighting impossible configurations.

Frequently Asked Questions

Can Ollama run without a GPU or with low VRAM?

Yes, Ollama runs entirely on CPU without a GPU, though performance is significantly slower. For low VRAM situations, use aggressive quantization (Q3 or Q4) and stick to smaller parameter models (3-7B). CPU inference delivers 3-6 tokens/s on modern processors, acceptable for batch jobs but frustrating for interactive use.

What happens if my model exceeds available VRAM?

Ollama automatically offloads excess layers to system RAM when a model exceeds VRAM. While this prevents crashes, the performance penalty is substantial, tests show speeds dropping 5-20x when overflowing into RAM.

Does Ollama work with AMD GPUs or only NVIDIA?

Ollama supports both NVIDIA and AMD GPUs. NVIDIA GPUs offer widest compatibility and best performance via mature CUDA support. AMD support uses ROCm library, which is more complex to set up and has limited GPU compatibility. Many AMD users report success with workarounds like HSA_OVERRIDE_GFX_VERSION for unsupported cards.

What is the best quantization level for Ollama?

Q4_K_M is the recommended quantization for most users, it offers the best balance of quality, VRAM efficiency, and speed. Q5_K_M provides marginally better quality at 15-20% higher VRAM cost. Q6_K and Q8_0 approach original model quality but require substantially more VRAM with diminishing returns. Avoid Q3 and Q2 unless absolutely necessary, quality degradation becomes severe.

How much VRAM do I need to run Ollama models locally?

For most users, 8-12GB VRAM is the sweet spot for running Ollama models locally. This allows you to run 7-8B parameter models like Llama 3.1 8B and Qwen 3 8B with Q4_K_M quantization at 40+ tokens/second. Entry-level GPUs with 4-6GB VRAM can handle smaller 3-4B models, while 16-24GB VRAM supports larger 13-32B models with higher quality quantization and extended context windows.

What is Q4_K_M quantization and why is it recommended for Ollama?

Q4_K_M quantization compresses model weights to 4-bit precision, reducing VRAM requirements by approximately 75% compared to full FP16 precision while maintaining excellent output quality. It represents the optimal balance between memory efficiency, inference speed, and model performance for consumer hardware. For example, an 8B model in Q4_K_M uses around 5-6GB instead of 16GB, making it the gold standard for local LLM deployment with Ollama.

Can I use multiple GPUs with Ollama for larger models?

Yes, Ollama supports multi-GPU configurations for NVIDIA and AMD cards. For NVIDIA, set CUDA_VISIBLE_DEVICES to comma-separated GPU IDs to distribute model layers across multiple GPUs. This enables running 70B models on dual 24GB GPUs (48GB total) that wouldn't fit on a single card. For AMD GPUs, use ROCR_VISIBLE_DEVICES with the same approach to leverage combined VRAM across multiple cards.