Feature image for How to increase context length of local LLMs in Ollama

How to Increase Context Length of Local LLMs in Ollama

Context window size is a critical parameter when working with language models in Ollama. It determines how much previous conversation or text the model can consider when generating responses. This article clarifies how to check, set, and manage context window sizes in Ollama, addressing common points of confusion.

Ollama Default Context Window Size

By default, Ollama uses a context window size of 4096 tokens for most models. The gpt-oss model has a default context window size of 8192 tokens. However, these defaults can be overridden as needed.

How to Check Context Window Size in Ollama

There are two important context-related values to understand:

  1. Model's Maximum Context Length: The maximum context length the model can support, determined by its architecture.
  2. Current Context Length: The actual context length being used in the current session.

Checking Model's Maximum Context Length in Ollama

To see the maximum context length a model can support:

ollama show <model_name>

For example, to check the context length for qwen3:8b:

ollama show qwen3:8b

This will display information including:

context length      40960
embedding length    4096

The "context length" value shows the maximum context window size the model can support, not what's currently being used.

Checking Current Context Length in Ollama

To see the actual context length being used in a running model:

ollama ps

This shows currently loaded models with their actual context length:

NAME           ID              SIZE     PROCESSOR    CONTEXT    UNTIL
qwen3:8b       abc123def456    6.5 GB   100% GPU     8192       4 minutes from now

The "CONTEXT" column shows the actual context length being used for the running model.

Please note: that this function will print the details only if the model is loaded in the memory. You can do this by running "ollama run " or by initiating a chat from the GUI. After that you can run the "ollama ps" command.

Setting Context Window Size

There are several ways to set the context window size in Ollama:

Using the Ollama GUI (Windows and macOS)

Ollama GUI Windows Context window, context length settings

  1. Open the Ollama application
  2. Go to Settings
  3. Adjust the context window size setting
  4. Restart Ollama for the changes to take effect

Using Environment Variables

To set the default context window for all models:

OLLAMA_CONTEXT_LENGTH=8192 ollama serve

Using the CLI

When running a model interactively:

/set parameter num_ctx 8192

Using the API

When making API calls:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?",
  "options": {
    "num_ctx": 8192
  }
}'

For detailed Windows installation and setup instructions, check our complete guide to running LLMs on Windows.

Persisting Context Window Settings

Changes made with /set parameter num_ctx during an interactive session are temporary and will not persist after the model is unloaded from memory. To make permanent changes to the context window size, you need to create a new custom model with your desired configuration. This ensures that every time you run your model, it will use the larger context length without having to manually set it each time.

Step 1: Create a New, Simple Modelfile

Create a new Modfile in any directory (Wherever you can run terminal commands). You can name it something different to avoid confusion, for example, Modelfile.custom.

In this new file, you only need two lines:

  • FROM qwen3:8b: This tells Ollama to use your existing qwen3:8b model as the starting point.
  • PARAMETER num_ctx <value>: This is where you set the new context length.

For example, let's say you want to set the context length to 8192 tokens. Create a file named Modelfile.custom and put the following content inside it:

# My custom Modelfile to increase context length
FROM qwen3:8b

# Set the context window size (e.g., 4096, 8192, 16384, etc.)
PARAMETER num_ctx 8192

Step 2: Create a New Model with Your Custom Modelfile

Now, open your terminal (PowerShell or Command Prompt) in the same directory where you saved Modelfile.custom.

Run the ollama create command. You need to give your new, modified model a new name. For example, let's call it qwen3:8b-8k.

ollama create qwen3:8b-8k -f Modelfile.custom

demonstrating the ollama create command to build a custom model with increased context length using the Modelfile
.

Let's break down that command:

  • ollama create: The command to build a new model.
  • qwen3:8b-8k: The new name for your model with the larger context.
  • -f Modelfile.custom: This flag tells Ollama to use the file Modelfile.custom for the instructions.

Ollama will quickly create a new model layer. It will not re-download the 5.2 GB model; it just applies your new configuration on top of the existing one.

Step 3: Verify Your Changes

You can now verify that your new model has the updated context length.

List your models:

ollama list

showing the ollama list command output displaying both the original qwen3:8b and new custom qwen3:8b-8k model

You will now see both the original and your new custom model:

NAME              ID              SIZE      MODIFIED
qwen3:8b          500a1f067a9f    5.2 GB    6 days ago
qwen3:8b-8k       ...             5.2 GB    seconds ago

Show the new model's information:

ollama show qwen3:8b-8k

This time, the output will include the num_ctx parameter you set!

...
PARAMETER num_ctx 8192
...

Step 4: Use Your New Model

You can now run your new model with its extended context window:

ollama run qwen3:8b-8k

running the newly created qwen3:8b-8k custom model with 8192 token context window ready for use

Notice how the VRAM has now increased to 7.5 GB. With the default 4k context, it utilized 6.5 GB and it was entirely in the VRAM. The model is using 1 GB extra VRAM for a 4k increase in context length. It loaded 6193MiB to the GPU VRAM and rest to the RAM.

Ollama loaded 93% of the new model's layers onto GPU's fast VRAM. This is where most of the heavy lifting and fast processing happens. GPU's VRAM wasn't large enough to hold the entire model. So, Ollama automatically offloaded the remaining 7% of the layers to your regular system RAM, where the CPU will process them.

Ollama Context Length vs Embedding Length

When you run ollama show <model_name>, you'll see two different values:

context length      40960
embedding length    4096

Here's a screenshot of the "ollama show qwen3:8b" command I ran locally on a Windows Machine:

Ollama show command output displaying model context length and embedding dimensions

Context Length

  • What it is: The maximum number of tokens the model can consider when generating a response.
  • Purpose: Determines how much previous conversation or text the model can reference.
  • Impact: A larger context window allows the model to consider more information but requires more memory and may slow down processing.
  • Adjustability: Can be set up to the model's maximum context length.

Embedding Length

  • What it is: The dimension of the vector representation of tokens used by the model.
  • Purpose: Determines the size of the semantic space in which tokens are represented.
  • Impact: Affects the model's ability to capture nuanced relationships between tokens but is fixed by the model's architecture.
  • Adjustability: Cannot be changed as it's a fundamental characteristic of the model.

Relationship Between Context Length and Memory Usage

Memory usage in language models is influenced by both context length and embedding dimensions, but the relationship is more complex than simple multiplication. When you increase the context length, the model needs to store more information about each token in the conversation, primarily through a mechanism called the KV cache. This cache grows as the context gets longer, requiring additional memory.

The embedding length affects how richly each token is represented in the model's understanding. Larger embedding dimensions allow for more nuanced representations but require more memory for each individual token.

However, memory usage depends on many factors beyond just these two values. The model's architecture, number of layers, attention mechanisms, and precision settings all play important roles. For example, the attention mechanism creates memory requirements that grow more rapidly than linearly with context length.

In practice, longer contexts generally require more memory, and models with larger embedding dimensions tend to use more memory per token, but the exact relationship varies significantly between different model architectures and implementations.

Using Ollama with Frontend Interfaces

When using frontend interfaces like OpenWebUI:

  • Frontend settings for context window serve as templates only when interacting through that interface
  • These settings do not change the default context window in Ollama itself
  • If you query the same model directly through the Ollama API without specifying identical parameters, Ollama will treat it as a new model request
  • This may cause the model to reload with different parameters than what was set in the frontend

Performance Considerations

Setting the context length higher may cause the model to not fit entirely onto the GPU, which will make it run more slowly. You can check where your model is loaded using:

ollama ps

Here's a screenshot of the "ollama ps" command we ran locally on a Windows Machine:

Ollama ps command output showing current model context length and memory usage

The PROCESSOR column will show:

  • "100% GPU" means the model was loaded entirely into the GPU
  • "100% CPU" means the model was loaded entirely in system memory
  • "48%/52% CPU/GPU" means the model was loaded partially onto both

🧮 Use our VRAM Calculator to see exactly how context length affects your memory requirements and find the optimal balance for your hardware.

Qwen3 8b is a great model to run locally, has a context window of 40k with ollama and it's delivering throughput of about 42 TPS on RTX 4060 locally when plugged in and about 35 TPS on average on battery. For GPUS with 8GB of RAM, 8b 4 bit LLMs offer impressive accuracy/performance. Here we have compared the best Local LLMs for 8 GB VRAM.

Troubleshooting

My Context Window Settings Aren't Taking Effect

  1. Verify you're setting the context window correctly using one of the methods above
  2. Check the current context length with ollama ps
  3. Remember that changes made with /set parameter are temporary unless you save them to a new model
  4. If using a frontend interface, check if the interface is overriding your settings

How to Tell if My Model Was Loaded with the Correct Context Size

Use ollama ps to see the current context length being used by running models. This shows the actual context length in use, not just the model's maximum capability.

Conclusion

Understanding and properly managing context window size in Ollama is essential for getting optimal performance from your language models. By using the correct commands to check and set context length, and understanding the difference between a model's maximum context length and its current setting, you can better control how your models operate and ensure they're configured appropriately for your specific use cases.

FAQ

  1. How do I check the maximum context length of a model in Ollama?

    Run ollama show and look for the "context length" field. The context length value in the output tells you the maximum number of tokens that model can handle in one session. For example, ollama show qwen3:8b may return context length 40960.

  2. How do I see the current context length being used in a running Ollama session?

    Use ollama ps to view active models and their context size. The CONTEXT column shows the session's actual context length, which may be smaller than the maximum supported by the model. This only works if the model is already running in memory.

  3. How can I increase the context window size in Ollama?

    Set the num_ctx parameter when launching or running a model. You can do this by adding /set parameter num_ctx in the CLI, setting OLLAMA_CONTEXT_LENGTH as an environment variable, adjusting context size in the Ollama GUI, or passing "options": {"num_ctx": } in API calls.

  4. How do I make a custom context length persistent in Ollama?

    Save it as a new model or define it in a Modelfile. Session changes with /set parameter num_ctx are temporary. To persist, either run /save my-custom-model after setting the parameter or create a Modelfile with PARAMETER num_ctx and build with ollama create.

  5. What is the difference between context length and embedding length in Ollama?

    Context length is how many tokens the model can consider; embedding length is the size of token vectors. Context length is adjustable up to the model's maximum and affects how much prior conversation is remembered. Embedding length is fixed by architecture and determines the dimensionality of token representations.