Run OpenClaw with Local Models via Ollama

Running an AI agent on your own hardware sounds straightforward — until Ollama silently truncates your prompt and the model starts producing garbage. I tested three tiers of GPU hardware and found three configuration settings that make the difference between a broken agent and a working one. This guide has the configs and commands. The video covers the full journey, including the failures.

Watch the video for the full test journey — including the iGPU that timed out, the €225 GPU running an actual agent, and the security implications of giving small models access to your smart home:

Hardware Tested

Tier	GPU	VRAM	Price	Best Model	Agent Capable?
0	Intel Arc iGPU (Meteor Lake)	Shared	€0	Qwen3 4B	No
1	RTX 3060 12GB	12 GB	~€225 second-hand	Qwen3 14B	Yes (~36s avg)
2	RTX 3090 24GB	24 GB	~€900 second-hand	Qwen3 Coder 30B	Yes (~3–11s avg)

Host: Intel NUC 14 Pro (Core Ultra 5 125H), Proxmox, Thunderbolt 4
eGPU dock: Aoostar AG02 (USB4, ~€120). Any Thunderbolt eGPU dock works. A desktop with a PCIe slot works too.

Model Performance on RTX 3090

Model	Type	VRAM	Avg Response	Notes
gpt-oss 20B	Dense	13 GB	2.7s	Fastest, but struggles with complex chains
Qwen3 14B	Dense	11.5 GB	10.0s	Most detailed answers
Qwen3 Coder 30B	MoE (3.3B active)	19.5 GB	10.9s	Best for complex tool chains
GLM 4.7 Flash	MoE (3.6B active)	20 GB	11.9s	Highest tool-use rate

All four models handle tool calling reliably on the 3090 with the configuration below.

RTX 3060 vs 3090

Both GPUs can drive an agent — the difference is whether you'd use it interactively or as a background worker. The 3060 runs the same tasks 3 to 10 times slower, which makes it a poor fit for anything you'd stand around waiting for, but ideal for scheduled tasks like a morning email summary. The video shows exactly what that looks like in practice with real demos and response times.

1. Install Ollama

curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3:14b

2. Set the Context Window

Ollama defaults to a 4096-token context window. OpenClaw's system prompt is around 11,500 tokens. At the default setting, Ollama silently truncates the prompt and the model never sees your tool definitions or skill instructions.

Create a modelfile with a larger context:

RTX 3060 (16K context):

cat > /tmp/Modelfile << 'EOF'
FROM qwen3:14b
PARAMETER num_ctx 16384
EOF

ollama create qwen3-16k -f /tmp/Modelfile

RTX 3090 (32K context):

cat > /tmp/Modelfile << 'EOF'
FROM qwen3-coder
PARAMETER num_ctx 32768
EOF

ollama create qwen3-coder-32k -f /tmp/Modelfile

VRAM impact (Qwen3 14B)

num_ctx	KV Cache	Total VRAM	Fits 12 GB?
4096 (default)	~640 MiB	~9.5 GiB	Yes
16384	~2.5 GiB	~11.3 GiB	Yes (tight)
32768	~5 GiB	~14 GiB	No

3. Ollama Environment Variables

export OLLAMA_KV_CACHE_TYPE=q8_0    # Quantize KV cache, saves ~50% VRAM
export OLLAMA_NUM_GPU=99             # All layers on GPU
export OLLAMA_MAX_LOADED_MODELS=1    # Unload previous model first
export OLLAMA_HOST=0.0.0.0           # Allow LAN access

If Ollama runs as a systemd service, add these to /etc/systemd/system/ollama.service.d/override.conf or equivalent.

4. OpenClaw Configuration

Three settings make the difference between a broken agent and a working one: the API mode, the base URL format, and the context window value.

Save to ~/.openclaw/openclaw.json. This example uses Qwen3 Coder on an RTX 3090 with 32K context. For an RTX 3060, use qwen3:14b (or your custom qwen3-16k) and set contextWindow to 16384 instead. You can add more models to the models array if you want to switch between them.

{
  "models": {
    "providers": {
      "ollama": {
        "baseUrl": "http://localhost:11434",
        "apiKey": "dummy",
        "api": "ollama",
        "models": [
          {
            "id": "qwen3-coder",
            "name": "Qwen3 Coder 30B",
            "reasoning": false,
            "input": ["text"],
            "cost": { "input": 0, "output": 0 },
            "contextWindow": 32768,
            "maxTokens": 4096
          }
        ]
      }
    }
  },
  "agents": {
    "defaults": {
      "model": {
        "primary": "ollama/qwen3-coder"
      },
      "compaction": {
        "mode": "safeguard",
        "reserveTokensFloor": 8000,
        "memoryFlush": {
          "enabled": true,
          "softThresholdTokens": 4000
        }
      }
    }
  }
}

What these settings do

api: "ollama" uses Ollama's native /api/chat endpoint. The OpenAI-compatible /v1 endpoint doesn't reliably support streaming + tool calling together, which produces malformed tool calls as plain text.

baseUrl without /v1 — the native Ollama API doesn't use the /v1 path. Including it breaks tool calling silently.

contextWindow must match the num_ctx in your Ollama modelfile. A mismatch causes either silent prompt truncation (Ollama side) or overly aggressive compaction (OpenClaw side).

reasoning: false disables thinking mode. On local models, this interferes with tool calling.

compaction.reserveTokensFloor: 8000 prevents OpenClaw from filling the entire context window, leaving room for the model's response.

mode: "merge" keeps both local and cloud providers available. Only needed if you set up the cloud fallback in step 5.

5. Cloud Fallback

The best local model I tested scored 20 out of 53 on the Artificial Analysis Intelligence Index. That's less than halfway to what a frontier model delivers. For multi-step reasoning and complex tasks, local models still fall short.

OpenClaw supports automatic fallback: if the local model fails a request (timeout, error, malformed output), it tries the next provider in the list. Add a fallbacks array to your agent config:

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "ollama/qwen3-coder",
        "fallbacks": ["anthropic/claude-sonnet-4-6"]
      }
    }
  }
}

This keeps private queries on your local model. When something is too complex for it, the request escalates to the cloud automatically.

For this to work, the cloud provider needs to be registered in the models.providers block alongside Ollama. Add it to the config from step 4:

{
  "models": {
    "mode": "merge",
    "providers": {
      "ollama": {
        "...": "your existing Ollama config from step 4"
      },
      "anthropic": {
        "apiKey": "sk-ant-..."
      }
    }
  }
}

The "mode": "merge" setting is what makes both providers available at the same time. You can also set up provider credentials during initial setup with openclaw onboard.

6. iGPU: SYCL Docker Image

If you have an Intel iGPU and want to run a local chatbot (not an agent), I published a Docker image that compiles llama.cpp with SYCL support. Find it on GitHub:

GitHub - richardvenneman/llama-cpp-sycl: Run llama.cpp with the SYCL backend on Intel iGPUs (Meteor Lake, Arrow Lake, Arc)

Run llama.cpp with the SYCL backend on Intel iGPUs (Meteor Lake, Arrow Lake, Arc) - richardvenneman/llama-cpp-sycl

github.com/richardvenneman/llama-cpp-sycl

Qwen3 4B runs at ~8.5 tokens/sec on Intel Arc (Meteor Lake). Usable for basic chat. Not enough for an agent like OpenClaw.

7. Security

Smaller local models are significantly more vulnerable to prompt injection than frontier models. The large cloud models spend a substantial portion of their training on recognizing and refusing malicious instructions. A 14B or 30B parameter model doesn't have that luxury — there simply aren't enough parameters to dedicate to safety alongside capability.

What does that mean in practice? A hidden instruction in a document, email, or website — white text on a white background, an invisible prompt buried in metadata — and your agent does whatever it's told. It could read files, call APIs, or execute commands on your behalf without you ever seeing the instruction.

This isn't hypothetical. It's the trade-off you accept when running local models with tool access. See the OpenClaw security docs for more detail on the threat model.

Mitigations:

Always run the built-in security audit before exposing your agent to real data:

openclaw security audit --deep
openclaw security audit --fix

Keep the gateway on your local network. Don't expose it to the internet. And be aware that the hybrid fallback (step 5) only catches technical failures — not a successful-but-compromised response from a manipulated model.

Quick Reference

Ollama:

ollama pull qwen3:14b
ollama list
ollama create <name> -f <file>
ollama serve

OpenClaw:

openclaw onboard
openclaw status
openclaw doctor --fix
openclaw security audit --deep
openclaw logs --follow

Config: ~/.openclaw/openclaw.json

Troubleshooting

Problem	Solution
Model produces garbage	Create a modelfile with higher `num_ctx`
Tool calls appear as plain text	Set `"api": "ollama"`, remove `/v1` from baseUrl
Agent says it can't access Home Assistant	Increase `num_ctx`, verify with `OLLAMA_DEBUG=1`
Responses take minutes	Model spilling to CPU, use a smaller model or more VRAM
"LLM request timed out"	iGPU too slow for agent prompts, need a dedicated GPU
Context disappears mid-conversation	Match `contextWindow` in OpenClaw to `num_ctx` in Ollama
Out of memory	Lower `num_ctx` or set `OLLAMA_KV_CACHE_TYPE=q8_0`