Running an AI agent on your own hardware sounds straightforward — until Ollama silently truncates your prompt and the model starts producing garbage. I tested three tiers of GPU hardware and found three configuration settings that make the difference between a broken agent and a working one. This guide has the configs and commands. The video covers the full journey, including the failures.
Watch the video for the full test journey — including the iGPU that timed out, the €225 GPU running an actual agent, and the security implications of giving small models access to your smart home:
Hardware Tested
Tier | GPU | VRAM | Price | Best Model | Agent Capable? |
|---|---|---|---|---|---|
0 | Intel Arc iGPU (Meteor Lake) | Shared | €0 | Qwen3 4B | No |
1 | RTX 3060 12GB | 12 GB | ~€225 second-hand | Qwen3 14B | Yes (~36s avg) |
2 | RTX 3090 24GB | 24 GB | ~€900 second-hand | Qwen3 Coder 30B | Yes (~3–11s avg) |
Host: Intel NUC 14 Pro (Core Ultra 5 125H), Proxmox, Thunderbolt 4
eGPU dock: Aoostar AG02 (USB4, ~€120). Any Thunderbolt eGPU dock works. A desktop with a PCIe slot works too.
Model Performance on RTX 3090
Model | Type | VRAM | Avg Response | Notes |
|---|---|---|---|---|
gpt-oss 20B | Dense | 13 GB | 2.7s | Fastest, but struggles with complex chains |
Qwen3 14B | Dense | 11.5 GB | 10.0s | Most detailed answers |
Qwen3 Coder 30B | MoE (3.3B active) | 19.5 GB | 10.9s | Best for complex tool chains |
GLM 4.7 Flash | MoE (3.6B active) | 20 GB | 11.9s | Highest tool-use rate |
All four models handle tool calling reliably on the 3090 with the configuration below.
RTX 3060 vs 3090
Both GPUs can drive an agent — the difference is whether you'd use it interactively or as a background worker. The 3060 runs the same tasks 3 to 10 times slower, which makes it a poor fit for anything you'd stand around waiting for, but ideal for scheduled tasks like a morning email summary. The video shows exactly what that looks like in practice with real demos and response times.
1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3:14b
2. Set the Context Window
Ollama defaults to a 4096-token context window. OpenClaw's system prompt is around 11,500 tokens. At the default setting, Ollama silently truncates the prompt and the model never sees your tool definitions or skill instructions.
Create a modelfile with a larger context:
RTX 3060 (16K context):
cat > /tmp/Modelfile << 'EOF'
FROM qwen3:14b
PARAMETER num_ctx 16384
EOF
ollama create qwen3-16k -f /tmp/Modelfile
RTX 3090 (32K context):
cat > /tmp/Modelfile << 'EOF'
FROM qwen3-coder
PARAMETER num_ctx 32768
EOF
ollama create qwen3-coder-32k -f /tmp/Modelfile
VRAM impact (Qwen3 14B)
num_ctx | KV Cache | Total VRAM | Fits 12 GB? |
|---|---|---|---|
4096 (default) | ~640 MiB | ~9.5 GiB | Yes |
16384 | ~2.5 GiB | ~11.3 GiB | Yes (tight) |
32768 | ~5 GiB | ~14 GiB | No |
3. Ollama Environment Variables
export OLLAMA_KV_CACHE_TYPE=q8_0 # Quantize KV cache, saves ~50% VRAM
export OLLAMA_NUM_GPU=99 # All layers on GPU
export OLLAMA_MAX_LOADED_MODELS=1 # Unload previous model first
export OLLAMA_HOST=0.0.0.0 # Allow LAN access
If Ollama runs as a systemd service, add these to /etc/systemd/system/ollama.service.d/override.conf or equivalent.
4. OpenClaw Configuration
See also the OpenClaw local models docs and the Ollama blog post on OpenClaw.
Three settings make the difference between a broken agent and a working one: the API mode, the base URL format, and the context window value.
Save to ~/.openclaw/openclaw.json. This example uses Qwen3 Coder on an RTX 3090 with 32K context. For an RTX 3060, use qwen3:14b (or your custom qwen3-16k) and set contextWindow to 16384 instead. You can add more models to the models array if you want to switch between them.
{
"models": {
"providers": {
"ollama": {
"baseUrl": "http://localhost:11434",
"apiKey": "dummy",
"api": "ollama",
"models": [
{
"id": "qwen3-coder",
"name": "Qwen3 Coder 30B",
"reasoning": false,
"input": ["text"],
"cost": { "input": 0, "output": 0 },
"contextWindow": 32768,
"maxTokens": 4096
}
]
}
}
},
"agents": {
"defaults": {
"model": {
"primary": "ollama/qwen3-coder"
},
"compaction": {
"mode": "safeguard",
"reserveTokensFloor": 8000,
"memoryFlush": {
"enabled": true,
"softThresholdTokens": 4000
}
}
}
}
}
What these settings do
api: "ollama" uses Ollama's native /api/chat endpoint. The OpenAI-compatible /v1 endpoint doesn't reliably support streaming + tool calling together, which produces malformed tool calls as plain text.
baseUrl without /v1 — the native Ollama API doesn't use the /v1 path. Including it breaks tool calling silently.
contextWindow must match the num_ctx in your Ollama modelfile. A mismatch causes either silent prompt truncation (Ollama side) or overly aggressive compaction (OpenClaw side).
reasoning: false disables thinking mode. On local models, this interferes with tool calling.
compaction.reserveTokensFloor: 8000 prevents OpenClaw from filling the entire context window, leaving room for the model's response.
mode: "merge" keeps both local and cloud providers available. Only needed if you set up the cloud fallback in step 5.
5. Cloud Fallback
The best local model I tested scored 20 out of 53 on the Artificial Analysis Intelligence Index. That's less than halfway to what a frontier model delivers. For multi-step reasoning and complex tasks, local models still fall short.
OpenClaw supports automatic fallback: if the local model fails a request (timeout, error, malformed output), it tries the next provider in the list. Add a fallbacks array to your agent config:
{
"agents": {
"defaults": {
"model": {
"primary": "ollama/qwen3-coder",
"fallbacks": ["anthropic/claude-sonnet-4-6"]
}
}
}
}
This keeps private queries on your local model. When something is too complex for it, the request escalates to the cloud automatically.
For this to work, the cloud provider needs to be registered in the models.providers block alongside Ollama. Add it to the config from step 4:
{
"models": {
"mode": "merge",
"providers": {
"ollama": {
"...": "your existing Ollama config from step 4"
},
"anthropic": {
"apiKey": "sk-ant-..."
}
}
}
}
The "mode": "merge" setting is what makes both providers available at the same time. You can also set up provider credentials during initial setup with openclaw onboard.
6. iGPU: SYCL Docker Image
If you have an Intel iGPU and want to run a local chatbot (not an agent), I published a Docker image that compiles llama.cpp with SYCL support. Find it on GitHub:
Qwen3 4B runs at ~8.5 tokens/sec on Intel Arc (Meteor Lake). Usable for basic chat. Not enough for an agent like OpenClaw.
7. Security
Smaller local models are significantly more vulnerable to prompt injection than frontier models. The large cloud models spend a substantial portion of their training on recognizing and refusing malicious instructions. A 14B or 30B parameter model doesn't have that luxury — there simply aren't enough parameters to dedicate to safety alongside capability.
What does that mean in practice? A hidden instruction in a document, email, or website — white text on a white background, an invisible prompt buried in metadata — and your agent does whatever it's told. It could read files, call APIs, or execute commands on your behalf without you ever seeing the instruction.
This isn't hypothetical. It's the trade-off you accept when running local models with tool access. See the OpenClaw security docs for more detail on the threat model.
Mitigations:
Always run the built-in security audit before exposing your agent to real data:
openclaw security audit --deep
openclaw security audit --fix
Keep the gateway on your local network. Don't expose it to the internet. And be aware that the hybrid fallback (step 5) only catches technical failures — not a successful-but-compromised response from a manipulated model.
Quick Reference
Ollama:
ollama pull qwen3:14b
ollama list
ollama create <name> -f <file>
ollama serve
OpenClaw:
openclaw onboard
openclaw status
openclaw doctor --fix
openclaw security audit --deep
openclaw logs --follow
Config: ~/.openclaw/openclaw.json
Troubleshooting
Problem | Solution |
|---|---|
Model produces garbage | Create a modelfile with higher |
Tool calls appear as plain text | Set |
Agent says it can't access Home Assistant | Increase |
Responses take minutes | Model spilling to CPU, use a smaller model or more VRAM |
"LLM request timed out" | iGPU too slow for agent prompts, need a dedicated GPU |
Context disappears mid-conversation | Match |
Out of memory | Lower |