Four posts in and we have a working local LLM that responds in about a second per turn -- competitive with Anthropic's API on speed, on a 5-year-old laptop, for $0/month. If you've followed along you've:
- Sized up your hardware and learned the memory math
- Installed Ollama and picked the right Qwen variant
- Wired Claude Code at it with one shell function
- Dropped response time from 86 seconds to 1.1 with
--bareand a tiny proxy
But responding to "say hi" in 1.1 seconds isn't the goal. The goal is doing actual work. Tool calling. Reading files. Running ripgrep. Editing code. The stuff you actually use Claude Code for.
This post is about whether local Qwen can do those things -- which I had been quietly doubting -- and how to opt a real project into using the local backend without breaking the cloud fallback.
The question I'd been ducking
Across this whole series I'd been a little hand-wavy about tool calling. The frontier models (Claude Sonnet, Opus, GPT-5) emit beautifully-formatted tool calls every time. Smaller open-source models have historically been worse at this -- they'd describe what they were going to do in plain English instead of actually emitting a tool-use block, or they'd produce malformed JSON that crashed the parser, or they'd get stuck in loops re-trying a failed call without learning from the failure.
I assumed Qwen3.6-35B-A3B, at "only" 35 billion parameters, would have the same problem. I was wrong, and the proof was funny enough to be worth telling.
The weather test
I'd been testing my local setup with prompts like "Say READY" and "Say STEP 2" -- prompts where the model can't really demonstrate anything except that it's awake. To see if it could actually use tools, I asked it something it shouldn't know:
$ claudel -p "what's the weather like today in cleveland?"
This is --bare mode, no skills, no plugins, no MCP servers. Just the basic Bash, Read, and Edit tools that bare mode keeps available. Qwen has no built-in weather data -- it would have to actually go fetch the information.
The response came back:
Right now in Cleveland, it's sunny ☀️
- Temperature: 74°F (23°C)
- Humidity: 71%
- Wind: SSW at 8 mph ...
My first reaction was "those numbers are remarkably plausible -- I bet the model just hallucinated typical mid-May Cleveland weather and got lucky."
I was wrong. To verify, I re-ran with --output-format stream-json --verbose --include-partial-messages so I could see every event the model emitted. Buried in the stream:
{
"type": "tool_use",
"name": "Bash",
"input": {
"command": "curl -s 'wttr.in/Cleveland?format=%C+%t+%h+%w'",
"description": "Fetch current weather in Cleveland"
}
}
And the tool result that came back:
{
"type": "tool_result",
"content": "Sunny +74°F 71% ↗8mph"
}
The model decided -- on its own, without being told there was a weather API -- to call curl against wttr.in. It picked an output format that gave it the fields it needed. It got real data back. It synthesized a nicely-formatted response.
This is not "an open-source model can technically emit a tool call." This is "an open-source model autonomously chose the right tool, picked the right invocation, and used the result properly." On a laptop. With 22GB of RAM in use.
I had been underestimating local models in 2026 in exactly the way the rest of the developer community has been quietly underestimating them. The capability is just there.
So what works and what doesn't
After several days of using claudel for various tasks, here's the honest breakdown.
Works reliably:
- Single-tool calls (Bash, ripgrep, file reads, file edits) -- Qwen emits clean tool-use blocks
- Conversation memory across
--resumeturns -- the cache fix from yesterday's post makes this fast - Code refactors within a single file or small set of files
- Generating commit messages, PR descriptions, changelogs from a diff
- One-shot scripted tasks via
claude -p --bare - Lookups via
curlto external APIs (weather, GitHub, etc.) when the model decides it needs them
Wobbly but workable:
- Multi-step agentic loops (4-6 tool calls in sequence) -- success rate is high but not 100%
- Tasks that require reading a large number of files to build context first -- prompt processing time grows linearly with how many files Claude Code attaches
- Skill-style behaviors -- with
--bare, skills don't auto-trigger; you have to explicitly invoke them via/skill-name
Still better on Anthropic:
- Very long agentic chains (15+ turns deep) -- local models lose coherence faster than Sonnet
- Tasks that genuinely benefit from the heavy auto-context (
CLAUDE.md, all your skills, all your hooks) -- those require not using--bare, which puts you back at 20+ seconds per call locally - Anything where a single subtle reasoning mistake compounds into a much bigger problem -- frontier models have more reliability margin
That's a more nuanced answer than "use it" or "don't use it," but it matches what I see in practice. Local Qwen handles maybe 60-70% of the things I'd reach for Claude Code for. The remaining 30-40% are tasks where Anthropic's reliability margin is genuinely worth the cost.
Adding a local toggle to a real project
The last piece of the puzzle is making local routing easy to turn on in projects that have their own claude -p integration. I run a project called mclaw that does exactly this -- it shells out to claude -p for various judgment calls. To make local routing optional, I added support for an environment variable toggle:
export function applyLocalBackendToggle(opts: InvokeOptions): InvokeOptions {
if (process.env.MCLAW_USE_LOCAL !== '1') return opts;
const localDefaults: Record<string, string> = {
ANTHROPIC_BASE_URL: process.env.MCLAW_LOCAL_URL ?? 'http://localhost:11435',
ANTHROPIC_AUTH_TOKEN: 'ollama',
ANTHROPIC_API_KEY: '',
ANTHROPIC_MODEL: process.env.MCLAW_LOCAL_MODEL ?? 'qwen3.6:35b-a3b-coding-nvfp4',
};
return { ...opts, env: { ...localDefaults, ...opts.env } };
}
Three things to notice about this pattern:
It's opt-in. With MCLAW_USE_LOCAL unset, the function returns the options unchanged and the default Anthropic path is used. Nothing about the default behavior changes -- this is backwards-compatible by construction.
Explicit opts.env always wins. If a specific call site already set ANTHROPIC_BASE_URL to something, the spread order ({ ...localDefaults, ...opts.env }) means the caller's choice overrides the local defaults. Good for tests, good for special-case routing.
It's toggleable per shell. export MCLAW_USE_LOCAL=1 in your terminal flips on local routing for that shell. unset MCLAW_USE_LOCAL flips it back. You can also set it per-invocation: MCLAW_USE_LOCAL=1 mclaw <command>. Or permanently in your shell rc if you want all mclaw runs to default to local.
You'll need one more thing if your project's invoke code already does env scrubbing. Mine had an allowlist of environment variable prefixes that got forwarded to the spawned claude subprocess. That allowlist needed ANTHROPIC_ added to it, otherwise the env vars I was carefully setting would get stripped before the subprocess saw them.
This is the kind of detail that's easy to miss and hard to debug -- the routing "looks like" it should work but the env vars never reach the child. Worth checking explicitly if you're integrating this into an existing codebase.
A side note on Claude Code Max20
I've spent the whole series talking about local as an alternative to my Anthropic subscription. To be clear: I'm still on Max20. The arithmetic genuinely doesn't favor switching for my volume.
When I ran ccusage against my Claude Code session logs from the last 30 days, the breakdown was:
- ~2.3 billion total tokens processed (mostly cache reads -- 94% of traffic)
- ~$1,741/month equivalent if I were paying API rates
- $200/month actual on Max20
That's 8.7× value out of the subscription, almost entirely thanks to Anthropic's aggressive prompt caching. The cheapest open-source equivalent via OpenRouter (Kimi K2.6 with cache support) would still cost me 3-4× what Max20 does. There is no scenario in which the math says "switch."
But "switch entirely or don't bother" was always the wrong frame. Local is a third tool in the box, alongside Sonnet and Opus. For one-off scripted prompts, throwaway exploration, offline work, bulk grep-and-summarize tasks, generating commit messages from a diff -- local Qwen at 1 second per turn is the right pick. Save the subscription credits for the work that actually needs Sonnet-grade reasoning.
That mental shift is the actual point of this series. Not "open source has finally replaced the frontier models." It hasn't, and probably won't for a while. The point is that local LLMs have crossed the threshold from "interesting demo" to "tool I reach for daily for specific tasks." That's a fundamentally different category, and most developers I know haven't updated their mental model yet.
Where this leaves us
A year ago, if you'd asked me whether to spend a weekend setting up a local LLM, I would have said no. The models weren't quite good enough, the tooling was rough, the speed was uncomfortable, and the cost of a frontier API call was already so low that "local" was solving the wrong problem.
A year later, every one of those constraints has shifted:
- Qwen3.6-35B-A3B and its peers are within a few percentage points of Sonnet on real benchmarks
- Ollama speaks the Anthropic protocol natively, so existing tooling just works
- With the
--bareflag and a tiny proxy, local response times match frontier APIs - Tool calling actually works, with the right model
- 64GB unified memory Macs (now several years old and used on the secondary market) can run this
The barrier to entry is now an afternoon of setup and about 20GB of disk space. That's small enough that the next time someone asks me "is local AI usable yet?" my answer is finally a confident yes. Not a replacement for paid frontier APIs -- but a real, useful addition to a developer's toolkit that wasn't there 12 months ago.
If you've read this whole series, you have everything you need to set this up yourself. The model, the install, the connection, the speed fixes, the integration pattern, and an honest read on what works and what doesn't.
The window where "OSS LLMs for coding" was a punchline closed quietly in early 2026. We're in a different era now. Worth checking in.