I've spent the last year shipping code with Claude Code on the Max20 plan. It's been one of the best $200/month I've ever spent. But I keep getting the same question from developer friends -- "is the open-source stuff actually usable yet?" And I keep giving the same hand-wavy answer about "almost, but not quite," because I genuinely didn't know.
So this week I sat down and found out.
This is the first in a short series where I walk through what it actually takes to run a competitive open-source coding model on a Mac, point Claude Code at it, and use it for real work. The TL;DR is: yes, it works, and the gap is much smaller than I expected. But getting there required learning a handful of things I wish someone had told me up front. This series is me writing that up.
What "competitive" means now
The state of open-source LLMs flipped quietly in the first half of 2026. Models from Alibaba's Qwen team, Moonshot's Kimi line, and DeepSeek V4 are now scoring within a few points of Claude Sonnet on real-world coding benchmarks like SWE-bench Verified. Not on hand-picked questions -- on actual GitHub issue resolution.
The honest comparison looks like this:
- Claude Sonnet 4.6: 77.2% on SWE-bench Verified
- Qwen3.6-27B (open source, fits on a beefy Mac): 77.2% -- same number
- Qwen3-Coder-Next (80B, MoE): "Sonnet 4.5-level" per the Alibaba team
- GLM-4.7-Flash, DeepSeek V4, Kimi K2.6: all in this same neighborhood
A year ago this would have been laughable. Today the question isn't "can open models do it" -- it's "can my hardware run them, and is the workflow tolerable."
My hardware reality
I'm doing all of this on a MacBook Pro M1 Max with 64GB of unified memory. Five years old at this point. Not a Mac Studio, not a beefy datacenter GPU, just a laptop.
Two things matter about this machine for local LLMs:
The unified memory architecture. On Apple Silicon, your CPU, GPU, and "RAM" all share the same physical memory pool. There's no copying data over a PCIe bus from system RAM to GPU VRAM. That's a huge structural advantage for running models that would otherwise need an expensive dedicated GPU.
64GB is the realistic ceiling. macOS reserves around 16GB for itself and various background services, so you have roughly 48GB usable for a model and its working memory. That's enough to run a serious model, but it's not enough to run any serious model -- and the difference between "fits" and "doesn't fit" is the difference between "snappy" and "your whole computer freezes for 90 seconds."
I know because I made that mistake. We'll get to it.
The SSD-as-RAM trap
This is the misconception I had to unlearn first, and I think it's the one that trips up most people considering local LLMs on a Mac.
My laptop has a fast 4TB NVMe SSD. I measured it -- 7.1 GB/s sequential reads, 1.9 GB/s writes. That's genuinely fast storage. So my instinct was: "if a model doesn't fit in RAM, can't macOS just page it to the SSD? The SSD is fast, it should be fine."
It's not fine. It's catastrophic.
Here's why. Generating one token from an LLM requires reading the entire model's weights from memory. For a 30B-parameter model at 4-bit quantization, that's ~20GB of memory traffic per token. The M1 Max's unified memory runs at around 400 GB/s, which gives you ~20 tokens per second from bandwidth alone. The SSD at 7 GB/s gives you ~0.35 tokens per second -- about 60× slower. And that's the optimistic sequential number. Real LLM access patterns are more random, which makes the gap bigger.
In practice, when the model overflows RAM and macOS starts paging to swap, you don't get "slightly slower." You get token generation that takes minutes per word, the whole UI freezes, your fans spin up, and you eventually just kill the process.
So when people in forums say "you need a Mac Studio with 192GB unified memory," they're not being snobby. They're memory-bandwidth-bound. The 4TB SSD doesn't help you run bigger models. It only helps you store more model files for swapping between.
The math, simplified
Here's the rule of thumb I've landed on for a 64GB Mac:
- A 4-bit quantized model file size is roughly your memory footprint at minimum context
- Add 2-8GB for the KV cache (longer context = more cache)
- Leave 15-20GB headroom for macOS, your browser, your editor
So practical ceiling is models with file sizes around 30-35GB. Anything bigger and you're flirting with swap.
This is where the architectural trick that makes 2026 different from 2025 comes in.
Mixture of Experts (or, "how does a 35B model run like a 3B one")
Traditional dense models -- like Llama 3 70B -- use every parameter to generate every token. Write the word "the" → all 70B weights get read. Write the next word → all 70B get read again. That's why dense models scale memory-bandwidth-bound: bigger model, slower output.
Mixture of Experts (MoE) models break this. The model is split into many smaller specialized sub-networks called "experts," and a small router at each layer picks just a few experts to actually run for any given token. The rest sit idle.
Concrete example: Qwen3.6-35B-A3B. The "35B-A3B" means 35 billion total parameters, 3 billion active per token. The whole 35B has to live in memory because the router might pick any expert next, but the math you actually do is only 3B per token.
You pay the memory cost of a 35B model. You get the speed of a 3B model.
The practical impact for a 64GB Mac is dramatic:
| Model type | Size in RAM | Bandwidth per token | Speed on M1 Max |
|---|---|---|---|
| Dense 70B | 40 GB | reads all 40 GB | ~5 tokens/sec (painful) |
| Dense 27B | 22 GB | reads all 22 GB | ~15 tokens/sec (okay) |
| MoE 35B-A3B | 22 GB | reads ~3 GB | ~50+ tokens/sec (snappy) |
Same memory footprint as a dense 27B, but multiple times the speed. MoE is what makes local LLMs feel responsive on a laptop instead of academically interesting on a workstation.
Why I'm writing this series
I went into this expecting to confirm what most developers I know assume -- that local models are still a year away from being useful for real work. Instead I came out the other side with a working setup that's competitive with my Anthropic subscription for a real subset of tasks, and "subscription-grade" for short conversational work.
The catch is that getting there required figuring out things that aren't documented anywhere obvious. Picking the right model variant. Realizing why some download tags will brick your laptop. Discovering that Claude Code sends a hidden header that destroys local cache performance. Patching a real production codebase to opt into local routing without breaking the fallback to the cloud.
The next four posts walk through it in order:
- Installing Ollama and picking the right Qwen model for a 64GB Mac (and learning the hard way that "fits" depends on context window settings too)
- Pointing Claude Code at the local model with one environment variable trick -- and the first painful side-by-side comparison
- The 80-second prompt and how to fix it -- the
--bareflag, the hidden cache-breaker header, and the tiny proxy that brought response times from 80 seconds to 1.1 seconds - Wiring local model support into a real project, where tool calls actually do work, and what's still legitimately hard
If you've been waiting for permission to try local LLMs for coding, this is it. The window where it was "interesting but impractical" is closing. By the end of this series you'll have a working setup, you'll understand the tradeoffs honestly, and you'll know exactly when to use local versus when to keep paying Anthropic.
And spoiler: I'm still on Max20. The conclusion isn't "switch to local." It's that for the first time, local is a serious tool in the box -- one that handles a real percentage of your day-to-day work faster, cheaper, and entirely offline.
That's a much bigger deal than it sounds.