Local LLMs for Dev — When Your Laptop Beats the API
Two years ago, "run an LLM locally" meant a quantized 7B model that could barely write a for-loop. In May 2026, an M4 Max with 128 GB unified memory runs Qwen 3 Coder 32B at 40+ tokens per second, no internet, no API key, no rate limit.
The question isn't can you. The question is when should you.
The Stack That Works
The local LLM ecosystem consolidated around three tools:
- Ollama — the easy mode.
ollama run qwen3-coderand you're done. - llama.cpp — the engine under most of it. Direct GGUF support, full Metal acceleration on Apple Silicon, Vulkan on Linux, every quantization format that matters.
- LM Studio — GUI for people who don't want to touch a terminal.
The hardware that matters in 2026:
- Apple Silicon M3/M4 Max with 64–128 GB — the sweet spot. Unified memory means you load 70B class models without a discrete GPU.
- NVIDIA RTX 4090 / 5090 — faster on smaller models, hard ceiling on VRAM (24–32 GB).
- AMD Strix Halo — 128 GB unified memory, real Llama 70B competition for the Mac, ~half the price.
If you're buying a dev machine in 2026 with local LLM inference in mind, you're choosing between more memory (Apple, AMD) or more speed on smaller models (NVIDIA). Memory wins for coding.
Models Worth Running
The open-weight landscape changed fast. As of May 2026, my actual rotation:
- Qwen 3 Coder 32B — current best open coding model. Beats GPT-4 class on most benchmarks, fits in 24 GB at 4-bit.
- DeepSeek V3.5 — strong reasoning, weaker at frontend, requires a 128 GB rig at any reasonable quant.
- Llama 4 Scout 70B — Meta's actually-good open release. General-purpose, decent coder.
- Hermes 4 (Nous) — fine-tune of Llama, less censored, surprisingly good for ambiguous problems.
I keep three loaded: a small fast one for autocomplete (Qwen 3 8B), a mid-tier for chat (Qwen 3 Coder 32B), and a big one for hard problems (Llama 4 70B).
The Cost Math
Claude Opus 4.7 and GPT-5 charge per token. Local models charge electricity.
A rough back-of-envelope: heavy daily usage (300K input + 50K output tokens) on Claude Opus 4.7 runs roughly $30–60/day. The same workload on a local M4 Max costs maybe $0.40 in electricity.
The breakeven on a $5,000 M4 Max is somewhere around 3–4 months of heavy usage. That ignores the API rate limits, the network round trips, and the data you don't want sitting in a vendor's logs.
When Local Wins
- Privacy-sensitive code — client work under NDA, security audits, anything you legally can't send to OpenAI
- Offline development — flights, trains, sketchy hotel WiFi
- High-volume agentic loops — when you're running an agent that fires 10,000 prompts a day, API costs eat you alive
- Latency-critical autocomplete — a 4B model on your GPU beats a 200ms round-trip every time
- Experimentation — fine-tuning, LoRAs, weird sampling strategies the APIs don't expose
When Local Loses
- Frontier reasoning — Claude Opus 4.7 and GPT-5 still beat anything you can run locally on hard novel problems
- Long context — 1M token context on a local machine is theoretically possible, practically painful
- Multimodal — vision and audio models that match the closed offerings don't fit on consumer hardware yet
- You don't want to manage it — Ollama is easy, but model selection, quantization, and prompt formatting are still real work
The Practical Recommendation
If you write code daily and have a recent Mac with 64 GB+ or a 4090: install Ollama tonight, pull Qwen 3 Coder, point your editor at it. Use it for the 80% of tasks that don't need frontier capability. Save the API budget for the 20% that does.
The frontier models are still ahead. They're just no longer essential for most of the work.