Local LLMs for Dev — When Your Laptop Beats the API

Two years ago, "run an LLM locally" meant a quantized 7B model that could barely write a for-loop. In May 2026, an M4 Max with 128 GB unified memory runs Qwen 3 Coder 32B at 40+ tokens per second, no internet, no API key, no rate limit.

The question isn't can you. The question is when should you.

The Stack That Works

The local LLM ecosystem consolidated around three tools:

Ollama — the easy mode. ollama run qwen3-coder and you're done.
llama.cpp — the engine under most of it. Direct GGUF support, full Metal acceleration on Apple Silicon, Vulkan on Linux, every quantization format that matters.
LM Studio — GUI for people who don't want to touch a terminal.

The hardware that matters in 2026:

Apple Silicon M3/M4 Max with 64–128 GB — the sweet spot. Unified memory means you load 70B class models without a discrete GPU.
NVIDIA RTX 4090 / 5090 — faster on smaller models, hard ceiling on VRAM (24–32 GB).
AMD Strix Halo — 128 GB unified memory, real Llama 70B competition for the Mac, ~half the price.

If you're buying a dev machine in 2026 with local LLM inference in mind, you're choosing between more memory (Apple, AMD) or more speed on smaller models (NVIDIA). Memory wins for coding.

Models Worth Running

The open-weight landscape changed fast. As of May 2026, my actual rotation:

Qwen 3 Coder 32B — current best open coding model. Beats GPT-4 class on most benchmarks, fits in 24 GB at 4-bit.
DeepSeek V3.5 — strong reasoning, weaker at frontend, requires a 128 GB rig at any reasonable quant.
Llama 4 Scout 70B — Meta's actually-good open release. General-purpose, decent coder.
Hermes 4 (Nous) — fine-tune of Llama, less censored, surprisingly good for ambiguous problems.

I keep three loaded: a small fast one for autocomplete (Qwen 3 8B), a mid-tier for chat (Qwen 3 Coder 32B), and a big one for hard problems (Llama 4 70B).

The Cost Math

Claude Opus 4.7 and GPT-5 charge per token. Local models charge electricity.

A rough back-of-envelope: heavy daily usage (300K input + 50K output tokens) on Claude Opus 4.7 runs roughly $30–60/day. The same workload on a local M4 Max costs maybe $0.40 in electricity.

The breakeven on a $5,000 M4 Max is somewhere around 3–4 months of heavy usage. That ignores the API rate limits, the network round trips, and the data you don't want sitting in a vendor's logs.

When Local Wins

Privacy-sensitive code — client work under NDA, security audits, anything you legally can't send to OpenAI
Offline development — flights, trains, sketchy hotel WiFi
High-volume agentic loops — when you're running an agent that fires 10,000 prompts a day, API costs eat you alive
Latency-critical autocomplete — a 4B model on your GPU beats a 200ms round-trip every time
Experimentation — fine-tuning, LoRAs, weird sampling strategies the APIs don't expose

When Local Loses

Frontier reasoning — Claude Opus 4.7 and GPT-5 still beat anything you can run locally on hard novel problems
Long context — 1M token context on a local machine is theoretically possible, practically painful
Multimodal — vision and audio models that match the closed offerings don't fit on consumer hardware yet
You don't want to manage it — Ollama is easy, but model selection, quantization, and prompt formatting are still real work

The Practical Recommendation

If you write code daily and have a recent Mac with 64 GB+ or a 4090: install Ollama tonight, pull Qwen 3 Coder, point your editor at it. Use it for the 80% of tasks that don't need frontier capability. Save the API budget for the 20% that does.

The frontier models are still ahead. They're just no longer essential for most of the work.

Local LLMs for Dev — When Your Laptop Beats the API

Local LLMs for Dev — When Your Laptop Beats the API

The Stack That Works

Models Worth Running

The Cost Math

When Local Wins

When Local Loses

The Practical Recommendation

Enjoyed this read?