Framework Desktop and Running LLMs

As I experiment with running local #llm on my #framework desktop, having 128GB of ram certainly gives you lots of options. I can run some large models, but they're generally quite slow.

My current set-up is to run two 'smaller' models simultaneously, a planning and coding model.

Qwen3-32B is my 'planner' model, which has good reasoning/instruction following capabilities.

Qwen3-Coder-30B-A3B is my 'coding' model, for coding, tool calling and debugging.

I'm running #opencode in the terminal, which by default has two primary agents, plan and build agent. This setup pairs nicely with that.

Planning + Coding Duo (Better Quality)
Planner: Qwen3-32B (Q5_K_M, ~22GB) — excellent reasoning and instruction following for decomposing tasks, writing specs, and architectural thinking. Has a native "thinking mode" for extended reasoning.
Coder: Qwen3-Coder-30B-A3B (Q5, ~18GB) — the specialist, handles tool calling, code generation, debugging, and repo-level tasks.
Both fit in RAM simultaneously (~40GB combined), leaving plenty of headroom for context. In LM Studio you'd just have both loaded and route to whichever role you need.

Some recommended settings for both these models:

Qwen3-32B (Planner)
Thinking Mode is the big one. Qwen3 has a hybrid thinking/non-thinking mode controlled by either a system prompt flag or a special token. In LM Studio you can enable this via the system prompt — add /think to activate extended reasoning, or /no_think to skip it. For planning tasks you generally want thinking enabled.
Context Length — set this high, 16K–32K. Planning tasks tend to involve longer back-and-forth and architectural context.
Temperature — 0.6 is Alibaba's recommended value for Qwen3 with thinking enabled. Don't go lower than this with thinking mode or you can get repetitive reasoning loops.
Top-P — 0.95, Top-K 20 — these are Qwen3's official recommended sampling params.
Repeat Penalty — 1.1 is a safe default.

Qwen3-Coder-30B-A3B (Coder)
Temperature — lower than the planner, 0.2–0.3 for deterministic, consistent code output. Some people go as low as 0.1 for pure completion tasks.
Context Length — this model supports up to 256K natively, but practically set it as high as your RAM budget allows after both models are loaded. 32K–64K is a good starting point for repo-level work.
Top-P — 0.9 or lower for coding, you want less randomness.
Repeat Penalty — 1.05–1.1, keep it mild to avoid interfering with repetitive code patterns (loops, boilerplate) that are intentional.
Thinking mode — you can disable it (/no_think) for the coder to keep responses fast and direct. Enable it selectively when debugging complex logic.

Both Models
Flash Attention — enable it in LM Studio's model settings if available for your build. Significant memory and speed improvement at long contexts.
GPU Offload — set to max layers offloaded to the iGPU. The Strix Halo's 890M handles this well and you'll see noticeably better token throughput vs. pure CPU inference.
Mlock — worth enabling to keep both models pinned in RAM and avoid swap thrashing when both are loaded simultaneously.