Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever (35.7 80.2 tok/s)
A reader on my last post said Ollama was leaving a lot on the table — that a tuned backend with multi-token prediction (MTP) could roughly double my 3090's throughput. So I went and measured it, one lever at a time. The short version: they were right, the 2.25× is real, and below is the exact path t
A reader on my last post said Ollama was leaving a lot on the table — that a tuned backend with multi-token prediction (MTP) could roughly double my 3090's throughput. So I went and measured it, one lever at a time. The short version: they were right, the 2.25× is real, and below is the exact path that got me there on my box. On a single RTX 3090, Qwen3.6-27B generation went from 35.7 tok/s (Ollama) to 80.2 tok/s (llama.cpp + MTP) — a measured 2.25× — by stacking three independent levers: a leaner engine, a smaller quant, and speculative decoding. The interesting part isn't the headline; it's which lever bought how much, and a couple of things that tripped me up on the way. (To be precise up front: MTP on its own is 1.78× at the same quant — the 2.25× is what you get when all three levers stack.) All on one RTX 3090, Qwen3.6-27B, 200 tokens generated, flash-attention on: step what changed backend quant MTP gen tok/s vs Ollama VRAM baseline — Ollama Q4_K_M — 35.7 1.00× 23.2 GB 1 engine ik_llama.cpp Q4_K_M — 41.9 1.17× 17.3 GB 2 + quant ik_llama.cpp IQ4_XS — 47.5 1.33× 15.1 GB 3 + MTP llama.cpp IQ4_XS on 80.2 2.25× ~15 GB A note on fairness: rows 0–2 use each engine's own native bench path, and row 3 is llama-server. For a clean apples-to-apples read of MTP alone, the same llama-server went 45.1 (MTP off) → 80.2 (MTP on) = 1.78×. So MTP by itself is ~1.78× on identical engine/model/tool; the 2.25× is the full stack vs Ollama. (Both the Ollama baseline and the llama.cpp runs fit fully in VRAM; the baseline ran at num_ctx 8192 and the llama.cpp runs at -c 4096 — generation throughput is largely insensitive to that as long as nothing spills to CPU, though it accounts for part of the VRAM difference in the table.) Moving the same Q4_K_M model from Ollama to a bare-metal ik_llama.cpp build (CUDA, flash-attention, compiled for the 3090's sm86) took me from 35.7 → 41.9 tok/s, and dropped VRAM from 23.2 → 17.3 GB. Ollama is convenience-first — it sizes things generously and doesn't expose the lower-level knobs — so a hand-built engine is faster out of the gate. Swapping the quant from Q4_K_M to IQ4_XS added a bit more and shrank VRAM further: 47.5 tok/s, 15.1 GB. Roughly a third faster, and nothing exotic yet. Multi-token prediction / speculative decoding is the big one. The idea: a small, fast draft predicts several tokens ahead, and the main model verifies them in one pass — when the drafts are accepted, you get multiple tokens for roughly the cost of one. Because the main model verifies every drafted token before it's emitted, the output is preserved — this is a throughput win, not a quality tradeoff. Two things were worth knowing for my setup: In my build, MTP came from mainline llama.cpp, not ik_llama. ik_llama got me to ~47 (engine + quant), but I couldn't get MTP running there — my build rejected the -mtp flags and ignored the model's nextn tensors. Mainline llama.cpp added MTP fairly recently (PR #22673, merged 2026-05-16), and that's where it worked for me. (There may well be an ik_llama path I missed — this is just what got it going on my box.) Ollama's GGUF couldn't be reused. Qwen3.6 changed rope.dimension_sections from 3 to 4 elements; Ollama's stored blob still has the older 3-element layout, so llama.cpp refused it (expected 4, got 3). I grabbed a properly-converted GGUF instead (bartowski / a nextn-equipped MTP build) — a small heads-up if you're tempted to point llama.cpp at your existing Ollama blob. With mainline llama.cpp, an MTP-equipped IQ4_XS GGUF, and --spec-draft-n-max 3, generation hit 80.2 tok/s. The one knob that mattered for me was --spec-draft-n-max (how many tokens to draft ahead): config gen tok/s draft acceptance n-max 2 77.5 78.1% n-max 3 80.2 70.3% n-max 4 70.7 53.4% n-max 3 + p-min 0.6 54.1 80.0% n-max 3 + KV q8_0 74.6 64.5% The counterintuitive bit: higher acceptance ≠ faster. Pushing p-min to 0.6 raised acceptance to 80% but dropped throughput to 54 — the extra rejected drafts cost more than they save. Plain f16 KV beat q8 KV too. n-max 3 with f16 KV was the sweet spot. (I also went looking for a "prefill-off" trick I'd heard about and couldn't find it as a flag in current llama.cpp — --spec-draft-n-max was the lever that actually moved the number for me.) Keeping these front and center, because they're the difference between a benchmark and a benchmark you can trust: 80.2 tok/s is this box's number (RTX 3090, WSL2). The originally-cited "~80" was a different setup; I reproduced ~80 honestly here. Prefill numbers are noisy — my test prompt was short (~56 tokens), so I'm not headlining prefill. Generation tok/s is solid (±0.1). The bartowski Q4_K_M and Ollama's Q4_K_M are the same quantization family but different conversions (the rope change above), so they're not bit-identical weights. The model and quant family are matched; the conversion isn't. Single GPU, single request. No batching or concurrency tested — that's a different question. One benchmarking trap that cost me time: llama-cli -n <N> is ignored under -no-cnv, so the model just generates until timeout (mine produced a 2 GB output file and looked like a 39-minute hang — it was runaway generation). Use llama-bench for token-exact non-MTP runs, and llama-server with n_predict for MTP. Hardware: RTX 3090 24 GB (Ampere, sm86), WSL2 Ubuntu 24.04, driver 591.74, nvcc 12.0. ik_llama.cpp (commit bbe1a51): cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DGGML_NATIVE=ON llama.cpp / mainline, has MTP (commit e3471b3): cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DBUILD_SHARED_LIBS=OFF Models: bartowski/Qwen_Qwen3.6-27B-GGUF (Q4_K_M, IQ4_XS); a nextn/MTP-equipped Qwen3.6-27B-MTP-IQ4_XS GGUF for the speculative step. Non-MTP bench: llama-bench -m <gguf> -p 56 -n 200 -ngl 99 -fa 1 -r 3 MTP run (the winner): llama-server -m Qwen3.6-27B-MTP-IQ4_XS.gguf -ngl 99 -fa on -c 4096 --spec-type draft-mtp --spec-draft-n-max 3, then POST /completion with n_predict: 200. Draft acceptance ≈ 70%. So the reader's nudge was a good one — Ollama really was leaving a clean ~2× on the table for this model on this card, and most of it is the MTP step. Ollama stays my default for everyday use (it's simple and it's what my tooling talks to); this build is the "I want every token/sec" setup. If you've gotten MTP working under ik_llama, or found the prefill trick, I'd genuinely like to hear how — that's the part I couldn't crack.
Key Takeaways
- •A reader on my last post said Ollama was leaving a lot on the table — that a tuned backend with multi-token prediction (MTP) could roughly double my 3090's throughput
- •This story was reported by Dev.to, covering developments in the dev space.
- •AI advancements continue to reshape industries — read the full article on Dev.to for complete coverage.
📖 Continue reading the full article:
Read Full Article on Dev.to →


