llama.cpp MTP Boost, New Gemma-4 GGUF, & Qwen 3.6 Local Benchmarks

local-ai · 2026-05-16

The `llama.cpp` project sees a significant performance leap with Multi-head Attention Parallelism (MTP) merged into master, showing up to 11.5% faster generation for 27B Qwen models. Meanwhile, a new Gemma-4 finetune optimized for creative writing is released in GGUF format for Ollama, and Qwen 3.6 models demonstrate strong performance on the Terminal-Bench 2.0 leaderboard, outperforming Gemini 2.5 Pro in some local coding tasks.

MTP support merged into llama.cpp (r/LocalLLaMA)

r/LocalLLaMA

The highly anticipated Multi-head Attention Parallelism (MTP) support has officially been merged into the `llama.cpp` master branch (PR #22673), marking a significant advancement for local LLM inference performance. This update is designed to optimize how `llama.cpp` handles large language models, particularly on systems with varying hardware configurations. Early benchmarks, specifically using Qwen3.6 27B models, show promising results. Tests comparing MTP-enabled `llama.cpp` against its base version on a 27B Qwen3.6 model demonstrated an 11.50% reduction in total wall time, accelerating generation significantly from 7.63 tokens/second to 16.15 tokens/second. While the 27B model saw clear improvements, benchmarks for larger 35B models showed mixed results, suggesting further optimization may be needed or performance varies by specific model size and hardware. This merge provides a foundational improvement for `llama.cpp` users looking to extract more performance from their local setups, especially for smaller to medium-sized models.

This MTP merge for `llama.cpp` is a game-changer for my 27B model runs, I've already seen noticeable speedups. It’s exactly the kind of optimization we need for more responsive local inference, though I'll be keen to re-test my larger 35B models after this update.

gemma-4-Ortenzya-The-Creative-Wordsmith-31B-it-uncensored-heretic is Out Now (r/Ollama)

r/Ollama

A new finetuned Gemma-4 model, dubbed "Ortenzya-The-Creative-Wordsmith-31B-it-uncensored-heretic," has been released, specifically aiming to enhance the writing quality of the base Gemma 4 31B instruction-tuned model. This finetune focuses on producing more natural English and improved prose, making it particularly suitable for creative writing tasks, translations, and role-playing scenarios. The model is available in both Safetensors and GGUF formats, catering to a wide range of local inference engines. For Ollama users, running this model is straightforward, exemplified by the command `ollama run hf.co/llmfan46/gemma-4-Ortenzya-The-Creative-Wordsmith-31B-it-uncensored-heretic:Q4K_M` to download and run the Q4K_M quantized version. The availability in GGUF format highlights its readiness for optimized local deployment, enabling users to leverage its enhanced creative capabilities directly on their consumer GPUs. This release exemplifies the ongoing community effort to refine and specialize open-weight models for various local applications.

Finally, a Gemma 4 finetune specifically for creative writing! The GGUF support for Ollama makes it dead simple to spin up, and I'm excited to see if it truly delivers on the 'natural English' promise for my storytelling projects.

Qwen3.6-35B-A3B and 9B are officially on the public Terminal-Bench 2.0 leaderboard! (r/LocalLLaMA)

r/LocalLLaMA

The new Qwen3.6-35B-A3B and 9B models have officially made their mark on the public Terminal-Bench 2.0 leaderboard, showcasing impressive capabilities, particularly in coding primitives. The little-coder × Qwen3.6-35B-A3B variant achieved a score of 24.6% (±3.2), notably positioning it above frontier models like Gemini 2.5 Pro on Gemini CLI (19.6%) for certain coding tasks. This public recognition underscores Qwen 3.6's potential as a highly competitive open-weight model for local development environments. Further community testing has corroborated these findings in local environments. One particular experiment compared Qwen 3.6 variants on single-file HTML canvas animation tasks, evaluating their ability to generate functional code. Such practical, local evaluations are crucial for understanding how these models perform beyond generalized benchmarks and indicate their readiness for self-hosted coding assistance and creative generation on consumer hardware. The strong performance of Qwen 3.6 models, even against proprietary alternatives, makes them a compelling choice for users prioritizing local AI inference with advanced capabilities.

Seeing Qwen 3.6 models climb the Terminal-Bench leaderboard and beat Gemini 2.5 Pro for coding is huge. I'm already using the 35B version locally for my dev tasks, and these benchmarks confirm its value for real-world, self-hosted coding assistance.