Qwen 3.6, llama.cpp Speculative Decoding, Deepseek TileKernels for Local AI on Consumer GPUs
This week highlights Qwen 3.6's prowess in local inference with llama.cpp and speculative decoding, showcasing powerful multimodal capabilities on consumer GPUs. Deepseek also delivers new performance-enhancing kernels and an evaluation platform for open models, empowering local AI development.
An Overnight Stack for Qwen3.6–27B: 85 TPS, 125K Context, Vision — on One RTX 3090 (r/LocalLLaMA)
This report details a highly optimized local deployment stack for the Qwen 3.6 27B model, demonstrating remarkable performance on a single consumer-grade NVIDIA RTX 3090 GPU. The "Overnight Stack" achieves an impressive 85 tokens per second (TPS) while managing a substantial 125K context window. A key highlight is the model's integrated vision capabilities, signifying a breakthrough for running multimodal AI locally on widely available hardware.
The setup showcases a practical approach to self-hosting large language models, pushing the boundaries of what's achievable with a single high-end consumer GPU. The performance metrics, particularly the large context window and multimodal support, position Qwen 3.6 as a strong contender for various local AI applications, from complex coding assistants to advanced data science tasks. This development encourages users to cancel costly cloud subscriptions in favor of powerful, private, and self-managed AI inference.
Achieving 85 TPS with a 125K context and vision on a single RTX 3090 is a game-changer for local AI. This kind of optimization makes self-hosting truly competitive with cloud services for many use cases.
Qwen-3.6-27B, llamacpp, speculative decoding - appreciation post (r/LocalLLaMA)
This appreciation post highlights the significant performance improvements achieved when running the new Qwen 3.6 27B model locally using `llama.cpp` combined with speculative decoding. The author conducted experiments to quantify the speed gains, affirming that speculative decoding dramatically enhances the inference speed of large language models on local hardware. This combination allows users to experience near real-time responses, making interactive local AI applications much more viable.
The synergy between `llama.cpp`'s efficient CPU/GPU inference capabilities and speculative decoding's ability to predict and pre-generate tokens ahead of the main model significantly reduces latency. For users with consumer-grade hardware, this translates into a highly responsive and fluid experience when interacting with powerful models like Qwen 3.6 27B, often making local AI a more attractive alternative to cloud-based solutions due to privacy and cost benefits.
Speculative decoding with `llama.cpp` fundamentally changes the local inference experience. It's fantastic to see Qwen 3.6 27B benefiting so much from this acceleration technique, making powerful models feel incredibly fast on personal machines.
Deepseek has released DeepEP V2 and TileKernels. (r/LocalLLaMA)
Deepseek AI, a notable contributor to open-weight models, has announced the release of DeepEP V2 and TileKernels. DeepEP (Deepseek-MoE Evaluation Platform) V2 is an updated platform designed for evaluating Mixture-of-Experts (MoE) models, providing researchers and developers with robust tools for performance assessment and comparison. This is crucial for understanding the nuances and optimizing the growing landscape of MoE architectures in open-source AI.
Accompanying this is TileKernels, a collection of optimized kernels specifically engineered to accelerate inference processes. These kernels can significantly improve the speed and efficiency of running large language models, particularly beneficial for local inference on consumer GPUs and self-hosted environments. The availability of these tools via GitHub allows developers to integrate these advancements directly into their projects, fostering innovation in local AI acceleration and model evaluation.
Deepseek's release of TileKernels is a direct boost for local inference acceleration. Combined with DeepEP V2, it offers both performance gains and better evaluation tools for the latest open-weight models, which is a powerful combination for developers.