NVIDIA Path Tracing, AMD RDNA 4m Drivers, & GPU MoE Offloading Benchmarks
This week features significant GPU advancements: NVIDIA's GDC presentation reveals faster path tracing techniques, while AMD's RDNA 4m architecture gains new open-source driver support. Additionally, a practical guide showcases how to achieve 79 t/s for large LLMs on consumer GPUs using CPU offloading for MoE layers.
RTX 5070 Ti achieves 79 t/s for Qwen3.6-35B-A3B with CPU MoE Offloading (r/LocalLLaMA)
A user detailed their experience optimizing the Qwen3.6-35B-A3B large language model (LLM) on consumer hardware, featuring a speculative RTX 5070 Ti GPU (representing high-end consumer NVIDIA cards) paired with an AMD Ryzen 9800X3D CPU. The setup achieved an impressive 79 tokens per second (t/s) when processing a 128K context window. The crucial optimization identified was the use of the `--n-cpu-moe` flag, which offloads Mixture-of-Experts (MoE) layers from the GPU to the CPU.
This technique is particularly significant for running large MoE models, like Qwen3.6-35B-A3B (which has 35 billion total parameters but 3 billion active experts), on GPUs with limited VRAM. By offloading less frequently accessed or less compute-intensive MoE layers to the CPU, VRAM pressure on the GPU is significantly reduced, allowing for larger context windows and more efficient inference. The benchmark demonstrates that high-performance LLM inference is increasingly achievable on consumer-grade hardware through judicious use of memory management and compute distribution strategies. This practical application highlights the ongoing efforts to make advanced AI models accessible to a wider audience without requiring enterprise-grade hardware.
This benchmark proves that intelligent MoE layer distribution across CPU and GPU is essential for pushing context limits and token rates on consumer cards. The `--n-cpu-moe` flag is a game-changer for maximizing local LLM performance.
NVIDIA GDC Presentation: Path Tracing Performance Boosts Explained (r/nvidia)
NVIDIA’s recent GDC (Game Developers Conference) presentation highlighted significant advancements in path tracing technology, promising faster and more efficient real-time rendering. Digital Foundry's clips explain these new techniques, which are crucial for achieving photorealistic graphics in modern games and professional visualization. Path tracing, a computationally intensive rendering method, simulates light paths more accurately than traditional rasterization, leading to superior global illumination, reflections, and refractions.
The presentation likely delved into optimizations within NVIDIA's RTX ecosystem, potentially covering improvements in RT Cores utilization, enhancements to DLSS (Deep Learning Super Sampling) for path-traced scenes, or new software development kits (SDKs) and APIs designed to streamline path tracing integration for developers. Such advancements imply ongoing driver updates and possibly future hardware optimizations aimed at pushing the boundaries of real-time ray tracing and path tracing. These developments are vital for next-generation graphics, indicating NVIDIA's roadmap for maintaining its leadership in high-fidelity rendering and offering developers more powerful tools to leverage their GPU hardware.
Faster path tracing from NVIDIA is a big deal for developers targeting photorealistic graphics, suggesting significant driver and SDK improvements are on the horizon. This directly impacts the visual fidelity and performance ceilings of upcoming titles.
Valve Developer Improves AMD RADV/ACO Drivers for RDNA 4m Architecture (r/Amd)
A Valve developer has committed significant changes to AMD's open-source graphics drivers, specifically RADV (the Mesa Vulkan driver for AMD GPUs) and ACO (the AMD Compiler for Mesa's OpenGL/Vulkan shaders). These updates target AMD's upcoming GFX11.7 / RDNA 4m architecture. This development signifies ongoing progress in preparing the Linux graphics stack for AMD's next-generation GPUs, ensuring optimal performance and compatibility from day one.
The continuous contribution from entities like Valve to open-source AMD drivers is critical, especially for the Linux gaming and compute ecosystems. RADV and ACO are fundamental components that translate high-level graphics APIs into instructions for AMD hardware. These changes likely involve architectural-specific optimizations, bug fixes, or new feature enablement designed to fully exploit the capabilities of the RDNA 4m microarchitecture, including potential improvements in shader compilation, geometry processing, or memory management. For developers and users on Linux, these patches directly translate into better performance, stability, and broader support for new hardware, demonstrating a healthy, collaborative ecosystem for AMD's GPU technology.
Seeing Valve contribute to RADV/ACO for RDNA 4m is great news for Linux users and developers, ensuring robust open-source driver support for AMD's next-gen GPUs right out of the gate. This proactive work is crucial for future hardware compatibility and performance.