CUDA SGEMM Bug on RTX 5090, Kernel-Fusing for SGEMV, & Radeon RX 9070 XT Price Surge

hardware · 2026-04-10

This week's top GPU news includes a critical cuBLAS performance bug affecting SGEMM on the NVIDIA RTX 5090, a deep dive into CUDA kernel-fused optimizations for SGEMV, and a significant price increase for AMD's Radeon RX 9070 XT graphics cards by ASUS.

Surfacing a 60% SGEMM performance bug in cuBLAS on RTX 5090 (r/CUDA)

r/CUDA

This post highlights a significant performance bug discovered in NVIDIA's cuBLAS library, specifically affecting FP32 SGEMM (Single-precision General Matrix Multiply) operations on the new RTX 5090 GPU. A user, while benchmarking a TMA-based implementation of FP32 SGEMM, found that `cuBLAS` was dispatching an inefficient `simt_128x32_8x5` kernel for batched FP32 SGEMM. This resulted in a drastic 60% performance degradation compared to expected throughput. Such a severe performance anomaly in a fundamental linear algebra routine is critical for developers working with high-performance computing, especially in AI/ML workloads where SGEMM is a foundational primitive for neural network training and inference. The discovery points to a potential issue within NVIDIA's CUDA compiler or the `cuBLAS` library's kernel dispatch logic, rather than an inherent hardware limitation of the RTX 5090. Identifying and reporting these low-level performance bugs is paramount for ensuring the reliability and peak efficiency of the CUDA ecosystem. Developers relying on `cuBLAS` for their computationally intensive tasks should be aware of this reported issue, as it could significantly impact the throughput of their applications. Until a fix is released via a `cuBLAS` update or driver patch, developers might need to explore alternative SGEMM implementations or specific kernel configurations to mitigate the performance loss, highlighting the continuous challenge of optimizing software for cutting-edge GPU hardware. This finding underscores the importance of rigorous benchmarking and community feedback in refining and improving GPU software stacks.

A 60% performance drop for SGEMM on a flagship GPU like the RTX 5090 is massive for AI workloads. This highlights the importance of thorough benchmarking and the need for NVIDIA to swiftly address cuBLAS kernel dispatch inefficiencies.

Kernel-fused temporal decay + importance scoring on top of cuBLAS SGEMV — looking for feedback on launch overhead (r/CUDA)

r/CUDA

A researcher on r/CUDA is actively soliciting feedback on a novel CUDA kernel-fused optimization technique, developed as part of a project named MARS. This initiative, which includes both a research paper and open-source MIT-licensed code on GitHub, focuses on enhancing GPU vector retrieval by tightly integrating several operations directly into a single CUDA kernel applied to `cuBLAS SGEMV` (Single-precision General Matrix-Vector product) operations. Specifically, the technique incorporates temporal decay, per-item importance scoring, and streaming inserts, aiming to perform these complex operations with minimized overhead. The primary goal is to reduce memory bandwidth bottlenecks and significantly improve computational efficiency by consolidating multiple processing steps into one GPU kernel launch, thereby drastically reducing the need for intermediate data transfers between GPU memory and processing units. The principle of "kernel-fused" operations is a cornerstone of advanced GPU programming, designed to eliminate the latency and overhead associated with sequential kernel dispatches and repeated data access. This approach is particularly advantageous for iterative algorithms, real-time data streaming, and large-scale data analytics, which are prevalent in modern AI, machine learning, and scientific computing. Developers engaged in high-performance GPU programming, especially those working on vector databases, recommendation systems, or similarity search, can leverage the insights and code from the MARS project. It provides a practical example of how to optimize `cuBLAS` routines and manage launch overhead effectively, offering valuable lessons in achieving maximal throughput on NVIDIA GPUs through careful kernel design and fusion. The project is a prime example of a practical, hands-on solution that readers can explore and potentially adapt for their own CUDA-accelerated applications.

Kernel fusion is key for maximizing GPU throughput, and integrating techniques like temporal decay directly into SGEMV kernels can dramatically reduce overhead for real-time vector processing. This approach minimizes launch overhead and memory access, directly impacting performance.

ASUS raises Radeon RX 9070 XT prices by up to 17.5% (r/Amd)

r/Amd

ASUS, a prominent hardware manufacturer, has announced a significant price increase for its Radeon RX 9070 XT graphics cards, with prices rising by up to 17.5%. This adjustment reflects a dynamic and often volatile GPU market, influenced by factors such as fluctuating component costs, global supply chain disruptions, and shifting consumer demand. While the specific rationale behind ASUS's decision for this particular increase was not explicitly detailed in the original report, such moves are generally indicative of broader economic pressures impacting manufacturing and distribution. This makes the Radeon RX 9070 XT, a crucial offering in AMD's current generation of mid-to-high-range GPUs, more expensive for end-users and system builders. For consumers, enthusiasts, and professional developers, this price hike translates to a higher investment required for acquiring a Radeon RX 9070 XT. This can influence purchasing decisions, alter budgeting for new PC builds or upgrades, and potentially shift market dynamics as buyers evaluate cost-performance ratios against competing GPUs from NVIDIA. Understanding these market movements is essential for anyone tracking the accessibility and overall financial landscape of current-generation GPU hardware. While not a technical benchmark or a driver release, changes in GPU pricing from major AIB partners directly impact the availability and perceived value of the hardware itself, making it a relevant piece of news for those focused on the GPU hardware segment. It underscores the economic realities that underpin the tech industry, affecting everything from individual consumer choices to large-scale data center procurement.

A nearly 18% price hike on a key AMD GPU like the RX 9070 XT means significant shifts in BOM or market demand. It's a clear signal for developers planning GPU clusters or upgrades to factor in potential cost volatility.