Nvidia RTX 5090 and RTX Pro 6000 Crash in Virtual Environments: $1,000 for Bug Squasher

nvidia

CloudRift discovered persistent stability issues when using Nvidia’s RTX 5090 and RTX PRO 6000 cards in virtual environments.

GPU cloud infrastructure provider CloudRift uncovered serious problems with Nvidia’s RTX 5090 and RTX PRO 6000. The GPUs become completely unusable in certain virtual environments over time. The error occurs randomly, usually after a few days of use or during the startup or shutdown of virtual machines. Once affected, the only solution is a complete restart of the physical host machine.

The error seems to occur when the GPU is used via PCI passthrough in combination with VFIO and QEMU/KVM. When releasing the GPU after shutting down a VM, it fails to correctly perform a so-called Function Level Reset (FLR). This causes the card to enter an unrecoverable state. The GPU remains visible to the system but no longer responds to commands. Other GPU models such as the Nvidia H100, B200, and RTX 4090 do not exhibit these problems.

No Clear Cause

According to CloudRift, various possible causes have already been ruled out. These include errors in IOMMU configurations, driver bindings, kernel versions, and libvirt settings. The systems exhibiting the errors are based on commonly used AMD Epyc Rome and Milan processors.

The problem is characterized by kernel messages indicating stalled CPU cores and failed attempts to reset PCI devices. Error messages appear such as unknown PCI header type and timeouts during hardware reset attempts. Attempts to rebind the GPU to a driver also fail.

$1,000 for Bug Exterminator

CloudRift is at its wit’s end and has set up a bug bounty program. The company is offering $1,000 to anyone who can find the cause or provide a working solution. A Proxmox user also discovered the problem and claims to know that Nvidia is aware and working on a solution. We’re curious if Nvidia will then also pocket the $1,000 for solving an error in their own hardware.

CloudRift points out that the problem could undermine the reliability of GPU virtualization, especially in AI workloads that depend on stable and long-term computing performance.