Nvidia GPUs Vulnerable to Bitflip Attack; Enabling ECC Helps

nvidia gpu
Nvidia RTX 5000

Researchers have for the first time exploited a DRAM vulnerability to cause bitflips in GDDR6 memory of an Nvidia GPU. The impact can be significant, especially when those GPU “s LLM” s run.

Researchers from the University of Toronto have for the first time successfully carried out a Rowhammer attack on GPU memory. The attack, named GPUHammer, demonstrates that GDDR6 memory in graphics cards is also vulnerable to so-called bitflips.

Rowhammer is a known vulnerability in DRAM, based on physical effects in memory cells. Attackers repeatedly try to activate memory cells until bitflips are caused in cells in adjacent rows.

Until now, the phenomenon was mainly investigated in CPU memory types including DDR4. The researchers focused on GDDR6 memory in the Nvidia RTX A6000 GPU and succeeded in causing bitflips in all tested DRAM banks using CUDA code.

Discovered Addresses

That is striking since GDDR6 memory in Nvidia-GPU “s is theoretically better protected. The GPU” s do not share physical memory addresses with the CUDA code, and those physical addresses are necessary to specifically trigger bitflips.

The researchers therefore investigated how the Nvidia driver assigns addresses. They succeeded in reaching the physical addresses through reverse engineering. This allowed the researchers to generate sufficient memory access to cause bitflips.

In a proof-of-concept, they demonstrated that a single bitflip in a deep learning model on the GPU can reduce accuracy from 80 percent to less than one percent. For this, they specifically manipulated the exponent of 16-bit floating-point numbers in model weights, which has a significant impact on the final result.

ECC Mitigates Damage

The researchers noted that enabling ECC (Error Correction Code) via the Nvidia driver can correct the observed single bitflips. That is logical; it is the role of ECC to detect memory errors caused by bitflips, triggered by, for example, cosmic radiation.

ECC is disabled by default on many GPUs due to performance impact. This can be up to ten percent. However, ECC does not prevent the underlying vulnerability, which stems from the physical properties of DRAM memory. Bitflips remain possible, but the memory can in principle correct them in time.

Although the attack was only confirmed on the RTX A6000 with GDDR6, the researchers emphasize that the techniques used are extensible to other GPUs. Newer models with HBM or GDDR7 currently appear immune due to better error correction. The researchers call for further study and adaptation of DRAM designs to structurally prevent these types of attacks.

Vulnerable GPUs

Nvidia advises users to enable ECC memory to prevent misuse. This suggestion applies to the following server GPUs:

  • Ampere: A100, A40, A30, A16, A10, A2, A800
  • Ada: L40S, L40, L4
  • Hopper: H100, H200, GH200, H20, H800
  • Blackwell: GB200, B200, B100
  • Turing: T1000, T600, T400, T4
  • Volta: Tesla V100, Tesla V100S

GPUs in workstations are also theoretically vulnerable. These include the following chips:

  • Ampere RTX: A6000, A5000, A4500, A4000, A2000, A1000, A400
  • Ada RTX: 6000, 5000, 4500, 4000, 4000 SFF, 2000
  • Blackwell RTX PRO
  • Turing RTX: 8000, 6000, 5000, 4000
  • Volta: Quadro GV100

The Rowhammer attack on GPUs can cause damage, but it is also complex to execute. An attacker must already have extensive access to a system to successfully perform bitflips. Therefore, a trade-off between the risk on the one hand, and the performance dip of ECC on the other, is necessary on an individual basis.