Itdaily - Nvidia CEO: ‘The solution to the RAM crisis? Work with extremely low precision’

Nvidia CEO: ‘The solution to the RAM crisis? Work with extremely low precision’

jensen huang computex

One of the best ways to improve memory usage is to work with extremely low precision. That sounds negative, but it doesn’t have to be, according to Nvidia’s CEO.

During a press event at Computex, Nvidia CEO Jensen Huang explained how the chipmaker intends to counter the global memory shortage. He wants to do more with less to dramatically reduce memory usage.

“The key is NVFP4, a 4-bit format that doubles the number of parameters in the same memory space, complemented by AI techniques such as neural rendering and neural texture compression.”

RTX Spark computer

“One of the best ways to improve memory usage is to work with extremely low precision,” he says. “Instead of storing values in 8 or 16 bits, we compress them to four bits. This allows the same model to fit into half or a quarter of the space.”

That philosophy is not new, but Nvidia is now extending it to consumer and business PCs. RTX Spark, the Windows chip Nvidia built together with MediaTek, relies heavily on precision reduction via so-called Tensor Cores to run local AI agents without filling up the memory.

What makes NVFP4 different

NVFP4 is Nvidia’s proprietary 4-bit floating point format, introduced with the Blackwell architecture. Each number is stored in the E2M1 structure: one sign bit, two exponent bits, and one mantissa bit. As a result, a single value can only take on a handful of levels. Think of the series 0, 0.5, 1, 1.5, 2, 3, 4, and 6, both positive and negative, in a range of approximately -6 to 6.

In itself, that is far too coarse for neural networks. Nvidia solves this with a two-stage system of scaling factors. A fine-grained FP8 scaling factor (E4M3) applies per micro-block of sixteen values, and on top of that is a global FP32 scaling factor for the entire tensor.

Competition with MXFP4

It is precisely this smaller block size that marks the difference with the competing MXFP4, which uses blocks of 32 values: by halving it to sixteen values, NVFP4 gets twice as many calibration points and the quantization error remains limited.

“NVFP4 is not just 4-bit floating point; it is a complete tensor structure, an entire mathematical structure that can calculate with a precision of just four bits, but switches dynamically between 4, 8, 16, and even 32 bits when necessary,” says Huang. He emphasizes, however, that four bits are used as much as possible.

“The fifth-generation Tensor Cores in Blackwell process that format at the hardware level, including the grouping of elements and dynamic scaling.” According to Nvidia, this delivers up to six times higher throughput and a halving of memory usage compared to FP8, while accuracy remains close to that of FP16.

Scarce memory

Huang summarized the concrete gain for RTX Spark in one figure: “This allows us to compress the neural network we place in memory, thereby doubling the number of parameters in just 128 gigabytes.” In other words, a model that normally wouldn’t fit locally now actually runs on the device itself.

This is not a minor detail, but the core of Nvidia’s strategy. Local AI agents need many parameters to be useful, and memory is precisely the scarce, expensive component. By halving the memory footprint, Nvidia makes AI agents feasible on a PC instead of exclusively in the data center. The timing is no coincidence, as memory prices are under pressure worldwide.

While the market primarily looks at the availability of HBM and DRAM capacity, Nvidia consciously chooses to solve the problem partly in software and formats. NVFP4 shifts the burden from ‘buying more memory’ to ‘handling memory smarter.’ For a platform like RTX Spark, which promises always-on, local AI agents, this may well be the decisive advantage.