Uncovering the Hidden 16TB Flash Layer in NVIDIA’s Latest AI Servers

A picture started making the rounds among Taiwanese stock traders early on January 7. It featured a Taiwanese semiconductor company called Phison’s NAND flash controller chip inside Nvidia’s recently released Vera Rubin server, which had been on stage at CES in Las Vegas the previous evening. Phison shares had reached their daily limit by the time trading resumed. More people had seen the picture than any press release. Furthermore, the answer to the question it posed—is flash actually becoming a part of Nvidia’s AI computing stack?—turned out to be more intriguing and significant than the stock move indicated.

Category	Details
Platform	NVIDIA Vera Rubin (announced CES 2026; full details at GTC March 2026)
Flash Storage Layer Name	NVIDIA Inference Context Memory Storage Platform (CMX) — powered by BlueField-4 DPU
Flash Per GPU	Up to 16TB of dedicated context memory space per GPU
Shared Flash Per Rack	150TB of shared, high-speed memory pool per rack
Why Flash, Not HBM	HBM cannot accommodate growing context memory for long-context and agentic AI workloads; flash provides far greater capacity at lower cost
Context Memory Explained	Long-term AI conversation history, KV cache, multi-turn reasoning data — too large for HBM but requiring faster access than traditional network storage
BlueField-4 Role	Manages the context memory pool; enables efficient sharing and reuse of KV cache data across AI infrastructure
STX Reference Rack	New storage rack architecture: 16 boxes × 2 BF-4 units = 32 Vera CPUs, 64 CX-9 NICs, 64 SOCAMM modules per rack
Phison Connection	Phison NAND flash controller IC spotted inside Vera Rubin server at CES; Phison CEO K.S. Pua confirmed flash becoming part of AI computing systems
Groq Acquisition	Nvidia paid $20B to license Groq IP and hire team; Groq LP30 LPU integrated into Vera Rubin inference stack for Attention FFN Disaggregation (AFD)
Jensen Huang Statement	This new storage architecture “could become the largest storage market in the world”
Vera Rubin NVL72 Specs	72 Rubin GPUs, 36 Vera CPUs; 260TB/s rack bandwidth; 3.6TB/s per GPU; 10x lower inference token cost vs. Blackwell
Reference Links	NVIDIA Newsroom — Vera Rubin Platform Launch · CommonWealth Magazine — Why Nvidia Is Turning to Flash for AI Memory

Uncovering the Hidden 16TB Flash Layer in NVIDIA’s Latest AI Servers

Yes, is the response. However, the whole picture of what Nvidia is doing with flash inside the Vera Rubin platform is more about a fundamental rethinking of how AI inference systems handle memory than it is about a component win for any specific chip supplier. In the last ten minutes of his CES keynote, Jensen Huang tackled it head-on, but the bandwidth figures that preceded the event received far more attention. Huang outlined a new storage architecture intended to store what he called “context memory”—the accumulated history of an AI system’s user interactions, the lengthy chains of reasoning that contemporary models produce during multi-turn conversations, and the key-value cache data that expands indefinitely as AI becomes more agentic and long-running.
Once stated, the problem is simple. Fast, costly, and physically limited, HBM is the stacked DRAM that gives Nvidia’s GPUs their amazing bandwidth. About half of the manufacturing costs for the H100 are already covered by HBM. That percentage increased to about 60% for Blackwell. Although the Vera Rubin platform improves performance, frontier AI models require more context memory during inference, which eventually becomes too much for HBM. On stage, Huang made it clear that eventually, that context “just won’t fit.” Nvidia’s solution is a distinct storage tier with up to 150 terabytes of shared, high-speed context memory that is controlled by four BlueField-4 data processing units per rack. Up to 16 terabytes of dedicated context space may be dynamically allocated from that pool to each GPU in the rack.

The NVIDIA Inference Context Memory Storage Platform, or CMX for short, is a product whose architecture is based on Nvidia’s most recent data processing unit, BlueField-4. By managing KV cache data, the BlueField-4 facilitates effective sharing and reuse throughout a deployment of AI infrastructure. A standardized hardware configuration outlining the precise number of drives, Vera CPUs, BlueField-4 DPUs, and networking components required to support a particular cluster at scale is provided by the companion STX reference rack. Every significant storage vendor, including DDN, VAST Data, NetApp, Pure Storage, IBM, HPE, Dell, and others, was listed by Nvidia as supporting STX. This type of partner announcement typically denotes the formal establishment of a market category rather than a niche experiment.
As this develops, it seems that Vera Rubin’s flash layer represents a truly novel position in the market for AI infrastructure. It’s not a GPU. It’s not DRAM. Data centers purchase this non-traditional SSD product for file storage. The quick, variable, stateful retrieval of context data across numerous concurrent AI sessions is something that was created especially for the access patterns of AI inference. In an earnings call, Huang was candid about the size he thinks this will reach, stating that it “will create a market that has never existed before, and could become the largest storage market in the world.” The degree to which agentic AI and long-context models take over the workload will determine whether or not that prediction is correct, but the trend is obvious.
A helpful dimension is added by the Phison story. In January, Phison CEO K.S. Pua told reporters that the true question was whether Phison would be incorporated into the production architecture in a way that would be commercially significant rather than whether his chip would be found in an Nvidia server. He was right to be cautious. One step is to appear in a reference design. It’s quite another to be integrated into the high-volume supply chain for what Jensen Huang called “possibly the largest storage market in the world.” Rather than adapting general-purpose storage products to a new use case, storage companies that figure out how to specifically build for AI inference access patterns will probably take a disproportionate share of whatever market develops.
In some respects, Vera Rubin’s context memory layer is the most subdued announcement on a platform full of striking figures. Rack bandwidth is 260 terabytes per second. Each GPU can perform 50 petaflops of NVFP4 computation. A new inference chip architecture is being added to the stack through the $20 billion Groq acquisition. 16 terabytes of flash per GPU seems almost insignificant in comparison to those numbers. However, it’s addressing the limitation that will determine whether frontier AI can truly operate at the scale and duration required by its applications; therefore, it’s likely the most important design choice for the entire platform in the long run.

Uncovering the Hidden 16TB Flash Layer in NVIDIA’s Latest AI Servers

Why the World’s Biggest Tech Companies Are Suddenly Investing in Nuclear Fusion

Why Louisiana’s Decision to Scrap AI Legislation Is Being Watched by Every Other State Capital

AI Just Passed Another Human Test

Big Tech Promised AI Would Create Jobs. Instead, Oracle Just Cut Thousands More.

How to Destroy a Hard Drive So the NSA Can Never Recover Your Data

The $100 Million AI Safety Pitch That Major Tech Giants Are Being Asked to Fund

Why the World’s Biggest Tech Companies Are Suddenly Investing in Nuclear Fusion

Researchers Say Machines May Soon Think Independently — And the Line Between Illusion and Reality Is Blurring Fast

This Breakthrough Changes Everything — And Most People Haven’t Heard About It Yet

Scientists Say They Are Entering Unknown Territory

How China’s Lithium-Free Fertilizer Production Is Insulating It From a Crisis Hitting Everyone Else

Uncovering the Hidden 16TB Flash Layer in NVIDIA’s Latest AI Servers

Related Posts