Pure Storage introduces Data Stream, a full-stack solution to prepare and connect data to Nvidia’s AI reference architecture.
Pure Storage has introduced Data Stream. It is an integrated hardware and software stack designed to help companies connect their data to AI applications. The solution automates the collection, cleaning, and structuring of data, enabling organizations to train and deploy AI models in the next step.
According to Pure Storage, a large portion of the time companies spend on AI projects goes into data preparation. Pure Storage claims this can take up to eighty percent of the total project time. Data Stream aims to simplify this process by automatically connecting data pipelines to the AI architecture, where storage and GPUs work together directly.
Part of the Data Platform
Data Stream is a component of the Pure Storage Data Platform and is aligned with enterprise inference use, using the Nvidia AI Data Platform as a reference design. Data Stream supports real-time data ingestion and structuring from various sources, such as text files, PDFs, and tables. The solution offers multi-protocol access (NFS, S3, SMB) and can be integrated with vector databases on Pure Storage FlashBlade//S.
Data Stream works closely with Nvidia NeMo Retriever to convert raw data into vector representations that AI systems can use to understand context and relationships. This approach supports applications such as Retrieval Augmented Generation (RAG). Through integration with NVIDIA NIM, organizations can run AI workloads on local infrastructure or in the cloud using standardized APIs.
Nvidia GPUs
Additionally, Data Stream utilizes GPU-optimized pipelines based on the Nvidia RTX Pro 6000 Blackwell Server Edition and software libraries such as Nvidia Spark Rapids and cuVS. The combination with FlashBlade//S aims to prevent computational bottlenecks and improve data processing performance.
Finally, Data Stream processes data directly on the storage layer, which Pure Storage claims reduces the number of data movements. The output is stored in formats such as JSON, Apache Parquet, or Arrow, suitable for scalable vector storage and large-scale RAG datasets.
