Storage technology explained: AI and data storage | Computer Weekly
NicoElNino - Fotolia
Artificial intelligence (AI) and machine learning (ML) promises a step change in the automation fundamental to IT, with applications ranging from simple chatbots to almost unthinkable levels of complexity, content generation and control.
Storage forms a key part of AI, to supply data for training and store the potentially huge volumes of data generated, or during inference when the results of AI are applied to real-world workloads.
In this article, we look at the key characteristics of AI workloads, their storage input/output (I/O) profile, the types of storage suited to AI, the suitability of cloud and object storage for AI, and storage supplier strategy and products for AI.
AI and ML are based on training an algorithm to detect patterns in data, gain insight into data and often to trigger responses based on those findings. Those could be very simple recommendations based on sales data, such as the “people who bought this also bought” type of recommendation. Or they could be the kind of complex content we see from large language models (LLMs) in generative AI (GenAI) trained on vast and multiple datasets to allow it to create convincing text, images and video.
There are three key phases and deployment types to AI workloads:
Where and how AI and ML workloads are trained and run can vary significantly. On the one hand, they can resemble batch or one-off training and inference runs that resemble high-performance computing (HPC) processing on specific datasets in science and research environments. On the other hand, AI, once trained, can be applied to continuous application workloads, such as the types of sales and marketing operations described above.
The types of data in training and operational datasets could vary from a great many small files in, for example, sensor readings in internet of things (IoT) workloads, to very large objects such as image and movie files or discrete batches of scientific data. File size upon ingestion also depends on AI frameworks in use (see below).
Datasets could also form part of primary or secondary data storage, such as sales records or data held in backups, which is increasingly seen as a valuable source of corporate information.
Training and inferencing in AI workloads usually requires massively parallel processing, using graphics processing units (GPUs) or similar hardware that offload processing from central processing units (CPUs).
Processing performance needs to be exceptional to handle AI training and inference in a reasonable timeframe and with as many iterations as possible to maximise quality.
Infrastructure also potentially needs to be able to scale massively to handle very large training datasets and outputs from training and inference. It also requires speed of I/O between storage and processing, and potentially also to be able to manage portability of data between locations to enable the most efficient processing.
Data is likely to be unstructured and in large volumes, rather than structured and in databases.
As we’ve seen, massive parallel processing using GPUs is the core of AI infrastructure. So, in short, the task of storage is to supply those GPUs as quickly as possible to ensure these very costly hardware items are used optimally.
More often than not, that means flash storage for low latency in I/O. Capacity required will vary according to the scale of workloads and the likely scale of the results of AI processing, but hundreds of terabytes, even petabytes, is likely.
Adequate throughput is also a factor as different AI frameworks store data differently, such as between PyTorch (large number of smaller files) and TensorFlow (the reverse). So, it’s not just a case of getting data to GPUs quickly, but also at the right volume and with the right I/O capabilities.
Recently, storage suppliers have pushed flash-based storage – often using high-density QLC flash – as a potential general-purpose storage, including for datasets hitherto considered “secondary”, such as backup data, because customers may now want to access it at higher speed using AI.
Storage for AI projects will range from that which provides very high performance during training and inference to various forms of longer-term retention because it won’t always be clear at the outset of an AI project what data will be useful.
Cloud storage could be a viable consideration for AI workload data. The advantage of holding data in the cloud brings an element of portability, with data able to be “moved” nearer to its processing location.
Many AI projects start in the cloud because you can use the GPUs for the time you need them. The cloud is not cheap, but to deploy hardware on-premise, you need to have committed to a production project before it is justified.
All the key cloud providers offer AI services that range from pre-trained models, application programming interfaces (APIs) into models, AI/ML compute with scalable GPU deployment (Nvidia and their own) and storage infrastructure scalable to multiple petabytes.
Object storage is good for unstructured data, able to scale massively, often found in the cloud, and can handle almost any data type as an object. That makes it well-suited for the large, unstructured data workloads likely in AI and ML applications.
The presence of rich metadata is another plus to object storage. It can be searched and read to help find and organise the right data for AI training models. Data can be held almost anywhere, including in the cloud with communication via the S3 protocol.
But metadata, for all its benefits, can also overwhelm storage controllers and affect performance. And, if cloud is a location for cloud storage, cloud costs need to be taken into account as data is accessed and moved.
Nvidia provides reference architectures and hardware stacks that include servers, GPUs and networking. These are the DGX BasePOD reference architecture and DGX SuperPOD turnkey infrastructure stack, which can be specified for industry verticals.
Storage suppliers have also focused on the I/O bottleneck so data can be delivered efficiently to large numbers of (very costly) GPUs.
Those efforts have ranged from integrations with Nvidia infrastructure – the key player in GPU and AI server technology – via microservices such as NeMo for training and NIM for inference to storage product validation with AI infrastructure, and to entire storage infrastructure stacks aimed at AI.
Supplier initiatives have also centred on the development of retrieval augmented generation (RAG) pipelines and hardware architectures to support it. RAG validates the findings of AI training by reference to external, trusted information, in part to tackle so-called hallucinations.
Numerous storage suppliers have products validated with DGX offerings, including the following.
DataDirect Networks (DDN) offers its A³I AI400X2 all-NVMe storage appliances with SuperPOD. Each appliance delivers up to 90GBps throughput and three million IOPS.
Dell’s AI Factory is an integrated hardware stack spanning desktop, laptop and server PowerEdge XE9680 compute, PowerScale F710 storage, software and services and validated with Nvidia’s AI infrastructure. It is available via Dell’s Apex as-a-service scheme.
IBM has Spectrum Storage for AI with Nvidia DGX. It is a converged, but separately scalable compute, storage and networking solution validated for Nvidia BasePOD and SuperPod.
Backup provider Cohesity announced at Nvidia’s GTC 2024 event that it would integrate Nvidia NIM microservices and Nvidia AI Enterprise into its Gaia multicloud data platform, which allows use of backup and archive data to form a source of training data.
Hammerspace has GPUDirect certification with Nvidia. Hammerspace markets its Hyperscale NAS as a global file system built for AI/ML workloads and GPU-driven processing.
Hitachi Vantara has its Hitachi iQ, which provides industry-specific AI systems that use Nvidia DGX and HGX GPUs with the company’s storage.
HPE has GenAI supercomputing and enterprise systems with Nvidia components, a RAG reference architecture, and plans to build in NIM microservices. In March 2024, HPE upgraded its Alletra MP storage arrays to connect two times the number of servers and four times the capacity in the same rackspace with 100Gbps connectivity between nodes in a cluster.
NetApp has product integrations with BasePOD and SuperPOD. At GTC 2024 NetApp announced integration of Nvidia’s NeMo Retriever microservice, a RAG software offering, with OnTap customer hybrid cloud storage.
Pure Storage has AIRI, a flash-based AI infrastructure certified with DGX and Nvidia OVX servers and using Pure’s FlashBlade//S storage. At GTC 2024, Pure announced it had created a RAG pipeline that uses Nvidia NeMo-based microservices with Nvidia GPUs and its storage, plus RAGs for specific industry verticals.
Vast Data launched its Vast Data Platform in 2023, which marries its QLC flash-and-fast-cache storage subsystems with database-like capabilities at native storage I/O level, and DGX certification.
In March 2024, hybrid cloud NAS maker Weka announced a hardware appliance certified to work with Nvidia’s DGX SuperPod AI datacentre infrastructure.
A source: www.computerweekly.com/feature/Storage-technology-explained-AI-and-the-data-storage-it-needs