"AI," which is to say Generative Artificial Intelligence (GenAI) based on large language models (LLMs), is transitioning from technology demo to the production data center. Server and storage virtualization, cloud and containers have all followed this product lifecycle, moving from tech demo to test and development (test/dev) and finally to production. Most often, little consideration is given to data protection until the implementation of the transition from Test/Dev to Production begins.   

GenAI is currently moving past the tech demo stage, where there is backup and recovery of some critical components and in-built redundancy, but little to no integration with enterprise data protection platforms. The industry is sorting out best practices and the technology is often running on a highly customized, performance-oriented platform. Backup and recovery is the last chapter in the manual and usually consists of a method to dump critical databases to a flat file and maybe an online manual page reference for "things to try" if the database becomes corrupted. 

The AI avalanche and the training and implementation of LLMs are generating huge amounts of data that are not being protected. According to Enterprise Strategy Group's (ESG) white paper, "Reinventing Backup and Recovery with AI and ML", 65% of their survey respondents admit to backing up less than 50% of their AI-related data.   

AI workloads share the same vulnerabilities to natural disasters and hardware failure as other workloads. The increasing value placed on these workloads and data will make them a priority target for cyber criminals. 

With the exponential growth and investment in this technology, the transition to Production is already becoming real. Dell Technologies is at the forefront of blueprinting an end-to-end, enterprise-grade solution with its AI Factory. The Dell AI Factory leverages NVIDIA's AI infrastructure and software suite, running on a Dell hardware and software stack to provide a modular, enterprise-grade environment to run production GenAI workloads. 

 

As part of this effort, Dell has incorporated data protection as a key infrastructure component that needs to be integrated with the stack. Leveraging their flagship Dell's Power Protect Data Manager software and PowerProtect DD appliances, they are solving the challenges of identifying and protecting the critical data and workloads that make up a GenAI instance. 

Types of data 

Looking at a typical LLM instance we can identify three main types of data associated with GenAI processing: 

  1. Training data sources are the large data lakes that the LLM is trained on.  Training data sets are usually extremely large and often unstructured.  The size of training data is growing exponentially as more sophisticated models require orders of magnitude more data.  Usually starting out as data created and stored for other purposes and then replicated and repurposed to train AI, these Data Sources are presumed to be protected within their own scope.
  2. AI models are the results of training. The models themselves, consisting of all of the information, nodes and weightings that comprise the neural networks that generate the responses, are high-value assets. They are the work product of the time, compute and power resources that go into a training run.
  3. Prompts and responses record the user interactions with the LLM, including what inputs were received and what answers were generated from those inputs. This information is useful for evaluating the accuracy of the model and identifying areas for additional training. Depending on the industry and data set, this information may be subject to privacy and regulatory compliance requirements that could drive protection and retention via backup.

Driving concerns 

There are multiple reasons to protect AI workloads.  Most customers will need to retain AI workload data for one or more of the following: 

  • Cyber resiliency and system protection – Overall protection of this data from events ranging from hardware failure to cyber-attacks.
  • Compliance – Legal compliance under a wide range of legal frameworks such as HIPAA or GDPR may drive data protection and/or long-term retention of sensitive data.
  • Legal protection—the need to retain and protect proprietary data or intellectual property to protect against IP infringement or copyright violation.
  • Workload state — Maintaining the state of AI workloads to enable recovery to a previous state to remediate issues or recover from a bad training event.
  • Dataset reconstruction — Reconstructing datasets used after the consolidation of training data from multiple sources, including the cloud, were used to generate a model.

Data types and protection strategies 

Using Dell's AI Factory as a model, their reference architecture is Kubernetes running over Ubuntu Linux on Dell PowerEdge servers with Dell PowerScale on the back end. Other solutions will vary but the overall requirements and protection strategies will align with the model we are describing. 

Reference Architecture for AI
Source: https://infohub.delltechnologies.com/en-us/t/dell-scalable-architecture-for-retrieval-augmented-generation-rag-with-nvidia-microservices-1/

 

Infrastructure—Protecting the server infrastructure and associated cluster states, metadata, etc., will be via Linux filesystem agents, a scripted dump and a subsequent sweep of the databases, which store the configuration and metadata, to rebuild the cluster and Kubernetes environment as needed. 

Kubernetes Containers and Persistent Volumes – There will likely be stateful containers that need to be backed up on a regular basis and the actual LLM and associated vector databases will be located on persistent volumes. These should be backed up through a data protection solution capable of protecting the Kubernetes container environment (see "Thoughts on Kubernetes Data Protection" for a deep dive).  Additional protection for the persistent volumes, especially during training should be provided by frequent snapshots of the backend storage – the RPO would be determined by how much training data or query/response information can tolerably be lost – typically fifteen minutes. 

Databases – Current AI solutions run on open-source NoSQL databases designed to handle large data volumes.  Dell's architecture runs Elastic Search, PostGres and MongoDB for storing and managing the data used for training models, as well as the models themselves and their configurations.  These databases should be protected by periodic backups, either via an agent or dump and sweep, as well as snapshots for filesystem crash consistent rollback points with better RPOs when needed. 

Dell for AI data protection 

Dell's solution for data protection in their AI Factory is Dell's Power Protect Data Manager (PPDM) software, which writes to Power Protect Data Domain (PPDD).  The combination of Dell's modern data protection software and industry-leading protection storage delivers the required performance and functionality to securely protect the stack and workloads. 

 

Why PowerProtect for AI workloads 

PowerProtect Data Manager 

PowerProtect Data Domain 

Cloud-ready architecture with multicloud integration 

Cyber resilient native immutability 

Orchestrated and automated recovery 

Robust data protection solution tailored for Kubernetes environments 

Ensure the safety of both Kubernetes resources and persistent data (PVCs) 

Define and enforce data protection policies directly through Kubernetes APIs 

Discovers all resources and PVCs in the Kubernetes cluster 

Flexible restore across diverse environments, including on-premises and cloud deployments. 

Rest API first architecture 

Fast performance for both traditional and modern workloads 

Efficient operation with low power, cost and cooling footprint 

Secure operational and cyber resilience via data immutability and integrity 

CPU-centric architecture and unique design innovations drive faster performance and lower latency. 

Faster synchronization in HA configurations  

Faster compression 

Faster DDR5 memory 

Faster software-defined NVRAM 

Faster SAS4 SSD cache 

Customers looking to bundle this solution into an all-in-one integrated appliance can look to Dell's DM5500 Data Manager Appliance, which bakes the functionality of PPDM and PPDD into a fully integrated appliance. 

Conclusion 

To say that AI is a "rapidly evolving landscape" is a severe understatement. As GenAI makes its way from the lab to the production environment, Data Protection needs to be addressed.  Regardless of one's preferred vendor, Dell's reference architecture and the white paper from ESG provide a solid beginning for Data Protection architects and administrators looking to better understand the nature of the workloads and how to protect them. 

Technologies