Breaking Data Silos: How Private Inference Unlocks GPU ROI on Sensitive Data
In this blog
- Enter Private Inference: The Missing Link
- The ARMOR Framework: Building on Solid Governance Foundations
- How Stained Glass Transform Enables Private Inference
- Reference Architecture: Private Inference on NVIDIA GPUs
- Real-World Impact: Three Use Cases
- Additional Benefits Beyond $/outcome Optimization & Privacy
- Technical Considerations for Implementation
- Getting Started: Practical Next Steps
- The Path Forward
- Download
Enterprises are rapidly scaling NVIDIA‑powered AI infrastructure—deploying NVIDIA H100 GPU and NVIDIA DGX™ GB300, adopting NVIDIA NIM™ microservices, and building modern AI factories to turn data into insight. Yet many organizations are still early in translating this compute potential into consistent business value, particularly when sensitive or regulated data is involved. Compliance, privacy, and data‑residency requirements often limit how broadly GPUs can be shared for inference, creating operational friction and leaving valuable capacity underutilized. The opportunity is clear: if sensitive data could be used safely and at scale, organizations could significantly accelerate ROI from their existing NVIDIA investments—without compromising security or governance.
Healthcare organizations can't run AI models on patient records without extensive de-identification that can hamper the outcome value for many use cases. Financial services firms isolate customer data by isolating compute across business units to maintain compliance boundaries. Multi-tenant AI platforms struggle to offer competitive services because each customer's data requires separate infrastructure.
The traditional approach, which forces a choice between innovation and compliance - mock whether to use sensitive data or cost-efficient compute - is no longer required. What if your GPU infrastructure could process sensitive data at full capacity by strengthening your security posture not compromising it?
Enter Private Inference: The Missing Link
Private inference represents a paradigm shift in how we approach AI on sensitive data. Unlike traditional approaches that rely on data movement, anonymization, or moving workloads to physically isolated environments, private inference enables AI models to process sensitive data without ever exposing that data in its original form. This not only unlocks new sensitive data tiers for use with AI models, it also breaks down inefficient AI compute siloes that kill the ROI of AI investments. This isn't just incremental improvement; it's architectural transformation that is enabled by Protopia AI's Stained Glass Transform (SGT) technology.
The ARMOR Framework: Building on Solid Governance Foundations
Before diving into the technical solution, it's critical to understand how private inference enables critical AI security and governance domains. World Wide Technology's AI Readiness Model for Operational Resilience (ARMOR), a vendor-agnostic solution, developed framework with NVIDIA provides a comprehensive approach to AI security and governance across six core domains, unified by the overarching principle of Cyber Resilient AI:
- Governance, Risk, and Compliance (GRC): Ensures AI operations align with regulatory requirements, organizational policies, and ethical standards, managing risks across on-premises and cloud environments.
- Model Protection and Application Security: Protects AI models from threats such as poisoning, inversion attacks, and theft, ensuring integrity and reliability throughout their lifecycle.
- Infrastructure Security: Secures the hardware and network foundation, including GPUs, DPUs, and cloud regions, to prevent unauthorized access or tampering.
- Secure AI Operations: Enables real-time monitoring and rapid response to threats, ensuring secure operation of AI platforms in interconnected systems.
- Secure Development Lifecycle (SDLC): Embeds security into the development of AI software and services, mitigating vulnerabilities like prompt injection from design to deployment.
- Data Protection: Safeguards datasets, whether stored on locally connected storage or in a cloud data lake, ensuring confidentiality, integrity, and regulatory compliance without stifling innovation.
- Governance, Risk, and Compliance (GRC): Ensures AI operations align with regulatory requirements, organizational policies, and ethical standards, managing risks across on-premises and cloud environments.
- Model Protection and Application Security: Protects AI models from threats such as poisoning, inversion attacks, and theft, ensuring integrity and reliability throughout their lifecycle.
- Infrastructure Security: Secures the hardware and network foundation, including GPUs, DPUs, and cloud regions, to prevent unauthorized access or tampering.
- Secure AI Operations: Enables real-time monitoring and rapid response to threats, ensuring secure operation of AI platforms in interconnected systems.
- Secure Development Lifecycle (SDLC): Embeds security into the development of AI software and services, mitigating vulnerabilities like prompt injection from design to deployment.
- Data Protection: Safeguards datasets, whether stored on locally connected storage or in a cloud data lake, ensuring confidentiality, integrity, and regulatory compliance without stifling innovation.
These six domains are unified by Cyber Resilient AI: the ARMOR framework's integrating principle that ensures AI systems can withstand, adapt to, and recover from security incidents while maintaining operational continuity.
Private inference architecture, powered by Protopia AI's Stained Glass, directly strengthens multiple ARMOR domains. Let's examine the ones most impacted:
1. Data Protection Domain
ARMOR's Data Protection domain safeguards data throughout the AI lifecycle, from ingestion through model training and inference. While encryption has long been the standard for protecting data at rest and in transit, it leaves a critical gap: data must be decrypted to be processed. Meaning even with encryption, sensitive prompts and context are exposed in plaintext to the model host at inference time, where it can leak via logs, memory, or observability tools.
Private inference with Protopia AI addresses this gap by transforming sensitive prompts and context into privacy-preserving representations using Stained Glass Transforms (SGTs) before the data leaves the root of trust, be it stored on locally connected storage or in cloud data lakes. Models run inference directly on the stochastically transformed data from SGTs, returning accurate results without ever exposing raw data to the host.
This aligns perfectly with ARMOR's principle of defense in depth—adding a technical control layer that works alongside encryption, access controls, and audit logging to ensure any potential unauthorized access to the inference hosts will not result in exposure of plain-text sensitive information.
2. Governance, Risk, and Compliance (GRC) Domain
ARMOR's GRC domain requires organizations to maintain clear data lineage, enforce access policies, and demonstrate compliance with regulations like HIPAA, GDPR, and SOC 2 across on-premises and cloud environments. Private inference doesn't replace these requirements, it makes them achievable while still enabling AI innovation.
With private inference, you can:
- Run AI models on regulated data without creating new copies or derivatives
- Maintain data residency requirements while enabling AI for cross-region analytics
- Provide audit trails showing that raw sensitive data was never exposed to AI model
- Enable multi-tenant AI architectures without data commingling risks
3. Adding Defense‑in‑Depth to Infrastructure and Model Security Domains for Multi‑Tenant AI Factories
AI Factories only achieve "GPU ROI" when they can safely run many workloads through the same expensive infrastructure—especially for inference, where demand is bursty, always‑on, and rarely isolated to one team or one customer. In practice, inference becomes multi‑tenant by default: shared GPU clusters, shared serving stacks, shared routing, and shared operational tooling.
ARMOR's Model Protection domain explicitly calls out multi‑tenant AI workloads as a core AI Factory pattern. That's a signal that the security posture can't stop at "secure the model" or "secure the cluster" in isolation; multi‑tenancy forces Infrastructure Security and Model Security to be designed together.
Infrastructure Security: multi‑tenancy expands the blast radius inside the inference plane.
ARMOR emphasizes segmentation and boundary controls, and notes that segmentation also supports multi‑tenancy. Because these high-performance networks are engineered first for throughput and ultra-low latency, some of the inline inspection and fine-grained internal zoning patterns people expect in traditional enterprise networks can be impractical to deploy everywhere at scale. But AI HPC fabrics can make deep east‑west inspection and perfect zoning difficult at scale, bandwidth requirements and tooling constraints mean some parts of the inference data plane must be treated as a lower‑trust zone, even in well‑architected factories.
Model Security (Model Protection): multi‑tenancy multiplies the "places data can leak."
ARMOR recommends a Model Gateway as a centralized control point for policy enforcement and auditable interaction with models. That governance layer is essential, but in conventional inference the shared platform still processes raw tenant inputs. In multi‑tenant deployments, the riskiest leakage paths often aren't "model bugs," they're operational surfaces: logs, tracing/observability tools, debug captures, caches, and in‑memory request/response handling, especially when misconfigurations or overly broad administrative privileges expose those surfaces to the wrong parties.
Where Private Inference fits: reduce the sensitivity of what the shared platform ever sees.
Protopia's Stained Glass Transform (SGT) adds defense‑in‑depth by transforming sensitive inputs at the moment of egress, inside the data owner's root of trust, so the shared inference environment operates on privacy‑preserving representations rather than raw, identifiable data. This complements (not replaces) ARMOR's segmentation, identity controls, and Model Gateway governance:
- **Infrastructure Security benefit (blast radius):** if a node, container, or account in the inference plane is compromised, what's exposed is the transformed representation—not raw tenant inputs—reducing the consequence of inevitable gaps in segmentation or monitoring coverage.
- **Model Security benefit (cross‑tenant isolation):** when multiple tenants share the same model infrastructure, SGT helps prevent one tenant's sensitive inputs from becoming intelligible to other tenants or platform operators through shared operational systems, because raw inputs are never present in the serving zone.
- **Governance benefit (control plane stays intact):** the Model Gateway still determines who can invoke which models and how usage is audited; SGT makes it possible to enforce and log those controls without moving raw sensitive inputs through shared infrastructure.
Model updates remain compatible with this approach. As models evolve, the corresponding SGT transforms can be generated and rolled out as part of the deployment lifecycle, so A/B testing, shadow testing, and version rollbacks can preserve a consistent 'no‑raw‑data‑in‑the‑serving‑zone' posture even as the factory scales.
Cyber Resilient AI: The Integrating Principle
Private inference architecture embodies ARMOR's overarching principle of Cyber Resilient AI: systems that don't just prevent attacks but maintain operational continuity even under adversarial conditions.
Traditional security models assume a breach-or-no-breach binary state. Cyber Resilient AI assumes ongoing threats and designs systems to:
- Withstand attacks through defense-in-depth (SGT adds a privacy layer that works even if other controls fail)
- Adapt to evolving threats (privacy parameters can be tuned without architectural redesign)
- Recover gracefully (if the model hosting infrastructure or associated systems become compromised, any sensitive data remains protected)
Private inference with SGT means that even in a worst-case scenario—complete compromise of your inference infrastructure – sensitive prompt and context data cannot be exfiltrated as it is never present on the system in plaintext. This is resilience through architectural design, not just perimeter defense.
How Stained Glass Transform Enables Private Inference
At the heart of private inference on NVIDIA infrastructure sits Protopia AI Stained Glass Transform—a technique that converts sensitive data into stochastic representations that preserve AI utility while protecting privacy.
The Technical Mechanics
Here's how SGT works in practice:
- Inference-Time Stochastic Transformation
When a sensitive image, document, or structured data record needs to be processed by an AI model, SGT applies a mathematically learned (for the target model) stochastic transformation before the data enters the inference pipeline. This isn't encryption or tokenization, it's a targeted noise-injection technique that creates a privacy-preserving representation only understandable by the specific model the data is intended for. - Utility Preservation
The genius of SGT lies in its ability to retain near-perfect model utility. The learned stochastic transformation preserves the statistical features that the target AI model relies on for accurate predictions while holistically obscuring details that could identify individuals or reveal confidential information. A chest X-ray's stochastic transformed representation maintains diagnostic features while patient-identifying characteristics no longer appear in plaintext. A financial transaction record transformed with SGT preserves fraud indicators that a target model would need while obfuscating account details with targeted noise. - Mathematically Proven Privacy
Unlike heuristic anonymization approaches, SGT holistically transforms all tokens as opposed to masking/tokenizing some tokens selectively and provides mutual information-based privacy guarantees. The transformation ensures that even if an unauthorized user has access to the system the inference model is deployed on, they cannot reverse-engineer the original sensitive data beyond a provable privacy bound from the stochastic representations they may uncover from memory, logs, disk, etc. - No Model Retraining Required
Here's the operational advantage: Your existing AI models, already trained and validated, work with SGT-transformed data without modification. The models were trained on clean data and continue to operate on representations that maintain the statistical properties they learned to recognize. The SGT for any AI model is trained in a post-training step to your model training without modifying your model weights using Protopia Stained Glass Engine. This process runs on the same infrastructure the base model is trained or fine-tuned on with less than 1% of the resources needed for the baseline model training.
Integration with NVIDIA Infrastructure
SGT integrates seamlessly into NVIDIA's inference stack:
- NVIDIA Triton™ Inference Server: SGT operates as a pre-processing step in the inference pipeline, transforming data before it reaches model ensembles
- NVIDIA® TensorRT™: The transformation leverages GPU acceleration for minimal latency impact—typically adding less than 10s ms to inference time
- NVIDIA NIM Microservices: As of NIMs release 1.15.0 with prompt embeddings support, models deployed with NIMs can natively receive the output of SGTs, providing privacy protection at the microservice deployment level
- NVIDIA DGX™ Platform: For on-premises deployments, SGT enables private inference across multiple tenants on shared infrastructure
The result? Your GPU infrastructure processes sensitive data at full capacity while enhancing your security posture.
Reference Architecture: Private Inference on NVIDIA GPUs
Let's examine a concrete reference architecture for deploying privacy-preserving inference using vLLM or NVIDIA NIM, by ensuring that sensitive prompts are transformed into embeddings before they reach the inference server (Client → Proxy → NVIDIA NIM → Proxy → Client).:
Architecture Components
Data Layer: Inside the data owner's root of trust; local or in the cloud
- Data from source systems (EHR, CRM, transaction databases) sits in secure storage with encryption at rest
- Data governance policies, access controls and audit logging are preserved inside client root-of-trust
- Users generate LLM prompts client-side, incorporate context (documents, files, images) from within that sensitive data store
- Stained Glass Transform (SGT) transforms the entirety of the prompt and context into protected embeddings at the moment of egress Transformation Layer (Stained Glass Transform)
- SGT Proxy application is deployed at the data source, within the root-of-trust, on NVIDIA GPUs using NVIDIA TensorRT for acceleration
- Configurable privacy parameters based on use case requirements
- Generates stochastic representations that preserve utility only for the target model
- Transformed embeddings are encrypted in transit to the model host
- Optionally caches transformed representations for repeated inference
Inference Layer (NVIDIA Triton)
- Off-the-shelf and custom LLMs (e.g. Llama, Mistral, Qwen, etc.) or traditional machine learning models/computer vision models, etc. process SGT-transformed data, not raw sensitive information
- For LLMs, compatible with any inference server that accepts prompt embeddings (e.g. vLLM)
- Unlocks GPU resources to be shared efficiently across multiple inference requests without the different inference request data appearing in plain-text on the inference server
- Model outputs generated without model or infrastructure ever "seeing" raw data
- Model outputs are encrypted in transit and decrypted within the data root-of-trust for roundtrip protection
Application Layer
- Predictions delivered to applications, dashboards, or downstream systems
- Audit trail records inference requests, model versions, and privacy parameters
- Compliance reporting demonstrates data protection throughout pipeline
Security Boundaries
The critical security insight: The inference infrastructure operates in a lower trust zone. Even if an attacker compromised the inference layer, they would only be able to access SGT-transformed representations, not the underlying sensitive data.
This architectural separation aligns perfectly with ARMOR's principle of zero trust architecture, where we assume breach and design systems to limit the blast radius.
Real-World Impact: Three Use Cases
Use Case 1: Healthcare AI at Scale (PHI Protection)
The Challenge: A major health system operates 30+ hospitals, each with AI-powered clinical decision support systems. HIPAA requires strict PHI protections, traditionally forcing each facility to maintain isolated infrastructure, driving massive GPU underutilization.
The Solution with Private Inference:
- Deploy centralized NVIDIA DGX infrastructure serving all facilities
- Apply SGT to medical images and EHR data housed in existing secure data stores managed by the health systems independently
- When providers prompt the model, all data is transformed at egress and additionally encrypted in transit
- Providers can utilize radiology AI models, sepsis prediction, and clinical NLP safely on transformed patient data
- Each facility's data maintains privacy guarantees even on shared GPU infrastructure
The Results:
- GPU utilization increased from 38% to 87% driven by increased AI adoption and compute efficiencies, $4.2M in infrastructure cost avoidance over 3 years
- Faster model updates deployed across all facilities simultaneously
- HIPAA audit findings: zero violations related to data exposure
Use Case 2: Financial Services Fraud Detection (PII at Scale)
The Challenge: A multi-national bank operates fraud detection across 12 countries, each with distinct data residency requirements. Traditional approach required separate models and infrastructure per jurisdiction.
The Solution with Private Inference:
- Centralized fraud detection models running on NVIDIA H100 and GB300 clusters
- SGT applied to transaction data, maintaining country-specific privacy parameters
- Cross-border pattern recognition without data movement or commingling
- Near real-time inference maintained with negligible SGT overhead
The Results:
- 23% improvement in fraud detection accuracy from being able to take advantage of latest cross-border learnings
- Infrastructure consolidation from 12 to 3 regional clusters
- Compliance with GDPR, CCPA, and local regulations verified
- GPU costs reduced by 61% while improving model performance
Use Case 3: Multi-Tenant AI Platform (Competitive Isolation)
The Challenge: An AI-as-a-service (AIaaS) provider serves competing companies in the same industry. Customers demand guarantees that their proprietary data isn't commingled or used to benefit competitors. Without SGT, providers often resort to coarse hardware carving, allocating dedicated nodes or boards to single customers in order to satisfy these data separation requirements.
The Solution with Private Inference:
- AIaaS providers operate a central NVIDIA AI Factory to serve their customer base (e.g. industry, sovereign cloud)
- NVIDIA NIM microservices with integrated SGT
- Each tenant's data transformed as it leaves their trusted environment
- Transformed embeddings are sent to the AI Factory and processed by the model, providers never take custody of tenant data in its raw form
- Shared GPU infrastructure with mathematical privacy guarantees
- Per-tenant audit trails and compliance reporting
The Results:
- Reduced operator TCO by 70% vs. dedicated per-tenant deployment
- Expanded customer base to include highly regulated industries
- Differentiated market positioning based on privacy guarantees
- GPU capacity planning simplified with efficient multi-tenancy
Additional Benefits Beyond $/outcome Optimization & Privacy
While improved $/outcome and privacy protection is the primary driver for AI Factories, private inference delivers unexpected operational benefits:
Simplified Compliance Audits: When sensitive data never exists in unprotected form in your inference pipeline, compliance documentation becomes straightforward. You're not proving that you protected the data adequately, you're demonstrating the data was never exposed which sets a new standard for what ``adequate protection'' should be.
Reduced Data Preparation Overhead: Traditional de-identification workflows involve manual review, k-anonymization, and extensive validation. SGT applies consistent, automated transformations to the entirety of input data, with mathematical privacy guarantees, no human review required.
Accelerated Innovation: Data science teams can experiment with real, production-grade data without navigating lengthy compliance review processes. The privacy guarantee is built into the architecture.
Future-Proof Architecture: As regulations evolve and privacy requirements strengthen, the underlying infrastructure doesn't need redesign. Stained Glass protection can quickly be applied to any new model by creating an SGT for that model without architectural changes.
Technical Considerations for Implementation
Performance Impact
Stained Glass Transform latencies are target model dependent but overall SGT adds minimal (<1%) latency compared to the target model:
- Image data (medical imaging, computer vision): 3-8ms
- Structured data (transactions, records): <1ms
- Unstructured text (documents, clinical notes): 5-12ms
For most inference workloads, the overhead is negligible. For ultra-low-latency applications, SGT can be applied upstream and cached.
Model Accuracy
Model accuracy impact is determined during the training process of Stained Glass Transform. This is a post-training step to the target model's training process, where Stained Glass Engine is used to train SGT for the target model. During this training process, hyperparameters are used to control the intensity of the trained transformation and adjust any impact to the target model's accuracy to what is desirable. Rigorous testing across domains shows that for a variety of computer vision models, LLMs, and structured data models inference accuracy can be retained to within 99% of the base model's accuracy.
- The key insight: AI models rely on statistical patterns, not specific data values. SGT preserves patterns while protecting the underlying raw values.
Deployment Patterns
- Sidecar Pattern: SGT deployed as a sidecar service to existing inference infrastructure.
- Integrated Pattern: SGT built directly into AI inference pipelines.
- Hybrid Pattern: Some use cases pre-transform and cache, others transform on-demand
Resource Requirements
SGTs require very little compute and can run on standard, off-the-shelf hardware. Even when GPU acceleration is used, the goal is simply to apply a lightweight transformation that produces protected representations for downstream private inference on far more powerful GPUs in the AI Factory.
For example, a medium-sized LLM (under 100B parameters) may need several 80GB GPUs to host the model itself. In contrast, its corresponding SGT can run at full context length on a single, smaller 40GB GPU within the data owner's trust zone—producing transformed outputs that feed into the multi-GPU model deployment in the AI Factory.
Getting Started: Practical Next Steps
If you're convinced that private inference addresses your GPU underutilization challenge, here's your roadmap:
Step 1: Identify Your Highest-Value Blocked Use Case (Week 1)
- Which AI use cases are currently blocked by data sensitivity?
- Where is GPU capacity sitting idle due to compliance restrictions?
- Which business units are building duplicate infrastructure to isolate data?
Work with your CISO, CIO, and Chief Data Officer to prioritize based on business impact and compliance complexity.
Step 2: Conduct a Technical Pilot (Weeks 2-6)
- Select a single use case with measurable business value
- Deploy SGT with a small subset of production data
- Measure accuracy retention, latency impact, and privacy guarantees
- Document compliance implications with your legal/compliance team
Pilot Setup Resources:
- NVIDIA LaunchPad provides cloud-based access to NVIDIA H100 infrastructure
- Protopia AI offers evaluation licenses for SGT, with out-of-the-box support for many open weight models
- WWT's Advanced Technology Center can host proof-of-concept deployments
Step 3: Build Your Business Case (Weeks 6-8)
Quantify three dimensions:
- Infrastructure Efficiency: GPU utilization improvement, capacity planning simplification
- Compliance Velocity: Reduced time-to-deploy for AI on sensitive data
- Business Enablement: New use cases unlocked, revenue opportunities enabled
Step 4: Design Your Production Architecture (Weeks 8-12)
- Integrate SGT into your NVIDIA inference infrastructure (NVIDIA Triton, NVIDIA NIM, or vLLM)
- Define privacy parameter policies aligned with ARMOR governance framework
- Establish monitoring and audit trail requirements
- Plan for scaled deployment and operational handoff
Step 5: Deploy and Measure (Weeks 12+)
- Roll out to production with clear success metrics
- Monitor GPU utilization, model accuracy, and compliance posture
- Iterate on privacy parameters based on business feedback
- Identify next use cases to expand the architecture
The Path Forward
The GPU infrastructure you've already deployed represents enormous potential that's currently constrained by the very real need to protect sensitive data. Private inference with technologies like Stained Glass Transform doesn't ask you to choose between innovation and compliance.
Instead, it reframes the question entirely: What if your compliance requirements enabled you to consolidate infrastructure, improve GPU utilization, and accelerate AI deployment?
The ARMOR framework provides the governance foundation. NVIDIA's infrastructure provides the accelerated computing power. Private inference provides the bridge between them.
Your $100M GPU investment can deliver $100M of value; or more.