Enterprise video intelligence: A new demand  

Enterprise AI has crossed a threshold. After years of experimentation, organizations are now deploying production AI systems at scale, and two persistent barriers continue to slow them down: GPU supply constraints and infrastructure complexity. Demand for accelerators has far outpaced supply, while the engineering overhead of managing GPU-dense infrastructure is nontrivial for most IT teams.

A rapidly growing workload category is making this tension acute: video intelligence. Enterprises are sitting on vast archives of surveillance footage, recorded meetings, and training sessions, which is generated faster than it can be reviewed. Intel's Video Search and Summarization (VSS) application transforms that passive archive into searchable, summarized intelligence, combining computer vision, language models (VLMs), large language models (LLMs), and retrieval-augmented generation (RAG) into a single, deployable solution.

The traditional answer has been GPU-heavy infrastructure. But for the majority of VSS workloads, inference at moderate concurrency using models in the 7B–20B parameter range, there is a better path. Intel® Xeon® 6 processors, deployed on Red Hat OpenShift with KServe, deliver production-grade VSS on CPU infrastructure.

This blog demonstrates that claim with a real deployment: VSS, an open-source, microservices-based solution from Intel's Edge AI Libraries, with VLM and LLM inference, served by an Xeon 6 on Red Hat OpenShift via KServe.

Video search and summarization: From raw footage to searchable intelligence

Intel's Video Search and Summarization (VSS) application, part of the Intel Edge AI Libraries on the Open Edge Platform, is a modular, microservices-based solution that transforms raw video into searchable, summarized intelligence. Built around Intel's AI systems and Edge AI inference microservices catalog, VSS is validated on Intel Xeon processors and designed for flexible deployment from edge to cloud.

Three operating modes

Video Search: Users submit natural language queries against a video collection. The pipeline uses multimodal embeddings and a LangChain-based RAG backend to retrieve semantically relevant video segments, ranked by relevance and returned with temporal metadata. Target use cases include security incident detection, media content location, and compliance auditing.

Video Summarization: The pipeline processes video through VLMs for frame-by-frame captioning, then routes captions to an LLM for a final narrative summary. An Audio Transcription microservice (using OpenAI Whisper) transcribes the audio channel in parallel, providing additional context to both the VLM and the search index. Use cases include executive briefing summaries, recorded meeting reviews, and training content digestion.

Combined Mode (Video Search and Summarization): The pipeline generates summaries first, then indexes summary embeddings rather than raw frame-level embeddings. The result is more efficient storage and retrieval, and higher contextual relevance in search results. This is the fullest expression of the VSS capability and the mode that most directly showcases the Xeon 6 + OpenShift inference architecture.

Pipeline architecture

The VSS pipeline is orchestrated by a central Pipeline Manager microservice (NestJS-based), which coordinates all downstream services asynchronously, supporting batching and parallel execution for performance optimization.

When a video is submitted, the Video Ingestion service built on Intel® Deep Learning Streamer decodes and chunks the video, extracts frames at configurable intervals, and optionally runs object detection (YOLOv8). Frames and metadata are sent to the VLM microservice, which, in this deployment, is a KServe InferenceService running on a Xeon 6 in Red Hat OpenShift. The VLM generates natural language captions for each video chunk, optionally enriched by object detection metadata and audio transcriptions. For this VSS deployment, Red Hat has validated and made available two quantized VLM models through its model catalog: RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w8a8 and RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w4a16, both optimized to run on Xeon 6 via AMX.

The LLM microservice also served via KServe on Xeon 6 aggregates, and captions are aggregated into a coherent final summary. In parallel, a Multimodal Embedding microservice converts captions, audio transcriptions, and frame metadata into vector representations stored in a VDMS (Visual Data Management System) vector database for semantic retrieval. When a user submits a query, the Video Search backend embeds the query, performs semantic similarity search against VDMS, and returns ranked video segments with temporal seek points.

All inference microservices expose OpenAI-compatible APIs, ensuring portability across backends. The modular design allows independent swapping of embedding models, VLMs, LLMs, or vector databases without changing the orchestration layer.

Key Architecture Detail

  • VLM and LLM inference are served by Intel Xeon 6 via KServe on Red Hat OpenShift.
  • This makes the inference tier enterprise-grade, scalable, and governed, not a standalone edge process.
  • OpenAI-compatible APIs across all inference microservices ensure portability and future flexibility.

Figure 1: VSS architecture with Xeon 6 inference on OpenShift

Figure 1: The Video Search and Summarization pipeline. VLM and LLM inference are served by Intel Xeon® 6 running on Red Hat OpenShift via KServe, making the inference tier enterprise-grade, scalable, and governed. All inference microservices expose OpenAI-compatible APIs.

 

Purpose-built for enterprise verticals

VSS addresses high-value use cases across industries where video is generated faster than it can be reviewed. The combination of semantic search and automated summarization makes previously inaccessible archives actionable:

VerticalVSS Capability
Security & SurveillanceSemantic search across hours of footage to identify incidents, suspicious activity, or patterns without manual review
Media & EntertainmentRapid location of specific scenes in large content archives; automated compliance checks
HealthcareSearch and retrieve key moments from recorded procedures, training sessions, or telehealth consultations
Legal & CompliancePinpoint evidence or verify claims within video records, supporting investigations and audits
Education & TrainingRetrieve key topics or moments from recorded lectures, enhancing personalized knowledge discovery
ManufacturingAutomated anomaly detection and event logging in process or quality-control video streams

Intel® Xeon® 6: Purpose-Built for Enterprise AI

Running VSS on CPU requires a processor that can sustain VLM and LLM inference at production scale without a discrete accelerator. Intel Xeon 6 (Granite Rapids) is architected precisely for that, built from the ground up for AI workloads at datacenter scale.

Intel® Advanced Matrix Extensions (AMX)

AMX was introduced in 4th Gen Intel Xeon Scalable Processors as a dedicated hardware block built directly into the cores for matrix multiplication, the core operation in transformer inference. No discrete accelerator is required. AMX supports BF16 and INT8 precision, delivering computational efficiency modern LLMs demand while maintaining accuracy comparable to FP32.

In Xeon 6 (Granite Rapids), AMX delivers up to 3x AI throughput compared to prior Xeon generations. For VSS, this means the VLM captioning and LLM summarization steps are the most compute-intensive stages of the pipeline that run efficiently on CPU with no discrete accelerator required. AMX optimizations are fully upstreamed to PyTorch and vLLM, so zero code changes are needed for existing AI workloads.

Multiplexed Rank DIMMs (MRDIMMs) and memory architecture

AI inference is fundamentally memory-bandwidth-bound. In VSS, the VLM processes multiple frames per video chunk, while the LLM holds a growing caption context; both operations place sustained pressure on memory bandwidth. The key-value (KV) cache grows with context length, and for long-form video summarization, this can become a bottleneck.

MRDIMMs deliver over 37% greater memory bandwidth than RDIMMs. For VSS workloads, this directly translates to higher throughput during VLM captioning and lower latency during LLM summarization under load.

Red Hat OpenShift AI: One platform, CPU and GPU

AI infrastructure that requires separate tools, separate pipelines, and separate operational expertise for CPU and GPU workloads creates compounding complexity. Red Hat AI 3.4 eliminates that divide.

First-class Xeon support in Red Hat AI 3.4

Red Hat AI 3.4 introduces Intel Xeon as a first-class inference target on full parity with GPU nodes. The same OpenShift AI control plane that manages GPU inference deployments now manages Xeon inference pods with identical APIs, auto-scaling policies, role-based access control, audit logging, and model governance tooling. There is no separate workflow, no separate toolchain, and no GPU required for any of it.

This means organizations can deploy VSS inference workloads on Xeon 6 nodes through the same MLOps pipeline they use for GPU-accelerated workloads and scale from one to the other as demand evolves.

KServe: Enterprise-grade model serving

KServe is the model serving standard on Red Hat OpenShift AI. For the VSS deployment, the VLM and LLM microservices are deployed as KServe InferenceServices on Xeon 6 nodes, exposing OpenAI-compatible API endpoints to the rest of the pipeline. This transforms the inference tier from a local process into a governed, scalable, observable service with autoscaling, canary deployments, and SLO-aware routing built in.

Red Hat builds and maintains a vLLM ServingRuntime for Xeon, which is shipped directly with OpenShift AI. Users select it from the OpenShift AI dashboard, and KServe handles deployment automatically, with no manual image management required. The runtime is built with AMX enabled, delivering hardware-accelerated BF16 and INT8 inference on Xeon 4th Gen or newer. The underlying operator stack OpenShift AI, OpenShift Serverless, and OpenShift Service Mesh is installed via OperatorHub and provides the full dependency foundation for KServe model serving.

Red Hat AI QuickStarts

The Red Hat AI QuickStarts catalog provides production-ready, ready-to-run examples optimized for Intel Xeon 6 on OpenShift AI, including LLM CPU Serving and RAG pipelines that map directly to the inference patterns VSS relies on. Development teams can take these as starting points and customize or deploy as-is to accelerate time-to-value. 

The path forward

This deployment is more than a technology demonstration. It is a production-validated proof point that CPU-native AI is ready for VSS workloads today, without GPU infrastructure, proprietary lock-in, or sacrificing the operational governance that enterprise deployments demand.

VSS continues to evolve. Upcoming capabilities include live camera streaming support, enabling VSS to process live feeds in addition to offline video archives.

For enterprises that handle sensitive video surveillance footage, medical recordings and compliance sessions, Intel® Trust Domain Extensions (Intel® TDX) provides an additional layer of protection., TDX hardware-encrypted Trust Domains protect model weights, inference state, and video data from unauthorized access, including at the hypervisor layer, without requiring changes to the VSS application.

"WWT's Telco AI engineering practice is built on the principle that the best AI solution is accessible, governable, and industry-leading. Our collaboration with Intel and Red Hat on Xeon AI delivers high-performance, CPU-native enterprise AI solutions bringing the power of OpenShift AI to clients who need AI without the GPU premium."

— WWT AI Practice

"The era of GPU-only AI is over. With Xeon AI and Red Hat AI 3.4, every enterprise server is an AI server."

— Bill Pearson, VP, AI Software, Intel

As AI becomes embedded in enterprise operations, the question is no longer whether CPUs can power enterprise AI; it is how far they can go. With Intel Xeon 6, Red Hat OpenShift AI, and VSS, the infrastructure to operationalize video intelligence at enterprise scale is available, validated, and ready to deploy.

Get started

The following resources provide everything needed to explore and deploy the Xeon 6 + OpenShift AI + VSS stack:

Technologies