Part 2: Transforming IT Operations - A Daily Ops Summary Agent

Note: This is the second post in a series exploring how AI agents can support IT operations teams. In this installment, we will dive into our first use case, the Daily Ops Summary Agent, within IT Operations that considers the broad landscape of the operating environment and the macroscopic trends that IT teams can have a hard time gleaning.

The Context

A typical day on an IT Operations team involves coordinating many moving parts. As mentioned in the first post of this series, these teams rely on various systems to manage ongoing work. One key example is a service management platform.

A service management system is its own ecosystem, usually segmented into logical areas. While there are several, this post focuses on two: incident management and change management.

Incident management is one area familiar to any operations team. It handles the resolution of issues in the environment, such as slow servers, application errors and laptop crashes, typically through an incident management ticketing system. As regular users, we've interacted with these systems when logging a help request.

Change management is another area well known to those in enterprise IT. Its focus is on planning the safe application of changes to any part of the entire IT environment that runs the business. Tasks like deploying new application versions, upgrading firmware or adjusting network configurations are handled through a change management system.

On any given day, IT teams may be working through dozens or even hundreds of incidents and changes. The service management system tracks key details: what the work is, who's doing it, what part of the environment is affected, when it should be completed and more. Each item contains structured data along with freeform notes, technical commentary and conversational input.

Over time, this data collection grows. Even after an incident is resolved or a change is applied, records are retained for institutional knowledge, quality control and auditing.

The gap

With that context established, it's worth thinking about the position that people working on these systems in this type of environment find themselves in. If there is interest in trying to improve the quality of IT operations, how does one inspect macroscopic patterns in the organization and technology environment? There is a wealth of data available in the service management system, but given the size and nature of that data, meaningful information can be hard to uncover.

Of course, service management platforms typically include a reporting engine that works well with structured data, producing pie charts, tables and other visualizations to show aggregate statistics. However, a gap emerges because there is quite a lot of unstructured data in the form of work notes, conversation threads and so on. Even within the structured data, mislabeling and improper categorization can still occur.

The remedy

Our first use case, the Daily Ops Summary Agent, aims to close this gap by applying AI-based technology to create a Daily Ops Summary Agent. Large language models (LLMs) are well-suited to deal with large amounts of textual and otherwise unstructured data. By taking advantage of LLMs, we can quickly 'read' through all the in-flight incidents and changes in the system, even if there are hundreds of each, and use their reasoning capabilities to surface insights and patterns. This gives operations staff visibility into potential problems or improvement areas that were not previously apparent.

Our agent solution parses through all the active incidents and changes within a service management platform (we used ServiceNow in our development environment) and produces a summary report in natural language.

This report contains:

A summarization of the active incident landscape
Aggregate statistics and priority breakdown analysis
Incident team performance highlights
An interpretation of the current service impact being experienced
Incident trends over the past seven days
Incident management recommendations

The report goes on to deliver an in-depth analysis of incidents with specific respect to service level agreements (SLAs), highlighting breach rates, team SLA performance, active incidents at risk for breaching, trend and pattern analysis, and SLA management recommendations.

For change management, the report summarizes the changes to be implemented in the coming seven days, with a closer look at major and emergency changes. It flags components with a history of recurring issues during changes and provides a list of change management recommendations, like documentation, scheduling or communication adjustments.

Conclusion

We believe the Daily Ops Summary Agent solution represents a meaningful advancement for IT Operation teams, and we're excited to demonstrate it live in our AI Proving Ground. The Daily Ops Summary Agent was developed using the NVIDIA NeMo Agent Toolkit and relies on models deployed as NVIDIA NIM™, leveraging the NVIDIA AI Enterprise. The solution is hosted in an HPE Medium Private Cloud AI cluster that is part of the NVIDIA AI Computing by HPE portfolio and is easily accessible via our on-demand lab.

Explore the complete series:

Part 2: Transforming IT Operations - A Daily Ops Summary Agent

In this blog

The Context

The gap

The remedy

Conclusion

Follow HPE or NVIDIA on wwt.com now to stay informed on all of our progress!

Technologies