Part 3: Transforming IT Operations - An Incident Knowledge Assistant
In this blog
Note: This is the third post in a series exploring how AI agents can support IT Operations teams. In this installment, we consider the multiple sources of information that are relevant to resolving IT issues and how these can be identified and presented to enable the worker tasked with resolution.
Explore the series:
- Part 1: Transforming IT Operations with Large Language Models
- Part 2: Transforming IT Operations - A Daily Ops Summary Agent
The context
Every day, in every company, tickets come in by the dozens, sometimes hundreds, to the IT organization: forgotten passwords, laptop requests, application errors, network outages and more. Each request becomes an incident in the IT Service Management (ITSM) platform, which tracks it from initial assignment through resolution. The ITSM platform manages the details of the incident: who is working on it, all the interaction between the worker and the requestor, and how long it spends in each stage of its journey to completion.
Associated with each stage of the incident journey are Service Level Agreements (SLAs). SLAs are pre-established durations which specify the maximum amount of time an incident can spend in any particular stage of its lifecycle. Every IT organization sets the SLAs it is committed to meeting, establishing the level of responsiveness its customers can expect from it.
SLA data are key metrics IT organizations use to measure the quality of the service they are providing, and the trend of this data is how they tell if they are improving or falling behind. It becomes obvious, then, that any steps to improve response times are very desirable.
The gap
With that context in mind, we can put ourselves in the shoes of someone working on these incidents as they are created and assigned. Based on what the incident is regarding (a laptop vs an application problem, for example), it is assigned to the team that has the technical expertise of that subject area. Even so, not everyone on a team is equally skilled or experienced in that particular domain.
To bridge that gap, the ITSM platform usually includes a built-in knowledge base, which is a collection of articles and documents with work instructions, how-to guides, FAQs and other documentation useful in addressing issues and resolving problems. Typically, the content within the knowledge base is built up over time as new problems are seen and resolved and the solutions are captured in it for posterity.
This means that the knowledge base grows considerably over the years, which is a good thing since it means that institutional knowledge is growing. But it also presents a challenge in that it then becomes harder to find the specific item in the knowledgebase which might pertain to the incident you're trying to resolve.
Another source of information incident workers take advantage of is historical incidents. These are all preserved in the ITSM system and are a treasure trove of how to fix a particular problem since they contain all the gritty details of user interaction, problem descriptions, error logs captured, troubleshooting steps and fixes tried until the final solution was found.
But as we mentioned, hundreds of incidents are opened and closed every week. So, the question becomes: Has an issue like this one been reported and solved in the past? How do I find any past incidents that might apply to the problem I'm dealing with today? If they can find that incident, they could reduce their SLA time considerably, leading to faster resolution, a happier worker and a happier customer.
The remedy
The idea behind this use case is to address these needle-in-a-haystack challenges using AI-based technology to provide an Incident Knowledge Assistant. Using Large Language Model (LLM) and Retrieval Augmented Generation (RAG) techniques, we analyze the details stored in the ITSM platform of the incident being worked on. Based on what is 'understood' from the incident, we can then provide a few areas of assistance to the worker.
First, we inspect the contents of the entire knowledge base and compare its articles to the incident details. From this, we provide the top five articles which seem to best describe and provide a solution to the incident. To save additional time, we parse through the contents of the matching knowledge base articles, looking for work instruction steps, scripts or other command line items that may be used to gather more information or outright solve the issue. These potential remedies are surfaced to help resolve the issue.
Next, we 'read' all the closed incidents in the system looking for similar issues reported in the past. Comparing them to the current incident, we then present the five historical incident matches that are the closest fit, as they may likely already contain the steps needed to resolve the issue. Additionally, we comb through the work notes inside each of these past related incidents, and based on the initial conversation threads found, we generate an initial response for the worker to send to the customer.
Finally, the ITSM platform is again searched for any recent changes that have been implemented in the past week related to the same item the incident is opened against, in case that change may have been the cause of the issue being experienced. At the same time, any known problems logged in the ITSM platform that pertain to the same item are also presented, quickly identifying if there is a larger issue currently at play in the broader environment.
Conclusion
Taken together, this solution represents a leveling-up and great enabler for IT teams. By taking advantage of what it offers, response and resolution times to incidents can be greatly reduced.
We're excited to be able to demonstrate it live in our AI Proving Ground. It was developed using the NVIDIA NeMo Agent Toolkit and relies on models deployed as NVIDIA NIM™. We deployed the solution into an HPE Medium Private Cloud AI cluster where it is hosted with easy access via wwt.com.
Follow HPE and NVIDIA on wwt.com now to stay informed on all of our progress!