AppDynamics Essentials: Catching Hardware Issues Before They Occur

If that application is hosted in your own data center, managing the infrastructure and network hardware is another consideration in determining the cause of application-related problems. While using AppDynamics to solve a hardware problem is a quick and painless process, we don't want a reactive approach to troubleshooting. Responding to problems because an incident has already happened promises that your application will always have some amount of downtime. In this article, we are going to use AppDynamics and its Infrastructure Visibility module to preemptively alert on hardware metrics before they become major application problems.

AppDynamics provides real-time monitoring of applications to detect anomalies, monitor application environment performance and collect and analyze metrics. AppDynamics helps determine the root cause of application issues by looking at application, network, server and machine metrics that measure infrastructure utilization. In this article, we are going to take those Health Rules, Policy and Actions within AppDynamics that we previously covered and apply them to hardware-related problems.

Let's recap

Understanding the basic concepts of Health Rules, Policies and Actions within AppDynamics will help you immensely going forward with the platform, as well as for understanding this article. For now, we're going to summarize them.

Health rules let you specify the parameters that represent what you consider normal or expected operations for your environment, measured with metric values. When the performance of the monitored entity violates a health rule's condition(s), it causes a health rule violation. AppDynamics uses that health rule violation to trigger a policy, which then can initiate automatic actions.

A policy is a trigger based on an event or multiple events and can be used to automate monitoring, alerting and effective problem remediation.

An action is a reusable, predefined and automatic response to an event. In AppDynamics, policies rigger actions in response to an event, like a health rule violation. Some examples of this are sending an SMS or email notification, HTTP Request, running a custom script or creating a Jira ticket.

Now we're going to take it a step further and use Health Rules, along with Policies and Actions, to help us catch hardware problems before they occur.

Preventing hardware problems from an application perspective

In this example, we're going to use the scenario from my previous article on solving hardware problems quickly. AD-Air-Travel is our primary application, and customers were complaining that the application is acting erratically or having longer than normal load times. The application appeared to be working under normal operating parameters, however, the problem still existed. We used AppDynamics and the Infrastructure Visibility module to find the root cause: the application filled up the partitions on its virtual hard drives.

We solved this problem quickly using the AppDynamics Infrastructure Visibility module, but it was a reactive approach to troubleshooting. Responding to problems because an incident has already happened promises that your application will experience an outage or downtime. According to a 2014 IDC report, "For the Fortune 1000, the average total cost of unplanned application downtime per year is $1.25 billion to $2.5 billion." We would rather use a proactive approach with AppDynamics to plan and prepare for outages like this, allowing us to address an issue before it starts to become a problem.

We accomplish this by building or using the default Health Rules to create Policies and Actions.

Home is where your heart is

Let's start at our home screen after logging into our AppDynamics Controller.

From the top menu, select the Alert & Respond option.

Once we're in the Alert and Respond menu, we see the Health Rules, Policies and Actions that we have discussed previously.

We previously used an AD-Air-Travel application as our example. We're going to use that again. Select the button next to the Alert & Respond title (1), select Applications (2) and select our AD-Air-Travel application.

Now let's take a look at our Health Rules for our AD-Air-Travel application. Because we discovered that there was a shortage of disk space in our application, we want to prevent this from happening in the future.

But first, a little action

Like our previous article, we're going to start with the Action we want to use to notify the application support representative, team or call center that should receive them. Select Actions from the Alert & Respond menu.

We're now at the Actions menu. We're going to treat this like it's the first Action we're creating for this application. Because of that, the only option we have available is to click the "+" and create a new one. As we build them, a list will be created.

In this new Create Action window, there are a large number of action types available to us across several categories. We will discuss how to use each of these in a future article, but for now, we're going to create a simple email notification.

By default, the "Send an email" radio button is selected. Click "OK" and move on to the next window.

From the Create Email Action Window, enter the email address of the application support representative, team or call center that should receive the email notifications for the Health Rule violation. We're going to pick our Health Rules and tie this Email Action to them in the next section.

Save it, and the newly created Action will now show up in our list. Now let's move on to our Health Rule.

Health is the greatest wealth

Now let's take a look at our Health Rules for the AD-Air-Travel application. Select the Alert & Respond menu from the top of the AppDynamics menu. Now select Health Rules.

Since we haven't created any new Health Rules, we're only going to see the default Health Rules that AppDynamics has for us. That's OK. We're going to use these to tie Policies and an Action to alert on.

We reviewed at length what comprises a Health Rule. Because we want to catch hardware problems preemptively, we're going to work with the Health Rules that monitor metrics related to that hardware. In this example, we're going to use these two default Health Rules: CPU utilization is too high and the Memory utilization is too high Health Rules. The basics of each rule is as follows:

CPU utilization is too high:

Overview: Always enabled, Using data from the last 20 minutes with a 30-minute wait time after violation.
Affected Entities: Tier/Node Health – Hardware, JVM, CLR (cpu, heap, disk I/O, etc), Nodes, All Nodes in the Application.
Critical Criteria: Single Metric, Value, Hardware Resources|CPU|% Busy, >Specific Value, 90, Trigger only when occurs 16 times in the last 20 minutes.
Warning Criteria: Single Metric, Value, Hardware Resources|CPU|% Busy, >Specific Value, 75, Trigger only when occurs 16 times in the last 20 minutes.

Memory utilization is too high Health Rule:

Overview: Always enabled, Using data from the last 30 minutes with a 30-minute wait time after violation.
Affected Entities: Tier/Node Health – Hardware, JVM, CLR (cpu, heap, disk I/O, etc), Nodes, All Nodes in the Application.
Critical Criteria: Single Metric, Value, Hardware Resources|Memory|Used %, >Specific Value, 90.
Warning Criteria: Single Metric, Value, Hardware Resources|Memory|Used %, >Specific Value, 75.

Let's talk about policies

Now let's create a policy that we can tie to the two health rules above. Select Policies from the Alert & Respond menu.

Since we're already working with the AD-Air-Travel application, AppDynamics has already selected it for us. Since we built a policy in the previous article, we will now see it in the list.

Select the "+" to create a new policy manually.

Our first stop is the Trigger Tab.

Let's trigger this!

We're going to build a policy to tie in the first Health Rule, CPU utilization is too high. This will cover any processes within the application that cause CPU resource spikes.

Input for a Name. Base it on the Health Rule we're using.
The Enabled option is available if we want to turn this off for any reason. Leave this checked so it remains on.
Execute actions in batch checkbox is selected by default and combines policy triggers that match in quick succession. Ex. If several hundred occurred in a two-second window, they would only start each action once for that batch of triggers. This reduces monitoring fatigue. We can pull the individual metrics for all of these that occur if we need more details, but we don't need a flood of notifications for violations that are identical and potentially already being addressed.
We want to catch the Health Rule Violation Events for warning as well as critical. In addition, if the same Health Rule continues to violate for the time we set in the "Wait Time After Violation," (30 Minutes) it will continue to report. Lastly, we want to notify us when the Health Rule Violation escalates from a warning to critical.

Do not limit this scope

Select the Health Rule Scope tab.

Select the "These Health Rules" radio button.
Select the "+" to expand the available Health Rule options.
Select the CPU utilization is too high for the Health Rule list.

On the Object Scope tab, select Any Object. We want policy to trigger on any and all objects in the Health Rule.

A little more action never hurt anyone

Visit Actions tab. Click the "+" button to open the Select Action window.

We should see the Email Action created earlier in the list. Select this rule and then click the "Select" button to open another Configure Action window.

You can add notes to include in the email notifications if you want to, but it's not required. The notification will contain a lot of information from the Health Rule violation. Click OK, then click Save to finish up the Policy.

The newly created Policy for CPU Utilization High is now present in the list of policies.

Congratulations! We're well on our way to preventing hardware problems from affecting our application.

Rinse and repeat

Since we're already on the Policies window, follow the steps above starting with the Let's talk about policies section to create another policy to cover the second Health Rule, Memory utilization is too high. Name it Memory Utilization High or something relevant. The Health Rule Scope will use the Memory utilization is too high Health Rule, but the Trigger, Object Scope and Actions options will remain the same.

Once you're done, the Memory Utilization High will now show up in your list of Policies.

Preventing hardware problems from a server perspective

In my previous article, we discovered that the servers that support our application ran out of hard drive space, causing a number of issues impacting performance. Since we don't want that to happen again, let's use AppDynamics and the Infrastructure Visibility module to build an alert that will notify us if our servers available hard drive space is getting low.

We could use the default Disk Usage is too high Health Rule like we did above for the memory and CPU usage, but we want to make some small adjustments to the rule to help avoid some alert fatigue.

Don't be afraid to go back home

Let's start back at our home screen on the AppDynamics Controller.

From the top menu, select the Alert & Respond option.

We're back at the very familiar Alert and Respond menu.

This time we're going to click the drop-down menu.
Select (2) Servers from the list.

Pay attention to your health

Select Health Rules.

Notice how Severs is in our drop-down list next to Health Rules. We're building a rule that will affect them.
Click the "+" to create a new Health Rule

Some kind of overview

We're going to build a new Health Rule to track our server hard drive space usage. Name it something appropriate like "Server Hard Disk Usage Higher than 80%" and leave the defaults for the rest. Double-check that it's enabled.

Move on to the Affected Entities tab.

Affected entities

On the Affected Entities tab, you're going to notice that Server Health is selected here for us. This is because we selected Servers on the Alert & Respond menu.
The rule can either affect all servers being monitored by AppDynamics or we can build subgroups and select those to be affected.
Select Servers allows us to choose all servers in the account, servers in specified subgroups, specific individual servers and servers matching criteria like naming conventions and metric expressions. We're going to select All Servers in the Account. This means that this Health Rule will cover every server that AppDynamics monitors across all applications in the account.

Move on to the Critical Criteria tab.

Let's get critical

Select the "+"to Add Condition, and the "A" Condition Criteria will pop up.
The option to add multiple conditions and choose whether we can use all or a combination of the conditions we create for this Health Rule. The "All" is grayed out because we only have one condition at this time. We're going to leave it that state because we're only building one condition.
Since we're focusing on server hard drive space usage, name the condition appropriately. Choose Single Metric and Value.
Upon clicking the Value, a new Metric Selection window will open. Since we want to focus the alerting for this Health Rule on hard drive space, we're going to select the Hardware resources|Volumes|Used% metric.
1. The list options available are vast and allow us to choose any of the metrics that AppDynamics with the Infrastructure Visibility module tracks.
2. You can even Specify a Relative Metric path if you want to further refine the information you want to alert on.
3. Select the metric to return to the Critical Criteria Condition "A" menu.

We want to be specific about how much space we consider critical. Select > Specific Value and use 90% for the value. This means that if 90% of the hard drive is full, this Health Rule will start violating.
We don't want this to cause our Policy to trigger repeatedly before we have a chance to address it or to receive a large number of notifications in quick succession about the same issue. Since AppDynamics reports every minute, this Health Rule would constantly trigger an alert every minute. By checking this box and modifying the values, it will now only trigger if it occurs 15 times over the last 30 minutes. This removes any false positives that may occur and prevents alert fatigue.

Warning criteria

On the Warning Criteria tab, we're going to take the easy route here and select the "Copy from Critical Criteria" button. This will bring the entire "A" Condition we just completed from the Critical Criteria tab.
The only change we need to make here is to modify the % of the hard drive space we want to be notified on. In the Critical Criteria tab, we used 90% of the hard drive. For the Warning, we want to use a lower value. The default Health Rule for hard drive space uses 75%, but we're going to use 80% here. The remaining items should be identical to the Critical Criteria tab.

Click Save, and return to the Alert & Respond menu.

Back to the policy!

Let's create a policy for our new Health Rule. Select Policies.

Select the "+" to create a new policy manually.

We're going to build a policy to link to our new Health Rule, Server Hard Disk Usage Higher than 80%. Our first stop is the Trigger Tab.

Trigger happy

Input for a Name. Base it on the Health Rule we're using.
Leave the Enabled option on.
Leave Execute actions in batch checked.
We want to catch the Health Rule Violation Events for warning as well as critical, when either of them continues (30 minutes) and when the warning violation escalates to the critical violation.

Move on to the Health Rule Scope.

Scope it

Select the "These Health Rules" radio button.
Select the "+" to expand the available Health Rule options.
Select the Server Hard Disk Usage Higher than 80% Health Rule we just created.

Let's visit the Actions tab one more time.

Back to the action… again.

Visit the Actions tab. Click the "+" button to open the Select Action window.

We should see the Email Action created earlier in the list. Select this rule and then click the "Select" button to open another Configure Action window. Add notes here if you need it.

The newly created Policy for Server Hard Disk Usage Higher than 80% is now present in the list of policies. Now when any of our hard drives or partitions that AppDynamics is monitoring are reporting 80 percent or more hard drive usage, the application support person, team or call center will receive email notifications with details on the Health Rule violation.

Congratulations! See how quickly we built these new Health Rules, Policies and Actions to prevent the hardware-related issues we previously experienced? It's that simple with AppDynamics and Infrastructure Visibility.

What's next?

Since you now know how easy it is to solve a hardware problem quickly using AppDynamics and Infrastructure Visibility, let's learn more about what AppDynamics can do with these other topics: