AppDynamics Essentials: Powerful Alerting With Health Rules, Policies and Actions
In this article
- What is a health rule?
- What is a policy?
- What is an action?
- What is Anomaly Detection?
- Defaults are not so bad
- Alert! Alert! Alert… and respond
- Act one, scene one: Action
- Hey application, stay healthy
- Reviewing a health rule
- Danger, Will Robinson!
- Warning: Proceed with caution
- What's the policy, Kenneth?
- Scope it out
- Finally, some actions
- What's next?
- Download
AppDynamics is an Application Performance Monitoring and Management (APM) platform that provides real-time monitoring of applications to detect anomalies, monitor application environment performance and collect and analyze metrics. AppDynamics helps determine the root cause of application issues by looking at application, network, server and machine metrics that measure infrastructure utilization.
Keeping up with an immense application landscape can be a daunting task. AppDynamics can further help you there. Using health rules, policies and actions is an easy way to catch issues before they happen and notify your application support team so they can address them.
What is a health rule?
According to AppDynamics, health rules let you specify the parameters that represent what you consider normal or expected operations for your environment. The parameters rely on metric values. For example, the average response time for a business transaction or CPU utilization for a node.
When the performance of the monitored entity violates a health rule's condition(s), it causes a health rule violation, represented by one of the following statuses: critical, warning, normal and unknown. When this status changes, it is further classified as starting, ending, upgrading from warning to critical or downgrading from critical to warning. These are shown on the controller dashboard. The best part is that a health rule violation event can be used to trigger a policy, which then can initiate automatic actions.
What is a policy?
A policy, in AppDynamics lingo, is a simple trigger. Based on event(s) within AppDynamics, a trigger can be used to automate monitoring, alerting and effective problem remediation.
What is an action?
An action is an automatic response to an event that can be predefined and is reusable. In AppDynamics, policies trigger actions in response to an event, like a health rule violation.
What is Anomaly Detection?
Anomalies are a separate option that further uses the power of AppDynamics and its machine learning to discover normal ranges of key Business Transaction metrics and then alert when they significantly deviate. Anomaly Detection is a feature that has to be enabled beforehand and is only available in the AppDynamics SaaS environment.
We're now going to tie health rules, policies and actions together to help us catch problems before they occur.
Defaults are not so bad
When first developing an alerting strategy, it's important to let AppDynamics run first and use the defaults that are in place to provide the view of the runtime operation of your code using its agents. Because those agents detect calls to a service entry point and follow the execution path through the call stack, they gather metrics regarding usage, code exceptions, errors, backend system calls and exit calls. The agent then provides all that data to the Controller for us to see.
AppDynamics is no stranger to this process and has created a number of health rules out-of-the-box that we can use to alert on, helping us solve a number of application problems. Let's take a look.
Alert! Alert! Alert… and respond
Let's start at our home screen after logging into our AppDynamics Controller.
In my previous article, we used an AD-Air-Travel application as our example. We're going to use that again, except we're not going to visit the application dashboard. We'll select the Alert & Respond menu option.
Once we're in the Alert and Respond menu, we see a number of options including the health rules, policies and actions that we defined earlier.
Select the button next to the Alert & Respond title (1), select Applications (2) and select our AD-Air-Travel application (3).
Act one, scene one: Action
Start with actions first. Select that from the Alert & Respond menu.
We're now at the actions menu. Since this is the first action we're creating for this application, the only option we have is to click the "+" to create a new one. As we build them, a list will be created.
In this new Create Action window, there are a large number of action types available to us across several categories. We will discuss how to use each of these in a future article, but for now we're going to create a simple email notification. By default, the "Send an email" radio button is selected. Click "OK" and move on to the next window.
From the Create Email Action window, enter the email address of the application support person, team or call center that should receive the email notifications for the health rule violation. Save it, and the newly created action will now show up in the list. Now let's move on to our health rule.
Hey application, stay healthy
Now let's take a look at our health rules for the AD-Air-Travel application. Select the Alert & Respond menu from the top of the AppDynamics menu. Now select Health Rules.
Since we're just getting started with AppDynamics and working with the AD-Air-Travel application, we're only going to see the out-of-the-box health rules. That's OK. We're going to use one of these to set up some policies and an action to alert on.
Reviewing a health rule
Let's use the first default health rule for Business Transaction Health to review what will cause a violation and set up an alert. A new window opens that puts us at the overview tab.
- We're not changing the defaults here, but the name can be changed to whatever makes sense to you.
- By default, the rule is enabled. Let's leave it that way.
- As another default, the rule is always enabled but can be managed via schedule if we only want it enabled at certain times, like after hours.
- With "Use data from the last," 30 minutes is the minimum amount of data needed for the health rule evaluation. This means that that once the health rule is saved, it won't draw any conclusions until it's run for at least 30 minutes. We'll leave this at 30 minutes.
- For "Wait Time after Violation," 30 minutes is the time waited after the health rule triggers, and the status remains unchanged before it's reported on again. For example, if this health rule triggers as a critical status and continues to remain critical, it won't trigger again until 30 minutes have passed. We'll also leave this at 30 minutes.
Click the Affected Entities tab to further review the health rule.
- Health Rule Type: The Business Transaction Performance (load, response time, slow calls, etc.) covers most of the key performance indicators (KPIs) that AppDynamics primarily tracks.
- Select what Business Transactions this Health Rule affects: We want this to cover all of the Business Transactions that could be covered by this health rule. All Business Transactions in the Application is already selected so we're going to leave it as is.
Danger, Will Robinson!
This is where things really get interesting. Let's move on to the Critical Criteria tab to review what parameters will classify a critical violation for this health rule.
1. We can add additional conditions to customize this, but we're going to leave the defaults alone.
2. If we had built the warning criteria first, we could have copied it here and made slight adjustments to what will violate this health rule.
3. This is for custom combinations. We can use all or a combination of the conditions we create for this health rule. In this case, we're saying that A and B must occur at the same time to violate the health rule or C and D and E must occur at the same time.
4. This is our first condition, "A." The name is Average Response Time (ms) Baseline Condition. The evaluate to true on no data option is used when you want the condition to evaluate to true when a configured metric does not return any data during the evaluation time frame, which is 30 minutes based on the setting on the Overview tab.
5. You can choose a single metric or a metric expression here. AppDynamics combines several single metrics in this health rule.
6. We have value selected here so it uses the value of a single metric. You can also use the minimum, maximum, sum, count, current and group count.
7. You'll see that Average Response Time is the chosen metric, but any of the AppDynamics KPIs can be selected like calls per minute, number of slow calls and errors per minute. You'll see that different ones are chosen for the other conditions.
8. The greater than (>) baseline is selected to compare to our Average Response Time (ms) to our Default Baseline (9) by three (10) Baseline Standard Deviations (11) or Baseline Percentage. We're not going to discuss the other conditions. I just wanted to highlight the available options that can be modified when building or altering health rules.
The baseline is a set of automatically calculated values from metrics within a given time range that shows historic performance patterns and averages. It's used to help find business transaction outliers and determines overall environment health. The long story short on baselines and the use of standard deviations is that if our average response time always sits at around 10ms and starts increasing to 20ms, that's one standard deviation. If it increases to 30ms, that's two. 40ms means three standard deviations. In this health rule, the critical is three standard deviations.
Warning: Proceed with caution
Now, let's visit the last stop of our health rule journey, the Warning Criteria tab. This is going to look a lot like the Critical Criteria tab because it's what triggers before the critical criteria if a problem occurs.
- You can see that the same add (+) condition, copy condition and custom combinations that the Critical Criteria tab had. We can quickly use the Copy from Critical Criteria button to make adjustments to the conditions that trigger critical health rule violations.
- The warning condition evaluation fields for this health rule are identical to critical but allows us to make minor adjustments.
- For the Warning Criteria, the only difference is that the Single Metric Value of the Average Response Time is triggered by two standard deviations instead of three. Using our previous example, this rule would trigger on 30ms instead of the 40ms on critical.
Since we're just reviewing what this health rule does and not making any changes, we're going to just cancel out of this and return to the original Alert & Respond menu.
What's the policy, Kenneth?
Now let's create a policy that we can tie to the health rule we just reviewed. Select Policies from the Alert & Respond menu.
Since this is our first time building a policy, there are no policies listed here.
- AppDynamics has already selected the AD-Air-Travel application. Like health rules, we can build policies specific to our applications, user experience, databases, servers and analytics.
- Select Create Policy Manually. We want the full list of options available to us.
Our first stop in creating a new policy is the Trigger tab.
- Like the previous health rule review, there is input for a Name. It's a good rule to name it based on the health rule we're using.
- The Enabled option is available if we want to turn this off for any reason. Leave this checked so it remains on.
- Execute actions in batch checkbox is selected by default and combines policy triggers that match in very quick succession. Ex. If several hundred occurred in a two-second window, they would only start each action once for that batch of triggers. This reduces monitoring fatigue.
- Health Rule Violation Events includes a host of options we'll highlight shortly.
- Anomalies discover normal ranges of key Business Transaction metrics and then alert when they significantly deviate. We're going to talk about this feature in a future article.
- Other Events gives us a large number of non-specific options that we can have the policy trigger on, but since we're combining this policy with a health rule, we're not going to use them at this time.
- We can create a custom event filter to cover something that isn't already listed.
Let's expand the Health Rule Violation Events (4 above). We want the policy to trigger on the Business Transaction Health Rule that we reviewed earlier, but we don't want to get overloaded with alerts. Select the Health Rule Violation Started – Warning, Started – Critical, Continues – Warning and Continues – Critical options.
This means is that whenever the health rule we selected produces a warning or critical violation event, this policy will trigger. In addition, if the same health rule continues to violate for the time we set in the Wait Time After Violation (30 minutes), it will continue to report.
Scope it out
Select the Health Rule Scope tab. We're met with two options here. We can choose any health rule or select a specific one. By default, Any Health Rule is selected. This option uses every health rule and we only want to focus on the one we reviewed. Select the These Health Rules radio button.
- Selecting the "These Health Rules" radio button expands the window further and now lets us click the "+" to add an existing health rule for this application.
- Choose the Business Transaction Health Rule that we've been working with for this entire article. Let's move on to the next tab, Object Scope
The Object Scope tab looks very similar to the previous Health Rule Scope tab. It has the option for the policy to trigger on any and all objects in the health rule or gives us the ability to choose specific conditions of the health rule that will trigger this policy. Because the health rule already combines multiple conditions in its scope, and we want to use those, select Any Object.
Finally, some actions
Lastly, select the Actions tab. Click the "+" button to open the Select Action window.
Here, at the Select Action window, we see the email action created earlier in the list. Select this rule and then click the Select button to open another Configure Action window.
We have the option to add additional notes to be included in the email notifications. It's not required because the email notification will have information from the health rule violation event when this policy triggers and fires off the action: sending the email. Now click OK, then click Save to finish up the policy.
The newly created policy is now present in the list of policies.
That's it!
In the event that a health rule condition is violated (critical or warning) and continues to violate beyond the configured evaluation window of 30 minutes, the policy will trigger the action, resulting in an email notification being delivered to recipient(s). If the performance of the entities evaluated by this health rule decreases, the correct people and/or teams will be notified.
What's next?
Since you now know how easy it is to configure health rules, policies and actions together, let's dive deeper into what you can accomplish with AppDynamics in these other topics: