Application Metrics in the Era of Cloud

At some point in our career, most of us have been pulled into an ad-hoc meeting or conference call where the top priority was a critical application's recent outage or unbearably poor performance for the users. The typical outcome consisted of individuals from infrastructure, platform and application development teams having action items to check log entries, config files and perform ad-hoc tests — until one late afternoon someone had an idea to "try something out."

That "something" seems to solve the problem without anybody fully understanding what exactly happened. How about the users? Will this happen again? Did we truly fix the issue? What's missing in order to understand the whole picture?

In software development, we evolved from a traditional waterfall approach to an agile methodology, which promotes a more iterative approach to writing software in order to enhance business agility through more frequent deployments. Furthermore, DevOps is aiming to solve the gap between Dev and Ops teams by promoting CI/CD and feedback loops.

Yet we still find ourselves in the same situation with application outages and performance problems, getting pulled into the same war room situation. Therefore, the issue must not be the way we deliver software. Could it be the that we do not have a holistic way of assessing the situation and don't speak a common language across all IT teams — one that is data-driven, rather than perception-influenced?

Why metrics?

Nowadays we are building applications that that are hybrid in nature, leveraging both data center and cloud components — with some spanning multiple clouds — and making calls to an increased number of third party services, all in order to provide business innovation at a much faster rate. The result is an exponentially increased (software) complexity, which introduces the potential for multiple points of failure.

Now more than ever, metrics are at the forefront of the delivery of an application and need to span the entire lifecycle of the application.

When it comes to our personal health, we all agree that getting regular checkups done is very important to keep good health and identify any early signs of health issues. A software system health must be treated the same way, and metrics are like the regular checkups that provide us with early signs of any issues that need to be addressed.

Imagine a scenario where we have a hybrid application serving millions of users, which has capability built-in to identify performance and availability issues and notify the technical support team before they impact the end-user. This will have direct impact not only on user experience, but also on sales and eventually business revenue.

We believe that metrics are the much needed language that brings people together around data. This provides a perspective on both the application reliability and a continuous feedback loop, which empowers the development teams to understand how customers are using the application and what their experience is.

A culture of application metrics

What does it mean?

In order to create a culture of metrics that brings Dev and Ops together around data rather than perception, we need to:

Establish a common set of metrics across all platforms involved in the delivery chain of your business application.
Be focused on constantly improving the reliability of the application without stopping innovation.
Understand the app usage patterns in order to make right architectural decisions and deliver more business value.

We need to move from a reactive state of taking action upon complaints from our users into a more proactive state, where we measure all aspects of our application, all the time. When we take a proactive stance to the point where we can warn users of potential user experience degradation before they actually experience it, the perception of the business value that they get from your application increases.

The next logical question then becomes: What should be measured?

User experience

Functional metrics that are able to answer questions like: How are the users using the app? What are the most used transactions? Is my app intuitive enough for the users?
Performance metrics: How is the app performing for individual transactions for my users? How does the API layer perform? Are some transactions constantly slower than others?
Am I able to run near real-time analytics in order to make better business decisions? Am I missing any revenue today? Are there any user events happening right now that are an indicator of a security breach?

The application

Understand if your app's architecture is "fit for purpose" and efficient. In other words, we need to understand how the application is reacting to the usage pattern of the users at an aggregate level (i.e. for all transactions) and at a user level (i.e. for only one user's transactions). Throughput and performance baseline metrics will help us understand what does the application do across a whole traffic pattern life cycle (e.g. monthly, annually, etc.).
Measure the efficiency of the APIs through metrics that evaluate the utilization of each API in order to determine if there are components that are undersized/over-sized, elastic enough to ensure high performing API calls and the number of endpoints per API.
Considering throughput, latency and performance by individual transaction calling the API often leads to decisions to re-architect the API layer to become more adaptable, and this is often achieved using a micro-services architecture pattern.
Measure database query performance in the context of a user's transactions (i.e. performance of queries invoked when performing the log-in transaction).
Additional consideration should be given to the potential performance difference between different types of databases (SQL, NoSQL, Graph, etc.).

The platform

It is important to have a readily available visual representation of your application's end-to-end delivery chain, with transaction level tracing for every hop.
Measure and baseline the latency between nodes.
Measure application sizing by measuring the CPU and memory utilization of resources.

The feedback loop

Metrics are entirely useless unless they are relevant and leveraged in a collaborative fashion that brings the app team around the data gathered in order to understand risk and take action in an iterative way. Creating and constantly improving a culture of metrics requires asking some very important questions:

How should we sit down and have a conversation about what's going on without pointing fingers?
How should we document the risk of potential down times coming from our current architecture?
How should we prioritize what gets worked on first? What are the metrics that will help us decide this?
How can we gather metrics from the beginning of the lifecycle so that we can prevent bad code from getting to production?
Are we acting upon the metrics across the whole CI/CD pipeline?
Should performance or functional health go wrong, who's doing what?
Are we able to measure our application's reliability and its business outcomes in a relevant way for all roles, from executives to the most technical person?
Do we have the right metrics in place to answer these questions?

Next steps — what can I do?

Implementing a culture change is challenging at all levels, regardless of industry. We recognize that all the customers that we interact with are at different points in their journey towards improving their most critical business applications.

Immediate steps: Plug the holes

While most people agree with the merits of having robust metrics, there are a few questions that come up. Do I need to make huge investment now on application monitoring tools? What is the ROI for this investment?

The good news is that, in many cases, you can start small by collecting application performance metrics and be able to show immediate value.

Let's think about the most recent outages and user experience degradation events that you have experienced in the past six months. If you were to experience them again, what are the types of metrics that would help you perform a root cause analysis faster? In other words, can we improve the mean time to resolution (MTTR)?

Here are a few steps you can take today:

Document recent outages and user experience degradations.
Map the application's delivery chain end-to-end and confirm that the whole team agrees upon the behavior of the application from the data center, across the Internet and into the public cloud.
Evaluate what metrics are available from current tooling. All major cloud platforms offer comprehensive monitoring platform that are embedded in the platform and provide metrics out of the box.
Agree upon core metrics to review from each key area (user experience, app, platform).
Agree upon action to be taken, should any of these metrics report out-of-bounds values.

Going through this exercise will provide a good metric on the level of understanding of the application that your team has. This exercise will ensure that contributors from different domains (e.g. network ops, app dev, automation, operations, etc.) have a common understanding of all the application components and dependencies.

Long term: Be strategic

At this point, everybody should have an awareness of the application metrics that matter across the whole delivery chain, along with an assigned responsibility for a subset of metrics. Customers that have already attempted the immediate steps typically realize where the gaps are by now and are ready to start taking a more strategic approach.

A common first step is to investigate what are the latest industry trends around application performance monitoring, log analytics and native cloud platform monitoring.
Determine where you are and what metrics are missing in order to paint the big picture.
Consider retooling to address the gaps.
Engage the right partner in order to ensure that you are making the right decisions to collect the right actionable data.
Determine what self-healing actions your application can take, should it report out-of-bounds parameters.
Take one application and make it better.
Consider a strategic enterprise approach that prioritizes the metrics collection based on business impact of the applications.
Depending on the above factors, a carefully crafted APM and AIOps solution will have to be identified.
Consider bringing all the metrics together and correlate the data based on business outcomes.

How would this journey look for our customers that are looking to migrate their application to a serverless cloud architecture? We have executed on several projects where we have focused initially on ensuring that both the as is platform (typically hosted on data center servers) and the to be (cloud) platforms have a common set of metrics for us to reference before, during and after the re-architecture process. In other words, we immediately look to "plug the holes."

Even though it is not very feasible to monitor the on-prem components with cloud-native tools, we do have an increasing number of customers that already have an APM tool deployed, and it is always more efficient when we can leverage such tools.

Leveraging the available metrics helps us to better understand how are the users actually using the application. Having metrics from the production environment enables us to create an efficient future-state architecture. These metrics also become very valuable in determining the future cost of the cloud solutions and how the applications should be configured for elasticity of compute.

The most successful and satisfying outcome is to journey along our customer app teams as they start to understand the importance of metrics throughout the evolution of the app and the value that metrics carry in enabling the customer's app team in their day-to-day contribution.

We are looking forward to hear your story of moving from perception about user experience and your app behavior to a data-driven DevOps organization that promotes a healthy metrics culture throughout your company.