Four Key DevOps Metrics and How To Measure Them
In This Article
Across the internet, you can find many blog posts from major organizations promoting the Four Key Metrics of DevOps. While this article is in the same vein, what I've seen missing from most other publications is the how of collecting these metrics; those articles only talk about the what of these metrics. It's important to remember that these metrics only help advance the ultimate goal of the business – delivering value to its customer faster. Metrics should never be the target – they are merely a way to measure progress. We are not perfect; code is not perfect. Outages, bugs, and mistakes will happen. The intent of metrics is to track our progress of continual improvement and raise early warnings when we regress. Let's quickly enumerate these metrics and delve into the why and how of them.
The four metrics
- Change Lead Time
- Deployment Frequency
- Change Failure Rate
- Mean Time to Recover/Restore (MTTR)
Why they are important
Through a decade of research and annual reports, the fine folks at the DevOps Research Agency (DORA), in association with Puppet, Inc., discovered these four metrics are key performance indicators for highly effective teams applying DevOps principles and practices to their way of working. As teams evolve and improve their processes, their growth is reflected by these metrics. When an organization sets out to be better than its prior self, it is necessary to know where you started and how you're progressing. Otherwise, there's no strong evidence to support what direction a team is heading. That is why metrics, and these four, in particular, are important.
Change lead time
Change Lead Time, commonly called Cycle Time, is the amount of time it takes for prioritized work to be completed by the team. Healthy teams are never fully saturated with work, leaving available capacity to pull in work whenever the business requires. This allows for a fast response to new change requests. Overall, this is an evaluation of efficiency in the development process. If teams are hindered by technical debt, slow deployment cycles, or burdensome approval hurdles, this will be reflected in a longer change lead time.
Tackling those issues is key to reducing the time the development cycle takes. Value Stream Mapping can be a good exercise for the team to help identify specific areas ripe for improvement. Additionally, improving change lead time also is dependent on managing story size. Story size, and more specifically reducing and standardizing story size, keeps a constant pace of work across the team. Having a consistent pace of work improves predictability and overall helps reduce change lead time. If teams are working in pairs (XP or extreme programming), this also gives a cleaner delineation for when pairs can rotate. There will always be changes smaller and larger than the goal, and it is a point of continuous improvement for the team to work towards tightening the bands of story size.
Deployment frequency is the percentage of builds that make it into the customer's hands. Extracting data from your CI/CD pipelines will bubble up this metric. This does not mean, however, that functionality must be active upon deployment. Feature toggles can reduce risk by decoupling deployments from delivery of the capability to your customer. This allows for a safer onramp of new functionality to a small subset of your customer base, where if needed, it can be disabled, improved, or completely rolled back.
The frequency of deployments is important because it demonstrates how development teams are batching their work. Smaller batches of changes can be deployed more frequently at a higher level of confidence because the effect on the system is minimized. Deployment frequency is also dependent on right-sizing work or stories to achieve consistent flow through a development system. It is directly correlated with Change Lead Time and Change Failure Rate. As we increase the quality of the development process, our confidence increases, risk is minimized, and smaller pieces of work can flow through to production unimpeded.
Change failure rate
Change Failure Rate is the ratio of working vs. broken features that are released to production. It is an important metric to track because it exposes the frequency in which new code negatively impacts the system's performance. Observations made through this lens enable teams to find out why these changes cause failures and what they can do to prevent them in the future. A key component of DevOps is the adoption of Lean practices, in particular the idea of failing fast. We want our systems to be resilient and fail quickly, meaning detecting a risk of failure as early in the development lifecycle as possible. As quality is built into the SDLC (Software Development Life Cycle), confidence grows that a change promoted to production is done at the lowest risk possible. As with all the other practices, this is a continual process to increase our system's quality.
Mean time to recover/restore (MTTR)
Mean Time to Recover is the duration from the moment an outage is detected to when its functionality is restored. This could be a system-wide event or just a piece of it. MTTR is important because it indicates how quickly a system can react to a failure. This includes both automated and human-driven responses to outages. As outages extend through time, the impact your customers' experience and the impact on your organization's bottom-line grows with it – sometimes exponentially. As systems become more complex, the shift towards localized auto-remediation becomes a necessity. A related metric to MTTR is the quantity of preventative or proactive automated processes built around a system's health. While this is an advanced practice, some tooling allows for building this during the earlier stages of development (think: container health checks). In some systems, a development team may not fully own the entire ecosystem in which it exists, and therefore is susceptible to counterparty risk. This often occurs at the infrastructure level. Measuring MTTR and tracking what parts fail for longer durations can help provide the data for conversations about shared responsibility or transfer of ownership to the necessary parties to reduce the impact further.
Bonus metric: "Lead time"
Lead Time, not to be confused with Change Lead Time, is the amount of time it takes from the business requesting a feature to when it is fully developed and in the customer's hands. This metric encircles Change Lead Time, and its duration is an important metric because it empowers the business to know how soon a new feature can be developed and deployed. It helps qualify new requests and surfaces the cost to develop a requested feature. Additionally, it sets expectations between the business and the development team(s) for how long it takes to complete a feature. As time is money, decreasing lead time enables organizations to react faster to market or industry changes. An example of this is the COVID-19 pandemic and how quickly organizations can pivot the business to handle such a dramatic shift in the economy (e.g., working remotely, food delivery, production of sanitization goods for at-home consumption).
How to measure these metrics
Change lead time
Often Change Lead Time can be tracked by stories moving across a Kanban board from Ready to Work to Done (specifically "Done Done," as deployed to production). The calculation itself is straightforward: Time Deployed to Production – Start Time. Some work tracking systems can provide this metric out of the box – given that the team does not game the process. Otherwise, the data should be available to export or manipulate to create the desired measurement. There is always a fallback to a manual process calculation (albeit less accurate). Elite performing teams have a change lead time of under an hour.
Of all the metrics, measuring deployment frequency is about as straightforward as it gets. From your pipeline tool of choice (GitLab CI, Jenkins, Azure DevOps, etc.), collect the total number of builds and the number deployed to production successfully and divide accordingly (Successful Deployments / Total Build Count * 100). As this number increases, the time between deployments will decrease inversely. Once the flow of deployments reaches a speed that the business can not maintain, the team may switch to on-demand deployments or make the business more comfortable with the team's deployment speed. Elite performers have a deployment frequency of multiple times a day.
Change failure rate
The most important part of tracking this metric is identifying what constitutes a failed deployment or release. Was a hotfix required? Did you roll forward/back? A feature flag was turned back off? Did an automated health check fail? Making this decision as a team is a great example of building constructive Working Agreements, keeping everyone in sync. Once you can concretely mark a deployment as a failure or success, the math becomes a simple calculation of dividing total deployments by failed deployments (failed/total * 100). Elite performing teams have a 0-15% failure rate.
Mean time to recover/restore
DevOps-enabled teams will have to decide whether they want to start the clock at the point of when an outage is known or when it actually started. The latter is preferred as it encourages improving monitoring capabilities for parts of a system that were previously not tracked. However, if it's easier to begin tracking from the point of knowing, that's a fine place to start. Ultimately, the team will want to improve monitoring to coalesce these two points, starting and knowing. Once the start time is determined, a calculation of time between then and when the services are restored is all that's needed. Start simple, Microsoft Excel can easily calculate differences between timestamps (resolved timestamp – start timestamp). From there, you can calculate a mean, graph it, build out standard deviations, etc. Elite performing teams can restore service in under an hour.
Bonus metric: "Lead time"
Like Change Failure Rate, there's a subjective point of when to start the clock on this metric. Generally, it's from the moment of request by the stakeholder to when it's running in production. It gets a bit fuzzy if there isn't tight alignment between stakeholders and the engineering team, meaning that requests may be loose ideas that aren't fully fleshed out and therefore aren't ready for being added to the backlog. Some workflow visualization systems have built-in calculations for lead time; however, you may be constrained to their method of calculation (e.g., Azure DevOps). Like MTTR, the calculation becomes a simple Completed Time – Start Time. Elite performing teams have lead times of less than a day.
Taking the Four Key DevOps Metrics outlined here and applying them to your team(s) is a great step towards improving how your organization works. The end goal – delivering value to the customer faster – is achieved through continual improvement. Knowing where the team started on their DevOps Journey and comparing it to where they are now can determine if progress is moving in the right direction. Reviewing these metrics often allows for a fast feedback loop to reinforce good patterns or pivot when experiments do not bear fruit.