In this blog

Effective dashboards have the power to accelerate troubleshooting during outages and help provide meaningful insight into business operations supported by hardware, software, and network components. Creating an effective dashboard from scratch can be a daunting task given the wide range of metrics and visualizations available to an engineer. Fortunately for the industry, Google has released a free guide for Site Reliability Engineering from where some of our principles are derived. The Grafana Dashboard Sandbox offers a massive repository of data driven and template dashboards that can spark inspiration or be used for version 0 starting points.

Target audience

The success of a dashboard can be measured by how quickly it can help answer questions for those that are viewing the data. The traditional workflow of an outage is:

  • Degradation
  • Troubleshooting Cause
  • Resolution
  • Resiliency (optional automation step)

The time needed to identify degradation and troubleshoot cause is directly linked to how long users are impacted by the incident therefore our goal when creating dashboards is to reduce the time needed to identify and troubleshoot. With this in mind we can introduce Personas; types of users that will consume data to solve issues or make decisions. Creating a persona list is helpful in understanding who is using the dashboards, what data is needed to display, how to organize the data to create the quickest workflow possible, and where the dashboard will be consumed.

A persona can be quickly created by answering four questions:

  1. What is the role of the person?
    1. Software Engineer
    2. Product Manager
    3. Infrastructure Engineer
  2. What is the goal the person is trying to achieve?
    1. "Is anything broken?"
    2. "How are different devices performing?"
    3. "Is the hardware running optimally?"
  3. What pains do they need to solve by using the dashboard?
    1. "It takes too much time to sort through all my tools to find the root cause."
    2. "My monitoring tools shows me device type but not performance together."
    3. "I have too many tools to quickly verify hardware optimization status."
  4. What goals are expected from having pain point data on a dashboard?
    1. "I have a place for all Metric, Event, Log, and Trace Data for my services in one place to sort through potential incident vectors."
    2. "The combination of metrics and device types are neatly organized in table form for easy reporting."
    3. "I see utilization of all hardware components together and can now roll up a simple optimal score for the environment."

Visual hierarchy

The dashboard below is representative of a good first attempt at organizing data that did not exist prior. We are able to see a nice historical graph of things happening yet with all things software, we can iterate and improve on this dashboard to create a more intuitive visual experience. Alignment, Size, Color, and Shapes are excellent aids in building a dashboard to radiate information for easy consumption.

Example Version 1.0 Dashboard
  • Alignment sets the data to read most important to least relative to the view presented and the western world consumes text from left to right; top to bottom. This forms a Z shaped pattern and when the viewer is fatigued by what they are seeing, normally skip the middle sections and focus on beginning and end.
  • Size is intuitively weighted where bigger is more important and smaller is less important. A large time series graph is good but it does not need to be large. Scale down the graph and summarize current state in a big number with a distinct color.
  • Color plays an excellent role in drawing attention to specific areas or indicating good/bad health and everything in between. Intuitively we recognize red as bad and green is good. The above dashboard already has colors for the different categories and can go one step further help the viewer understand how the numbers relate to a good or bad rating by creating a new gauge or heat widget to showcase this relationship.
  • Shapes are useful at positioning data in ways that quickly express the relationship between widgets. Culturally we place meaning in different shapes and can piggyback on this shared understanding to eliminate redundant headers or build a eye catching visual that keeps the viewer looking longer.
  • Accuracy and Precision impacts the credibility of the dashboard which already is an uphill battle. Inaccurate metrics across different pages can create confusion and numbers not adding up to 100% introduces ambiguity.

Here we apply some of the above principles to help answer some obvious questions when looking at the data. The pairing of current power colored red indicates that relative to the 45K we can see at the beginning of the graph, we are generating significantly less power than expected. Similar with the RPM and average turbine power, the quick insight is much better organized than previously shown above. Another revision could be to add in a geographic panel to relate location to device and surface any regional based patterns.

Example 2.0 Version Dashboard

Good practices

  • Emphasize key metrics on the top. The first thing the viewer see's should be the most important metric in a format that is easy to understand.
  • Supporting Details are placed below or at the bottom.
  • External Links can also support higher tier heuristics or link panels together.
  • Variety can help break up the monotony and drive eyesight in a predictive manner.
  • Organize based on importance; not all information is equally important.
  • Iterate; the needs of the business continually change and observability should keep pace.
  • Show state with color with gauges or time series graphs to eliminate multiple widgets.
  • Space out widgets and panels to neatly align content that they do not feel crowded.
  • Layer data to combine similar metrics into categories
  • Custom Branding overrides allow dashboard templates to utilize variables in display features such as headers, names, and images.
  • Accessibility is important because statistically someone who requires different colors or a screen reader is very high. Be mindful of color palettes or labels and how they might be perceived by other viewers.
  • Real-Time data feeds from read-only sources, used sparingly can underscore the theme of a dashboard.
  • Interactive Elements allow viewers to manipulate what they see and engage more in exploring the data.

Starting points

With the wealth of metrics available to us today it can be difficult to know where to start and avoid common pitfalls experienced by others. Fortunately for us, Grafana and Google have easy to remember acronyms to help us begin putting data on a dashboard so we can gain insight and make informed decisions.

  • The RED Method is an effective baseline of metrics when starting to observe a micro-service or collection of services. Request rates give an understanding of how many people are interacting with your system. Errors paints a picture of how many requests are impacted adversely. Duration or Latency indicates how long people wait for things to happen. The combination of these metrics is an implicit view into user experience and a jumping off point in understanding how to improve performance of the system.
  • The USE Method defines an infrastructure heuristic that is easy to roll up into a device score. Utilization tells us how much work the device is doing currently relative to total work that can be performed. Saturation defines the amount of work the resource needs to perform; common examples are queue length or batch processing. Error count for events helps reinforce incidents when paired with high utilization or saturation. It is important to measure these three independently because 100% utilization or saturation is not required for performance degradation.
  • The Four Golden Signals Of Monitoring provides a minimum viable set of metrics that creates an informed view when brought together. In some cases monitoring teams are just starting the observability journey and want to get quick return for their efforts or the devices to monitor are in the field and present low bandwidth/power restrictions. Latency, Traffic, Errors and Saturation build an excellent image on how the system is behaving under load from users by allowing engineers to understand where an error is presenting.