?
Digital Application Development
15 minute read

API Access Log Monitoring on Splunk

Understanding application health is crucial in the development and frequent delivery of software. For API development, access logs are an ideal data source for monitoring application health. This article provides a guide for leveraging access logs to create powerful monitoring dashboards on Splunk.

In This Article

copy link

Introduction

Understanding application health is crucial in the development and delivery of software. Having a constant view into the performance and behavior of an application under development leads to increased confidence when adding features, fixing bugs, and addressing technical debt. After value is delivered you can validate your hard work by seeing the feature works and is being used (and hopefully appreciated!). Any problems that do arise can be identified quickly, allowing for expedient resolution. Taken altogether, the benefits of monitoring application health build the trust necessary for faster value delivery more often.

For API development, access logs are an ideal data source for monitoring application health. Access logs capture valuable information for every request made to the API, including request metadata, response status, and response latency. Access log source can vary and depends on the architecture of your API. The server or service running your API might provide access logs, or you could implement some form of middleware in code. Structured logs are easier to work with than traditional formats and aid the creation of rich queries and dashboards. All examples will assume structured JSON logging. For the purposes of this article, the following standard access log format will be used to build out a working dashboard on Splunk. 

{
    "host": "fake-host ",
    "httpMethod": "POST",
    "ip": "11.11.111.111",
    "resourcePath": "/",
    "responseLatency": "95",
    "responseLength": "103",
    "status": "200"
}

copy link

Getting Data In

The first step is getting data into Splunk. This is a large topic in itself and not something this article can cover in detail. Luckily, Splunk provides a plethora of documentation on Getting Data In. Refer to the documentation for the Splunk product and version at your disposal and get to work! 

Hint: When configuring indexes, create one index per environment to make it easy to zero in on a specific environment in your queries and dashboards. You can still query multiple indexes at a time if you stick to a naming convention (See below).

Hint: Stick to a naming convention! You will almost certainly want to view metrics for multiple APIs and maybe even multiple environments on the same dashboard. For example, the input sources named access-logs-FoodAPI and access-logs-StuffAPI can be queried together with source=”access-logs-*”. Likewise, the indexes sample-env-dev and sample-env-prod can be queried together with index=”sample-env-*”.

copy link

Building the Dashboard

The rest of this article focuses on building a single access log dashboard on Splunk. This will get technical; a basic understanding of XML and Splunk searching will be helpful in following along. If you have no familiarity with Splunk dashboards, familiarize yourself with About Dashboard documentation before proceeding. The sample configuration that follows will give an idea of what is possible in using access logs to monitor application health. 

Input Options

Input options are used to customize dashboard output. With cleverly composed queries, input options allow powerful dashboards with a wide breadth of coverage. A single dashboard can be used for multiple APIs across many environments. Dashboards can be viewed over custom time ranges and data aggregated over custom time spans. For the examples that follow, the listed dashboard inputs must be configured. Paste the provided XML as a child to the top-level element in the edit source view. Update source_index and source_api choices to values that exist in your Splunk instance. Input options can also be configured manually through the edit UI view.

  • source_index
    • The source_index should be a drop_down or text type input. This allows the dashboard to support many environments, assuming you create one index per environment.
  • search_time_range
    • The search_time_range should be a time type input. This allows you to customize the time range of the data displayed on the dashboard.
  • chart_interval_time
    • chart_interval_time is a drop_down or text type input. This allows customizing the increment at which data is aggregated.
    • The values should be Splunk supported timespans. For example: 1hr, 2d, 3 months.
  • source_api
    • The source_api should be a drop_down or text type input. This allows the dashboard to support multiple APIs. This field corresponds to the Splunk source.
    • Hint: If your input sources follow a naming convention, you can display all APIs at once. For example, an “All” option could have the value “access-logs-*”
<fieldset submitButton="false">
  <input type="dropdown" token="source_index">
    <label>source_index</label>
    <choice value="sample-env-prod">production -- UPDATE VALUE TO REAL INDEX</choice>
    <default>sample-env-prod</default>
  </input>
  <input type="time" token="search_time_range">
    <label>search_time_range</label>
    <default>
      <earliest>-24h@h</earliest>
      <latest>now</latest>
    </default>
  </input>
  <input type="dropdown" token="chart_interval_time">
    <label>chart_interval_time</label>
    <choice value="10m">10 minutes</choice>
    <choice value="15m">15 minutes</choice>
    <choice value="1h">1 hour</choice>
    <choice value="4h">4 hours</choice>
    <choice value="1d">1 day</choice>
    <choice value="1w">1 week</choice>
    <default>1h</default>
  </input>
  <input type="dropdown" token="source_api">
    <label>api</label>
    <choice value="access-logs-FoodAPI*">Food -- UPDATE VALUE TO REAL SOURCE</choice>
    <choice value="access-logs-StuffAPI*">Stuff -- UPDATE VALUE TO REAL SOURCE</choice>
    <choice value="access-logs-*">All</choice>
    <default>access-logs-StuffAPI*</default>
  </input>
</fieldset>

The Base Search

You may be familiar with Splunk dashboards and not familiar with the concept of a base search. A base search is not strictly necessary when building dashboard panels. Often, every panel in a dashboard will perform a full query to render. Using a base search is an optimization for dashboards with panels sourced from the same data. In Splunk jargon, a base search is an inline search that can be referenced in multiple dashboard panels. Dependent dashboard panels post-process the base search results to generate visualizations. 

The base search makes dashboard rendering fast by limiting the amount of work Splunk has to do. One query per dashboard is faster to perform than one per dashboard panel. Base searches also allow for an abstraction of filtering logic. The input options configured above are dealt with in the base search rather than in every panel. Using a base search does have drawbacks. Additional configuration is required, and the query can be gnarly and confusing. Deciding on whether or not to use a base search will come down to the question of speed vs complexity. 

For the purpose of this article, a base search will be used for most panels. The following base search will be referenced by the dashboard panel examples. The base search will have to be added to the dashboard’s raw XML. Add the following as a child to the top-level element in the edit source view of the dashboard. 

<search id="Access_Logs">
    <query>index="$source_index$" source="$source_api$"
        | eval methodPath = httpMethod + " " + host + " " + resourcePath
        | eval is504 = if(status = 504 , 1, 0)
        | eval isOver5s = if(responseLatency &gt; 5000 , 1, 0)
        | eval isOver10s = if(responseLatency &gt; 10000 , 1, 0)
        | eval isOver15s = if(responseLatency &gt; 15000 , 1, 0)
        | eval isOver20s = if(responseLatency &gt; 20000 , 1, 0)
        | eval isOver25s = if(responseLatency &gt; 25000 , 1, 0)
        | bin span=$chart_interval_time$ _time 
        | stats 
        max(responseLatency) as maxResponseLatency, 
        sum(responseLatency) as sumResponseLatency, 
        count as requestCount, 
        sum(isOver5s) as sumOver5s, sum(isOver10s) as sumOver10s, 
        sum(isOver15s) as sumOver15s, sum(isOver20s) as sumOver20s, 
        sum(isOver25s) as sumOver25s, sum(is504) as sum504s, 
        dc(ip) as userCount 
        by _time, methodPath, status
    </query>
    <earliest>$search_time_range.earliest$</earliest>
    <latest>$search_time_range.latest$</latest>
  </search>

Breaking Down the Base Search

  1. <search id="Access_Logs">
    1. Begin the base search
    2. The ID Access_Logs will be referenced in succeeding panels
  2. <query>index="$source_index$" source="$source_api$"
    1. Begin the query
    2. $source_index$ and $source_api$ are inputs to the dashboard
  3. | eval methodPath = httpMethod + " " + host + " " + resourcePath
    1. The eval command creates a new field called methodPath
    2. methodPath will be used as a bucketing field
  4. | eval is504 = if(status = 504 , 1, 0) 
        | eval isOver5s = if(responseLatency &gt; 5000 , 1, 0)      
        | eval isOver10s = if(responseLatency &gt; 10000 , 1, 0
        | eval isOver15s = if(responseLatency &gt; 15000 , 1, 0)
        | eval isOver20s = if(responseLatency &gt; 20000 , 1, 0)
        | eval isOver25s = if(responseLatency &gt; 25000 , 1, 0)
    1. Create new fields to count the number of requests over certain latency thresholds
    2. Used exclusively in the Excessive Latency dashboard panel specified below
  5. | bin span=$chart_interval_time$ _time
    1. Bucket data by time to prepare for the subsequent transforming command
    2. The span $chart_interval_time$ is an input to the dashboard
  6. | stats 
        max(responseLatency) as maxResponseLatency,
        sum(responseLatency) as sumResponseLatency, 
        count as requestCount,
        sum(isOver5s) as sumOver5s, sum(isOver10s) as sumOver10s,
        sum(isOver15s) as sumOver15s, sum(isOver20s) as sumOver20s,
        sum(isOver25s) as sumOver25s, sum(is504) as sum504s, 
        dc(ip) as userCount 
        by _time, methodPath, status
    1. Transform the search into a statistical table with aggregating fields and bucketing fields
    2. This is the meat of the base search and will be used for the basis of visualizations to follow
  7. </query>
    1. End the query
  8. <earliest>$search_time_range.earliest$</earliest>
    <latest>$search_time_range.latest$</latest>
    1. Time range of the query
    2. $search_time_range$ is an input to the dashboard
  9. </search>
    1. End the base search

Dashboard Panels

Dashboard panels provide the means to visualize API access. This section provides seven useful dashboard panels driven by the base search, plus one bonus for fun. 

Minimal customization has been added to the samples, but two in particular, are too useful to leave out. A couple have the y-axis scaled logarithmically to prevent losing small values among large outliers. Also, every panel has legends placed at the bottom to keep graphs aligned when panels are stacked vertically. This allows easy comparison of time series data.

The dashboard panels can be added to the dashboard’s raw XML. Add the following samples as a child to the top-level element in the edit source view. When you’re done, switch to the UI view to move and customize the visualization as desired.

Status Code Breakdown

A high-level visualization to see status code distribution for the entire time range selected. This pie-chart does a reasonable job at identifying issues for any API returning a small subset of HTTP status codes. Hover to see percentages. 

<row>
  <panel>
    <title>Status Codes Breakdown</title>
    <chart>
      <search base="Access_Logs">
        <query>stats sum(requestCount) by status</query>
      </search>
      <option name="charting.chart">pie</option>
      <option name="charting.drilldown">none</option>
    </chart>
  </panel>
</row>

Call Count

A simple visualization to track the number of calls to the API over time. The y-axis is scaled logarithmically to prevent skewness due to high-traffic timespans. 

<row>
  <panel>
    <title>Call Count</title>
    <chart>
      <search base="Access_Logs">
        <query>timechart span=$chart_interval_time$ sum(requestCount) by methodPath</query>
      </search>
      <option name="charting.axisY.scale">log</option>
      <option name="charting.chart">column</option>
      <option name="charting.drilldown">none</option>
      <option name="charting.legend.placement">bottom</option>
    </chart>
  </panel>
</row>

Status Codes Over Time

This visual combines the two above to track the status code distribution over time. Logarithmic scaling is applied to highlight less frequent status codes.

<row>
  <panel>
    <title>Status Codes Over Time</title>
    <chart>
      <search base="Access_Logs">
        <query>timechart span=$chart_interval_time$ sum(requestCount)  by status</query>
      </search>
      <option name="charting.axisY.scale">log</option>
      <option name="charting.chart">column</option>
      <option name="charting.drilldown">none</option>
      <option name="charting.legend.placement">bottom</option>
    </chart>
  </panel>
</row>

Average Response Time

Using the base search actually makes calculating the average over time a little tricky. This is due to the extra bucketing dimensions the base search has which need to be unwound. A little math and Splunk flexing get the job done. Alternatively, query from scratch without the base search.  No logarithmic scaling needed for averaged values.

With Base Search
<row>
  <panel>
    <title>Average Response Times (ms)</title>
    <chart>
      <search base="Access_Logs">
        <query>stats sum(sumResponseLatency) as t_responselatency, sum(requestCount) as t_count by _time, methodPath | eval avgResponseLatency=t_responselatency/t_count | timechart span=$chart_interval_time$ first(avgResponseLatency) as avgResponseLatency by methodPath</query>
      </search>
      <option name="charting.chart">column</option>
      <option name="charting.drilldown">none</option>
      <option name="charting.legend.placement">bottom</option>
    </chart>
  </panel>
</row>
Without Base Search
<row>
  <panel>
    <title>Average Response Times (ms)</title>
    <chart>
      <search>
        <query>index="$source_index$" source="$source_api$"
        | eval methodPath = httpMethod + " " + host + " " + resourcePath
        | timechart span=$chart_interval_time$ avg(responseLatency) by methodPath
        </query>
        <earliest>$search_time_range.earliest$</earliest>
        <latest>$search_time_range.latest$</latest>
      </search>
      <option name="charting.chart">column</option>
      <option name="charting.drilldown">none</option>
      <option name="charting.legend.placement">bottom</option>
    </chart>
  </panel>
</row>

Max Response Times

Unlike the average, max is easy to find with the base search. Like the average, no logarithmic scaling is applied. Max response time shows how your API is performing at the worst of times.

<row>
  <panel>
    <title>Max Response Times (ms)</title>
    <chart>
      <search base="Access_Logs">
        <query>timechart span=$chart_interval_time$ max(maxResponseLatency) by methodPath</query>
      </search>
      <option name="charting.chart">column</option>
      <option name="charting.drilldown">none</option>
      <option name="charting.legend.placement">bottom</option>
    </chart>
  </panel>
</row>

Excessive Latency

 

Week to Date
March 18

This area graph over time seeks to highlight requests which take an excessive amount of time to complete. In the top visual, which covers the entire week-to-date time range, you can see the excessive latency spiked on March 18.  Seven requests took over 5 seconds, with two of those taking over 10 seconds and three taking over 15 seconds. The lower visual narrows into an hourly view of March 18 to see the time requests were affected. This is accomplished by changing the input options search_time_range and chart_interval_time on the dashboard input options. The Excessive Latency graph is a little more useful than the Max Response Times graph as you can see the number of requests that are slow as opposed to the slowness of the slowest request.

Excessive is relative, so you may need to tweak the thresholds. This would require modifying both the base search and the panel query. As is, the thresholds are in multiple of 5s up to 25, which is assumed to be the max response time before timeouts. 

<row>
  <panel>
    <title>Excessive Latency</title>
    <chart>
      <search base="Access_Logs">
        <query>timechart span=$chart_interval_time$ sum(sumOver5s) as "&gt; 5 sec" sum(sumOver10s) as "&gt; 10 sec" sum(sumOver15s) as "&gt; 15 sec" sum(sumOver20s) as "&gt; 20 sec" sum(sumOver25s) as "&gt; 25 sec" sum(sum504s) as timeouts</query>
      </search>
      <option name="charting.chart">area</option>
      <option name="charting.drilldown">all</option>
      <option name="charting.legend.placement">bottom</option>
    </chart>
  </panel>
</row>

Distinct IP

This column chart gives an indication of the number of callers accessing your API over time. Use this with the Call Count graph to get a rough estimation of application chattiness, the number of calls client applications make per user.  This can also help identify times of day with relatively low usage in case you need to have planned downtime. 

<row>
  <panel>
    <title>Distinct IP Count</title>
    <chart>
      <search base="Access_Logs">
        <query>timechart span=$chart_interval_time$ sum(userCount) as "users"</query>
      </search>
      <option name="charting.chart">column</option>
      <option name="charting.drilldown">none</option>
      <option name="charting.legend.placement">bottom</option>
    </chart>
  </panel>
</row>

Bonus! Geo Location

This satisfying visualization gives an indication of where your callers are located throughout the world. The map view is valuable and easy to digest. Mapping over distinct IP addresses is not suitable for the base search, so a new query is necessary.

<row>
  <panel>
    <title>Geo Location</title>
    <map>
      <search>
        <query>index="$source_index$" source="$source_api$"
| iplocation ip
| geom geo_countries allFeatures=True featureIdField=Country
| geostats count(Region) latfield=lat longfield=lon</query>
        <earliest>$search_time_range.earliest$</earliest>
        <latest>$search_time_range.latest$</latest>
      </search>
      <option name="mapping.type">marker</option>
    </map>
  </panel>
</row>

copy link

Putting It All Together

Access log dashboards are an obvious first step for API monitoring due to their breadth of coverage and relative ease of setup. The confidence gained from adding API monitoring will enable development teams to move faster and deliver more often. To help you get started, included below is a single XML document containing all sample code from this article. Customize the document to fit your project, and create new visualization with the data you have available. Happy Splunking!

dashboard.xml

<dashboard>
  <label>API Access</label>
  <fieldset submitButton="false">
    <input type="dropdown" token="source_index">
      <label>source_index</label>
      <choice value="sample-env-prod">production -- UPDATE VALUE TO REAL INDEX</choice>
      <default>sample-env-prod</default>
    </input>
    <input type="time" token="search_time_range">
      <label>search_time_range</label>
      <default>
        <earliest>-24h@h</earliest>
        <latest>now</latest>
      </default>
    </input>
    <input type="dropdown" token="chart_interval_time">
      <label>chart_interval_time</label>
      <choice value="10m">10 minutes</choice>
      <choice value="15m">15 minutes</choice>
      <choice value="1h">1 hour</choice>
      <choice value="4h">4 hours</choice>
      <choice value="1d">1 day</choice>
      <choice value="1w">1 week</choice>
      <default>1h</default>
    </input>
    <input type="dropdown" token="source_api">
      <label>api</label>
      <choice value="access-logs-FoodAPI*">Food -- UPDATE VALUE TO REAL SOURCE</choice>
      <choice value="access-logs-StuffAPI*">Stuff -- UPDATE VALUE TO REAL SOURCE</choice>
      <choice value="access-logs-*">All</choice>
      <default>access-logs-StuffAPI*</default>
    </input>
  </fieldset>
  <search id="Access_Logs">
    <query>index="$source_index$" source="$source_api$"
        | eval methodPath = httpMethod + " " + host + " " + resourcePath
        | eval is504 = if(status = 504 , 1, 0)
        | eval isOver5s = if(responseLatency &gt; 5000 , 1, 0)
        | eval isOver10s = if(responseLatency &gt; 10000 , 1, 0)
        | eval isOver15s = if(responseLatency &gt; 15000 , 1, 0)
        | eval isOver20s = if(responseLatency &gt; 20000 , 1, 0)
        | eval isOver25s = if(responseLatency &gt; 25000 , 1, 0)
        | bin span=$chart_interval_time$ _time 
        | stats 
        max(responseLatency) as maxResponseLatency, 
        sum(responseLatency) as sumResponseLatency, 
        count as requestCount, 
        sum(isOver5s) as sumOver5s, sum(isOver10s) as sumOver10s, 
        sum(isOver15s) as sumOver15s, sum(isOver20s) as sumOver20s, 
        sum(isOver25s) as sumOver25s, sum(is504) as sum504s, 
        dc(ip) as userCount 
        by _time, methodPath, status
    </query>
    <earliest>$search_time_range.earliest$</earliest>
    <latest>$search_time_range.latest$</latest>
  </search>
  <row>
    <panel>
      <title>Status Codes Breakdown</title>
      <chart>
        <search base="Access_Logs">
          <query>stats sum(requestCount) by status</query>
        </search>
        <option name="charting.chart">pie</option>
        <option name="charting.drilldown">none</option>
      </chart>
    </panel>
  </row>
  <row>
    <panel>
      <title>Call Count</title>
      <chart>
        <search base="Access_Logs">
          <query>timechart span=$chart_interval_time$ sum(requestCount) by methodPath</query>
        </search>
        <option name="charting.axisY.scale">log</option>
        <option name="charting.chart">column</option>
        <option name="charting.drilldown">none</option>
        <option name="charting.legend.placement">bottom</option>
      </chart>
    </panel>
  </row>
  <row>
    <panel>
      <title>Status Codes Over Time</title>
      <chart>
        <search base="Access_Logs">
          <query>timechart span=$chart_interval_time$ sum(requestCount)  by status</query>
        </search>
        <option name="charting.axisY.scale">log</option>
        <option name="charting.chart">column</option>
        <option name="charting.drilldown">none</option>
        <option name="charting.legend.placement">bottom</option>
      </chart>
    </panel>
  </row>
  <row>
    <panel>
      <title>Average Response Times (ms)</title>
      <chart>
        <search base="Access_Logs">
          <query>stats sum(sumResponseLatency) as t_responselatency, sum(requestCount) as t_count by _time, methodPath | eval avgResponseLatency=t_responselatency/t_count | timechart span=$chart_interval_time$ first(avgResponseLatency) as avgResponseLatency by methodPath</query>
        </search>
        <option name="charting.chart">column</option>
        <option name="charting.drilldown">none</option>
        <option name="charting.legend.placement">bottom</option>
      </chart>
    </panel>
  </row>
  <row>
    <panel>
      <title>Average Response Times (ms)</title>
      <chart>
        <search>
          <query>index="$source_index$" source="$source_api$"
        | eval methodPath = httpMethod + " " + host + " " + resourcePath
        | timechart span=$chart_interval_time$ avg(responseLatency) by methodPath
          </query>
          <earliest>$search_time_range.earliest$</earliest>
          <latest>$search_time_range.latest$</latest>
        </search>
        <option name="charting.chart">column</option>
        <option name="charting.drilldown">none</option>
        <option name="charting.legend.placement">bottom</option>
      </chart>
    </panel>
  </row>
  <row>
    <panel>
      <title>Max Response Times (ms)</title>
      <chart>
        <search base="Access_Logs">
          <query>timechart span=$chart_interval_time$ max(maxResponseLatency) by methodPath</query>
        </search>
        <option name="charting.chart">column</option>
        <option name="charting.drilldown">none</option>
        <option name="charting.legend.placement">bottom</option>
      </chart>
    </panel>
  </row>
  <row>
    <panel>
      <title>Excessive Latency</title>
      <chart>
        <search base="Access_Logs">
          <query>timechart span=$chart_interval_time$ sum(sumOver5s) as "&gt; 5 sec" sum(sumOver10s) as "&gt; 10 sec" sum(sumOver15s) as "&gt; 15 sec" sum(sumOver20s) as "&gt; 20 sec" sum(sumOver25s) as "&gt; 25 sec" sum(sum504s) as timeouts</query>
        </search>
        <option name="charting.chart">area</option>
        <option name="charting.drilldown">all</option>
        <option name="charting.legend.placement">bottom</option>
      </chart>
    </panel>
  </row>
  <row>
    <panel>
      <title>Distinct IP Count</title>
      <chart>
        <search base="Access_Logs">
          <query>timechart span=$chart_interval_time$ sum(userCount) as "users"</query>
        </search>
        <option name="charting.chart">column</option>
        <option name="charting.drilldown">none</option>
        <option name="charting.legend.placement">bottom</option>
      </chart>
    </panel>
  </row>
  <row>
    <panel>
      <title>Geo Location</title>
      <map>
        <search>
          <query>index="$source_index$" source="$source_api$"
| iplocation ip
| geom geo_countries allFeatures=True featureIdField=Country
| geostats count(Region) latfield=lat longfield=lon</query>
          <earliest>$search_time_range.earliest$</earliest>
          <latest>$search_time_range.latest$</latest>
        </search>
        <option name="mapping.type">marker</option>
      </map>
    </panel>
  </row>
</dashboard>