Modernizing Network and Infrastructure Observability: SNMP to Streaming Telemetry

Background

Gathering metrics from network elements and related infrastructure devices continues to be a critical facet of infrastructure observability, complemented but not replaced by other approaches like packet inspection and synthetic testing. Network teams have been leveraging the Simple Network Management Protocol (SNMP) for many decades now and SNMP continues to be the prevalent approach for metrics gathering and alerting in networks. SNMP is widely supported and continues to be improved with new versions like SNMP V3. However, a more modern approach referred to as Streaming Telemetry has entered the picture. Streaming Telemetry removes many of the drawbacks of traditional network monitoring and provides a real time framework aligned to modern Observability and AI Ops outcomes.

The old approach

Metrics acquisition with SNMP is based on polling of network devices. The NMS polls each device for the relevant metrics on a regular interval, using the SNMP GET operation. Each device replies with the metrics requested. The NMS stores the metrics in some form of time series database, analyzes them and makes them available for viewing. While effective, this method has some inherent drawbacks. First, only so many devices can be polled at a time and thus each device may only be polled every few minutes. In large environments, intervals of up to fifteen minutes are typical. These long intervals reduce the granularity of the data to an undesirable level and impair the ability to detect issues requiring attention. While SNMP does have a trap mechanism for alerting, this is also cumbersome to manage and limited to preset device detected events. SNMP in general is a dated management protocol. Another inefficiency of polling is that each poll returns all the data requested even if no changes occurred since the previous poll. And SNMP relies on the User Datagram protocol which can particularly un-reliable during impaired network conditions - which is the time when we really need robust visibility. SNMP also provides an attack surface for hackers by not encrypting data and passwords (community strings). While upgrading to SNMP V3 resolves this, the time and effort might be better spent upgrading to a modern solution like streaming telemetry.

The new approach

Streaming Telemetry leverages a form of "push technology" where the network devices push the metrics to subscribers. Unlike SNMP polling, changes to metrics can be forwarded in close to real time, resulting in sufficient resolution for capacity planning, and the ability to detect anomalies proactively. On the network device side, Streaming Telemetry is based on modern alternatives to SNMP like NETCONF and Open Config. These leverage YANG data models to describe the device in question (in lieu of MIBs in SNMP). These modern protocols are also widely employed for device control and automation, as the preferred alternative to the CLI. So now it's possible to converge on a common management protocol. On the device side, telemetry leverages gNMI (Google Network Management Interface) built on gRPC - an open-source framework developed by Google and managed by CNCF. gNMI provides methods to write and read from the devices and methods to subscribe to Streaming Telemetry. While some older network devices may still lack support for gNMI, most current devices and software do. We'll will show how to support both simultaneously with an enhanced publish / subscribe data framework based on Kafka.

Old versus new

Following is a side-by-side comparison of the traditional SNMP stack versus the modern network telemetry stack.

*Fig 1. SNMP vs. Streaming Telemetry Stack*

Streaming data

The typical telemetry architecture incorporates a robust streaming data backbone to provide reliable data transport, enrichment, and publish / subscribe integration fabric. This allows multiple back-end observability platforms to subscribe to and consume the telemetry. For example, the same metrics could be made available to:

Specialized NMS platforms
Cross-domain observability platforms spanning both infrastructure and applications
Data lakes to facilitate advanced machine learning driven analytics.

This data streaming approach is being widely adopted in the broader observability and AIOps space, across multiple IT domains including application, cloud, and infrastructure monitoring, and IOT. The publish / subscribe architecture makes it easy to add new functionality or migrate or consolidate existing platforms.

A sample framework

The following diagram shows a modern scalable framework to concurrently support legacy SNMP and Streaming Telemetry. The main parts of this framework are:

Network Devices
Server Based Agent
Event Streaming Platform
One or more Observability Back-Ends

Each part of the framework is described below.

Network devices

The framework can support both legacy SNMP devices and devices supporting telemetry. In both cases, devices interact directly with the Telegraf agent.

SNMP devices send traps to Telegraf.
Telegraf polls the SNMP devices for metrics.
Telegraf subscribes to and receives metrics from telemetry equipped devices via the gNMI (Google Network Management Interface).

Server based agent

In this example we leveraged Telegraf, a server-based agent for collecting and sending metrics and events from network devices, databases, systems, and IoT sensors. Telegraf is written in Go and compiles into a single binary with no external dependencies and requires a very minimal memory footprint. As described above Telegraf requests and receives metrics from the devices via SNMP and gNMI and receives SNMP traps from the devices where applicable. Telegraf can support inline filtering and enrichment of the received data and forwarding of the data to Kafka to make the data available to the downstream consumers. Telegraf interfaces with the various sources and destinations via a plethora of available input and output plugins.

Event streaming platform

In this example the event streaming backbone is based on Apache Kafka, a distributed event store and stream-processing platform. Kafka is a widely used open-source platform developed by the Apache Software Foundation. Kafka is written in Java and Scala, and aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.

Kafka combined with Telegraf, and potentially other tools comprises an Observability Pipeline that provides routing, enrichment and robust transport of logs, metrics, and traces from a variety of sources to various back-end systems.

One of Kafka's key benefits is support for the publish / subscribe model which allows multiple consumers to all receive the same data from a single topic. Topics are Kafka's mechanism for sharing data. This example leverages two topics, Metrics and Alerts. The producers of these messages send the data to Kafka on the relevant topic(s). Various consumers can receive these messages by subscribing to their topic(s) of interest. This approach is very flexible compared to point-to-point integrations. A more complex implementation could see different types of metrics getting their own topics (e.g., CPU, memory, disk, network, etc.) to enable more efficient categorization and handling of specific metrics.

In this example Telegraf Agent interacts with Kafka on behalf of the network devices, publishing metrics originating from SNMP and gNMI to the Metrics topic for consumption by various Kafka subscribers which may include specialized Network Management platforms, general purpose Observability platforms and data lakes.

Data enrichment

Enrichment of metrics can be accomplished in Telegraf or Kafka. While Telegraf is capable of simple data enhancement, Kafka is capable of more sophisticated inline enrichment and transformation. Examples of enrichment include normalizing formats between SNMP versus telemetry, applying context tags to aid with mechanization of incident response, and even incorporation of data from external sources.

Observability back-ends

In this example we have one or more back-end platforms capable of consuming time series data from Kafka. This would include a wide variety of solutions:

Commercial and open-source observability platforms
Network Monitoring System (NMS)
General Purpose Time Series Databases

We can have multiple of these platforms concurrently subscribing to and ingesting network metrics potentially for different purposes. A traditional NMS focused on fault management might be augmented by a data lake where metrics are analyzed using machine learning tools for capacity management and network planning. Observability platforms can ingest network metrics to provide a single view across IT multiple IT domains including network, cloud, applications, etc.

These back-ends can analyze the metrics in real time, detect anomalous conditions (i.e., exceeded traffic and error thresholds) and generate actionable alerts. These alerts are published to Kafka and consumed by NOC dashboards and an Event Correlation platform to accelerate incident response.

Summary of benefits

This architecture provides several key benefits:

Modernizes the network management framework to a modern observability paradigm.
Aligns network metrics gathering with other infrastructure and application monitoring solutions.
Provides a robust and secure data transport opposed to lossy UDP.
Provides the flexibility of a publish / subscribe integration model (versus point to point).
Allows consolidation across IT domains and easy addition of advanced ML based capabilities.
Provides a single data platform enabling easier integration with data from other lines of business.

How we can help

This framework can be built incrementally once certain core elements are in place. WWT can help in many ways including:

Rapid prototyping
End-to-end telemetry integration and data streaming design.
Integration of observability back-end platforms.
Advanced analytics for predictive operation and automation of incident response and resolution.

We can meet you where you are and help you deliver a modern scalable and optimized network observability platform.