Use the NEXUS Dashboard Free Trial to Proactively Monitor Your ACI Fabric (Part 2)
Use the Virtual NEXUS Dashboard free Trial to gain Day 2 operational visibility in ACI Fabrics.
In the Virtual NEXUS Dashboard (vND), we can now quickly build a small 3-6 node cluster and utilize the free 90 day trial period. This free trial can verify the integrity of an ACI fabric's policy for best practices and changes and troubleshoot any issues that may occur using the assurance engine. Using the NEXUS Insights tools, we can verify the integrity of the fabric from PSIRTS, bugs, EOS software, hardware, and flow telemetry, showing dropped packets endpoint issues and flow-based analytics. The Cisco Nexus Dashboard uses a correlated data lake to provide real-time insights and automation services to operate multi-cloud data center networks spanning on-premises, virtual edge, and cloud sites. It provides a single unified pane of glass into proactive operations with continuous Assurance and actionable insights across the data center. The Nexus Dashboard incorporates Nexus Insights, NEXUS Dashboard Orchestrator, and third-party telemetry to provide seamless access to network controllers, third-party tools, and cloud-based services.
The vision for NEXUS Dashboard is to provide a unified platform and a correlated data lake for consumption, containerized applications to reside and consume this data lake, and integration with 3rd party applications. Eventually, this leads to a fully observable network, with auto-suggested solutions and eventually proper autonomous operations.
As users and workloads become more and more distributed, it is harder to gain accurate insights into our transport fabrics utilization and if they are problems we cannot see.
The true vision for the NEXUS Dashboard platform is to provide a unified platform and a correlated data lake for consumption, containerized applications to reside and consume this data lake, and integration with 3rd party applications. Unification for Data Center Controllers and day 2 APPs to reside on the NEXUS Dashboard and consume all of the correlated data finally allows us to get a single pane of glass as to business anomalies and how the network is causing them( or possibly not causing them)
If you consider how we use our Internet browsers today, whether on a computer or smart device, we go to a search engine such as Google, and it is the correlation engine for all of the web content on the internet. The goal of NEXUS Dashboard is to be that correlation engine, but instead of web content, it gathers telemetry and events from various sources across our new distributed and diverse IT landscape. The more diverse and distributed sources we can gather data from, the better insights we shall have into the daily operations of our business and lives.
Customers want a great user experience, expand into multi-cloud securely, simplify compliance and security posture, Accelerate IT operations with automation and deliver the future of work. NEXUS Dashboard provides those Customer Initiatives and makes it easier to achieve these goals as more applications, integrations, and third-party APPs are onboarded.
NEXUS Dashboard Overview and integrating an onboarded site into NEXUS Insights
We left off in part 1 by installing the NEXUS Dashboard cluster, onboarding an ACI site, Installing NEXUS Insights, and connecting to Intersight. The next piece is to integrate the onboarded Site into NEXUS Insights App.
First, go back to your NEXUS Dashboard and log in with Admin and the defined password.
We see we can click Do Not Show at log in, Review Setup, or Get Started. Click on Get Started
When you first log into ND presents us with One View, which shows all of your sites in one pane of glass view. If there were Anomalies, Advisories, Audit Logs, and Faults in any site, you could click on them. If they are local to the Federation Master Cluster, it shows up; if not, it crosses launches to the NEXUS Dashboard that has the Site (s) with problems. Also on the top left-hand corner is the Home Button; this always get you back to the One View Home Screen
Click on the Admin Console from the Left Navigation Menu
From the Admin Console, we can see the System is healthy, the Cluster Heath is OK, and Intersight is connected. If we click on the Blue Circle in the upper right-hand corner, it opens up a sub-menu to add a site, add a node to the cluster, set up a firmware update, create users and login domains, and verify the NEXUS Dashboard Setup. We want to click on the Services tab on the left-hand navigation menu.
From the Services Menu, we see Installed Services and the App Store tabs. On the NEXUS Dashboard Insights App, click Open to launch
We have no Site Groups Configured, So we need to create one. Site groups are an excellent way to group your sites for easier manageability in NEXUS Insights. Click on Configure Site Groups.
A window launches showing the NTP and Inband Management prerequisites that we completed in the Part1 white paper. Click the check box and click Lets' Get Started
Click Configure to set up a Site Group
There are no Site Groups; click Add New Site Group.
Give the Site Group a Name and Description, on the Data Collection Type* click the Radio button to Add Site(s), then under Entity click Select member. Selecting a member pops up a window to display the sites onboarded to NEXUS Dashboard. If more than one Site onboards, we could add them to the group or configure a separate grouping for administrative purposes. Click MSITE-EQUINIX or what you had named the Site, then click Select
You notice that the Site is added to the Site Group but needs to be enabled. Click on the drop-down to Enable and then on Configuration.
A window pops up to put the username and password of the onboarded Site again. Click Save
Next, click on the checkmark to verify the password. Notice the Site enabled now that the step connects to the APIC and verify connectivity.
The Site is now Enabled, Configured and we now see a pencil where we can edit the username and password if it ever changes. Click Save to save the Site Group with the one Enabled Site.
We see our Site Group name, 1 entity. Click Done on the bottom right side.
Click Done again.
We next see the Overview, and we don't see a lot of errors or issues, as we need to set up Bug Scan, Assurance Analysis, Flow Telemetry, and AppDynamics third-party integration. We do see 2 Leafs', 1 spine and 1 controller. We also see two advisories. Click on the Topology Tab for a different view of the fabric.
Clicking on Leaf302-MSITE-Equinix launches a window on the right that shows details such as IP address for mgmt., Events, Advisories, version.
Clicking on the cross-launch icon takes us right to the leaf, and any errors or advisories for that particular leaf, spine, or APIC shows up.
The Overview tab shows us the same information we saw on the sidebar view. Click on the Alerts to drill into the 2 advisories shown.
Here we see the two advisories, one for a field notice on the SSD drives, the second a PSIRT notice. Intersight downloaded the daily database, and when the Site onboards to NEXUS Dashboard Insights, it scanned the fabric for all errors against this database. This scanning time can be modified in the setup Click Done or the X to close the window
Going back to the overview Navigation menu tab, we configure NEXUS Insights to use the assurance engine and flow telemetry and third-party integration with AppDynamics.
Click on the 3 Dots next to MSITE (Or whatever you named your site group) and click Configure Site Group
If you had more than one Site or Site Group, you would need to choose the Site first, click the 3 dots and Configure Site Group. However, for the POC, we are limited by the capacity of the virtual NEXUS Dashboard, so we limit the POC to one Site.
When the Configure Site Group-"SITE" opens, we see various tabs that configure NEXUS insight telemetry and Assurance.
The first Tab is Bug Scan, and this scan uses the Intersight Connector we set up in Part 1 to the Cisco database that downloads daily. The daily download updates the database used to scrub your fabric hardware and code version for bugs. We can see it hasn't run yet but is enabled and set to run once a week. You can edit that time to daily using the pencil icon, and you can also start a scan now.
I've modified mine to run once a day at 2:00 AM; click Save
Bug Scan Configuration
I've also started the Bug Scan to catch an issue that we can use later in the guide.
The next tab is one of the essential pieces, and that is the Assurance Analysis Engine. It used to be called NAE before being combined with NEXUS Insights. Assurance Analysis compares APIC policy and the programming of the TCAM on each switch. It then compares this Configuration to a database of best ACI practices, TAC cases, bugs to ensure that your APIC policy has no flaws and follows best practices.
Assurance Analysis configuration
We can see Assurance has not run, and it is disabled. To enable Assurance Analysis, click on the pencil icon
Enable the Configuration and repeat every 15 minutes for the POC, so we don't overload the small POC vND. We see later that these 15-minute intervals become what is known as Epochs which are a snapshot of the Policy and expected state of the infrastructure. If there are discrepancies, we can look back to previous Epochs to see when they started and what changed. These Epochs are one of the massive benefits of the correlated database as we can now look at these epochs and see what happened, what changed who is affected by going back in the Epochs.
To configure the Assurance Analysis, we Enable the State, choose a start time, and Repeat Every. (15-60 minutes recommended for using the small vND). Then Click Save
Click Run Now so the Assurance Analysis starts and we have data for later tasks.
Export Data Configuration
The Next tab is the Export Data, which allows email basic or advanced datasets and export these data sets to Message Bus Configured devices. The Export Data is outside the scope of the POC; however, this is something to look into for production installs after the POC and a Production NEXUS Dashboard cluster using proper sizing constraints.
For the Message Bus piece, we can send the same advanced dataset to a remote messaging bus.
Flow Telemetry Discussion
Before we discuss setting up the Flow Telemetry, we need to understand where we obtain it, the types of flow data, and its use. As we all know, visibility in a network is hard to obtain, especially at a large scale, and now in multi-site and multi-cloud architectures, we need a way to stitch flows together for an end-to-end view.
Network telemetry needs to be pushed from the source to a central repository with a data lake so applications such as Network Insights can consume it. This data lake also needs to provide the means to correlate the telemetry data with other data sources.
The newer ACI -EX,-FX, and -GX series of switches use a unique flow telemetry ASIC to push the line rate data to a collection source. It is not a complete packet capture like a tap but rather the first 160 bytes of the header information that we can use to determine the health of the network flows. We are not interested in the entire data payload for security tools like NetFlow could provide but rather enough information to stream line-rate telemetry to a collector. Other methods of flow telemetry cannot push at line rate, so the flow telemetry becomes sampled, and we have gaps in visibility. Using the custom ASICs on the NEXUS 9000 series switches, we do not have any gaps in visibility and do not provide a sampled telemetry stream.
Depending on the model of NEXUS 9000 series switches, they offer Flow Table(FT) Flow table Events FTE) and Streaming Statistics Exports(SSX). As you can see, the SSX has the most analytical capabilities to be as deep as showing microbursts in the network.
The following table shows the switch models and what type of export they can provide. Network insights can ingest all 3 types, but there may be gaps in the details, such as microbursts.
Network Insights 6.0(NI 6.0) offers many telemetry data sources such as Syslog, RIB, and FIB tables, streaming telemetry, and now also capability to do Netflow. It then ingests these datasets and extracts the metadata, and correlates against a database updated from Cisco. This telemetry and correlation of the metadata NI derive insights and suggest remediation actions for finding root cause analysis and predictive failure. The new Network Insights 6.0 application is a combination of Network Insight Resources(NIR), which looked and telemetry and correlated events, and Network Insights Advisor, which looked at the code and Configuration of the fabric and compared it to a known database of PSIRT issues, bugs, recommended Software and other issues. In the 6.0 release, Assurance Analysis has correlated data into all of our day 2 operations suites.
Now instead of having to troubleshoot the network in traditional means, we can let network insights automatically parse through the telemetry data and see where the issues are by identifying a problem, locating it, and determining root cause analysis
Understanding the NEXUS Insights application
The NI 6.0 platform needs data to ingest and correlate with anomalies. First, we need streaming Software and hardware telemetry data to populate our correlated database. We have Software telemetry streaming from the APICs and switches and hardware telemetry streaming from the switches. The various telemetry sources gathered from the APIC's, the spines, leafs' interfaces, and flow telemetry data are used to populate the database.
NI also gathers Cisco's data of the latest PSIRT, bugs, configuration issues, and EOL notices every 24 hours using the Intersite Collector. The Network Insight Advisor uses this database and inspects your hardware, Software, and Configuration to see if any faults need correction. Next, we pull in the running-config, show tech, audit logs, and other data to compare APIC policy and Policy programmed in hardware. Also, by using the CISCO Intersite connector, we pull in PSIRTs, FN, EOS/EOL, bugs, and defects.
Now that we have the Data Lake let's look at the architecture of the NI 6.0 platform. We can see data being ingested either from ACI or DCNM via the Telemetry collectors. These are then attached to a Kafka messaging bus. The Kafka bus has a Data Lake connector and anomaly engines that inspect the data lake for your fabric issues. If issues are detected, they display in the GUI.
The correlation engines use a Kafka bus to relay messages. In later releases, this Kafka bus can connect other data and telemetry sources such as APPd, Splunk, and other third-party vendors
The correlation Engine normalizes the data to see all flows and anomalies along a timeline for inspection.
Now that we are more familiar with the flows coming from the switches into NI, let us configure the flow collectors
Flow Collector Configuration
The following essential piece is the Flows Tab. The Flow Configuration allows us to utilize the flow telemetry that we configured
One thing to watch for is the VRF used in the rules. When you configure the Rule Subnets (in this case 172.27.135.0,16 and 32), the VRF used MUST be the VRF attached to the bridge domain hosting those subnets. If using a VRF in the Common Tenant or a different Tenant, you MUST use that VRF. Make sure if you do VRF leaking and want to export flow telemetry, you MUST use the VRF the bridge domains subnets are using. Also, when you create the Rule Subnet, you must use the actual subnet, not the GW like the Bridge Domain does. In this example, we ONLY see traffic from the 3 subnets. I would recommend in a POC to limit how many flows become imported using the Flow Telemetry Rules to limit flows to NI.
Set the Microburst Sensitivity to Low Sensitivity
The alert rules and Compliance Requirement discussion in a later portion.
The last tab, Collection Status, shows us how the Analysis and flow telemetry collection is performing. We can see that collection may not offer feature support in some devices due to the hardware type. Close this window using the X in the top right corner to return to the Overview.
In the Overview, we now see we are getting many alerts now that data collection populates the correlated database, and our Assurance and Insight Engine can consume the data and offer alerts and Analysis.
The last integration (optionally if you have APPd) to configure is Third-Party Integration with AppDynamics. We have many customers who use AppDynamics to show them application issues, and now we can use this as a source of datasets in our Correlated database for more significant insights.
To add third party support, Click the gear in the upper right-hand corner, then Integrations, then Manage
The next screen is the Integrations screen click Add Integration
Here we choose AppDynamics; give the username of your APPd account, URL, username, and password, and associate with the MSITE or your named Site.
Now we can see that the AppDynamics connector is Active, so we are gathering data from AppD and putting it into our Correleated database to compare; when we see an application issue, we also see a network issue.
The Correlated data between AppD and NEXUS Insights becomes the magic button for network operators when the business calls that the app is slow and blaming the network. We can now see in real-time or using Epochs go back days and see that when Appd monitored the application, there was indeed a network slowdown, or if not, they have to look at computing and storage. As network engineers, we have never had this ability to say it is NOT the network definitively.
AppDynamics is an APM or Application Performance Monitor. The AppD APM uses agents to gather information on a business application, send the sensor data to AppD controllers, and then map the transactions to provide a real-time business view of how your applications perform.
Once the data is collected and correlated on the AppD controllers, AppD does the Auto-Discovery and application mapping. From the mappings, we can glean Digital and Business Transaction Monitoring. We can also perform anomaly detection, root cause diagnostics, and custom dashboard, allowing us to see the performance of an application in real-time and create dashboards showing revenue lost from outages and a full-stack analysis.
AppDynamics agents are plug-ins or extensions that monitor the performance of your application code, runtime, and behavior and time for the application to process and network traffic to be sent between tiers and servers or services. The agent's correlated performance data via the Controller UI allows you to view data insights from multiple applications in one place.
We can create Business Metrics Dashboards to see revenue at risk and business transaction health in real-time for the business analysts.
For the Application owners and server teams, we can see storage issues, transaction times, how many calls per sec a server is making and responding to, and if any of the nodes are breaking baseline parameters.
We can also look at how the network is performing if there are errors and RTT.
By using Appd and NEXUS Dashboard and NEXUS Insights integration, we can pull telemetry data from the AppD agents and compare it to the telemetry from flow telemetry switch data and see if it's the network causing the slowness or the actual application servers or containers.
For more information on AppDynamics and the APM tools, please look at the AIOps section on the WWT platform.
Now that the NEXUS dashboard has a site onboarded and NEXUS Insights configured to collect events, anomalies, advisors, telemetry, and third-party data, we can look at use cases.
NEXUS Dashboard Federation Discussion
First, let go back to One View as the new Federation needs revisiting. In a large ACI multi-site or ACI and DCNM based fabrics, we can have multiple NEXUS Dashboards gathering data from multiple sites, and using the new Federation; we can see multiple sites from the One View Dashboard. Federation and One View is the recommended way of configuring the NEXUS Dashboard for customers doing a large deployment of NEXUS Dashboards. Because there is only one Site due to the small vND POC, we need to think of how this single page of glass One View provides to see all of your sites quickly and then drill in.
For example, a NEXUS Dashboard with multiple sites and Federation would look similar to this.
If we click on the Home Button, this brings us to One View. We see our one Site in red indicating issues. We can see under the map, Active Anomalies, Advisors, Audit logs, and Faults. Click on the Admin Console Navigation Menu on the left
We see a Healthy Cluster, Intersight Status connected, Sites connected, and Services all green from the Admin Console Overview. Click on Sites from the Left-hand navigation page. Also notice Intersight status is connected, and we have 3 master and 3 worker nodes for our cluster.
We see the Site has warnings, the Site's name, what type of Site, connectivity status, firmware version, and services used. Click on Services on the left-hand menu
Under Services, we see NEXUS Dashboard Insights; click Open.
NEXUS Insights USE Case- Assurance Analysis
From the NEXUS Insights Overview, we have critical alerts in Red, major in Orange minor in Yellow. Also, we see the Epoch's Alert Detection Timeline. This timeline was set up for 15 minutes when we set up the Assurance Analysis. We see multiple minor yellow alerts along the timeline
We can click on one of the alerts and drill in. We can see the errors, and then we can click on Analyze for a deep view
We can see the details of the alert, the detection time, and cleared time
NEXUS Insights USE Case- Explorer use cases
Go back to the Insights Overview and click on Explore
If we Click Explore and click on, See what objects are associated, who can talk to whom, or View interface, it brings up choices What, Can, and Interfaces. Explorer allows us to use natural language searches to see if the APIC policy allows communications to occur.
One excellent use of this tool is quickly seeing how EPGs can or cannot talk. A typical call to the networking team is, "I've placed a VM on the DB EPG, and I cannot ping anything in the APP EPG. A NEXUS Dashboard operator simply needs to do a natural language query in Explore and see that these two EPGs can communicate. Then by simply clicking the green line Between EPG DB and EPG APP, we can see how these entities talk. Further, we can see the details such as the source and destination prefix, the VRF and contract, and the filters applied.
Then by going to the Endpoint tab, we can see what endpoints are learned In the DB EPG and ask the caller what IP address he is using. This deep level visibility makes troubleshooting very quick, and the operator and VM owner can resolve issues very quickly.
During the POC, make sure you explore how you can use the natural langue search to see what can and View Interfaces so see what EPGs, VRF's Tenants, and endpoints can talk. The ability to perform natural language searches shows how easy it is to answer the Tier 1 call of can A talk to B.
NEXUS Insights USE Case- Assurance Analysis
Looking at the Nodes Tab on the left-hand menu, we see the nodes in our fabric and any issues. We also see we can choose the time and date range. It can be the last 15 minutes, last day last week, a Date Range, or time window. The time window is helpful if you know when there was an issue (over the weekend or after hours), and you want to use it as a DVR for networking issues. Here we choose the last two hours and apply.
Click on a spine or leaf with issues; we can see 31 critical issues in the example. Click the cross launch button on the upper right-hand side
The Overview looks healthy, with Resources, Environmental, and Statistics looking normal. Next, click the Alerts tab.
We see almost all the alerts are interfaces administratively up but operationally down. However, we do see an error for endpoint learning; let's drill into that.
Click on Endpoint learning Error, (or any other errors you may have) and a window on the right side pops up. Click Analyze. Use the Analysis for all errors, not just endpoint learning, so make sure you investigate this ability during the POC.
After we analyze, we can glean a tremendous amount of information. We see the IP address affected, Diagnostic report, and plain English recommendation of corrective Recommendations. We can also see mutual occurrences during the same Epoch(when the Assurance Analysis occurs). The following is just an example of what you may find in your APIC fabric, but this is very important to explore all issues and perform the corrective recommendations to resolve them.
NEXUS Insights Use case- Analyzing anomalies and advisory alerts
Next, let us look at Analyze Alerts Anomalies. Again the Assurance Analytics Engine looks at the fabric policy of every EPOCH (Determined during setup of Assurance) and compares it with know best practices. We have introduced defects and errors in the Policy to generate alerts; however, this is another critical place to look when doing the vND POC to verify your fabric does not have significant errors such as the ones we introduced.
If we look under advisories, we see the Field Notice on SSD disks and the PSIRT. The advisories are another critical place you should be looking at in your fabric during the POC.
NEXUS Insights Use case-Compliance
Nexus Insights can ensure we have a compliance policy setup, and if anyone modifies the APIC policy to break the Compliance, a notification displays.
To set up Compliance, we start by going to the Compliance menu, and we can see there is noting setup. Click on the 3 dots next to the site name(MSITE in our case) and click Configure Site Group
Clicking this brings up our Configure Site Group; click on the Compliance Requirement
Click on Actions, then Create New Requirement
Before we create a compliance policy, let's look at the simple application profile to demonstrate Compliance. We have a straightforward 3 tier app, Web, App, and DB. We want to make sure that Web can not talk to DB, so there is no contract between the WEB and DB EPG.
We add the criteria for one side of the communication. We add the Tenant(tn=ACME_DEV), the Access Profile(ap=DEV_3_Tier_APP), and EPG(epg=WEB). Click Add
Repeat the same exercise for the DB side
Next, we need to add a Traffic selector rule; in our case, we want IP ALL as a selector for traffic. Click Add
Here is the review of the Criteria setup for Compliance. Click Save
We see our new Compliance Requirement.
Since Compliance Analysis relies on the Assurance Analysis scan, depending on how long you have set the scan to run(15 min, 60 min) during the POC, the Analysis won't complete until the following scan. We can go into Assurance Analysis and click Run Now to speed things up for testing to speed up this process. Also, keep in mind that you need to determine your scan frequency, cluster sizing, and the number of sites and devices in production.
After the Assurance Analysis runs, we can see that Compliance is Satisfied.
A developer has asked one of his friends, the ACI operator, that he needs full access from the WEB EPG, where he has some tools he wants to use to troubleshoot his DB, which is on the DB EPG. The ACI operator doesn't realize that there are many databases on that DB EPG, and a PCI compliance rule is in effect. The ACI operator creates a contract between WEB and DB EPGs for his friend.
We won't see the non-compliance until Assurance Analysis runs again, so go into Assurance Analysis and Run Now
We can now see we are out of Compliance. Using the Compliance tool would help with audits and keeping your network secure if you were doing micro-segmentation. If we click on the Anomaly, we can analyze the Anomaly on the right-hand side.
In the Analysis, we can see violations, the lifespan, and the recommended course of action.
If we scroll down to the bottom where the timelines are, we can see the power of the correlated database. We see the Anomaly happen at 10:55, and we also see an entry in the Audit logs at approx 10:52. Let drill into the Audit log.
If we look at the timeline, we can see that the user admin created a contract between WEB and DB, violating Compliance.
NEXUS Insights Use Case- Using Delta Analysis to troubleshoot an issue rapidly.
Delta Analysis is a great tool to use when suddenly you have major or minor issues in the fabric connectivity, such as slowness, unreachable devices, or other outages. Go into the Troubleshoot menu, then Delta Analysis sub-menu. Click on New Analysis.
Click on date range when you know the fabric was stable
Click the Latest Snapshot and click Apply.
Give the Analysis a name (In this case, I deleted objects in my AWS tenant to show issues). Click Create
Once Delta Analysis is complete, click on the Analysis, and a window on the right side opens. Click the cross launch icon to bring up the delta Analysis Venn diagram.
We can see many errors on our fabric Venn Diagram, click on the Total and the Later Circle. We can see we have bridge domain and VRF association errors, contract scope errors.
Click on one of the errors, and we see a window on the right pop-up. Click Analyze
Here we can see the error in plain English. "BD Subnet missing where this BD's EPG is deployed" It gives descriptions of the error as well as Recommendations. Now here is the power of the correlated database. We can see the yellow Anomaly with the blue circle, which we used for the Delta Analysis. If we look a little while before in the audit logs timeline, we can click on that and see if any changes occurred.
As we look through the audit logs, we see that the user admin changed subnet 10.254.12.49.28 and changed the scope from public shared to private.
We can also switch to policy delta to see the JSON that was changed and by whom
The ability to troubleshoot a network down issue in a couple of minutes shows the power of the correlated database and the day 2 operations tools that use it. The reduction in MTTR alone is well worth the cost of the product, and WWT highly recommends its use to become proactive in operating your fabric.
NEXUS Insights Use Case- Log Collecting
Log Collector is used to collecting logs from devices for TAC cases. Very easy to use; click and start a New Log Collection.
Give it a name, choose a site (if you are using Federation, you can pull from any site) Click Next
Choose the Nodes to collect data from and click Start Collection
NEXUS Insights Uses cases- Resources and Environmental
Next, we can look at Browse on the Left side Navigation Menu. Click on Resources. We can see the Site Capacity by Utilization and top Nodes by Utilization. Site Capacity is an excellent place to look at where your fabric is from a utilization standpoint.
Next, we can look at environmental such as disk usage, CPU, memory, and temperature.
Next, click Browse next to the Dashboard. Look at Operational resources, Configuration Resources, and Hardware Resources to see how healthy the entire fabric is. When we did this on topology view, we looked at a single device; now, we look at the fabric as a whole from here. First, let's look at Operational Resources. We can see the mac learned, IP V4 and IPV6 routes, and Multicast routes per leaf. Click on a leaf to see the Operational Resources individual switch. You can click on the square on the right side to cross launch to the individual leaf for more details.
Click on Configuration Resources, and we can see the VRF, EPG, and BD counts
Look at the hardware resources and see port usage, bandwidth trends, TCAM usage, and warnings.
Next, we can look at environments such as disk usage, CPU, memory, and temperature.
If you click on Browse, you can see how the CPU, Memory, temperature are trending.
NEXUS Insights Use Case-Flows
We can see up to the second flow anomalies coming from the application, VRF, and subnets we configured under Flow record settings using the streaming flow telemetry. If we look, we see all greens now under dropped flows
If we click on one of the flows, we can gather very detailed information on the Flow. We can see the time of the Flow, source and destination, Tenant, VRF, and EPG, and the path summary from a hardware perspective. We use this information later to show how we can troubleshoot using the correlation between AppDynamics events and actual flow telemetry from the switches themselves.
NEXUS Insights Use Case-Endpoints
We can use the Endpoints Tab on the left sidebar to look at endpoint anomalies. Since it is a relatively new build for the vND POC, so not many interesting anomalies
However, we can look at past errors and see how useful this can be to look in on any issues. In this particular fabric, we can see an endpoint that came up Yellow with a Major Alert. Drilling into the endpoint, we see the following information. We can see the leaf, interface, and EPG the IP was learned on and see the Major alert. By clicking into the alert, we can drill into the issue
In this case, We can see that the endpoint has moved across ports on scaleleaf-203 and scaleleaf-204. These alerts help troubleshoot VMWare DRS issues on links flapping, causing the mac to be learned on multiple switches if port-channeled. Also, when we have too many endpoint moves, the fabric forces the endpoint to be stale, freeze the endpoint, and won't be learned again if it moves. It causes a black hole or intermittent connectivity if the VM is flapping due to DRS as it loses connectivity when it moves and gains connectivity back when it moves back to where it was frozen. The endpoint tool is excellent for troubleshooting complex problems and proactively seeing issues before they become a help desk call for an outage.
NEXUS Insights Use Case- Applications
This section is for third-party applications that become onboarded to NEXUS Insights. Today AppDynamic integration is the only third-party app, but as NEXUS Insights matures, more third-party apps integrate.
As we can see, the AppDynamics integration is reporting a critical Anomaly Score, and we can drill into it to see what the issues are.We saw earlier how AppDynamics could send application-based anomalies via the messaging bus and APIs' to NEXUS Insights, correlating applications, networking issues, and changes. We can see how APPd reports Critical Application errors to NEXUS insights and what those applications are. Notice Appd is sending telemetry for TeaStore_VM, and we can filter by application. Clicking on the Critical Anomaly launches a window on the right-hand side, and from there, we can cross launch to get a detailed look at the anomalies from AppDynamics
We can see that the Web Tier is having issues. TCP Loss is going up. Performance Impacting Events(PIE) is going down. The Business and Application owner would look at this as a network issue as they see TCP loss.
Scrolling further, we can see the critical WebUI Tier anomaly for 3+ hours, and minor TCP loss errors for the WebUI Tier
We can now look at our actual telemetry flow data from the switches that the application tiers reside. We know that the IP address for the WebUI is 172.27.135.2, and AppDynamics is reporting severe anomalies and TCP loss between our timeframe 7:35 PM to 10:35 PM for critical severity and 8:46 PM for TCP loss reported by AppDynamics. We can now go into our flow telemetry data and look at these timelines to see if the fabric is dropping packets or an issue with the VM and high CPU dropping packets.
We can also drill into the actual flow record and see that we had no drops or networking issues during that period.
We can see with this example of how applications like AppDynamics can feed APM data and other application tools and correlate the business view of the health of the application and determine if the issue is the network or is it something else like heap memory or CPU that is causing the VM to drop packets.
NEXUS Insights Uses Case- Interfaces
Using the Interfaces Sub-Menu, we can quickly view our spine and leaf interfaces utilization over time. It is helpful for capacity planning as, over time, we can see if specific ports are being overutilized and are also trending upward, which would call for some architectural reviews
We can also drill into a single interface and see how that interface is trending. We can see over the last few weeks that the traffic on Leaf301 eth 1/5 is trending upward in all areas, so this is an excellent place to look at long term trending of the fabric
NEXUS Insights Uses Case- Protocols
From the Protocols Sub menu, we can see what protocols are running on the fabric and errors. We can see here that IGMP Snooping is having many errors and is trending upward. We can then cross launch to the switch to see what issues there are
We can see that both VLAN 114 and 115 are seeing a lot of IGMP errors and investigated.
NEXUS Insights Use Case- Events
Events should be the first place to look when a network outage or connectivity issues occur. 90% of all networking issues are misconfiguration, and using Events, we can see if a change made caused a fault or a Severe or Major Event occurrence. Because NEXUS Dashboard and NEXUS Insights can monitor multiple sites, we can look at issues that affect the local Site and connectivity between sites. If a network is architected properly for DR scenarios, a switch, link, NIC in a server should not cause a loss of connectivity unless the backup path is also affected.
NEXUS Insights Use Case- Firmware upgrade Analysis
A Firmware analysis verifies the upgrade process, and if NEXUS Insights sees any potential issues with the hardware, CIMC versions, incompatibilities, errors that need to be corrected are all displayed before you even go through the upgrade process. The Analysis allows the operator time to fix all the issues and have a successful ACI upgrade.
This validation gives customers better insights into upgrading their ACI fabrics
NEXUS Insights Use Case- Pre-Change Analysis
The Pre-Change analysis feature (PCV) allows users to model ACI configuration changes against an Assurance Epoch and predict a change's impact. The user can model the changes directly into the GUI or import a JSON/XML file containing the relative ACI configuration.
When the Analysis is complete, the Pre-Change Analysis job presents in the Pre-Change Analysis table, and you can raise Smart Events as part of the Analysis to alert the user of possible issues. A delta analysis generates between the Pre-Change Analysis and the base epoch in every pre-change Analysis listed in the Manage Pre-Change Analysis table. Any Smart Event that affects the later Epoch is a potential issue with the proposed change.
Our junior ACI admin has a Service Now ticket from a developer who needs a VMWare port group named Jenkins. It needs to use the subnet of 172.27.136.224/28 and GW of 172.27.136.225. Our ACI admin exports the JSON for one of the existing EPGs and modifies it like NX-OS to find and replace 224 with Jenkins. He uses VMM integration to create an EPG which then creates the port group. However, before he pushes it into production, he decides to try the new Pre-Change Analysis feature in a product they are doing a Proof of Concept on called NEXUS Insights.
The next step is to run a Pre-Change Analysis on the modified JSON that he pushes later that night during the change control window
Give the Pre-Change Analysis a Descriptive name, choose the Site to run the Analysis on, and choose the snapshot (Epoch). Choose the latest Epoch, then click apply
Choose the JSON file created with the changes you would like to validate before the change control window.
Next, we see the file uploaded, and we can then choose to Save and Analyze
Once the Analysis changes from Running to Completed, check the Analysis box on the name and bring up a window on the right. The window shows 4 critical and 2 major issues with the new proposed JSON file. Click on the Square cross launch button in the right-hand window to see the issues.
This cross launch allows us to see the critical issues pushing this JSON during change control would have on the ACI fabric.
The following window is for the Delta Analysis that we saw earlier to compare what changes in the fabric. It can not see what has changed since we didn't push the JSON. But instead, by using Pre-Change Analysis, we can see the change if we push the JSON. We see the earlier snapshot VS the JSON file, the Anomaly counts. Click on the Total for later circle(7 in this case) or scroll down to see the anomalies.
Here we see the anomalies, and I have set them for Aggregated for easier sorting and viewing. Suppose we click on the first one, and we see that the EPG Has an invalid BD. When our ACI admin created the JSON, he either did a find and replace wrong or put in a BD that does not exist. Click on that error to see the issue and recommendations
Here we see an EPG created using a BD that does not exist. Click on the Anomaly ID. A window pops up on the right side; click Analyze
Using the Pre-Change Analysis tool clearly shows what's wrong and recommends suggestions to fix the JSON. We can see that using these tools, we can within minutes verify that the JSON configuration would break, and this can save using a tremendous amount of time and wasted effort during a change control window
NEXUS Insights Use Case- Migrations
When migrating to an ACI environment, customers pre-provision all the required configurations relating to Tenant/VRF/BD/EPGs and program the correct policies required for Layer1/Layer2 connectivity for server ports. The following need to be configured in the Cisco APICs:
* Interface policies, interface policy-groups, interface profiles, leaf profiles, AAEPs, VLAN Pools, physical/virtual domains.
* Tenants / VRFs / Bridge-domains / Application Profiles / EPGs / Static path bindings.
The next step is to extend Layer 2 connectivity from old infrastructure to new ACI environment and begin moving workloads into the ACI fabric. During a cut-over, default gateways for multiple VLANs moved from legacy infrastructure to ACI. As ACI takes over the default gateway, several ACI configurations are DHCP relay, Contracts, and inter-subnet routing. Proper Configuration of contracts is necessary to connect EPGs in the fabric and entities outside the fabric. Also, removing the SVI and external gateway during the migration is critical as you cannot have two gateway for the same subnet without connectivity issues.
In many cases, migrations become delayed/postponed/backed off due to human errors even after multiple reviews. The Cisco NEXUS Insights Assurance platform allows operators to pinpoint configuration issues within the ACI fabric and potential misconfiguration outside the fabric. NEXUS Insights reduces the time taken to perform migrations and the costs associated with multiple change windows, and you can be confident that the proposed changes do not overlap with existing configurations.
Again, we shall use Pre-Change Analysis to verify migration from a standard 3 tier SVI and VLAN-based data center network going through a migration process. A typical migration would be extending L2 out of the ACI fabric and using the GW in the old fabric. Then during the migration change control window, the new bridge domain and subnet that duplicates the SVI is created. We want the BD created in the ACI fabric but not advertise the new subnet until the SVIs shut down. These changes use scripting; however, we want to verify that the subnet is not turned up and advertised externally.
We create a second Pre-Change Analysis and add the new BD to the ACI fabric. Here we create the Analysis for Site MSITE-EQUINIX, choose the latest window, choose the JSON you use, and run the Analysis.
Here we see multiple errors of overlapping subnets. Pre-Change Analysis and EPOCH Analysis should be a go-to tool for the ACI operations personnel to make sure changes are correct and successful, as well as rapid troubleshooting to find errors, what changed, and who made the changes.
The Pre-Change Analysis allows us to safely simulate configuration changes to the fabric before a change control window, ensuring a successful migration.
NEXUS Dashboard and Day 2 Operations wrap up
As NEXUS Dashboard continues to evolve, the vision is to gain telemetry from other Cisco products like SD-WAN, SD-Access, UCS via Intersight, and other third-party vendors. The NEXUS dashboard can ingest telemetry data and correlate that data using the timeline for easier application performance troubleshooting by utilizing the standards-based Kafka bus. A cloud-based platform complements the appliance and ova-based NEXUS Dashboard in the next version of NEXUS Dashboard for cloud telemetry ingestion. The day 2 operations tools are now integrated under one suite of tools (NAE/NIR/NIA); this has already been done with NIR and NIA becoming NEXUS Insights. The goal of bringing all of this correlated data together with one unified platform is to drive business decisions as we can now see that network or compute performance affects Applications through AppDynamics. The second goal is offering the business a much better MTTR using the correlated data and simplified View of the entire fabric as a single switch.
In the future, look for more integrations with tools such as thousand eyes for branch and campus telemetry, exporting to Splunk for security and auditing tools. Also, integration with service now is available so that when a Critical or Major Alert or event happens, NEXUS Dashboard automatically opens a Service Now ticket.
Intent Based Networking Overview
Where does NEXUS Dashboard fit into future Intent-Based Networking(IBN) systems? First, we must get a basic understanding of IBN using the illustration below. We have Business Intent which is a mix of Application SLA's (Uptime and response time), Policy and Compliance (A must talk to B but not C), and IT operations (for hardware and software maintenance)
This Business Intent is then imported, usually with automation, into a Translation engine. The Translation looks at Business Intent, converts to Policy, and checks the Policy's integrity to what is already defined.
Once the Policies are defined, they send to an activation/automation engine that sends the Policy via APIs' to the Physical and Virtual Infrastructure.
Finally, we have Assurance Engines such as NEXUS Dashboard with NEXUS Insights to assure that the Translated Policy from the Translation is pushed correctly via Activation/Automation APIs. We get continuous verifications, visibility, and corrective actions displayed in NEXUS Dashboard and NEXUS Insights; however, to close the loop, the assurance engine must be able to feedback to the Translation Engine what corrective action to perform once the industry can do that we can genuinely have autonomous self-healing networks.
The first step to closing the loop between Assurance and Translation with the latest version of NEXUS Insights 6.0 using a new workflow called One-Click Remediation
NEXUS Insights Use Case- One-click Remediation
One-Click Remediation feature reduces MTTR by enabling you to remediate an anomaly based on recommendations from NEXUS Insights.
Nexus Dashboard Insights uses the enhanced framework and workflow mappings on Cisco APIC to recommend the anomaly diagnostics and impact. The Estimated Impact and Recommendations area in the Analyze Anomaly page describes the anomaly diagnostics impact and recommendations. The one-Click Remediation feature allows you to execute the proposed recommendation on the fabric.
In the NI 6.0 release, One-Click remediation support for the following anomalies:
As future releases become available, more One-Click Remediation becomes available
To test One-Click, create a dummy AEP in your APIC. In the following case, I called it One_Click, then ensured no physical or virtual domains were associated.
Once Assurance Analytics runs again, it creates an anomaly that we can click on in NEXUS Insights. Filter by Major for severity, click on the Anomaly, and a window pops up on the right side. Click Analyze
We see the Anomaly in detail here; the description, affected object, Impact Analysis, Recommendations(in this case, their recommended course of action is to delete
Click on the Remediate Action, and we get a pop-up window that shows the proposed action to delete this AEP. Close the window and click Fix.
Go back to the APIC, and we can see that the AEP One_Click is gone.
While these are very rudimentary, it shows the power of detecting an anomaly, providing recommended actions, and offering a "FIX" button to fix the Anomaly in the fabric.
The One-Click Remediation concludes the white paper on doing a 90-day free POC using the virtual NEXUS Dashboard to investigate your fabric for anomalies, advisories, and in-depth Flow telemetry to increase uptime for the business and reduce MTTR if there is an issue. We feel One click is the first step in closing the feedback loop for autonomous networking so stay tuned.
As WWT continues to do more integrations as future code comes out, look back here for future labs, articles on NEXUS Dashboard and AIOps/AppDynamics content.