Tuesday, January 12, 2016

How to setup an IP SLA latency graph in Cisco Switches and Routers

When ever there are performance problems in a flow between 2 points, customer reporting slowness and communication failures, first thing network engineers do is to run pings or reliability protocols to verify that.
The problem is when this only happens somethings during the week. Then you will need a graph measuring that latency.
Latency can be measured with several linux machine, using self made scripts, tools like someping, etc, but the issues goes down normally to a specific link that you want to be sure that is not causing the problem. Also, it can be in a specific vrf where your linux machine cannot reach.

The best solution is always the ip sla feature in routers and switches. It will measure the latency every 1 minute and then you will monitor that ip sla using snmp to store it.

Here is an example of IP SLA graph generated by capturing the ip sla data via snmp and dumping it to mrtg file.
It shows the latency of a 300 km link between 2 datacenters. The latency varies a lot during the day stable at 4ms but variating to 12 ms sometimes. The measurement is done using a ping between a router and something that replies back to icmp. It will not show us jitter, will not measure udp transport. For that you will need an ip sla on the 2 sides running ip sla.
There is another graph that will tell us the service uptime:. Here is the graph below. The is very usefull to see exactly if the ping failed, then the service failed. It is a snmp capture as well dumped to a mrtg file and you can use it to associate it to an operation alarm.
You can setup a condition in your monitoring tool: If the values measure is 1 during 3 reads, then wake up somebody.

This is a good start to analyse latency but it will only pool every 5 minutes and measurements will happen every 1 minute. So, obviously you will lose information but in the end , you will have important data to report in a RCA. From my experience with ip sla feature in Cisco routers or switches, 90% of the times it works great. Some versions will measure incorrectly, but you can leave with it. After all, this graph is only a hint on where should you search for the problem.

So, here's how to do it:

For example, a IP SLA on a Cisco 6500:
ip sla 30
 icmp-echo 192.168.10.10 source-ip 192.168.10.1
 vrf CUSTOMER01
 tag MYPROBE30CUSTOMER01
 frequency 10
ip sla schedule 30 life forever start-time now

If you are using mrtg:
###  PROBE_A Probe 30 status
Target[10.1.1.1_PROBE_P30_STATUS]:1.3.6.1.4.1.9.9.42.1.2.9.1.6.30&1.3.6.1.4.1.9.9.42.1.2.9.1.6.30:public@10.1.1.1
MaxBytes[10.1.1.1_PROBE_P30_STATUS]: 500000
PNGTitle[10.1.1.1_PROBE_P30_STATUS]: PROBE_A Probe 30 status (1-fail 2-sucess)
YLegend[10.1.1.1_PROBE_P30_STATUS]: Status (1-fail 2-sucess)
Options[10.1.1.1_PROBE_P30_STATUS]: growright, nopercent, gauge
ShortLegend[10.1.1.1_PROBE_P30_STATUS]: _status
Title[10.1.1.1_PROBE_P30_STATUS]:  PROBE_A Probe 30 status
PageTop[10.1.1.1_PROBE_P30_STATUS]: PROBE_A Probe 30 status

###  PROBE_A Probe 30 rtt
Target[10.1.1.1_PROBE_A_P30_RTT]:1.3.6.1.4.1.9.9.42.1.2.10.1.1.30&1.3.6.1.4.1.9.9.42.1.2.10.1.1.30:public@10.1.1.1
MaxBytes[10.1.1.1_PROBE_A_P30_RTT]: 500000
PNGTitle[10.1.1.1_PROBE_A_P30_RTT]: PROBE_A Probe 30 rtt
YLegend[10.1.1.1_PROBE_A_P30_RTT]: Status (1-fail 2-sucess)
Options[10.1.1.1_PROBE_A_P30_RTT]: growright, nopercent, gauge
ShortLegend[10.1.1.1_PROBE_A_P30_RTT]: _ms
Title[10.1.1.1_PROBE_A_P30_RTT]:  PROBE_A Probe 30 rtt
PageTop[10.1.1.1_PROBE_A_P30_RTT]: PROBE_A Probe 30 rtt






No comments: