Monday, January 11, 2010

cisco High Load

It is annoying when you have a small network component which acts funny. Here is a switch with a high load.


Switch has a high load. Consistently, this L2 switch has a load of ninety-plus percent.

System or Traffic?

Checking the traffic at the switch uplink, I see less than ten percent utilization:

#show int e 5/19 | inc util
  300 second input rate: 28243442 bits/sec, 5964 packets/sec, 2.88% utilization
  300 second output rate: 94650470 bits/sec, 13938 packets/sec, 9.59% utilization
#show int e 5/20 | inc util
  300 second input rate: 55866957 bits/sec, 10364 packets/sec, 5.68% utilization  
  300 second output rate: 56331075 bits/sec, 12719 packets/sec, 5.75% utilization

And, on the switch itself (I am removing all the 1/255 manually):

#show interfaces | inc /255
     reliability 255/255, txload 10/255, rxload 3/255
     reliability 255/255, txload 26/255, rxload 2/255
     reliability 255/255, txload 2/255, rxload 1/255
     reliability 255/255, txload 13/255, rxload 16/255
     reliability 255/255, txload 9/255, rxload 46/255
     reliability 255/255, txload 16/255, rxload 3/255
     reliability 255/255, txload 11/255, rxload 31/255
     reliability 255/255, txload 8/255, rxload 5/255

So, traffic is not really that high. Let's check the system to see what is going on.

#show processes cpu sorted 5sec | exclude 0.00%
CPU utilization for five seconds: 85%/14%; one minute: 86%; five minutes: 87%
 PID Runtime(ms)   Invoked      uSecs   5Sec   1Min   5Min TTY Process
 118  36487939601297149358       2812 65.97% 67.99% 68.97%   0 LLDP Protocol
 134         471       393       1198  0.15%  0.16%  0.04%   1 Virtual Exec
  97     2604007   1553121       1676  0.15%  0.06%  0.05%   0 HRPC qos request

LLDP Protocol seems to be where the work is happening. Let's dig some more.

So, this looks odd:

#show platform port-asic stats miscellaneous

Port-asic Misc Statistics
    TxBufferFullDropCount               2227541109

I found some drops on "Queue 4".

#show platform port-asic stats drop | include Queue  4
    Queue  4: 116263423
    Queue  4: 116263506

"Queue 4" is the L2 Protocol queue (remember the first queue is number zero):

#show controllers cpu-interface
cpu-queue-frames  retrieved  dropped    invalid    hol-block  stray
----------------- ---------- ---------- ---------- ---------- ----------
L2 protocol       1253419075 0          0          5          0

It is time to enter some debug commands, which every network guy hates to do. Debug is usually the quickest way to make a problem worse. It is much akin to an oncologist saying, "I don't know about this lump, why don't we just slice you open?" The surgery could be worse than the initial problem.

configure terminal
no logging console
logging buffered 128000
service timestamps debug datetime msecs localtime
no debug all

debug platform cpu-queues ?
#debug platform cpu-queues ?
  broadcast-q         Debug packets received by Broadcast Q
  cbt-to-spt-q        Debug packets received by cbt-to-spt Q
  cpuhub-q            Debug packets received by CPU heartbeat Q
  host-q              Debug packets received by host Q
  icmp-q              Debug packets received by ICMP Q
  igmp-snooping-q     Debug packets received by IGMP snooping Q
  layer2-protocol-q   Debug packets received by layer2 protocol Q
  logging-q           Debug packets received by logging Q
  remote-console-q    Debug packets received by remote console Q
  routing-protocol-q  Debug packets received by routing protocol Q
  rpffail-q           Debug packets received by RPF fail Q
  software-fwd-q      Debug packets received by software forward Q
  stp-q               Debug packets received by STP Q

An Odd Fix

Figuring we might as well go to the latest code during a planned maintenance window, I decided to upgrade the IOS from 12.2(40) to 12.2(53). The problem immediately went away and has been running error free at five percent utilization for more than three weeks.

No comments:

Post a Comment