Problem
Switch has a high load. Consistently, this L2 switch has a load of ninety-plus percent.
System or Traffic?
Checking the traffic at the switch uplink, I see less than ten percent utilization:
#show int e 5/19 | inc util 300 second input rate: 28243442 bits/sec, 5964 packets/sec, 2.88% utilization 300 second output rate: 94650470 bits/sec, 13938 packets/sec, 9.59% utilization #show int e 5/20 | inc util 300 second input rate: 55866957 bits/sec, 10364 packets/sec, 5.68% utilization 300 second output rate: 56331075 bits/sec, 12719 packets/sec, 5.75% utilization
And, on the switch itself (I am removing all the 1/255 manually):
#show interfaces | inc /255 reliability 255/255, txload 10/255, rxload 3/255 reliability 255/255, txload 26/255, rxload 2/255 reliability 255/255, txload 2/255, rxload 1/255 reliability 255/255, txload 13/255, rxload 16/255 reliability 255/255, txload 9/255, rxload 46/255 reliability 255/255, txload 16/255, rxload 3/255 reliability 255/255, txload 11/255, rxload 31/255 reliability 255/255, txload 8/255, rxload 5/255
So, traffic is not really that high. Let's check the system to see what is going on.
#show processes cpu sorted 5sec | exclude 0.00% CPU utilization for five seconds: 85%/14%; one minute: 86%; five minutes: 87% PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process 118 36487939601297149358 2812 65.97% 67.99% 68.97% 0 LLDP Protocol 134 471 393 1198 0.15% 0.16% 0.04% 1 Virtual Exec 97 2604007 1553121 1676 0.15% 0.06% 0.05% 0 HRPC qos request
LLDP Protocol seems to be where the work is happening. Let's dig some more.
So, this looks odd:
#show platform port-asic stats miscellaneous Port-asic Misc Statistics ===================================== TxBufferFullDropCount 2227541109
I found some drops on "Queue 4".
#show platform port-asic stats drop | include Queue 4 Queue 4: 116263423 Queue 4: 116263506
"Queue 4" is the L2 Protocol queue (remember the first queue is number zero):
#show controllers cpu-interface cpu-queue-frames retrieved dropped invalid hol-block stray ----------------- ---------- ---------- ---------- ---------- ---------- L2 protocol 1253419075 0 0 5 0
It is time to enter some debug commands, which every network guy hates to do. Debug is usually the quickest way to make a problem worse. It is much akin to an oncologist saying, "I don't know about this lump, why don't we just slice you open?" The surgery could be worse than the initial problem.
configure terminal
no logging console
logging buffered 128000
service timestamps debug datetime msecs localtime
no debug all
debug platform cpu-queues ?
#debug platform cpu-queues ? broadcast-q Debug packets received by Broadcast Q cbt-to-spt-q Debug packets received by cbt-to-spt Q cpuhub-q Debug packets received by CPU heartbeat Q host-q Debug packets received by host Q icmp-q Debug packets received by ICMP Q igmp-snooping-q Debug packets received by IGMP snooping Q layer2-protocol-q Debug packets received by layer2 protocol Q logging-q Debug packets received by logging Q remote-console-q Debug packets received by remote console Q routing-protocol-q Debug packets received by routing protocol Q rpffail-q Debug packets received by RPF fail Q software-fwd-q Debug packets received by software forward Q stp-q Debug packets received by STP Q
An Odd Fix
Figuring we might as well go to the latest code during a planned maintenance window, I decided to upgrade the IOS from 12.2(40) to 12.2(53). The problem immediately went away and has been running error free at five percent utilization for more than three weeks.
No comments:
Post a Comment