The FCIA recently took on the topic of Fibre Channel performance in our live webcast “Fibre Channel Performance: Congestion, Slow Drain, and Over-Utilization, Oh My!” If you missed the live event, it’s now available on-demand. During the live event, attendees asked our experts Earl Apellanes and Ed Mazurek a lot great questions. Here are answers to them all:
Q: Slow drain doesn’t cause the ISL congestion. Switch could not transfer frames due to lack of credits. Where do you see the ISL congestion?
A: Slow drain and over utilization does cause ISL congestion. Almost always, the ISLs are not the source of the congestion but the individual devices that are the destinations of the frames are causing the back up across the ISLs.
Q: Suppose on a Brocade Director switch one sees ports errors on one of the blade faults and we have another switch connected via ISL. How do we identify the congestion on particular ports or slot?
A: See answer to the next question.
Q: Is there any command to check the daily health of the switch…switchstatushow? It shows healthy but that’s not always the case. Please shed light on this.
A: For Brocade switches:
- The “switchshow” command displays switch, blade, and port status information. This command is very high level but provides an entire switch view.
- The “porterrshow” will display an error summary of the port interfaces of the switch such as encoding out, class 3 discards, link failures and other physical errors causing a link to fail.
- The “portstatsshow” will display the port hardware statistics counters. Here is where you’ll see some details on time transmit credit zero but it doesn’t always necessarily mean there is a problem when you see a transmit credit 0 on a port. Generally speaking if the transmit credit 0 counter is about 30% of the frames transmitted, that’s where you want to start investigating an end device either withholding credits or consuming more frames than it can handle.
- A single command generally won’t provide a smoking gun to a particular problem, especially slow drain. There are multiple commands that can be used to determine a slow drain problem but correlating the data between multiple commands is best. We advise that customers use Brocade Network Advisor (GUI) along with features such as Fabric Vision, MAPS, Fabric Performance Impact, and IO Insight that can do the data correlation and point customers to exactly which FPort (end device) is misbehaving. Brocade Network Advisor has a free download and has full functionality free of charge for up to 4 months. Installing the GUI is non disruptive to any production SAN and provides visibility to all the ports in the fabric.
A: For Cisco MDS switches:
- Show interface counters – This will list, per interface, TxWait as a total and as a 1s/1m/1h/72h average of how much credits were not available
- Show logging onboard txwait – This will show any interfaces that have 100ms or more of TxWait over a 20 second interval. This is very helpful. It includes a congestion percentage and is very good for low level latency problems. Each entry includes a date and time when it occurred.
- Show logging onboard error-stats – This will show 100ms Tx credit not available, timeout drops, credit-loss-recovery and is very good for more severe problems. Each entry includes a date and time when it occurred.
- Show process creditmon txwait-history – This will show histograms of txwait for the last 60 seconds, 60 minutes and 72 hours. This is very good for real-time troubleshooting
- Show tech-support slowdrain – Wrapper for many slow drain related commands including all of the above and more. This is best when gathered for the entire fabric of switches together. This can be done easily via DCNM.
- Also, enabling port-monitor for several of the slowdrain counters will give automatic SNMP alerts to DCNM or any other network management software that is being used.
Q: Do you see discards during over utilization?
A: Yes, but it depends on how severe the congestion is. You need to remember the goal of FC flow control is to equalize the flow and to prevent drops. So, under light congestion there should be no frame drops but under more severe times of congestion it is expected to have drops, specifically timeout drops.
Q: Are there any general rules of thumb about having multiple device port speeds in a single fabric such as having a mix of 2, 4, 8 and 16Gbps device ports?
A: It was mentioned that no more than two different speeds in a fabric, but I don’t know of any specific rules. Less speed differences is definitely better ISLs should always be at a speed >= edge devices.
Q: What’s about ISL compression/encryption: any best practices there?
A: Compression/encryption don’t affect the line rate so they have no impact on congestion.
Q: I had a case where the interfaces where the Tape drives are connected received “TX Credit Not Available” and “Credit Loss Reco”. The server zoned to the tape drives running at 8Gbps are not affected, but the ones running at 4Gbps are affected. The TX only occurred on the interfaces when backups started. Why are 4Gbps affected and not 8Gbps?
A: Hard to know the specifics. Credit-loss-recovery can be due to severe congestion or due to physical errors on the link resulting in a loss of credits.
Q: What cars does Ed have? (The hint was 1360 total HP)
- A: This is a fun one to answer!
1986 Dodge Omni GLH Turbo – 146HP
2009 Cadillac CTS-V – 556HP
2014 Ford Shelby GT 500 convertible – 662HP
All are manual transmissions!
Q: When is the next FCIA webcast?
A: Glad you asked! It’s Fibre Channel Cabling on April 19, 2018. We’ve lined up a great presentation with experts, Zach Nason,Data Center Systems, Greg McSorley, Amphenol-Highspeed, and Mark Jones, Broadcom. Register here.