The FCIA recently hosted a fascinating discussion on Fibre Channel Fabric Notifications called “Introducing Fabric Notifications, From Awareness to Action” where our panel of Fibre Channel experts, Howard Johnson, Mark Jones, Nishant Lodha, Rupin Mohan and Kiran Ranabhor explained how Fabric Notifications work and highlighted some exciting new innovations. If you missed the live panel discussion, it’s available on-demand at the FCIA YouTube Channel and at BrightTALK.
The panel answered several interesting questions during the live event. Here are answers to them all:
Q: Any alarming notification is only a good as the personnel being notified. Many a time the current SAN switch alerting features are sending notifications (snmp traps, emails) well before the impact occurs. How will new fabric notifications platforms overcome a failure to respond to alerts?
Answer: Fabric Notifications are sent inband through the FC SAN, which facilitates automation of corrective actions unlike other types of notifications.
Explanation: The Fabric Notifications architecture is built to enable automated responses by the receiving devices. This approach reduces the reliance on the administrators to receive and react to the notifications. Implementations have the ability to process the notifications based on their interpretation of severity. Thus, vendors have the freedom to deploy a range of solutions from logging to automatic mitigation.
Q. Since hosts need to be aware of the new notifications system, this solution seems appropriate for greenfield deployments but very impractical on existing deployments. True?
Answer: False – Fabric Notification capabilities can be enabled on existing deployments via a simple software upgrade and can greatly benefit brown field deployments too.
Explanation: The Fibre Channel standards community was keenly aware of the deployment concerns for Fabric Notifications and orchestrated the architecture to allow environments to deploy a mixture of capabilities. The registration process ensures that notifications are only sent to the devices that are capable of receiving them. In addition, the registration operation allows implementations to select the notifications of interest, which ensures the device only receives messages about events it is ready to handle. Lastly, the supplementary information provided in the annexes of each standard provide an operational foundation which encourages device implementations to take mitigation actions in small increments and limit those actions to alleviating the problem without unduly compromising the device.
Q. What is the state of readiness of the storage ecosystem (HBAs, Switches, Storage Arrays) to support Fabric Notifications today?
Answer: The ecosystem is ready!
Explanation: Products implementing Fabric Notifications are available today from Fabric vendors, FC HBA vendors, and several OS vendors. In fact as of November 2020, IBM AIX 7.2 TL5 and Red Hat RHEL8.3 with EPEL8 provide MPIO solutions that employ Fabric Notifications for Link Integrity events. These solutions leverage the currently available Fabric and HBA functionality for Gen6 (16/32GFC) and Gen7 (32/64GFC) Fibre Channel solutions.
Q: If you have congestion on a path, and you signal all (some?) initiators to switch away from that path to alleviate it, couldn’t that cause cascading problems on the other paths because of all the more traffic going over them? Wouldn’t you need to lower – not just switch – overall traffic from the initiators?
Answer: No, Fabric Notifications include a feedback loop to prevents conditions such as cascading effects of corrective actions taken in response to FPIN events.
Explanation: The Fabric Notifications architecture restricts the distribution of the notifications to the devices that have registered for the notifications and the devices that are zoned with the impacting port. This limits the notifications to only those devices that are directly affected by the condition. In the case of a congestion notification, the devices receiving the notification are made aware of the congested port, which allows them to decide if they can move traffic to an alternative path or if they need to lower the I/O rate to the impacted port. Regardless, the device now knows the reason for slower response times is due to congestion at the destination.
Q. Timing is essential here since a credit stall can disappear as fast as it comes. So how “fast” is this FPIN stuff? if we are talking several milliseconds, it will get limited use.
Answer: Very fast as Fabric Notifications includes hardware signals.
Explanation: The Fibre Channel standards committee explicitly addressed the Credit Stall case with the architecture of the Congestion Signal function of Fabric Notifications. This mechanism employs the generation of a primitive signal sent from a transmitter to a receiver on the link. Since this signal is hardware based, the response functions can be tuned to address the conditions at wire speed. However, the architecture also recognized that it is not always necessary to perform mitigation actions at hardware rates. Thus, the architecture provides recommendations for leveraging existing tools for Fabric Notifications that can recognize, notify, and mitigate events faster than human response times, which is a significant improvement over the current state of the art.
Q. How does a host know if it should take action or the arrays should take action or both?
Answer: By design, a coordinated response to FPIN events is not required to alleviate the problem.
Explanation: The beauty of the Fabric Notifications architecture is that all of the devices can take actions independently of each other, so there is no need for a host or array to coordinate their actions based on the notifications they receive. For example, when read oversubscription is detected, the host is notified that it is causing the oversubscription via the Congestion Notification FPIN ELS and the array is notified that the host is the cause of the oversubscription via the Peer Congestion Notification FPIN ELS. Both devices can take mitigating actions (i.e. the host begins throttling read requests and the array begins speed matching). These actions join together to mitigate the issue, which occurs much faster than if just one device performs the mitigation. In our example, once the mitigation has eliminated the oversubscription condition, the host may stop throttling but the array could remember that “that host” is only capable of accepting data at a certain rate and respond to accordingly. This provides a “learning” function that has the effect of reducing the occurrences of oversubscription with that host in the future.
Q. Is there some sort of “SIEM” like tool available for handling Fabric Notifications to prioritize and notify the Admin? Or would this sort of triage mechanism be built into tools like Ansible etc.?
Answer: Unlike other notifications that rely on administrator actions, FPIN events and the associated corrective actions are automated.
Explanation: Fabric Notification events are delivered only to the devices requesting participation and have an interest in the event (i.e. those that have registered and are zoned with the affected port). Therefore, exposing the notifications to upper layer management tools is an implementation choice of the device. However, the intent of the architecture is that the end devices take actions based on the event type in order to mitigation the effects of the event. For example, a server that receives an FPIN ELS indicating a Link Integrity event surfaces the event to the MPIO layer to cause the path state to be changed to the “degraded” state. The path selection function of the MPIO solution then removes the “degraded” path from consideration in favor of the remaining healthy paths. These actions immediately address the issue caused by the Link Integrity condition and eliminate the need for human intervention. Consequently, logging or surfacing the event to a DevOps tool simply provides the administrator with a record of the automated actions taken by the devices and provides information about the failing connection. That is, further automation via Ansible or other tools is not required for mitigation, but might be nice to have for audit purposes.
Q. A single device not aware of FPIN could make the whole solution ineffective to solve problems since it will not respond to whatever the network tells it to do. Correct?
Answer: No.
Explanation: One of the key elements guiding the Fabric Notifications architecture is the recognition that deployments would occur piecemeal and that not all devices in an environment would be Fabric Notifications aware at the same time. Understanding this reality, the architecture provides both descriptions and recommendations about device behavior to maximize the positive aspects of adding Fabric Notifications capable devices to existing environments. For example, if an unaware device is the cause of read oversubscription, the Fabric Notifications aware devices receive the notifications and can adjust their behavior accordingly; thus, a Target device may throttle I/O to that device to alleviate the condition. Furthermore, Fabric Notifications aware devices can surface the event notifications which help accelerate the problem determination, isolation, and mitigation actions by the administrators. In this manner, the “good actors” all point to the “bad actor” to help resolve the issue.
Q. Dell/EMC Ansible Playbooks? Which do you suggest?
Answer: Here are some references to ansible playbooks:
- https://docs.ansible.com/ansible/2.9/modules/list_of_storage_modules.html
- https://www.dell.com/support/manuals/en-us/openmanage-ansible-modules-v2.0.1/omam_2.0.1_users_guide/running-your-first-ome-playbook
Attendees of this webcast also had the opportunity to download the FCIA 2020 Solutions Guide. You can download it here on the FCIA website.