By Howard Johnson, FCIA Member, INCITS/Fibre Channel chair, Broadcom Technology Architect
I cannot wait for self-driving cars! I want to be able to hop in my car, say, “go to work,” and have it get me there safe and sound – under any circumstances. Yet, with all of the advancements from the likes of Tesla, BMW, Mercedes, and the big three, it is going to be a while before we get to that point. In the meantime, I am enjoying some of the cool features that make the task of driving easier like lane detection, keeping, and centering. Just thinking of all the technology needed to make these features work reveals the challenges for any system attempting this level of automation. One of the biggest challenges is determining how to assimilate and condense the massive amounts of information presented to the system at any given time.
In the 2020 Solutions Guide, we introduced Fabric Notifications, which is an element of automation for the Autonomous SAN in Fibre Channel. It addresses the challenge of processing device-generated data by a) enlisting the existing features of Fibre Channel SANs to invoke notifications for detected problems and b) condensing the information into a simple description used by the participating devices to drive the automation.
Lane Detection
Lane detection was the first step toward self-driving cars. Automobiles outfitted with cameras gather information about the road, which is fed into an onboard computing complex to provide warnings if the car deviated from its lane.
Similarly, devices in a SAN have significant capabilities to detect anomalies and generate messages associated with those anomalies. The difficulty is that the ever-increasing number of devices in storage and SAN infrastructures challenges administrators with ever-increasing amounts of data to be processed. Devices have significant detection and reporting mechanisms that log information that administrators have to process to find the source of problems detected and reported by the devices. As the system grows, the amount of information in the logs grows beyond the administrator’s capacity to process it in a timely manner. Even worse, the sheer volume and velocity of the data produced by the devices in the system can turn the smallest problem into hours, if not days, of time to isolate and mitigate.
For example, in a large storage network, a faulty optic can cause each layer of the system to generate log messages. The network hardware (e.g., switches) logs that a transmission error occurred, the transport hardware (e.g., HBAs) logs that an IO request error occurred, the IO system software (e.g., multi-pathing) logs that an IO retry occurred, and the application (e.g., backup application) logs that a read or write request timed out. Each layer of the system produces log entries for the same instance of the error. Often in communications networks, the process repeats numerous times depending on the volume of traffic and frequency of failures caused by the faulty optic. The result is a large amount of log data for the administrator to wade through!
Lane Keeping
The task of the administrator is to sort through all of these messages to find the one message that indicates a transmission error occurred. It is no wonder administrators are turning to various devOps tools to help them process log data for pertinent messages and generate summaries.
Typically, these devOps tools provide a kind of lane-keeping function similar to what is available in newer vehicles. With lane keeping, the car assists the driver to maintain its position in the lane – it tries the keep the vehicle between the lines. In a storage network, the devOps tools leverage the device APIs to extract information and condense it into actions that keep the system from wandering too far off course. These tools can often generate work tickets that summarize the actions necessary to resolve a particular problem.
This level of automation provides the valuable function of reducing the massive quantity of log data generated by the system into a much smaller set of tasks that help keep the environment running. However, the administrators are not off the hook because they react to the tickets generated by the system, execute
the instructions summarized on the ticket, and determine if the actions described in the ticket actually resolve the problem.
Lane Centering
The current state of self-driving technology provides a lane-centering function that does more than keep the car between the lines; it optimizes the car’s position in the lane. Rather than warning the driver that the car is outside of the lane or noticeably correcting the car’s path, lane centering constantly monitors the data to keep it centered.
Fabric Notifications takes a similar approach for Fibre Channel networks. It leverages the intelligence of the devices in the system and provides a method for condensing and sharing the information with the device’s peers. Each device evaluates the information in the notification to determine the most appropriate action to keep the system functioning optimally – keep it centered.
The Art of Automation
On the road to self-driving cars, the automation of driving evolved from lane detection to lane keeping to lane centering. With each innovation, a new level of possibilities is revealed. Automation in the data center is also a continuously evolving endeavor. Each step produces solutions that reveal the objectives for the next level of the automation.
The art of the possible begins with an idea that expands and evolves simply. Fabric Notifications embraces this concept and allows solutions to evolve and expand as needed. Storage arrays have adopted Fabric Notifications to address the problem of information overload impeding problem determination, isolation, and mitigation. The arrays register to receive notifications and store the notifications in their system logs. This provides a key for devOps tools to locate and surface, which reduces problem determination time. The notifications include the location of the detected event leading to a reduction in the problem isolation time. Finally, knowing the nature and location of the problem reduces the problem-mitigation time from days to minutes.
Similarly, a Fabric Notifications-enabled server with a multi-path solution leverages the detection capabilities of the devices to locate the occurrence of the intermittent physical issues. The detecting device generates the notification sent to its peers affected by the event, and the multi-path solution instantly knows the location and nature of the physical error. It can then “route around” the impacted path by utilizing good, alternate paths. Much like lane centering for the self-driving car, the administrator is not involved in the identification, isolation, and recovery of the error.
Self-Driving SANs
The Fibre Channel technical committee developed the architecture for Fabric Notifications to improve the resiliency of Fibre Channel SANs. The objective was to simplify the task of administrators as the scale of their systems grew. By employing the intelligence inherent in the system at the point of detection, the Fabric Notifications architecture produced flexible solutions that can adapt and move closer toward full automation.