In this tidbit I want to provide a quick overview of a hardware issue I recently encountered in a customer's environment in hopes that it can help someone else that may encounter this type of issue & potentially save some time. For the record, I did have to engage TAC on this one to grasp a better understanding of steps I needed to take. I highly recommend working with TAC for hardware issues btw.
Brief overview of the actual server: UCS chassis was a C220-M5 operating as one of three nodes in a DNAC cluster, more specifically DN2-HW-APL type.
The morning started out with identifying the following errors in CIMC after realizing the DNAC cluster had some service issues & one of the nodes was 'unreachable' according to the DNAC admin UI.
After further recon, via CIMC the following errors were discovered:
Fault events “RAID CONTROLLER DEGRADED” alert was present.
The storage tab and the Modular raid controller was showing server fault, but BBU and MRAID were showing operational parameters, so the issue was related to connection loss from the RAID Controller side.
Screenshot from CIMC:
TAC's response: In rare cases, RAID Controller hangs because is unable to provide its status to the CIMC and this triggers an alert even though RAID is operable, the workaround for this scenario is to reseat the RAID controller to re-establish the connection.
At this point there were two options per TAC:
Try reseating the controller to see if that resolves the issue
RMA the controller and/or replace the entire server
Option 1 is obviously first/quickest choice. The steps for this option were as follows:
Completely shut down the server & remove power
Slide the server out of the rack & remove the top
Removing the mRAID riser from the server:
Using both hands, grasp the external blue handle on the rear of the riser and the blue finger-grip on the front end of the riser.
Lift the riser straight up to disengage it from the motherboard socket.
Set the riser upside down on an antistatic surface.
Remove any existing card from the riser:
Disconnect cables from the existing card.
Open the blue card-ejector lever on the back side of the card to eject it from the socket on the riser.
Pull the card from the riser and set it aside.
Per TAC, wait 2-3 minutes before proceeding with the following steps:
Return the riser to the server
Align the connector on the riser with the socket on the motherboard. At the same time, align the two slots on the back side of the bracket with the two pegs on the inner chassis wall.
Push down gently to engage the riser connector with the motherboard socket. The metal riser bracket must also engage the two pegs that secure it to the chassis wall.
Finally, replace the top cover to the server.
Replace the server in the rack, replace cables, and then fully power on the server by pressing the Power button.
Quick inside view of UCS server:
Lastly, after powering up the server return to CIMC after a few minutes and ensure that the card and chassis is in good standing:
Detailed steps/info from Cisco docs: Cisco UCS C220 M5 Server Installation and Service Guide - Maintaining the Server [Cisco UCS C-Series Rack Servers] - Cisco
Good luck & Cheers!