Technical Details on the Roar Collab August Outage - PSU Institute for Computational and Data Sciences

During the August Roar and Roar Collab outage, ICDS worked with its group storage vendor to troubleshoot false positive alerts from hardware.

During a troubleshooting session on Thursday, a piece of hardware became unresponsive and was replaced. While trying to force SSDs back to their proper owner there was a cluster panic and a DUE (Data Unavailable Event) NFS was not available. ICDS and the vendor were aware of this possibility, and therefore waited for the planned downtime to do it. As a result, 3 SSDs were identified as faulted and a RAID rebuild began. The RAID rebuild should complete between 3 and 7AM Sunday morning.

During Sunday’s 9AM-1PM planned downtime ICDS and the vendor will restore the cluster to full hardware High Availability. Because this could require actions that would cause another DUE, the brief outage on Sunday was deemed prudent for system health.

Finally, despite initial concerns to the contrary, during the outage ICDS engineers were able to maintain all user “scratch” data.