Roar Outage Details

ICDS engineers have recently updated and expanded the outage protocol to improve recovery time and expand testing.  The outage workflow has been updated to make use of serviceNOW and includes:  

  •    Expanded tracking and documenting of planned changes
  •    Leadership review to understand potential impacts 
  •    Expanded system test plan including client-submitted use cases 

 

Planned updated for the May 15 2025 outage include:

  • Troubleshoot VAST power redundancy
  • Troubleshoot VAST RDMA timeout issues.
  • Storage-Globus software updates xx to yy.
  • Scheduler-Slurm Update from 24.05.4 to 24.05.8
  • Operating System Image and Package updates
  • Update symlink at /storage/icds/tools/sw/firefox to point to updated firefox.
  • Cluster Admin Node Updates
  • Re-sync the software stack (RC->RR)
  • Matlab License Updates
  • Comsol License Updates