ICDS engineers have recently updated and expanded the outage protocol to improve recovery time and expand testing. The outage workflow has been updated to make use of serviceNOW and includes:
- Expanded tracking and documenting of planned changes
- Leadership review to understand potential impacts
- Expanded system test plan including client-submitted use cases
Planned updated for the May 15 2025 outage include:
- Troubleshoot VAST power redundancy
- Troubleshoot VAST RDMA timeout issues.
- Storage-Globus software updates xx to yy.
- Scheduler-Slurm Update from 24.05.4 to 24.05.8
- Operating System Image and Package updates
- Update symlink at /storage/icds/tools/sw/firefox to point to updated firefox.
- Cluster Admin Node Updates
- Re-sync the software stack (RC->RR)
- Matlab License Updates
- Comsol License Updates