Resource Request for HPC Deployment Automation for AI/HPC systems
PI: Gary Skouson (ICDS)
Plan for funding tuition for graduate students, or the remainder of the researcher’s salary for postdoc and research faculty: No plan
Purpose:
The ICDS Systems Engineering team is looking for assistance to develop and/or improve system deployment automation for the ICDS AI and HPC systems. We aim to continue following DevOps practices to implement infrastructure as code processes to deploy AI and computational resources for our research computing and data storage platforms.
Background:
ICDS currently utilizes xCAT in deploying system resources. With IBM’s exit from supporting and developing xCAT, we’re looking at other options for current and future AI/HPC and cloud systems deployments. Some example software that may meet our needs could include options like Lenovo’s Confluent system, open source warewulf or openstack ironic/bifrost. It could also include other emerging systems like openChami.
Project Scope and Deliverables
Work with technical staff in using, configuring and comparing different options for system deployment for new systems
People Resource Request
One Junior Researcher for Fall 2025 and Spring 2026
The project will include time working with other ICDS technical staff. While not essential initially, the following skills will be helpful in participating fully:
Linux command-line and systems experience
Linux scripting experience (Bash, Python, etc.)
Version control experience using git as part of DevOps, or developing CI/CD (continuous integration/continuous development) systems.
Comfortable learning depolyment systems, including out of band management (BMC, IPMI, iDRAC, etc.), network boot protocols (DHCP/PXE etc.)
It could also be beneficial to have experience building linux images or containers with image build tools like builah or other tools.
Experience configuring or deploying cloud like environments like openstack, or kubernettes.
Justification
Additional support is needed to ensure consistent testing of multiple options
Expected Outcomes
Comparison of pro/con lists for various options for future ICDS system depoluments
Pilot setup for system deployment
Options for converting what we have to proposed new system.