Resource Request for HPC Deployment Automation for AI/HPC systems (Faculty/Junior Researcher Collaboration Opportunity)

Resource Request for HPC Deployment Automation for AI/HPC systems

PI: Gary Skouson (ICDS)

Apply as Junior Researcher 

Plan for funding tuition for graduate students, or the remainder of the researcher’s salary for postdoc and research faculty: No plan

Purpose:

The ICDS Systems Engineering team is looking for assistance to develop and/or improve system deployment automation for the ICDS AI and HPC systems. We aim to continue following DevOps practices to implement infrastructure as code processes to deploy AI and computational resources for our research computing and data storage platforms.

Background:

ICDS currently utilizes xCAT in deploying system resources. With IBM’s exit from supporting and developing xCAT, we’re looking at other options for current and future AI/HPC and cloud systems deployments. Some example software that may meet our needs could include options like Lenovo’s Confluent system, open source warewulf or openstack ironic/bifrost. It could also include other emerging systems like openChami.

Project Scope and Deliverables

Work with technical staff in using, configuring and comparing different options for system deployment for new systems

People Resource Request

One Junior Researcher for Fall 2025 and Spring 2026

The project will include time working with other ICDS technical staff. While not essential initially, the following skills will be helpful in participating fully:

 Linux command-line and systems experience

 Linux scripting experience (Bash, Python, etc.)

 Version control experience using git as part of DevOps, or developing CI/CD (continuous integration/continuous development) systems.

 Comfortable learning depolyment systems, including out of band management (BMC, IPMI, iDRAC, etc.), network boot protocols (DHCP/PXE etc.)

 It could also be beneficial to have experience building linux images or containers with image build tools like builah or other tools.

 Experience configuring or deploying cloud like environments like openstack, or kubernettes.

Justification

Additional support is needed to ensure consistent testing of multiple options

Expected Outcomes

 Comparison of pro/con lists for various options for future ICDS system depoluments

 Pilot setup for system deployment

 Options for converting what we have to proposed new system.