Dynamically Adjustable Queue to Optimize the Roar GPU Cluster (Faculty/Junior Researcher Collaboration Opportunity)

Dynamically Adjustable Queue to Optimize the Roar GPU Cluster

PI: Guido Cervone (Geography)

Apply as Junior Researcher 

The goal of this research is to optimize the queue for the Roar GPU cluster. Currently GPUs are allocated either through the reserved model and the credit model. Under the reserved model, a user buys an allocation which is guaranteed to be available. However, most users do not use the allocations 265/24/7, and they remain idle for a considerable amount of time. Under the credit model, the users do not reserve an entire card, but they pay only for what the actual utilization.

The cards available are already oversold, but because of their high idle time, it is possible to allocate a significant number of computing hours to credit users. There is a need to dynamically optimize the queue so that users who paid for a reservation can access these promptly when needed, but at the same time credit users can also access these highly prized resource when they are not being fully utilized, and to do so with a minimal queueing time.

Initial efforts to provide the resources to both types of users have been successful, but also conservative. In fact, overall idle time for the cards remain between 60% and 70%, strengthening the hypothesis that a different queue strategy will better serve PSU research.

The project will leverage queueing theory with dynamic adjustments based on real use. The students will have the opportunity to work in an operational HPC environment, and optimize the allocation of resources to better serve the PSU research enterprise.