Arbiter: Dynamically Limiting Resource Consumption on Login Nodes
Event Type
Student Paper
Cluster Management
TimeWednesday, July 3111:30am - 12pm
LocationRegency AB
DescriptionIn a high-performance computing (HPC) environment, shared resources are often the most capricious and unreliable. The performance of each process is affected by the behavior of all connected users. This applies particularly to the login nodes of HPC clusters, which are used by multiple people at a time. Often, policies govern their use, typically limiting users to small, lightweight tasks such as scripting, compiling, submitting batch jobs, and staging data. If a user is behaving poorly, such as using a significant portion of CPU time or consuming a large fraction of the total memory, processes using the same resources can be slowed or run out of memory. Policies on login nodes that prohibit such behavior, however, may not be enforced automatically because of technical limitations.

Arbiter is a daemon that overcomes such limitations by using cgroups, a feature of the Linux kernel, to monitor and limit usage. It can both enforce default limits—to ensure the server remains responsive—and penalize users who are consuming resources immoderately while still allowing for brief testing of workloads better suited to computational resources. Arbiter tracks the total memory and CPU usage of a user, reduces the quantity available for a period of time if usage is excessive, and sends emails to inform users of policies and potentially impactful behaviors. Throttling performed by Arbiter encourages users to use computational resources for intensive tasks (through the batch system), thereby supporting the reliability and responsiveness of login nodes.