Deploying a Slurm Workload Manager cluster
Using Elasticluster
The following commands are provided as examples of how to use ElastiCluster to create and interact with a simple Slurm cluster. For more information on ElastiCluster, please refer to the documentation here.
Deploy a Slurm cluster on the cloud using the configuration provided:
elasticluster start slurm -n hpc1
List information about the cluster:
elasticluster list-nodes hpc1
An example of output is:
Cluster name: hpc1
Cluster template: slurm
Default ssh to node: frontend001
- compute nodes: 2
- frontend nodes: 1
To login on the frontend node, run the command:
elasticluster ssh hpc1
To upload or download files to the cluster, use the command:
elasticluster sftp hpc1
compute nodes:
- compute001
connection IP: <x.x.x.x>
IPs: <x.x.x.x>
instance id: <uuid>
instance flavor: m3.small
- compute002
connection IP: <x.x.x.x>
IPs: <x.x.x.x>
instance id: <uuid>
instance flavor: m3.small
frontend nodes:
- frontend001
connection IP: <x.x.x.x>
IPs: <x.x.x.x>
instance id: <uuid>
instance flavor: m3.small
Grow the cluster to 10 nodes (add another 8 nodes):
elasticluster resize hpc1 -a 8:compute
Terminate (destroy) the cluster:
elasticluster stop hpc1
If the deployment fails during the start phase, you may need to fix the issue and then run the following to continue:
elasticluster setup hpc1
Some NeCTAR images are set to update packages on first boot, if you are using an image that does this the initial setup phase will fail with:
Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)"
If this happens you just need to wait for a minute for the updates to complete and run:
elasticluster setup hpc1
to finalise the setup of Slurm itself.