Submitting Jobs on Athene using Slurm (salloc, srun, sbatch, sinfo, squeue)

Summary

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Click here to learn more.

Body

Upload your code and data to your Athene home directory using Ondemand, Globus, or SCP.
Choose the type of job you would like to run (salloc, srun, sbatch)
1. salloc requests and holds an allocation on the cluster so you can run interactive jobs using srun, mpiexec, and other applications. See srun for more information.
  More information execute “man salloc” in a athene-login.hpc.fau.edu terminal.
  1. Example:
    salloc -N 1 –exclusive
    srun hostname
2. srun requests an allocation on the cluster if one is not currently granted by salloc. It then executes the specified command
  More information execute “man srun” in a athene-login.hpc.fau.edu terminal.
  1. Example:
    srun -N 1 –exclusive hostname # execute the hostname of a node
3. sbatch executes a task in the background and is not connected to the current terminal. It allocates a task similar to salloc and logs the results. If your computer loses connection to the cluster, sbatch tasks will continue to run making this a very powerful command. An example of a sbatch task is provided below.
  1. Create a script named {JOBNAME}.sh to start your job containing the following:
    #!/bin/sh
    #SBATCH – – partition=shortq7
    #SBATCH -N 1
    #SBATCH – -exclusive
    #SBATCH – – mem-per-cpu=16000
    # Load modules, if needed, run staging tasks, etc…
    # Execute the task
    srun hostname
  2. Run the command: chmod +x {JOBNAME}.sh to make the job executable.
  3. Submit the job using the sbatch command. sbatch {JOBNAME}.sh
4. Please adjust partition (queue), application to execute, memory, tasks and the heap sizes as needed for these different examples to create your own. If you need help please let us know by submitting a ticket to the Help Desk.

You can print a list of queues with the sinfo command

[user@rocky-login011 ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
shortq7* up 2:00:00 2 down* node[001,056]
shortq7* up 2:00:00 3 mix node[009,030-031]
shortq7* up 2:00:00 37 alloc node[002-006,008,011-014,019-020,051,054,057-058,062-065,067-081,083-084]
shortq7* up 2:00:00 30 idle gpu-exxact[1-5],gpu-k80,node[007,010,027-029,032,052-053,059-061,082,087-098]
longq7 up 7-12:00:00 2 down* node[001,056]
longq7 up 7-12:00:00 1 mix node009
longq7 up 7-12:00:00 37 alloc node[002-006,008,011-014,019-020,051,054,057-058,062-065,067-081,083-084]
longq7 up 7-12:00:00 6 idle node[007,010,059-061,082]

You can see the status of the cluster using the squeue command.

[user@rocky-login011 ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           2552838    longq7 MS1b_m14 usera PD       0:00      1 (Dependency)
           2553360    longq7 TS_71-72 userb PD       0:00      2 (Resources)
           2553425    longq7 TS_69-70 userb PD       0:00      2 (Resources)
           2552836    longq7 MS1b_m14 userc  R 4-12:08:47      1 node002
           2552837    longq7 MS1b_m14 userc  R 5-17:38:36      1 node063
           2553116    longq7 Homoseri userd  R 7-03:41:54      1 node078
           2553117    longq7 Homoseri userd  R 7-02:29:54      1 node071
           2553157    longq7 HGE_V1_1 usere  R 6-07:50:41      1 node003
           2553288    longq7 Cys_Ket_ userf  R 4-21:02:05      1 node004
...

For more information regarding SLURM see the manuals and their quick start guide.

Details

Article ID: 141472

Created

Mon 8/29/22 3:21 PM

Modified

Fri 7/18/25 9:43 AM