Submitting Jobs on Koko using Slurm (salloc, srun, sbatch, sinfo, squeue)

Summary

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Click here to learn more.

Body

  1. Upload your code and data to your KoKo home directory using Ondemand, Globus, Filezilla or SCP.
  2. Choose the type of job you would like to run (salloc, srun, sbatch)
    1. salloc requests and holds an allocation on the cluster so you can run interactive jobs using srun, mpiexec, and other applications. See srun for more information.
      More information execute “man salloc” in a koko-login.fau.edu terminal.
      1. Example:
        salloc -N 1 –exclusive
        srun hostname
         
    2. srun requests an allocation on the cluster if one is not currently granted by salloc. It then executes the specified command
      More information execute “man srun” in a koko-login.fau.edu terminal.
      1. Example:
        srun -N 1 –exclusive hostname # execute the hostname of a node
         
    3. sbatch executes a task in the background and is not connected to the current terminal. It allocates a task similar to salloc and logs the results. If your computer loses connection to the cluster, sbatch tasks will continue to run making this a very powerful command. An example of a sbatch task is provided below.
      1. Create a script named {JOBNAME}.sh to start your job containing the following:
        #!/bin/sh
        #SBATCH – – partition=shortq7
        #SBATCH  -N 1
        #SBATCH – -exclusive
        #SBATCH – – mem-per-cpu=16000
        # Load modules, if needed, run staging tasks, etc…
        # Execute the task
        srun hostname

      2. Run the command: chmod +x {JOBNAME}.sh to make the job executable.
      3. Submit the job using the sbatch command.     sbatch {JOBNAME}.sh
    4. Please adjust partition (queue), application to execute, memory, tasks and the heap sizes as needed for these different examples to create your own. If you need help please let us know by submitting a ticket to the Help Desk.
  3. You can print a list of queues with the sinfo command
    [user@koko-login2 ~]$ sinfo
    PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
    shortq7* up 2:00:00 2 down* node[001,056]
    shortq7* up 2:00:00 3 mix node[009,030-031]
    shortq7* up 2:00:00 37 alloc node[002-006,008,011-014,019-020,051,054,057-058,062-065,067-081,083-084]
    shortq7* up 2:00:00 30 idle gpu-exxact[1-5],gpu-k80,node[007,010,027-029,032,052-053,059-061,082,087-098]
    longq7 up 7-12:00:00 2 down* node[001,056]
    longq7 up 7-12:00:00 1 mix node009
    longq7 up 7-12:00:00 37 alloc node[002-006,008,011-014,019-020,051,054,057-058,062-065,067-081,083-084]
    longq7 up 7-12:00:00 6 idle node[007,010,059-061,082]
  4. You can see the status of the cluster using the squeue command.
    [user@koko-login2 ~]$ squeue
                 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               2552838    longq7 MS1b_m14 usera PD       0:00      1 (Dependency)
               2553360    longq7 TS_71-72 userb PD       0:00      2 (Resources)
               2553425    longq7 TS_69-70 userb PD       0:00      2 (Resources)
               2552836    longq7 MS1b_m14 userc  R 4-12:08:47      1 node002
               2552837    longq7 MS1b_m14 userc  R 5-17:38:36      1 node063
               2553116    longq7 Homoseri userd  R 7-03:41:54      1 node078
               2553117    longq7 Homoseri userd  R 7-02:29:54      1 node071
               2553157    longq7 HGE_V1_1 usere  R 6-07:50:41      1 node003
               2553288    longq7 Cys_Ket_ userf  R 4-21:02:05      1 node004
    ... 
    
  5.  

For more information regarding SLURM see the manuals and their quick start guide.

Details

Details

Article ID: 141472
Created
Mon 8/29/22 3:21 PM
Modified
Mon 5/15/23 1:38 PM