Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Slurm

Slurm is the scheduler that decides where your work runs and for how long. On Atlantis, almost everything you do with compute resources should go through Slurm. You will NOT have access to most compute resources unless allocated through Slurm.

Partitions

Partitions define the types of jobs that can run on the cluster, and they come with different limits and policies. When you submit a job, you need to pick a partition that matches your needs.

You specify a partition with -p <partition> when you submit a job. If you do not specify one, the scheduler will assign your job to the default partition, which is normal.

PartitionIntended useMax timeResource LimitsAccessible nodes
debugshort tests and quick debugging1:00:00Max 1 GPUgauri, khai, matthew, monster,zixian
normalregular jobs4:00:00N/Agauri, khai, matthew, monster, zixian
longlong checkpointable jobsunlimitedN/Agauri, khai, matthew, monster, zixian

Debug

The debug partition is designed for short runs, debugging sessions, and small interactive checks.

sbatch -A <account> -p debug job.sh

Normal

The normal partition is designed for regular workloads. Default to this partition for most CPU and GPU work, as long as it stays within the time limits.

sbatch -A <account> -p normal job.sh

Long

The long partition is designed for long checkpointable jobs that can survive interruption and restart cleanly. Long jobs should write output incrementally and checkpoint often enough that a restart is practical. If you use long, assume the job may be cancelled and resumed later.

sbatch -A <account> -p long --requeue job.sh

How to check partition limits and which nodes they cover

Slurm configuration can change over time. To get authoritative, live information about time limits, memory limits, node membership, and other constraints for a partition, run the following commands on the cluster:

# show partition configuration (MaxTime, DefaultTime, Nodes, State, etc.)
scontrol show partition <partition-name>

# list nodes with partition membership and per-node resources
sinfo -Nel

# filter the node list for a single partition (human-readable)
sinfo -p <partition-name> -Nel

# or use sinfo formatting to show node, partitions, memory and state
sinfo -o "%P %n %D %G %l %m"

Examples above will show exactly which node hostnames belong to each partition; you can then cross-reference those hostnames with the Cluster details page to see CPU/GPU models.

When in doubt, ask the cluster administrators or run scontrol show partition for the most current limits.

Job Submission

Slurm provides several commands for running work. Here are the common ones and when to use them:

  • sbatch: submit a batch job to the scheduler. Use this for non-interactive, scripted runs. The scheduler queues the job and runs it when resources become available. Example:
sbatch job.sh
  • srun: request resources and run a command immediately. When used inside an allocation or with appropriate flags, srun launches tasks on compute nodes and can be used for short interactive sessions or parallel launches. Example (interactive shell):
srun -A <account> -p debug --time=00:30:00 --pty bash
  • salloc: obtain an allocation (a set of nodes) and then run commands inside that allocation. salloc is useful when you want to reserve resources first and then use srun or SSH to interact with the allocated nodes. Example:
salloc -A <account> -p debug --gres=gpu:1 --time=00:30:00
# you are now inside an allocation
# any srun commands will use the allocated resources
srun --pty bash

Quick guidance:

  • Use sbatch for most production and research runs where the job can run unattended.
  • Use srun for short interactive tests or to launch tasks inside an allocation.
  • Use salloc when you need an allocation ahead of time (for debugging, interactive workflows, or manual experiments).

Accounts and projects

Every job must be charged to a Slurm account. When you submit, include the account with -A <account>, for example:

sbatch -A my_project -p normal job.sh

If you are not sure which account belongs to your work, check your project membership or ask the cluster administrators before you launch a larger run.