Slurm
Slurm is the scheduler that decides where your work runs and for how long. On Atlantis, almost everything you do with compute resources should go through Slurm. You will NOT have access to most compute resources unless allocated through Slurm.
Partitions
Partitions define the types of jobs that can run on the cluster, and they come with different limits and policies. When you submit a job, you need to pick a partition that matches your needs.
You specify a partition with -p <partition> when you submit a job. If you do not specify one, the scheduler will assign your job to the default partition, which is normal.
| Partition | Intended use | Max time | Resource Limits | Accessible nodes |
|---|---|---|---|---|
debug | short tests and quick debugging | 1:00:00 | Max 1 GPU | gauri, khai, matthew, monster,zixian |
normal | regular jobs | 4:00:00 | N/A | gauri, khai, matthew, monster, zixian |
long | long checkpointable jobs | unlimited | N/A | gauri, khai, matthew, monster, zixian |
Debug
The debug partition is designed for short runs, debugging sessions, and small interactive checks.
sbatch -A <account> -p debug job.sh
Normal
The normal partition is designed for regular workloads. Default to this partition for most CPU and GPU work, as long as it stays within the time limits.
sbatch -A <account> -p normal job.sh
Long
The long partition is designed for long checkpointable jobs that can survive interruption and restart cleanly. Long jobs should write output incrementally and checkpoint often enough that a restart is practical. If you use long, assume the job may be cancelled and resumed later.
sbatch -A <account> -p long --requeue job.sh
How to check partition limits and which nodes they cover
Slurm configuration can change over time. To get authoritative, live information about time limits, memory limits, node membership, and other constraints for a partition, run the following commands on the cluster:
# show partition configuration (MaxTime, DefaultTime, Nodes, State, etc.)
scontrol show partition <partition-name>
# list nodes with partition membership and per-node resources
sinfo -Nel
# filter the node list for a single partition (human-readable)
sinfo -p <partition-name> -Nel
# or use sinfo formatting to show node, partitions, memory and state
sinfo -o "%P %n %D %G %l %m"
Examples above will show exactly which node hostnames belong to each partition; you can then cross-reference those hostnames with the Cluster details page to see CPU/GPU models.
When in doubt, ask the cluster administrators or run scontrol show partition for the most current limits.
Job Submission
Slurm provides several commands for running work. Here are the common ones and when to use them:
sbatch: submit a batch job to the scheduler. Use this for non-interactive, scripted runs. The scheduler queues the job and runs it when resources become available. Example:
sbatch job.sh
srun: request resources and run a command immediately. When used inside an allocation or with appropriate flags,srunlaunches tasks on compute nodes and can be used for short interactive sessions or parallel launches. Example (interactive shell):
srun -A <account> -p debug --time=00:30:00 --pty bash
salloc: obtain an allocation (a set of nodes) and then run commands inside that allocation.sallocis useful when you want to reserve resources first and then usesrunor SSH to interact with the allocated nodes. Example:
salloc -A <account> -p debug --gres=gpu:1 --time=00:30:00
# you are now inside an allocation
# any srun commands will use the allocated resources
srun --pty bash
Quick guidance:
- Use
sbatchfor most production and research runs where the job can run unattended. - Use
srunfor short interactive tests or to launch tasks inside an allocation. - Use
sallocwhen you need an allocation ahead of time (for debugging, interactive workflows, or manual experiments).
Accounts and projects
Every job must be charged to a Slurm account. When you submit, include the account with -A <account>, for example:
sbatch -A my_project -p normal job.sh
If you are not sure which account belongs to your work, check your project membership or ask the cluster administrators before you launch a larger run.