Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Atlantis User Guide

Atlantis is a computing cluster built and managed by students at the UCSD Supercomputing Club.

If you are new to the cluster, start with Quick Start. If you already know the basics, jump straight to the page you need from the summary.

Quick Start

If you just want to get moving, the basic flow is straightforward: make sure your account is ready, connect to a login node, and submit a small Slurm job before you try anything larger.

1. Setup your account

After being provisioned an account by cluster administrators, log into the user portal with your temporary password. From there, you can reset your password and add your SSH public keys. Although SSH password authentication is disabled on cluster nodes, you will need a password to log into the user portal as well as other services hosted by the club relying on SSO.

2. Log in to the cluster

Login nodes are meant for editing files, preparing job scripts, checking status, and doing light interactive work. They are not meant for long training runs, large simulations, or benchmarks.

ssh <username>@132.249.248.230

3. Submit a first job

Create a small job script called job.sh:

#!/bin/bash
#SBATCH -A <account>
#SBATCH -p normal
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G

hostname
python my_script.py

Submit it with:

sbatch job.sh

If you want a more comprehensive introduction to using Slurm, this Slurm quick start guide is a solid start.

Cluster Details

Atlantis is made up of a small set of shared compute nodes, each with a different mix of CPUs and GPUs. That matters when you are choosing where to run a job, because the resources you ask for should line up with the hardware that can actually satisfy them.

Compute nodes

NodeTypeCPUsCoresGPUs
monstercomputeIntel Xeon CPU E5-2699v4444 x AMD Instinct MI210
matthewcomputeAMD EPYC 7742642 x AMD Instinct MI210
2 x NVIDIA P100
gauricomputeIntel Xeon CPU E5-2680v4286 x NVIDIA GTX 980 Ti
zixiancomputeIntel Xeon Platinum 8176568 x NVIDIA RTX 2080 Ti
khaicomputeIntel Xeon Gold 613840N/A

Login nodes

Login nodes are the gateway to the cluster. They are where you prepare your work, check on jobs, and do light interactive work. They are not meant for long training runs, large simulations, or benchmarks.

You should reach the cluster through the login node first:

ssh <username>@132.249.248.230

Direct SSH access to a compute node is only available while you have an active Slurm job or allocation on that node. In practice, that means you cannot just pick a compute node and log in whenever you want. You need to earn that access through the scheduler first.

To see where your jobs are and what they are doing, use:

squeue -u $USER

Slurm

Slurm is the scheduler that decides where your work runs and for how long. On Atlantis, almost everything you do with compute resources should go through Slurm. You will NOT have access to most compute resources unless allocated through Slurm.

Partitions

Partitions define the types of jobs that can run on the cluster, and they come with different limits and policies. When you submit a job, you need to pick a partition that matches your needs.

You specify a partition with -p <partition> when you submit a job. If you do not specify one, the scheduler will assign your job to the default partition, which is normal.

PartitionIntended useMax timeResource LimitsAccessible nodes
debugshort tests and quick debugging1:00:00Max 1 GPUgauri, khai, matthew, monster,zixian
normalregular jobs4:00:00N/Agauri, khai, matthew, monster, zixian
longlong checkpointable jobsunlimitedN/Agauri, khai, matthew, monster, zixian

Debug

The debug partition is designed for short runs, debugging sessions, and small interactive checks.

sbatch -A <account> -p debug job.sh

Normal

The normal partition is designed for regular workloads. Default to this partition for most CPU and GPU work, as long as it stays within the time limits.

sbatch -A <account> -p normal job.sh

Long

The long partition is designed for long checkpointable jobs that can survive interruption and restart cleanly. Long jobs should write output incrementally and checkpoint often enough that a restart is practical. If you use long, assume the job may be cancelled and resumed later.

sbatch -A <account> -p long --requeue job.sh

How to check partition limits and which nodes they cover

Slurm configuration can change over time. To get authoritative, live information about time limits, memory limits, node membership, and other constraints for a partition, run the following commands on the cluster:

# show partition configuration (MaxTime, DefaultTime, Nodes, State, etc.)
scontrol show partition <partition-name>

# list nodes with partition membership and per-node resources
sinfo -Nel

# filter the node list for a single partition (human-readable)
sinfo -p <partition-name> -Nel

# or use sinfo formatting to show node, partitions, memory and state
sinfo -o "%P %n %D %G %l %m"

Examples above will show exactly which node hostnames belong to each partition; you can then cross-reference those hostnames with the Cluster details page to see CPU/GPU models.

When in doubt, ask the cluster administrators or run scontrol show partition for the most current limits.

Job Submission

Slurm provides several commands for running work. Here are the common ones and when to use them:

  • sbatch: submit a batch job to the scheduler. Use this for non-interactive, scripted runs. The scheduler queues the job and runs it when resources become available. Example:
sbatch job.sh
  • srun: request resources and run a command immediately. When used inside an allocation or with appropriate flags, srun launches tasks on compute nodes and can be used for short interactive sessions or parallel launches. Example (interactive shell):
srun -A <account> -p debug --time=00:30:00 --pty bash
  • salloc: obtain an allocation (a set of nodes) and then run commands inside that allocation. salloc is useful when you want to reserve resources first and then use srun or SSH to interact with the allocated nodes. Example:
salloc -A <account> -p debug --gres=gpu:1 --time=00:30:00
# you are now inside an allocation
# any srun commands will use the allocated resources
srun --pty bash

Quick guidance:

  • Use sbatch for most production and research runs where the job can run unattended.
  • Use srun for short interactive tests or to launch tasks inside an allocation.
  • Use salloc when you need an allocation ahead of time (for debugging, interactive workflows, or manual experiments).

Accounts and projects

Every job must be charged to a Slurm account. When you submit, include the account with -A <account>, for example:

sbatch -A my_project -p normal job.sh

If you are not sure which account belongs to your work, check your project membership or ask the cluster administrators before you launch a larger run.

Requesting Resources

This page explains how to request CPUs, memory, runtime, and GPUs for Slurm jobs.

CPUs, memory, and time

Declare CPUs, memory, and runtime when submitting with sbatch flags. Example requesting 4 CPU cores, 16G memory, and 2 hours on the normal partition:

sbatch -A <account> -p normal --cpus-per-task=4 --mem=16G --time=02:00:00 job.sh

Slurm accepts several time formats: days-hours:minutes:seconds, hours:minutes:seconds, or plain minutes. Examples using the --time flag:

sbatch -A <account> -p normal --time=30 job.sh
sbatch -A <account> -p normal --time=02:00:00 job.sh
sbatch -A <account> -p normal --time=1-00:00:00 job.sh

Be conservative with requests during testing, then scale up once the job behavior is confirmed.

GPU jobs

GPU jobs must request GPUs explicitly so the scheduler can place the job on compatible hardware.

If your code runs on any GPU, request a generic GPU:

sbatch -A <account> -p debug --gres=gpu:1 --wrap='hostname'

Request multiple GPUs:

sbatch -A <account> -p normal --gres=gpu:2 job.sh

If you need a specific GPU model, request it explicitly:

sbatch -A <account> -p normal --gres=gpu:mi210:1 job.sh
sbatch -A <account> -p normal --gres=gpu:rtx2080ti:1 job.sh
sbatch -A <account> -p normal --gres=gpu:gtx980ti:1 job.sh

SBATCH directive

All of the flags shown above (for CPUs, memory, time, and GPUs) can instead be placed inside a job script using #SBATCH directives. For example, the sbatch command:

sbatch -A <account> -p normal --cpus-per-task=4 --mem=16G --time=02:00:00 job.sh

is equivalent to putting the following lines in job.sh:

#!/bin/bash
#SBATCH -A <account>
#SBATCH -p normal
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=02:00:00

# your commands here

Practical tips

  • Test on debug before running large GPU workloads.
  • Request only the GPUs you actually need; do not ask for all GPUs on a node unless required.
  • If your job must run on a specific architecture (e.g., CUDA version), load and verify the correct modules in your script before starting the application.

Interactive Sessions

Interactive sessions let you run commands and debug directly on compute nodes. Use them for lightweight exploration, testing, and interactive development.

Using srun for interactive shells

GPU interactive session:

srun -A <account> -p debug --gres=gpu:1 --time=00:30:00 --pty bash

CPU-only interactive session:

srun -A <account> -p debug --time=00:30:00 --pty bash

Using salloc for an allocation

If you prefer to obtain an allocation and then connect to the node yourself, request an allocation with salloc:

salloc -A <account> -p debug --gres=gpu:1 --time=00:30:00

After the allocation is granted, find your job with:

squeue -u $USER

Then SSH to the allocated node:

ssh <allocated-node>

You may only SSH into a compute node while that node is reserved for your active job or allocation.

Monitoring and Accounting

Once a job is running, the next question is usually where it is, how much time it has left, and whether it finished the way you expected. Slurm gives you a few commands that cover most of that day-to-day checking.

Cluster status

See what partitions and nodes are available with:

sinfo

Active jobs

Check your own jobs with:

squeue -u $USER

That is usually the first command to run when something seems delayed or when you want to confirm which node your allocation landed on.

Completed jobs

List completed jobs for your account with:

sacct -u $USER

To focus on one job:

sacct -j <jobid>

If you want a more detailed accounting line, use:

sacct -j <jobid> --format=JobID,JobName,Account,Partition,QOS,State,Elapsed,AllocTRES,ExitCode

Job inspection

For a scheduler-level summary of a job, this command is usually the most useful:

scontrol show job <jobid>

Software and Storage

The easier your environment is to reproduce, the easier it is to trust the results. That mostly comes down to two things: loading the right software and keeping data in the right place.

Software modules

Software on the cluster is provided through environment modules. The common commands are:

module avail
module list
module load <module>
module unload <module>
module purge

If a job behaves differently from one session to the next, checking which modules are loaded is a good place to start.

Storage

Storage layout may change over time, but this is the general rule of thumb:

LocationPurposeNotes
/home/<user>personal files, scripts, small configsnot for large datasets or heavy output
/projects/<project>project-shared filesuse for shared project data

Keep large datasets and job output out of /home whenever you can. Project storage is a better fit for shared work, and temporary files should be cleaned up once a job ends.

Node-local files are temporary, so do not rely on them for anything you need after the job is finished.

Good Habits

The long partition is built for work that can survive interruption. If you use it, assume the job may be paused and resumed later.

Practical habits help a lot here:

  • write output incrementally
  • checkpoint often enough to recover cleanly
  • test restart logic before you launch a long run
  • avoid requesting more resources than the job really needs

For small sanity checks, use debug. For normal production work, use normal. Save long for jobs that genuinely benefit from restartability.

Before you scale up, it is worth doing one small run first:

  1. Test on debug.
  2. Confirm the script works.
  3. Confirm output paths are correct.
  4. Confirm the resource request is reasonable.
  5. Move to normal or long once the job behaves the way you want.