Atlantis User Guide
Atlantis is a computing cluster built and managed by students at the UCSD Supercomputing Club.
If you are new to the cluster, start with Quick Start. If you already know the basics, jump straight to the page you need from the summary.
Quick Start
If you just want to get moving, the basic flow is straightforward: make sure your account is ready, connect to a login node, and submit a small Slurm job before you try anything larger.
1. Setup your account
After being provisioned an account by cluster administrators, log into the user portal with your temporary password. From there, you can reset your password and add your SSH public keys. Although SSH password authentication is disabled on cluster nodes, you will need a password to log into the user portal as well as other services hosted by the club relying on SSO.
2. Log in to the cluster
Login nodes are meant for editing files, preparing job scripts, checking status, and doing light interactive work. They are not meant for long training runs, large simulations, or benchmarks.
ssh <username>@132.249.248.230
3. Submit a first job
Create a small job script called job.sh:
#!/bin/bash
#SBATCH -A <account>
#SBATCH -p normal
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
hostname
python my_script.py
Submit it with:
sbatch job.sh
If you want a more comprehensive introduction to using Slurm, this Slurm quick start guide is a solid start.
Cluster Details
Atlantis is made up of a small set of shared compute nodes, each with a different mix of CPUs and GPUs. That matters when you are choosing where to run a job, because the resources you ask for should line up with the hardware that can actually satisfy them.
Compute nodes
| Node | Type | CPUs | Cores | GPUs |
|---|---|---|---|---|
monster | compute | Intel Xeon CPU E5-2699v4 | 44 | 4 x AMD Instinct MI210 |
matthew | compute | AMD EPYC 7742 | 64 | 2 x AMD Instinct MI210 2 x NVIDIA P100 |
gauri | compute | Intel Xeon CPU E5-2680v4 | 28 | 6 x NVIDIA GTX 980 Ti |
zixian | compute | Intel Xeon Platinum 8176 | 56 | 8 x NVIDIA RTX 2080 Ti |
khai | compute | Intel Xeon Gold 6138 | 40 | N/A |
Login nodes
Login nodes are the gateway to the cluster. They are where you prepare your work, check on jobs, and do light interactive work. They are not meant for long training runs, large simulations, or benchmarks.
You should reach the cluster through the login node first:
ssh <username>@132.249.248.230
Direct SSH access to a compute node is only available while you have an active Slurm job or allocation on that node. In practice, that means you cannot just pick a compute node and log in whenever you want. You need to earn that access through the scheduler first.
To see where your jobs are and what they are doing, use:
squeue -u $USER
Slurm
Slurm is the scheduler that decides where your work runs and for how long. On Atlantis, almost everything you do with compute resources should go through Slurm. You will NOT have access to most compute resources unless allocated through Slurm.
Partitions
Partitions define the types of jobs that can run on the cluster, and they come with different limits and policies. When you submit a job, you need to pick a partition that matches your needs.
You specify a partition with -p <partition> when you submit a job. If you do not specify one, the scheduler will assign your job to the default partition, which is normal.
| Partition | Intended use | Max time | Resource Limits | Accessible nodes |
|---|---|---|---|---|
debug | short tests and quick debugging | 1:00:00 | Max 1 GPU | gauri, khai, matthew, monster,zixian |
normal | regular jobs | 4:00:00 | N/A | gauri, khai, matthew, monster, zixian |
long | long checkpointable jobs | unlimited | N/A | gauri, khai, matthew, monster, zixian |
Debug
The debug partition is designed for short runs, debugging sessions, and small interactive checks.
sbatch -A <account> -p debug job.sh
Normal
The normal partition is designed for regular workloads. Default to this partition for most CPU and GPU work, as long as it stays within the time limits.
sbatch -A <account> -p normal job.sh
Long
The long partition is designed for long checkpointable jobs that can survive interruption and restart cleanly. Long jobs should write output incrementally and checkpoint often enough that a restart is practical. If you use long, assume the job may be cancelled and resumed later.
sbatch -A <account> -p long --requeue job.sh
How to check partition limits and which nodes they cover
Slurm configuration can change over time. To get authoritative, live information about time limits, memory limits, node membership, and other constraints for a partition, run the following commands on the cluster:
# show partition configuration (MaxTime, DefaultTime, Nodes, State, etc.)
scontrol show partition <partition-name>
# list nodes with partition membership and per-node resources
sinfo -Nel
# filter the node list for a single partition (human-readable)
sinfo -p <partition-name> -Nel
# or use sinfo formatting to show node, partitions, memory and state
sinfo -o "%P %n %D %G %l %m"
Examples above will show exactly which node hostnames belong to each partition; you can then cross-reference those hostnames with the Cluster details page to see CPU/GPU models.
When in doubt, ask the cluster administrators or run scontrol show partition for the most current limits.
Job Submission
Slurm provides several commands for running work. Here are the common ones and when to use them:
sbatch: submit a batch job to the scheduler. Use this for non-interactive, scripted runs. The scheduler queues the job and runs it when resources become available. Example:
sbatch job.sh
srun: request resources and run a command immediately. When used inside an allocation or with appropriate flags,srunlaunches tasks on compute nodes and can be used for short interactive sessions or parallel launches. Example (interactive shell):
srun -A <account> -p debug --time=00:30:00 --pty bash
salloc: obtain an allocation (a set of nodes) and then run commands inside that allocation.sallocis useful when you want to reserve resources first and then usesrunor SSH to interact with the allocated nodes. Example:
salloc -A <account> -p debug --gres=gpu:1 --time=00:30:00
# you are now inside an allocation
# any srun commands will use the allocated resources
srun --pty bash
Quick guidance:
- Use
sbatchfor most production and research runs where the job can run unattended. - Use
srunfor short interactive tests or to launch tasks inside an allocation. - Use
sallocwhen you need an allocation ahead of time (for debugging, interactive workflows, or manual experiments).
Accounts and projects
Every job must be charged to a Slurm account. When you submit, include the account with -A <account>, for example:
sbatch -A my_project -p normal job.sh
If you are not sure which account belongs to your work, check your project membership or ask the cluster administrators before you launch a larger run.
Requesting Resources
This page explains how to request CPUs, memory, runtime, and GPUs for Slurm jobs.
CPUs, memory, and time
Declare CPUs, memory, and runtime when submitting with sbatch flags. Example requesting 4 CPU cores, 16G memory, and 2 hours on the normal partition:
sbatch -A <account> -p normal --cpus-per-task=4 --mem=16G --time=02:00:00 job.sh
Slurm accepts several time formats: days-hours:minutes:seconds, hours:minutes:seconds, or plain minutes. Examples using the --time flag:
sbatch -A <account> -p normal --time=30 job.sh
sbatch -A <account> -p normal --time=02:00:00 job.sh
sbatch -A <account> -p normal --time=1-00:00:00 job.sh
Be conservative with requests during testing, then scale up once the job behavior is confirmed.
GPU jobs
GPU jobs must request GPUs explicitly so the scheduler can place the job on compatible hardware.
If your code runs on any GPU, request a generic GPU:
sbatch -A <account> -p debug --gres=gpu:1 --wrap='hostname'
Request multiple GPUs:
sbatch -A <account> -p normal --gres=gpu:2 job.sh
If you need a specific GPU model, request it explicitly:
sbatch -A <account> -p normal --gres=gpu:mi210:1 job.sh
sbatch -A <account> -p normal --gres=gpu:rtx2080ti:1 job.sh
sbatch -A <account> -p normal --gres=gpu:gtx980ti:1 job.sh
SBATCH directive
All of the flags shown above (for CPUs, memory, time, and GPUs) can instead be placed inside a job script using #SBATCH directives. For example, the sbatch command:
sbatch -A <account> -p normal --cpus-per-task=4 --mem=16G --time=02:00:00 job.sh
is equivalent to putting the following lines in job.sh:
#!/bin/bash
#SBATCH -A <account>
#SBATCH -p normal
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=02:00:00
# your commands here
Practical tips
- Test on
debugbefore running large GPU workloads. - Request only the GPUs you actually need; do not ask for all GPUs on a node unless required.
- If your job must run on a specific architecture (e.g., CUDA version), load and verify the correct modules in your script before starting the application.
Interactive Sessions
Interactive sessions let you run commands and debug directly on compute nodes. Use them for lightweight exploration, testing, and interactive development.
Using srun for interactive shells
GPU interactive session:
srun -A <account> -p debug --gres=gpu:1 --time=00:30:00 --pty bash
CPU-only interactive session:
srun -A <account> -p debug --time=00:30:00 --pty bash
Using salloc for an allocation
If you prefer to obtain an allocation and then connect to the node yourself, request an allocation with salloc:
salloc -A <account> -p debug --gres=gpu:1 --time=00:30:00
After the allocation is granted, find your job with:
squeue -u $USER
Then SSH to the allocated node:
ssh <allocated-node>
You may only SSH into a compute node while that node is reserved for your active job or allocation.
Monitoring and Accounting
Once a job is running, the next question is usually where it is, how much time it has left, and whether it finished the way you expected. Slurm gives you a few commands that cover most of that day-to-day checking.
Cluster status
See what partitions and nodes are available with:
sinfo
Active jobs
Check your own jobs with:
squeue -u $USER
That is usually the first command to run when something seems delayed or when you want to confirm which node your allocation landed on.
Completed jobs
List completed jobs for your account with:
sacct -u $USER
To focus on one job:
sacct -j <jobid>
If you want a more detailed accounting line, use:
sacct -j <jobid> --format=JobID,JobName,Account,Partition,QOS,State,Elapsed,AllocTRES,ExitCode
Job inspection
For a scheduler-level summary of a job, this command is usually the most useful:
scontrol show job <jobid>
Software and Storage
The easier your environment is to reproduce, the easier it is to trust the results. That mostly comes down to two things: loading the right software and keeping data in the right place.
Software modules
Software on the cluster is provided through environment modules. The common commands are:
module avail
module list
module load <module>
module unload <module>
module purge
If a job behaves differently from one session to the next, checking which modules are loaded is a good place to start.
Storage
Storage layout may change over time, but this is the general rule of thumb:
| Location | Purpose | Notes |
|---|---|---|
/home/<user> | personal files, scripts, small configs | not for large datasets or heavy output |
/projects/<project> | project-shared files | use for shared project data |
Keep large datasets and job output out of /home whenever you can. Project storage is a better fit for shared work, and temporary files should be cleaned up once a job ends.
Node-local files are temporary, so do not rely on them for anything you need after the job is finished.
Good Habits
The long partition is built for work that can survive interruption. If you use it, assume the job may be paused and resumed later.
Practical habits help a lot here:
- write output incrementally
- checkpoint often enough to recover cleanly
- test restart logic before you launch a long run
- avoid requesting more resources than the job really needs
For small sanity checks, use debug. For normal production work, use normal. Save long for jobs that genuinely benefit from restartability.
Before you scale up, it is worth doing one small run first:
- Test on
debug. - Confirm the script works.
- Confirm output paths are correct.
- Confirm the resource request is reasonable.
- Move to
normalorlongonce the job behaves the way you want.