3 Running Jobs on Compute Nodes

3.1 Compute Nodes

Thus far, this guide has described how to run your code/jobs on RRN itself. As explained in the section RRN vs compute nodes, if you have very large compute-intensive jobs (e.g., jobs that need more than 16 CPUs and more than 12hrs of run time), it is recommended to submit them to compute nodes. Resources on RRN (a login node) are always shared. As a result, when the RRN is busy, everyone gets less resources and your code runs slower. One the other hand, you could request resources on compute nodes, and once granted, they are exclusively for your job during its computation.

Rotman users have access to 8 compute nodes via a “standard” partition (a partition is just a logical set of compute nodes). Each compute node has 24 CPUs and 240G memory.

In this section, we briefly discuss how to submit jobs to compute nodes. For a more detailed document on how to use compute nodes, see How to Run Jobs Using a Scheduler on the CAC Wiki website.

3.2 Job Submission Script

Resources on compute nodes are managed by a workload management system called Slurm (Simple Linux Utility for Resource Manager). In order to request resources on compute nodes and submit a job, you will need to write a job submission script. The script is a Bash shell script that contains a set of SLURM directives (or commands) describing resource requests and other submissions options.

Below is a sample script for submitting a Python job. You could name the script as you like (e.g.,job.sh), and store it in the same directory as your Python script.

#!/bin/bash

#SBATCH --job-name=project_abc    # create a job name
#SBATCH --partition=standard      # specify job partition
#SBATCH --mail-type=ALL           # email you when the job starts, stops, etc.
#SBATCH --mail-user=your_email    # specify your email
#SBATCH --output=STD.out          # save standard output to STD.out
#SBATCH --error=STD.err           # save std. error output to STD.out
#SBATCH -c 4                      # request 4 CPUs
#SBATCH --time=30:00              # set total run time limit to 30mins (days-hrs:mins:secs)
#SBATCH --mem=5000                # request 5G memory

module load StdEnv/2023           # load StdEnv/2020 software environment
module load python/3.11.5         # load python module
source ~/myenv/bin/activate        # load my python virtual environment
python python_test.py             # run python code

The first line in the above script tells the system which shell to use when running the script.

The #SBATCH lines (line 2 to 11) are SLURM commands specifying options for scheduling the job. See the comment on each line (starting with the second # sign) for what those options are. Adjust them to fit your need.

An important note: under the current cluster configuration, each job is restricted to run on a single node, i.e., a single parallel job can only take advantage of the memory and multiple CPUs on a single node. As a result, do not request more than 24 CPUs and 240G memory as that’s the resource capacity for each compute node in the standard partition. In fact, since the Operating System on a node would need resources as well, the maximum CPUs and memory you request (if needed) should be slightly lower than the full capacity (e.g., 23 CPUs and 220G memory).

Line 13 and 14 load the StdEnv/2023 software environment and the Python module. In general, you will load the Python module that you have setup for your project (see the section on Python Setup for more details). In this example, we load the Python 3.11.5 module. If you have a Python virtual environment setup for your code/project, you would also activate the environment here (after loading the Python module). In the example, line 15 loads the myenv virtual environment.

Line 16 runs the Python code (python_test.py). Note that you do not need to use nohup or & to run your code as you would normally do when running code in batch mode on RRN itself.

You would simply adjust line 13 to 16 if you use other software (i.e., not Python). Below are a few examples.

For R,

module load StdEnv/2023
module load r/4.4.0
Rscript mycode.R

For Matlab,

module load StdEnv/2023
module load matlab/R2024b
matlab -nodesktop -nosplash -nodisplay <matlab_test.m

For Stata 18,

stata -b do filename

For Stata 17,

module load stata17
stata -b do filename

For SAS,

sas mycode.sas

3.3 Submitting Jobs to Compute Nodes

To submit your job, use the sbatch command. A job ID will be returned if it’s successful.

sbatch job.sh       # job.sh is the name of the job script

Your job may wait in a queue for some time until the requested resources are available for you.

Sometimes, you may want to check the current availability of resources on compute nodes. The below sinfo commands could be useful.

List all compute nodes available in the standard partition and their resources (total CPUs, memory, etc.)

sinfo -p standard --Node --long

List current resource availability of each compute node in the standard partition, including node A/I/O/T (Allocated/Idle/Other/Total), CPU A/I/O/T, and total and free memory.

sinfo -p standard --Node -o"%8N %14F %13C %6m %e"

3.4 Managing Jobs on Compute Nodes

Show status of jobs submitted.

squeue --user <your_user_name>
squeue --job <job_id>
squeue --start --job <job_id>   # get an estimation for when a job will run

Cancel jobs.

scancel <job_id>                # kill a job with a job ID <job_id>
scancel -u <your_user_name>     # kill all jobs for user <your_user_name>

3.5 Best Practice

Debug your code well before submitting it to run on compute nodes. You could debug a less resource-intensive version of your code on your own computer or on RRN. You could also request a small interactive session on a compute node to debug your code. To do so, use the salloc command.

salloc -c 4 --mem=6g      # start a 4-CPU 6G memory interactive session

When debugging your code, monitor CPU and memory usage so you could estimate how many CPUs and memory you would need to request when running your code on compute nodes. You may request a bit more resources than what your code will actually need. However, don’t request excessive resources. Note that the more resources you request, the longer queuing time (wait time before the code actually runs) you may experience as the compute nodes are also shared and may not have the requested resources available at the time of your request.