We have at CEREMADE a cluster for parallel computing.
The cluster consists of 8 nodes (machines named clust1
, clust2
, etc.) of different configurations:
clust1
: 40 CPU(s), Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 1 Tesla T4 GPUclust2
: 40 CPU(s), Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 1 Tesla T4 GPUclust3
: 40 CPU(s), Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 1 Tesla P4 GPUclust4
: 40 CPU(s), Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 1 Tesla P4 GPUclust5
: 40 CPU(s), Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 1 Tesla P4 GPUclust6
: 40 CPU(s), Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 1 Tesla P4 GPUclust7
: 40 CPU(s), Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 1 Tesla P4 GPUclust8
: 40 CPU(s), Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 1 Tesla P4 GPUSo a total of 320 CPUs !
For the ERC MDFT
clust9
: 40 CPU(s), Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHzclust10
: 40 CPU(s), Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHzTo submit a computation, the SLURM service was set up to manage the submitted computations. This was done by setting up a particular machine (front-end) named cluster.ceremade.dauphine.lan
through which one must submit the desired calculation by requesting time and resources that will be managed by the SLURM service.
For example, we can send our pi
directory containing code, data, etc. via scp
:
scp -r /home/chupin/pi/ chupin@cluster.ceremade.dauphine.lan:~/
It is also possible to do this via sFTP (using for example FileZilla).
We connect to the cluster.ceremade.dauphine.lan
machine with ssh
:
If you access the cluster through the [VPN] access (https://www.ceremade.dauphine.fr/doc/fr/logiciels/vpn-dauphine) you must use the ip address : 10.101.7.5 rather than the dns name cluster.ceremade.dauphine.lan
ssh nomutilisateur@cluster.ceremade.dauphine.lan
or
ssh nomutilisateur@10.101.7.5
To send a calculation, you have to write a SBATCH file which will tell Slurm what is needed and what commands are needed to execute the calculation.
SBATCH scripts are bash
scripts that contain SBATCH commands as comments.
If you use Python, Julia, R, and your code uses particular libraries, they are installed locally in your home (with pip
, Pkg
, etc.).
A minimal example of SBATCH commands is provided here. All commands are described at the end of the page.
#!/bin/sh
# File submission.SBATCH
#SBATCH --nodes=1
#SBATCH -c 20
#SBATCH --job-name="MY_JOB"
#SBATCH --output=test.out
#SBATCH --mail-user=chupin@ceremade.dauphine.fr
#SBATCH --mail-type=BEGIN,END,FAIL
The following SBATCH csripts:
#!/bin/sh >
# File submission.SBATCH
#SBATCH --nodes=1
#SBATCH -c 20
#SBATCH --job-name="MY_JOB"
#SBATCH --output=%x.%J.out
#SBATCH --error=%x.%J.out
#SBATCH --mail-user=duleu@ceremade.dauphine.fr
#SBATCH --mail-type=BEGIN,FAIL,END
### Some info that may be useful >
echo Host `hostname`
### Total number of CPUs
echo It has been logged $SLURM_JOB_CPUS_PER_NODE cpus
### Definition of the env variable for OpenMP >
# $SLURM_JOB_CPUS_PER_NODE is the number of CPUs per node request >
OMP_NUM_THREADS=$SLURM_JOB_CPUS_PER_NODE
export OMP_NUM_THREADS
echo This job has $OMP_NUM_THREADS cpus
will produce the following result:
Host clust3
It has been allocated 20 cpus
This job has 20 cpus
To run a job from cluster.ceremade.dauphine.lan
, use :
chupin@cluster:~/pi/> sbatch submission.SBATCH
Other SLURM tools to view, cancel, stop, etc. a job are described at the end of the page.
Let's consider, for example, a C++ code compute_pi.cpp using the omp.h library and thus the OpenMP instructions to the compiler (but a Python code also fits in this frame).
Such a code must be compiled in the following way:
g++ -o compute_pi -fopenmp compute_pi.cpp
Once this is done, we create a SBATCH script (in a file named for the example submission.SBATCH
) which can look like :
chupin@cluster:~/pi/> sbatch submission.SBATCH
If you are not on the dauphine premises connected by ethernet cable, you must connect to the VPN.
To use Jupyter we need to go through an interactive session, by running this command:
srun --pty -c 10 -N 1 /bin/bash
If all goes well you should see your prompt change from:
duleu@cluster:~/code/test$
à
duleu@clust3:~/code/test$
You can see that the machine name is now clust3 and not cluster. We can now run a jupyter notebook:
jupyter notebook --ip=0.0.0.0
We get the address to copy and paste in a browser:
I 11:09:58.871 NotebookApp] JupyterLab extension loaded from /home/users/duleu/anaconda3/lib/python3.7/site-packages/jupyterlab
I 11:09:58.871 NotebookApp] JupyterLab application directory is /home/users/duleu/anaconda3/share/jupyter/lab
I 11:09:58.876 NotebookApp] Serving notebooks from local directory: /mnt/nfs/rdata02-users/users/duleu/code/test
I 11:09:58.876 NotebookApp] The Jupyter Notebook is running at:
[I 11:09:58.876 NotebookApp] http://clust3:8888/?token=673591c461f08d5773353be62416ba33a3468551a8926287
[I 11:09:58.876 NotebookApp] or http://127.0.0.1:8888/?token=673591c461f08d5773353be62416ba33a3468551a8926287
I 11:09:58.876 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
W 11:09:58.939 NotebookApp] No web browser found: could not locate runnable browser.
C 11:09:58.939 NotebookApp]
To access the notebook, open this file in a browser:
file:///mnt/nfs/rdata02-users/users/duleu/.local/share/jupyter/runtime/nbserver-1947447-open.html
Or copy and paste one of these URLs:
http://clust3:8888/?token=673591c461f08d5773353be62416ba33a3468551a8926287
or http://127.0.0.1:8888/?token=673591c461f08d5773353be62416ba33a3468551a8926287
You must add .ceremade.dauphine.lan after clust3.
We want to run our python program script.py
. To do this, we can use the SBATCH file below (note the export of the OpenMP variable).
#!/bin/sh
# file submission.SBATCH
#SBATCH --nodes=1
#SBATCH -c 20
#SBATCH --job-name="Test_Python"
#SBATCH --output=%x.%J.out
#SBATCH --time=10:00
#SBATCH --error=%x.%J.out
#SBATCH --mail-user=<user>@ceremade.dauphine.fr
#SBATCH --mail-type=BEGIN,FAIL,END
### Definition of the env variable for OpenMP >
# $SLURM_JOB_CPUS_PER_NODE is the number of CPUs per node request >
OMP_NUM_THREADS=$SLURM_JOB_CPUS_PER_NODE
export OMP_NUM_THREADS
python3 script.py
Matlab is installed on all the nodes of the cluster. So we can use it. Let's suppose that in our working directory, we have a script script.m that we want to run.
#!/bin/sh >
# File submission.SBATCH
#SBATCH --nodes=1
#SBATCH -c 20
#SBATCH --job-name="MY_JOB"
#SBATCH --output=%x.%J.out
#SBATCH --error=%x.%J.out
#SBATCH --mail-user=<name>@ceremade.dauphine.fr
#SBATCH --mail-type=BEGIN,END,FAIL
# we execute the matlab program but without graphical interface
matlab -nodisplay -nodesktop -r "run('script.m')"
This is a bash
script whose comments starting with #SBATCH
are commands for SLURM. Here, the name of the calculation is MON_JOB
.
Warning: here, we have taken 20 CPUs with 1 "node" (it is SLURM that manages the choice of machines and CPUs), in fact to be able to use the 40 threads available, we have to do multithreading with Matlab, and we don't know how to do it without a graphical interface.
Warning: on some accounts, matlab is not accessible, and you have to specify the full path of the executable:
/usr/local/bin/matlab -nodisplay -nodesktop -r "run('script.m')"
Once these files are on the cluster
machine in a directory in its home
, we submit the job using the following command:
chupin@cluster:~/codematlab/> sbatch submission.SBATCH
To request GPU resources it is necessary to add this information in the pbs file or during an interactive session.
Here is an example of a file with a GPU resource request:
#!/bin/bash
#!/bin/bash >
# File submission.SBATCH
#SBATCH --nodes=1
#SBATCH -c 20
#SBATCH --gres=gpu:1
#SBATCH --job-name="MY_JOB"
#SBATCH --output=%x.%J.out
#SBATCH --error=%x.%J.out
#SBATCH --mail-user=<name>@ceremade.dauphine.fr
#SBATCH --mail-type=BEGIN,END,FAIL
# For OpenMP export
OMP_NUM_THREADS=$SLURM_JOB_CPUS_PER_NODE
export OMP_NUM_THREADS
# we move to the SLURM directory
cd $SLURM_SUBMIT_DIR
# execute the matlab program but without the graphical interface
matlab -nodisplay -nodesktop -r "run('script.m')"
the necessary instruction is the following: --gres=gpu:1.
Option | Description |
---|---|
#SBATCH --partition=<part> |
Choose the partition to use for the job |
#SBATCH --job-name=<name> |
Defines the name of the job as it will be displayed in the various Slurm commands (squeue, sstat, sacct) |
#SBATCH --output=<stdOutFile> |
The standard output (stdOut) will be redirected to the file defined by "--output" or, if not defined, a default file "slurm-%j.out" (Slurm will replace "%j" with the JobID). |
#SBATCH --error=<stdErrFile> |
The error output (stdErr) will be redirected to the file defined by "--error" or, if not defined, to the standard output. |
#SBATCH --input=<stdInFile> |
The standard input can also be redirected with "--input". By default "/dev/null" is used (none/empty). |
#SBATCH --open-mode=<append,truncate> |
The option "--open-mode" defines the mode of opening (writing) files and behaves like an open/fopen of most programming languages (2 possibilities: "append" to write after the file (if it exists) and "truncate" to overwrite the file at each batch execution (default value)). |
#SBATCH --mail-user=<e-mail> |
Defines the e-mail address of the recipient |
#SBATCH --mail-type=<BEGIN,END,FAIL,TIME_LIMIT,TIME_LIMIT_50,...> |
Allows to be notified by e-mail of a particular event in the life of the job : beginning of the execution (BEGIN), end of the execution (END, FAIL and TIME_LIMIT)... See the Slurm documentation for the complete list of supported events. |
#SBATCH --cpus-per-task=<n> |
Defines the number of CPUs to allocate per Task. The actual use of these CPUs is up to each Task (creation of processes and/or threads). |
#SBATCH --ntasks=<n> |
Defines the maximum number of Tasks executed in parallel. |
#SBATCH --mem-per-cpu=<n> |
Defines the RAM in MB allocated to each CPU. By default, 4096 MB are allocated to each CPU, using this variable allows to specify a specific RAM size, less or equal to 7800 MB (maximum allocable per CPU). |
#SBATCH --nodes=<minnodes[-maxnodes]> |
Minimum [-maximum] number of nodes on which to distribute Tasks. |
#SBATCH --ntasks-per-node=<n> |
When used in conjunction with --nodes, this option is an alternative to --ntasks that allows you to control the distribution of Tasks to individual nodes. |
You can specify exactly which nodes you want to use with different numbers of processors on each with the option :
#PBS -l nodes=1:ppn=5:clust8
This allows to explicitly choose the clust8
node, with 5 threads for this node. Of course, this is not recommended, TORQUE handles the job distribution.
Variable name | Description | Example |
---|---|---|
SLURM_JOB_ID |
The job identifier (calculation) | 12345 |
SLURM_JOB_NAME |
The name of the job defined with the -J option |
my_job |
SLURM_JOB_NODELIST |
Is the name of a file that is made by SLURM and contains the list of nodes used. | |
SLURM_SUBMIT_HOST |
Name of the host on which sbatch was run (in our case cluster ) |
cluster |
SLURM_SUBMIT_DIR |
Directory from which the job is submitted | /home/user/chupin/scripts_pbs |
SLURM_JOB_NUM_NODES |
Number of nodes required for the job (e.g. with -N 5 ) |
|
SLURM_NTASKS_PER_NODE |
Number of threads (cores) per node required for the job (for example with -N 20 -n 8 ) |
To list the computations launched on the cluster, we use the smap
program:
chupin@cluster:~/pi/> smap -i 1
which produces something like :
┌─────────────────────────────────────────────────────────────────────────────────────┐
│..B....... │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
└─────────────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────────────┐
│Tue Mar 16 16:57:10 2021 │
│ID JOBID PARTITION USER NAME ST TIME NODES NODELIST │
│A 101 debug duleu MON_JOB R 00:00:20 1 clust3 │
│B 102 debug duleu MON_JOB R 00:00:16 1 clust3 │
│C 103 debug duleu MON_JOB PD 00:00:00 1 waiting... │
│D 104 debug duleu MON_JOB PD 00:00:00 1 waiting... │
│E 105 debug duleu MON_JOB PD 00:00:00 1 waiting... │
│F 106 debug duleu MON_JOB PD 00:00:00 1 waiting... │
│G 107 debug duleu MON_JOB PD 00:00:00 1 waiting... │
│H 108 debug duleu MON_JOB PD 00:00:00 1 waiting... │
│I 109 debug duleu MON_JOB PD 00:00:00 1 waiting... │
│J 110 debug duleu MON_JOB PD 00:00:00 1 waiting... │
│ │
└─────────────────────────────────────────────────────────────────────────────────────┘
To see the occupancy rate of the nodes we use the command pestat
:
chupin@cluster:~/pi/> pestat
Here is an example of what is displayed:
Hostname Partition Node Num_CPU CPUload Memsize Freemem Joblist
State Use/Tot (MB) (MB) JobId User ...
clust1 debug* down* 0 40 0.00* 1 0
clust2 debug* down* 0 40 0.00* 1 0
clust3 debug* idle 0 40 0.20 1 87878
clust4 debug* down* 0 40 0.00* 1 0
clust5 debug* down* 0 40 0.00* 1 0
clust6 debug* down* 0 40 0.00* 1 0
clust7 debug* down* 0 40 0.00* 1 0
clust8 debug* down* 0 40 0.00* 1 0
clust9 erc down* 0 40 0.00* 1 0
clust10 erc down* 0 40 0.00* 1 0
Other programs are available to handle the calculations run by qsub
. In particular :
#JOBID
column of smap
is required.
chupin@cluster:~/pi/> sstat 150
#JOBID
column of smap
is required.