Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
public:usage:fuchs [2020/05/15 10:50] – [Job Monitoring] geier | public:usage:fuchs [2025/05/09 13:35] (current) – [Hybrid Jobs: MPI/OpenMP] geier | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== FUCHS Cluster Usage ====== | ====== FUCHS Cluster Usage ====== | ||
- | The [[..: | + | <note tip>In 2023, the cluster was updated to a new operating system (AlmaLinux 9). Please help us to help you. In the beginning it is usally very stressful because we get bomarded with tickets, like my previous software ran but nothings works anymore. It can be nessecary to set up software e.g. Spack from the scratch. Also some old ssh-keys with the rsa cipher might not work anymore. Please provide us some time to rearrange our documentation. Please discuss problems first in your group, maybe team members already found a solution to your problem, before attacking us with tickets. As less tickets we get as faster we can proceed to provide a working cluster for everybody. Also see [[public: |
+ | |||
+ | [[..: | ||
===== Login ===== | ===== Login ===== | ||
Line 8: | Line 10: | ||
< | < | ||
- | On Windows systems please use/install a Windows SSH client (e.g. PuTTY, | + | Am I connected to the right server? Please find our ssh-rsa fingerprint here: |
+ | |||
+ | ++++fuchs.hhlr-gu.de fingerprints| | ||
+ | |||
+ | <WRAP center round info 90%>The '' | ||
+ | **ECDSA | ||
+ | **ECDSA | ||
+ | **ED25519 SHA256: | ||
+ | **ED25519 MD5: | ||
+ | </ | ||
+ | |||
+ | ++++ | ||
+ | <note important> | ||
+ | You may receive a warning from your system that something with the security is wrong, "maybe somebody is evedropping" | ||
+ | |||
+ | ssh-keygen -R fuchs.hhlr-gu.de | ||
+ | |||
+ | and accept the new fuchs.hhlr-gu.de key. Above you will find our unique ECDSA and ED25519 fingerprints. Some programs tend to display the fingerprint in the SHA256 or MD5 format. Just click on // | ||
+ | |||
+ | Please check with '' | ||
+ | On Windows systems please use/install a Windows SSH client (e.g. PuTTY, | ||
After your [[first login]] you will get the message, that your password has expired and you have to change it. Please use the password provided by CSC at the prompt, choose a new one and retype it. You will be logged out automatically. Now you can login with your new password and work on the cluster. | After your [[first login]] you will get the message, that your password has expired and you have to change it. Please use the password provided by CSC at the prompt, choose a new one and retype it. You will be logged out automatically. Now you can login with your new password and work on the cluster. | ||
- | <note warning> | + | <note warning> |
+ | \\ | ||
'' | '' | ||
- | on the command line. On a login node, any process that exceeds the CPU-time limit (e.g. a long running test program or a long running rsync) will be killed automatically.</ | + | on the command line. On the login node, any process that exceeds the CPU-time limit (e.g. a long running test program or a long running rsync) will be killed automatically.</ |
===== Environment Modules ===== | ===== Environment Modules ===== | ||
+ | There are several versions of software packages installed on our systems. The same name for an executable (e.g. '' | ||
- | There are several versions of software packages installed on our systems. The same name for an executable (e.g. mpirun) and/or library file may be used by more than one package. The environment module system, with its '' | + | < |
- | + | ||
- | < | + | |
If you want to know more about module commands, the '' | If you want to know more about module commands, the '' | ||
+ | ===== Working with Intel oneAPI ===== | ||
+ | |||
+ | With the command ' | ||
+ | |||
+ | <note tip>To avoid erros please use versions numbers instead of latest.</ | ||
+ | |||
+ | < | ||
+ | module load intel/ | ||
+ | module avail | ||
+ | ... | ||
+ | # new modules available within the Intel oneAPI modulefile | ||
+ | ----------------------------------- / | ||
+ | advisor/ | ||
+ | advisor/ | ||
+ | ccl/ | ||
+ | ccl/ | ||
+ | compiler-rt/ | ||
+ | compiler-rt/ | ||
+ | compiler-rt32/ | ||
+ | compiler-rt32/ | ||
+ | compiler/ | ||
+ | compiler/ | ||
+ | compiler32/ | ||
+ | compiler32/ | ||
+ | dal/ | ||
+ | dal/ | ||
+ | debugger/ | ||
+ | debugger/ | ||
+ | |||
+ | Key: | ||
+ | loaded | ||
+ | </ | ||
+ | |||
+ | Please also note, by default, Intel MPI's '' | ||
+ | |||
+ | module load intel/ | ||
+ | module load compiler/ | ||
+ | module load mpi/ | ||
+ | export I_MPI_CC=icx | ||
===== Compiling Software ===== | ===== Compiling Software ===== | ||
- | You can compile your software on the login nodes (or on any other node, inside a job allocation). | + | You can compile your software on the login nodes (or on any other node, inside a job allocation). |
- | * GNU compilers | ||
* Intel compilers | * Intel compilers | ||
* MPI libraries | * MPI libraries | ||
+ | * other libraries | ||
For the right compilation commands please consider: | For the right compilation commands please consider: | ||
- | < | + | < |
- | [[https://software.intel.com/en-us/ | + | [[https://www.intel.com/content/ |
</ | </ | ||
- | To compile | + | To build and manage software which is not available |
- | ===== Debugging ===== | + | |
- | + | ||
- | The [[http:// | + | |
- | + | ||
- | - Compile your code with your favored MPI using the debug option -g, e.g.< | + | |
- | mpicc -g -o mpi_prog mpi_prog.c</ | + | |
- | - Load the TotalView module by running< | + | |
- | module load debug/ | + | |
- | - Allocate the resources you need using salloc, e.g.< | + | |
- | salloc -n 4 --partition=test --time=00: | + | |
- | - Start a TotalView debugging session, e.g.< | + | |
- | totalview </ | + | |
- | - Choose Debug a parallel session | + | |
- | - Choose your executable (mpi_prog), Parallel System (e.g. Intel MPI CSC or openmpi-m), number of tasks and load the session | + | |
===== Storage ===== | ===== Storage ===== | ||
Line 64: | Line 110: | ||
* your home directory ''/ | * your home directory ''/ | ||
* your scratch directory ''/ | * your scratch directory ''/ | ||
- | * the non-shared local storage (i.e. only accessible from the compute node it's connected to, max. 1.4 TB, slow) under ''/ | + | * the non-shared local storage (i.e. unique on each compute node) under ''/ |
Please use your home directory for small permanent files, e.g. source files, libraries and executables. | Please use your home directory for small permanent files, e.g. source files, libraries and executables. | ||
Line 72: | Line 118: | ||
{{ : | {{ : | ||
- | By default, the space in your home directory is limited to 10 GB and in your scratch directory to 5 TB and/or 800000 inodes (which corresponds to approximately 200000+ files). You can check your homedir and scratch usage by running the '' | + | By default, the space in your home directory is limited to 30 GB, and in your scratch directory to 5 TB and/or 800000 inodes (which corresponds to approximately 200000+ files). |
< | < | ||
- | While the data in your home directory is backed up nightly (please ask, if you want us to restore anything from there), there is no backup of your scratch directory.</ | + | While the data in your home directory is backed up nightly (please ask, if you want us to restore anything from there, see also [[http:// |
If you need local storage on the compute nodes, you have to add the '' | If you need local storage on the compute nodes, you have to add the '' | ||
<code bash> | <code bash> | ||
Line 110: | Line 156: | ||
* [[# | * [[# | ||
- | For every compute job you have to submit a job script (unless working interactively using '' | + | For every compute job you have to submit a job script (unless working interactively using [[# |
sbatch jobscript.sh | sbatch jobscript.sh | ||
- | on a login node. A SLURM job script is a shell script | + | on a login node. A SLURM job script is a shell script |
#SBATCH ... | #SBATCH ... | ||
Line 130: | Line 176: | ||
==== Your First Job Script | ==== Your First Job Script | ||
- | In '' | + | In '' |
<code bash> | <code bash> | ||
Line 138: | Line 184: | ||
#SBATCH --nodes=3 | #SBATCH --nodes=3 | ||
#SBATCH --ntasks=60 | #SBATCH --ntasks=60 | ||
- | #SBATCH --cpus-per-task=1 | + | #SBATCH --ntasks-per-node=20 |
+ | #SBATCH --cpus-per-task=1 | ||
#SBATCH --mem-per-cpu=512 | #SBATCH --mem-per-cpu=512 | ||
#SBATCH --time=00: | #SBATCH --time=00: | ||
- | #SBATCH --no-requeue | + | #SBATCH --no-requeue |
- | #SBATCH --mail-type=FAIL | + | #SBATCH --mail-type=FAIL |
- | #SBATCH –-extra-node-info=2: | + | |
srun hostname | srun hostname | ||
Line 149: | Line 195: | ||
</ | </ | ||
- | 1) For SLURM, a CPU core (a CPU thread, to be more precise) is a CPU.\\ | + | < |
- | 2) Prevent the job from being requeued after a failure.\\ | + | |< |
- | 3) Send an e-mail if sth. goes wrong.\\ | + | |< |
- | 4) Run job without Hyper-Threading.\\ | + | |< |
+ | </ | ||
The '' | The '' | ||
- | Although nodes are allocated exclusively, | + | Although nodes are allocated exclusively, |
After saving the above job script as e.g. '' | After saving the above job script as e.g. '' | ||
Line 166: | Line 213: | ||
==== Job Monitoring ==== | ==== Job Monitoring ==== | ||
- | For job monitoring (to check the current state of your jobs) you can use the '' | + | For job monitoring (to check the current state of your jobs) you can use the '' |
If you need to cancel a job, you can use the '' | If you need to cancel a job, you can use the '' | ||
Line 177: | Line 224: | ||
#SBATCH --partition=fuchs | #SBATCH --partition=fuchs | ||
</ | </ | ||
- | to the job script | + | to the job script, when they want to use the FUCHS cluster. |
- | ==== Node Type ==== | + | ==== Node Types ==== |
- | On FUCHS **one type** of compute | + | On FUCHS only **one type** of compute |
^Number^Type^Vendor^Processor^Processor x Core (HT)^RAM [GB]^ | ^Number^Type^Vendor^Processor^Processor x Core (HT)^RAM [GB]^ | ||
|194|dual-socket|Intel|Xeon Ivy Bridge E5-2670 v2| 2x10 (2x20)|128| | |194|dual-socket|Intel|Xeon Ivy Bridge E5-2670 v2| 2x10 (2x20)|128| | ||
Line 194: | Line 241: | ||
| '' | | '' | ||
| '' | | '' | ||
- | | '' | + | | '' |
| '' | | '' | ||
Line 216: | Line 263: | ||
#SBATCH --ntasks=20 | #SBATCH --ntasks=20 | ||
#SBATCH --cpus-per-task=1 | #SBATCH --cpus-per-task=1 | ||
- | #SBATCH --mem-per-cpu=2000 | + | #SBATCH --mem=100g |
#SBATCH --time=01: | #SBATCH --time=01: | ||
#SBATCH --mail-type=FAIL | #SBATCH --mail-type=FAIL | ||
- | |||
- | export OMP_NUM_THREADS=1 | ||
# | # | ||
# Replace by a for loop. | # Replace by a for loop. | ||
- | ./program input01 | + | ./program input01 &> 01.out & |
- | ./program input02 | + | ./program input02 &> 02.out & |
... | ... | ||
- | ./program input20 | + | ./program input20 &> 20.out & |
# Wait for all child processes to terminate. | # Wait for all child processes to terminate. | ||
wait | wait | ||
</ | </ | ||
- | In this (SIMD) example we assume, that there is a program (called '' | + | In this (SIMD) example we assume, that there is a program (called '' |
- | + | ||
- | If the running times of your processes vary a lot, consider using the //thread pool pattern//. Have a look at the '' | + | |
==== Job Arrays ==== | ==== Job Arrays ==== | ||
- | If you have a lot of single-core computations to run, job arrays are worth a look. Telling SLURM to run a job script as a job array will result in running that script multiple times (after the corresponding resources have been allocated). Each instance will have a distinct '' | + | If you have lots of single-core computations to run, job arrays are worth a look. Telling SLURM to run a job script as a job array will result in running that script multiple times (after the corresponding resources have been allocated). Each instance will have a distinct '' |
Due to our full-node policy, you still have to ensure, that your jobs don't waste any resources. Let's say, you have 400 single-core tasks. In the following example 400 tasks are run inside a job array while ensuring that only 20-core nodes are used and that each node runs exactly 20 tasks in parallel. | Due to our full-node policy, you still have to ensure, that your jobs don't waste any resources. Let's say, you have 400 single-core tasks. In the following example 400 tasks are run inside a job array while ensuring that only 20-core nodes are used and that each node runs exactly 20 tasks in parallel. | ||
Line 252: | Line 295: | ||
#SBATCH --mem-per-cpu=2000 | #SBATCH --mem-per-cpu=2000 | ||
#SBATCH --time=00: | #SBATCH --time=00: | ||
- | #SBATCH --array=0-319:20 | + | #SBATCH --array=0-399:20 |
#SBATCH --mail-type=FAIL | #SBATCH --mail-type=FAIL | ||
Line 285: | Line 328: | ||
</ | </ | ||
- | If the task running times vary a lot, consider using the //thread pool pattern//. Have a look at the '' | + | If the task running times vary a lot, consider using the //thread pool pattern//. Have a look at **GNU parallel**, for instance. |
==== OpenMP Jobs ==== | ==== OpenMP Jobs ==== | ||
- | For OpenMP jobs, set the '' | + | For OpenMP jobs, set the '' |
<code bash># | <code bash># | ||
#SBATCH --partition=fuchs | #SBATCH --partition=fuchs | ||
+ | #SBATCH --nodes=1 | ||
#SBATCH --ntasks=1 | #SBATCH --ntasks=1 | ||
#SBATCH --cpus-per-task=20 | #SBATCH --cpus-per-task=20 | ||
- | #SBATCH --mem-per-cpu=200 | + | #SBATCH --mem=4000 |
#SBATCH --mail-type=ALL | #SBATCH --mail-type=ALL | ||
#SBATCH --time=48: | #SBATCH --time=48: | ||
- | export OMP_NUM_THREADS=20 | + | export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK |
- | ./omp_program | + | ./your_omp_program |
</ | </ | ||
Line 306: | Line 350: | ||
==== MPI Jobs ==== | ==== MPI Jobs ==== | ||
- | **Remember: | + | **Remember: |
See also: http:// | See also: http:// | ||
- | As an example, we want to run a program that spawns 80 Open MPI ranks and where 1200 MB of RAM are allocated for each rank. | + | As an example, we want to run a program that spawns 80 MPI ranks and where 1200 MB of RAM are allocated for each rank. |
<code bash># | <code bash># | ||
#SBATCH --partition=fuchs | #SBATCH --partition=fuchs | ||
+ | #SBATCH --nodes=4 | ||
#SBATCH --ntasks=80 | #SBATCH --ntasks=80 | ||
+ | #SBATCH --ntasks-per-node=20 | ||
#SBATCH --cpus-per-task=1 | #SBATCH --cpus-per-task=1 | ||
#SBATCH --mem-per-cpu=1200 | #SBATCH --mem-per-cpu=1200 | ||
#SBATCH --mail-type=ALL | #SBATCH --mail-type=ALL | ||
- | #SBATCH --extra-node-info=2: | + | #SBATCH --extra-node-info=2: |
#SBATCH --time=48: | #SBATCH --time=48: | ||
- | module load mpi/XXXX/.... | + | module load mpi/.../< |
- | export OMP_NUM_THREADS=1 | + | mpirun ./your_mpi_program |
- | mpirun | + | |
</ | </ | ||
- | ==== Combining Small MPI Jobs ==== | + | < |
- | + | ||
- | As mentioned earlier, running small jobs while full nodes are allocated leads to a waste of resources. In cases where you have, let's say, a lot of 10-rank MPI jobs (with similar runtimes and low memory consumption), you can start more than one computation within a single allocation (and on a single node). Open MPI example (running two MPI jobs concurrently on a 20-core node): | + | |
- | + | ||
- | <code bash># | + | |
- | # | + | |
- | #SBATCH --nodes=1 | + | |
- | #SBATCH --ntasks=20 | + | |
- | #SBATCH --cpus-per-task=1 | + | |
- | # | + | |
- | #SBATCH --time=48: | + | |
- | #SBATCH --mail-type=FAIL | + | |
- | + | ||
- | export OMP_NUM_THREADS=1 | + | |
- | mpirun -np 10 ./program input01 >& 01.out & | + | |
- | # Wait a little before starting | + | |
- | sleep 3 | + | |
- | mpirun -np 10 ./program input02 >& 02.out & | + | |
- | # Wait for all child processes to terminate. | + | |
- | wait | + | |
- | </code> | + | |
- | You might also need to disable core binding (please see the '' | + | Some MPI installations support |
+ | [...] | ||
+ | module load mpi/ | ||
+ | srun --mpi=pmix | ||
+ | MPI implementations are typically designed to work seamlessly with job schedulers like Slurm. When you launch MPI tasks with '' | ||
==== Hybrid Jobs: MPI/OpenMP ==== | ==== Hybrid Jobs: MPI/OpenMP ==== | ||
- | MVAPICH2 | + | MPI example script (20 ranks, 5 threads each and 200 MB per thread, i.e. 1 GB per rank; so, for 20*5 threads, you'll get five 20-core nodes): |
<code bash># | <code bash># | ||
#SBATCH --partition=fuchs | #SBATCH --partition=fuchs | ||
- | #SBATCH --ntasks=40 | + | #SBATCH --ntasks=20 |
#SBATCH --cpus-per-task=5 | #SBATCH --cpus-per-task=5 | ||
#SBATCH --mem-per-cpu=200 | #SBATCH --mem-per-cpu=200 | ||
#SBATCH --mail-type=ALL | #SBATCH --mail-type=ALL | ||
- | #SBATCH --extra-node-info=2: | + | #SBATCH --extra-node-info=2: |
#SBATCH --time=48: | #SBATCH --time=48: | ||
+ | module load mpi/ | ||
export OMP_NUM_THREADS=5 | export OMP_NUM_THREADS=5 | ||
+ | # When using MVAPICH2 disable core affinity. | ||
export MV2_ENABLE_AFFINITY=0 | export MV2_ENABLE_AFFINITY=0 | ||
mpirun -np 20 ./ | mpirun -np 20 ./ | ||
Line 370: | Line 401: | ||
Please note, that this is just an example. You may or may not run it as-it-is with your software, which is likely to have a different scalability. | Please note, that this is just an example. You may or may not run it as-it-is with your software, which is likely to have a different scalability. | ||
- | You have to disable the core affinity when running hybrid jobs with MVAPICH2. Otherwise all threads of an MPI rank will be pinned to the same core. Our example now includes the command | + | You have to disable the core affinity when running hybrid jobs with MVAPICH2((MVAPICH2 is another MPI library)). Otherwise all threads of an MPI rank will be pinned to the same core. Our example now includes the command |
<code bash> | <code bash> | ||
Line 376: | Line 407: | ||
</ | </ | ||
- | which disables this feature. The OS scheduler is now responsible for the placement of the threads during the runtime of the program. But the OS scheduler can dynamically change the thread placement during the runtime of the program. This leads to cache invalidation, | + | which disables this feature. The OS scheduler is now responsible for the placement of the threads during the runtime of the program. But the OS scheduler can dynamically change the thread placement during the runtime of the program. This leads to cache invalidation, |
+ | |||
+ | When using **Intel MPI**, please also check its [[https:// | ||
+ | |||
+ | ==== Memory Allocation ==== | ||
+ | |||
+ | Normally the memory available per CPU thread is calculated by the whole amount of RAM divided by the number of threads. For instance 128GB / 40 threads = 3.2GB per thread. Keep in mind that the FUCHS cluster provides two threads per core. Now imagine you need more memory, let's say 8192MB per task. Type '' | ||
+ | |||
+ | < | ||
+ | # | ||
+ | #SBATCH --job-name=< | ||
+ | #SBATCH --partition=fuchs | ||
+ | #SBATCH --ntasks=69 | ||
+ | #SBATCH --cpus-per-task=1 | ||
+ | |||
+ | # #SBATCH --mem-per-cpu=8192 | ||
+ | # message, therefore it's commented out. | ||
+ | |||
+ | #SBATCH --mem=0 | ||
+ | #SBATCH --ntasks-per-node=15 | ||
+ | |||
+ | srun hostname | ||
+ | </ | ||
+ | |||
+ | If everything works fine you were granted 5 nodes. For example 4 nodes à 14 tasks and 1 node à 13 tasks, i.e. 56 tasks + 13 tasks = 69 tasks, as requested. | ||
==== Local Storage ==== | ==== Local Storage ==== | ||
Line 386: | Line 441: | ||
For interactive workflows you can use SLURM' | For interactive workflows you can use SLURM' | ||
- | < | + | < |
- | salloc: Granted job allocation | + | salloc: Granted job allocation |
salloc: Waiting for resource configuration | salloc: Waiting for resource configuration | ||
- | salloc: Nodes node45-[002-005] are ready for job | + | salloc: Nodes node27-[012-015] are ready for job |
- | [user@loginnode | + | [user@fuchs ~]$ |
</ | </ | ||
Now you can '' | Now you can '' | ||
- | + | < | |
- | < | + | [user@node27-012 ~]$ hostname |
- | [user@loginnode | + | node27-012 |
- | [user@node45-002 ~]$ hostname | + | [user@node27-012 ~]$ exit |
- | node45-002.cm.cluster | + | logout |
- | [user@node45-002 ~]$ logout | + | Connection to node27-012 closed. |
- | Connection to node45-002 closed. | + | |
- | ... | + | |
- | [user@loginnode ~]$ ssh node45-003 | + | |
- | [user@node45-003 ~]$ hostname | + | |
- | node45-003.cm.cluster | + | |
- | [user@node45-003 ~]$ logout | + | |
- | Connection to node45-003 closed. | + | |
- | ... | + | |
- | [user@loginnode ~]$ ssh node45-005 | + | |
- | [user@node45-005 ~]$ hostname | + | |
- | node45-005.cm.cluster | + | |
- | [user@node45-005 ~]$ logout | + | |
- | Connection to node45-005 closed. | + | |
</ | </ | ||
Or you can use '' | Or you can use '' | ||
- | < | + | < |
- | node45-002.cm.cluster | + | node27-013 |
- | node45-003.cm.cluster | + | node27-012 |
- | node45-005.cm.cluster | + | node27-015 |
- | node45-004.cm.cluster | + | node27-014 |
- | [user@loginnode ~]$ | + | |
</ | </ | ||
Finally you can terminate your interactive job session by running '' | Finally you can terminate your interactive job session by running '' | ||
- | < | + | < |
- | salloc: Relinquishing job allocation | + | exit |
- | [user@loginnode ~]$ | + | salloc: Relinquishing job allocation |
</ | </ | ||