Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
public:usage:fuchs [2020/05/15 10:50] – [Job Monitoring] geierpublic:usage:fuchs [2025/05/09 13:35] (current) – [Hybrid Jobs: MPI/OpenMP] geier
Line 1: Line 1:
 ====== FUCHS Cluster Usage ====== ====== FUCHS Cluster Usage ======
  
-The [[..:service:fuchs|FUCHS]] is a general-purpose compute cluster based on Intel CPU architectures running Scientific Linux 7.6 and [[#running_jobs_with_slurm|SLURM]]. Please **read the following instructions and ensure that this guide is fully understood** before using the system.+<note tip>In 2023, the cluster was updated to a new operating system (AlmaLinux 9). Please help us to help you. In the beginning it is usally very stressful because we get bomarded with tickets, like my previous software ran but nothings works anymore. It can be nessecary to set up software e.g. Spack from the scratch. Also some old ssh-keys with the rsa cipher might not work anymore. Please provide us some time to rearrange our documentation. Please discuss problems first in your group, maybe team members already found a solution to your problem, before attacking us with tickets. As less tickets we get as faster we can proceed to provide a working cluster for everybody. Also see [[public:usage:Common errors]].</note> 
 + 
 +[[..:service:fuchs|FUCHS]] is a general-purpose compute cluster based on Intel CPU architectures running AlmaLinux 9 and [[#running_jobs_with_slurm|SLURM]]. Please **read the following instructions and ensure that this guide is fully understood** before using the system.
  
 ===== Login ===== ===== Login =====
Line 8: Line 10:
 <code>ssh <user_account>@fuchs.hhlr-gu.de</code> <code>ssh <user_account>@fuchs.hhlr-gu.de</code>
  
-On Windows systems please use/install a Windows SSH client (e.g. PuTTY, or the Cygwin ssh package).+Am I connected to the right server? Please find our ssh-rsa fingerprint here: 
 + 
 +++++fuchs.hhlr-gu.de fingerprints| 
 + 
 +<WRAP center round info 90%>The ''fuchs.hhlr-gu.de'' fingerprints are\\ \\ 
 +**ECDSA   SHA256:V5s3UkuRW3tr3xXe80AZAVvsnobfIslTEU+N7gl4yWs** \\ 
 +**ECDSA   MD5:75:61:ed:61:b6:43:30:3e:26:dc:d7:e4:00:5c:b5:b1**\\ 
 +**ED25519 SHA256:NZtoFOMnT4cdouiF+827eYaL2t7sbsUhJBx2OjFxRAQ** \\ 
 +**ED25519 MD5:55:e6:b8:c0:35:f2:13:4b:22:0c:d6:d0:59:7d:cc:be** 
 +</WRAP> 
 + 
 +++++ 
 +<note important>Warnings - Security Breach - Keys etc.\\ \\ 
 +You may receive a warning from your system that something with the security is wrong, "maybe somebody is evedropping". Due to the upgrade of the operating system from Scientifc Linux 7.9 to AlmaLinux 9.2 the SSH server keys have changed. Please erase your old FUCHS key with 
 + 
 +  ssh-keygen -R fuchs.hhlr-gu.de 
 + 
 +and accept the new fuchs.hhlr-gu.de key. Above you will find our unique ECDSA and ED25519 fingerprints. Some programs tend to display the fingerprint in the SHA256 or MD5 format. Just click on //fuchs.hhlr-gu.de fingerprints// above.</note> 
 + 
 +Please check with ''ssh-keyscan fuchs.hhlr-gu.de'' if the key above is the same. 
 +On Windows systems please use/install a Windows SSH client (e.g. PuTTY, MobaXterm, the Cygwin ssh package or the built-in ''ssh'' command).
  
 After your [[first login]] you will get the message, that your password has expired and you have to change it. Please use the password provided by CSC at the prompt, choose a new one and retype it. You will be logged out automatically. Now you can login with your new password and work on the cluster. After your [[first login]] you will get the message, that your password has expired and you have to change it. Please use the password provided by CSC at the prompt, choose a new one and retype it. You will be logged out automatically. Now you can login with your new password and work on the cluster.
  
-<note warning>Never run heavy calculations, i.e. CPU-time-consuming processes, on the login nodes. You can check the CPU-time limit (in seconds) by running+<note warning>Never run heavy calculations on the login node, i.e. CPU-time- or RAM-consuming processes. You can check the CPU-time limit (in seconds) by running \\ 
 +\\
  
 ''ulimit -t'' ''ulimit -t''
  
-on the command line. On login node, any process that exceeds the CPU-time limit (e.g. a long running test program or a long running rsync) will be killed automatically.</note>+on the command line. On the login node, any process that exceeds the CPU-time limit (e.g. a long running test program or a long running rsync) will be killed automatically.</note>
  
 ===== Environment Modules ===== ===== Environment Modules =====
  
 +There are several versions of software packages installed on our systems. The same name for an executable (e.g. ''mpirun'') and/or library file may be used by more than one package. The environment module system, with its ''module'' command, helps to keep them apart and prevents name clashes. You can list the module-managed software by running ''module avail'' on the command line. Other important commands are ''module load <name>'' (loads a module) and ''module list'' (lists the already loaded modules).
  
-There are several versions of software packages installed on our systems. The same name for an executable (e.g. mpirun) and/or library file may be used by more than one package. The environment module system, with its ''module'' command, helps to keep them apart and prevents name clashes. You can list the module-managed software by running ''module avail'' on the command line. Other important commands are ''module load <name>'' (loads a module) and ''module list'' (lists the already loaded modules). E.g. if you want to work with Intel MPI, run ''module load mpi/intel/2020.0'' +<note important>It's important to know, which modules you really need. Loading more than one MPI module at the same time will likely lead to overlapping.</note>
- +
-<note> It's important to know, which modules you really need. Loading more than one MPI module at the same time will likely lead to overlapping. </note>+
  
 If you want to know more about module commands, the ''module help'' command will give you an overview. If you want to know more about module commands, the ''module help'' command will give you an overview.
  
 +===== Working with Intel oneAPI =====
 +
 +With the command 'module avail' you are able to see all available modules on the cluster. The ''intel/oneapi/xxx'' works a kind of different, because it functions like a container. Start loading it with ''module load intel/oneapi/xxx'' and see with ''module avail'' what is inside. Now start loading your preferred module.
 +
 +<note tip>To avoid erros please use versions numbers instead of latest.</note>
 +
 +<code|Example>
 +module load intel/oneapi/2023.2.0
 +module avail
 +...
 +# new modules available within the Intel oneAPI modulefile
 +----------------------------------- /cluster/intel/oneapi/2023.2.0/modulefiles ---------------------
 +advisor/2023.2.0        dev-utilities/2021.10.0  icc32/2023.2.1                mkl32/2023.2.0
 +advisor/latest          dev-utilities/latest     icc32/latest                  mkl32/latest
 +ccl/2021.10.0           dnnl-cpu-gomp/2023.2.0   inspector/2023.2.0            mpi/2021.10.0
 +ccl/latest              dnnl-cpu-gomp/latest     inspector/latest              mpi/latest
 +compiler-rt/2023.2.1    dnnl-cpu-iomp/2023.2.0   intel_ipp_ia32/2021.9.0       mpi_BROKEN/2021.10.0
 +compiler-rt/latest      dnnl-cpu-iomp/latest     intel_ipp_ia32/latest         mpi_BROKEN/latest
 +compiler-rt32/2023.2.1  dnnl-cpu-tbb/2023.2.0    intel_ipp_intel64/2021.9.0    oclfpga/2023.2.0
 +compiler-rt32/latest    dnnl-cpu-tbb/latest      intel_ipp_intel64/latest      oclfpga/2023.2.1
 +compiler/2023.2.1       dnnl/2023.2.0            intel_ippcp_ia32/2021.8.0     oclfpga/latest
 +compiler/latest         dnnl/latest              intel_ippcp_ia32/latest       tbb/2021.10.0
 +compiler32/2023.2.1     dpct/2023.2.0            intel_ippcp_intel64/2021.8.0  tbb/latest
 +compiler32/latest       dpct/latest              intel_ippcp_intel64/latest    tbb32/2021.10.0
 +dal/2023.2.0            dpl/2022.2.0             itac/2021.10.0                tbb32/latest
 +dal/latest              dpl/latest               itac/latest                   vtune/2023.2.0
 +debugger/2023.2.0       icc/2023.2.1             mkl/2023.2.0                  vtune/latest
 +debugger/latest         icc/latest               mkl/latest
 +
 +Key:
 +loaded  modulepath
 +</code>
 +
 +Please also note, by default, Intel MPI's ''mpicc'' uses the GCC. To make Intel MPI use an Intel compiler you have to set ''I_MPI_CC'' in your environment (or use ''mpiicc''), e.g.:
 +
 +  module load intel/oneapi/2023.2.0
 +  module load compiler/2023.2.1 
 +  module load mpi/2021.10.0
 +  export I_MPI_CC=icx
 ===== Compiling Software ===== ===== Compiling Software =====
  
-You can compile your software on the login nodes (or on any other node, inside a job allocation). On Goethe-HLR several compiler suites are available. While GCC version 4.8.is the built-in OS default, you can list additional compilers and libraries by running ''module avail'':+You can compile your software on the login nodes (or on any other node, inside a job allocation). Several compiler suites are available. While GCC version 11.X.is the built-in OS default, you can list additional compilers and libraries by running ''module avail'':
  
-   * GNU compilers 
    * Intel compilers    * Intel compilers
    * MPI libraries    * MPI libraries
 +   * other libraries
  
 For the right compilation commands please consider: For the right compilation commands please consider:
  
-<note important>C/C++, Fortran77, Fortran 95 \\ \\ +<note>C/C++, Fortran77, Fortran 95 \\ \\ 
-[[https://software.intel.com/en-us/mpi-developer-reference-linux-compilation-commands|Compilation commands for the different compilers]]+[[https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/command-reference/compiler-commands.html|Compilation commands for different compilers]]
 </note> </note>
  
-To compile and manage software which is not available under "''module avail''" we recommend //Spack// Please read this small [[public:usage:spack|introduction]] on how to use Spack on the Cluster. More information is available on the ''[[ https://spack.io/|Spack ]]'' webpage. +To build and manage software which is not available via "''module avail''" and is not available as a built-in OS package, we recommend using //Spack//. Please read this small [[public:usage:spack|introduction]] on how to use Spack on the cluster. More information is available on the ''[[ https://spack.io/|Spack ]]'' webpage.
-===== Debugging ===== +
- +
-The [[http://www.roguewave.com/products-services/totalview|TotalView]] parallel debugger is available on the Goethe-HLR cluster. Follow these steps to start a debugging session: +
- +
-   - Compile your code with your favored MPI using the debug option -g, e.g.<code> +
-mpicc -g -o mpi_prog mpi_prog.c</code> +
-   - Load the TotalView module by running<code> +
-module load debug/totalview/2019.0.4</code> +
-   - Allocate the resources you need using salloc, e.g.<code> +
-salloc -n 4 --partition=test --time=00:59:00</code> +
-   - Start a TotalView debugging session, e.g.<code> +
-totalview </code> +
-   - Choose Debug a parallel session +
-   - Choose your executable (mpi_prog), Parallel System (e.g. Intel MPI CSC or openmpi-m), number of tasks and load the session   +
  
 ===== Storage ===== ===== Storage =====
Line 64: Line 110:
   * your home directory ''/home/fuchs/<group>/<user>'' (NFS, slow),   * your home directory ''/home/fuchs/<group>/<user>'' (NFS, slow),
   * your scratch directory ''/scratch/fuchs/<group>/<user>'' (parallel file system BeeGFS, fast),   * your scratch directory ''/scratch/fuchs/<group>/<user>'' (parallel file system BeeGFS, fast),
-  * the non-shared local storage (i.e. only accessible from the compute node it's connected to, max. 1.4 TB, slow) under ''/local/$SLURM_JOB_ID'' on each compute node+  * the non-shared local storage (i.e. unique on each compute node) under ''/local/$SLURM_JOB_ID'' on each compute node (max. 1.4 TB per node, slow)
  
 Please use your home directory for small permanent files, e.g. source files, libraries and executables. Please use your home directory for small permanent files, e.g. source files, libraries and executables.
Line 72: Line 118:
 {{ :public:loewe-storage4.png }} {{ :public:loewe-storage4.png }}
  
-By default, the space in your home directory is limited to 10 GB and in your scratch directory to 5 TB and/or 800000 inodes (which corresponds to approximately 200000+ files). You can check your homedir and scratch usage by running the ''quota'' command on a login node.+By default, the space in your home directory is limited to 30 GBand in your scratch directory to 5 TB and/or 800000 inodes (which corresponds to approximately 200000+ files). **You can check your homedir and scratch usage by running the** ''quota'' **command on a login node.**
  
 <note> <note>
  
-While the data in your home directory is backed up nightly (please ask, if you want us to restore anything from there), there is no backup of your scratch directory.</note>+While the data in your home directory is backed up nightly (please ask, if you want us to restore anything from there, see also [[http://www.rz.uni-frankfurt.de/49197551/backup_achivdienste?|HRZ-Backup]]), there is no backup of your scratch directory.</note>
 If you need local storage on the compute nodes, you have to add the ''%%--%%tmp'' parameter to your job script (see SLURM section below). Set the amount of storage in megabytes, e.g. set ''%%--%%tmp=5000'' to allocate 5 GB of local disk space. The local directory (''/local/$SLURM_JOB_ID'') is deleted after the corresponding job has finished. If, for some reason, you don't want the data to be deleted (e.g. for debugging), you can use ''salloc'' instead of ''sbatch'' and work interactively (see ''man salloc''). Or, one can put an ''rsync'' at the end of the job script, in order to save the local data to ''/scratch'' just before the job exits: If you need local storage on the compute nodes, you have to add the ''%%--%%tmp'' parameter to your job script (see SLURM section below). Set the amount of storage in megabytes, e.g. set ''%%--%%tmp=5000'' to allocate 5 GB of local disk space. The local directory (''/local/$SLURM_JOB_ID'') is deleted after the corresponding job has finished. If, for some reason, you don't want the data to be deleted (e.g. for debugging), you can use ''salloc'' instead of ''sbatch'' and work interactively (see ''man salloc''). Or, one can put an ''rsync'' at the end of the job script, in order to save the local data to ''/scratch'' just before the job exits:
 <code bash> <code bash>
Line 110: Line 156:
   * [[#hybrid_jobsmpi_openmp|hybrid MPI/OpenMP]]   * [[#hybrid_jobsmpi_openmp|hybrid MPI/OpenMP]]
  
-For every compute job you have to submit a job script (unless working interactively using ''salloc'' or ''srun'', see man page for more information). If ''jobscript.sh'' is such a script, then a job can be enqueued by running+For every compute job you have to submit a job script (unless working interactively using [[#the_salloc_command|salloc]] or ''srun'', see man page for more information). If ''jobscript.sh'' is such a script, then a job can be enqueued by running
  
   sbatch jobscript.sh   sbatch jobscript.sh
  
-on a login node. A SLURM job script is a shell script which may contain SLURM directives (options), i.e. pseudo-comment lines starting with+on a login node. A SLURM job script is a shell script containing SLURM directives (options), i.e. pseudo-comment lines starting with
  
   #SBATCH ...   #SBATCH ...
Line 130: Line 176:
 ==== Your First Job Script  ==== ==== Your First Job Script  ====
  
-In ''fuchs'' you can allocate up to 120 nodes with two Intel Ivy Bridge CPUs with 20 cores (i.e. 40 HT threads). In the following example we allocate 60 CPU cores (i.e. three nodes) and 512 MB per core for 5 minutes (SLURM may kill the job after that time, if it's still running):+In ''fuchs'' you can allocate up to 120 nodes with two Intel Ivy Bridge CPUs, where each node has 20 cores (or 40 HT threads). In the following example we allocate 60 CPU cores (i.e. three nodes) and 512 MB per core for 5 minutes (SLURM may kill the job after that time, if it's still running):
  
 <code bash> <code bash>
Line 138: Line 184:
 #SBATCH --nodes=3 #SBATCH --nodes=3
 #SBATCH --ntasks=60 #SBATCH --ntasks=60
-#SBATCH --cpus-per-task=1   1)+#SBATCH --ntasks-per-node=20 
 +#SBATCH --cpus-per-task=1
 #SBATCH --mem-per-cpu=512    #SBATCH --mem-per-cpu=512   
 #SBATCH --time=00:05:00 #SBATCH --time=00:05:00
-#SBATCH --no-requeue        2) +#SBATCH --no-requeue 
-#SBATCH --mail-type=FAIL    3) +#SBATCH --mail-type=FAIL
-#SBATCH –-extra-node-info=2:10:  4)+
  
 srun hostname srun hostname
Line 149: Line 195:
 </code> </code>
  
-1For SLURM, a CPU core (a CPU thread, to be more precise) is a CPU.\\ +<note> 
-2) Prevent the job from being requeued after a failure.\\ +|<code>--cpus-per-task=1</code>For SLURM, a CPU core (a CPU thread, to be more precise) is a CPU. | 
-3) Send an e-mail if sth. goes wrong.\\ +|<code>--no-requeue</code>Prevent the job from being requeued after a failure. | 
-4) Run job without Hyper-Threading.\\+|<code>--mail-type=FAIL</code>Send an e-mail if sth. goes wrong. | 
 +</note>
  
 The ''srun'' command is responsible for the distribution of the program (''hostname'' in our case) across the allocated resources, so that 20 instances of ''hostname'' will run on each of the allocated nodes concurrently. Please note, that this is not the only way to run or to distribute your processes. Other cases and methods are covered later in this document. In contrast, the ''sleep'' command is executed only on the head((the first one of the three allocated nodes)) node. The ''srun'' command is responsible for the distribution of the program (''hostname'' in our case) across the allocated resources, so that 20 instances of ''hostname'' will run on each of the allocated nodes concurrently. Please note, that this is not the only way to run or to distribute your processes. Other cases and methods are covered later in this document. In contrast, the ''sleep'' command is executed only on the head((the first one of the three allocated nodes)) node.
  
-Although nodes are allocated exclusively, you should always specify a memory value that reflects the RAM requirements of your job. The scheduler treats RAM as a //consumable resource//. As a consequence, if you omit the ''%%--%%nodes'' parameter (so that only the number of CPU cores is defined) and allocate more memory per core than there actually is on a node, you'll automatically get more nodes if the job doesn't fit in otherwise. Moreover, jobs are killed through SLURM's //memory enforcement// when using more memory than requested.+Although nodes are allocated exclusively, you should always specify a memory value that reflects the RAM requirements of your job. <del>The scheduler treats RAM as a //consumable resource//. As a consequence, if you omit the ''%%--%%nodes'' parameter (so that only the number of CPU cores is defined) and allocate more memory per core than there actually is on a node, you'll automatically get more nodes if the job doesn't fit in otherwise.</del>((please see [[#Memory Allocation]] for the new Slurm behavior)) Moreover, jobs are killed through SLURM's //memory enforcement// when using more memory than requested.
  
 After saving the above job script as e.g. ''jobscript.sh'', you can submit your job by running After saving the above job script as e.g. ''jobscript.sh'', you can submit your job by running
Line 166: Line 213:
 ==== Job Monitoring ==== ==== Job Monitoring ====
  
-For job monitoring (to check the current state of your jobs) you can use the ''squeue'' command. Depending on the current cluster utilization (and other factors), your job(s) may take a while to start. You can check the current queuing times by running ''sqtimes'' on the command line.+For job monitoring (to check the current state of your jobs) you can use the ''squeue'' command. Depending on the current cluster utilization (and other factors), your job(s) may take a while to start. You can list the current queuing times by running ''sqtimes'' on the command line.
  
 If you need to cancel a job, you can use the ''scancel'' command (please see the ''manpage'', ''man scancel'', for further details). If you need to cancel a job, you can use the ''scancel'' command (please see the ''manpage'', ''man scancel'', for further details).
Line 177: Line 224:
 #SBATCH --partition=fuchs #SBATCH --partition=fuchs
 </code>  </code> 
-to the job script at Goethe-HLR, when they want to use the FUCHS cluster. +to the job script, when they want to use the FUCHS cluster.
  
  
-==== Node Type ====+==== Node Types ====
  
-On FUCHS **one type** of compute node is available. There are+On FUCHS only **one type** of compute nodes is available. There are
 ^Number^Type^Vendor^Processor^Processor x Core (HT)^RAM [GB]^ ^Number^Type^Vendor^Processor^Processor x Core (HT)^RAM [GB]^
 |194|dual-socket|Intel|Xeon Ivy Bridge E5-2670 v2| 2x10 (2x20)|128| |194|dual-socket|Intel|Xeon Ivy Bridge E5-2670 v2| 2x10 (2x20)|128|
Line 194: Line 241:
 | ''MaxJobsPU'' |  10| max. number of jobs a user is able to run simultaneously | | ''MaxJobsPU'' |  10| max. number of jobs a user is able to run simultaneously |
 | ''MaxSubmitPU'' |  20| max. number of jobs in running or pending state | | ''MaxSubmitPU'' |  20| max. number of jobs in running or pending state |
-| ''MaxNodesPU'' |  120|  max. number of nodes a user is able to use at the same time |+| ''MaxNodesPU'' |  50|  max. number of nodes a user is able to use at the same time |
 | ''MaxArraySize'' |  1001| the maximum job array size | | ''MaxArraySize'' |  1001| the maximum job array size |
  
Line 216: Line 263:
 #SBATCH --ntasks=20 #SBATCH --ntasks=20
 #SBATCH --cpus-per-task=1 #SBATCH --cpus-per-task=1
-#SBATCH --mem-per-cpu=2000+#SBATCH --mem=100g
 #SBATCH --time=01:00:00 #SBATCH --time=01:00:00
 #SBATCH --mail-type=FAIL #SBATCH --mail-type=FAIL
- 
-export OMP_NUM_THREADS=1 
  
 # #
 # Replace by a for loop. # Replace by a for loop.
  
-./program input01 >& 01.out & +./program input01 &01.out & 
-./program input02 >& 02.out &+./program input02 &02.out &
  
 ... ...
  
-./program input20 >& 20.out &+./program input20 &20.out &
 # Wait for all child processes to terminate. # Wait for all child processes to terminate.
 wait wait
 </code> </code>
  
-In this (SIMD) example we assume, that there is a program (called ''program'') which is run 20 times on 20 different inputs (usually input files). Both output streams (''stdout'' and ''stderr'') of each process are redirected to a file ''N.out''. A job script is always executed on the first allocated node, so we don't need to use ''srun'', since exactly one node is allocated. Further we assume that the executable is located in the same directory where the job was submitted (that is the initial working directory)+In this (SIMD) example we assume, that there is a program (called ''program'') which is run 20 times on 20 different inputs (usually input files). Both output streams (''stdout'' and ''stderr'') of each process are redirected to a file ''N.out''. A job script is always executed on the first allocated node, so we don't need to use ''srun'', since exactly one node is allocated. Further we assume that the executable is located in the same directory where the job was submitted (the initial working directory).
- +
-If the running times of your processes vary a lot, consider using the //thread pool pattern//. Have a look at the ''xargs -P'' command, for instance.+
  
 ==== Job Arrays ==== ==== Job Arrays ====
  
-If you have a lot of single-core computations to run, job arrays are worth a look. Telling SLURM to run a job script as a job array will result in running that script multiple times (after the corresponding resources have been allocated). Each instance will have a distinct ''SLURM_ARRAY_TASK_ID'' variable defined in its environment.+If you have lots of single-core computations to run, job arrays are worth a look. Telling SLURM to run a job script as a job array will result in running that script multiple times (after the corresponding resources have been allocated). Each instance will have a distinct ''SLURM_ARRAY_TASK_ID'' variable defined in its environment.
  
 Due to our full-node policy, you still have to ensure, that your jobs don't waste any resources. Let's say, you have 400 single-core tasks. In the following example 400 tasks are run inside a job array while ensuring that only 20-core nodes are used and that each node runs exactly 20 tasks in parallel. Due to our full-node policy, you still have to ensure, that your jobs don't waste any resources. Let's say, you have 400 single-core tasks. In the following example 400 tasks are run inside a job array while ensuring that only 20-core nodes are used and that each node runs exactly 20 tasks in parallel.
Line 252: Line 295:
 #SBATCH --mem-per-cpu=2000 #SBATCH --mem-per-cpu=2000
 #SBATCH --time=00:10:00 #SBATCH --time=00:10:00
-#SBATCH --array=0-319:20+#SBATCH --array=0-399:20
 #SBATCH --mail-type=FAIL #SBATCH --mail-type=FAIL
  
Line 285: Line 328:
 </code> </code>
  
-If the task running times vary a lot, consider using the //thread pool pattern//. Have a look at the ''xargs -P'' command, for instance.+If the task running times vary a lot, consider using the //thread pool pattern//. Have a look at **GNU parallel**, for instance.
  
 ==== OpenMP Jobs ==== ==== OpenMP Jobs ====
  
-For OpenMP jobs, set the ''%%--%%cpus-per-task'' parameter. As usual, you should also specify a ''%%--%%mem-per-cpu'' value. But in this case you have to divide the total RAM required by your program by the number of threads. E.g. if your application needs 4000 MB and you want to run 20 threads, then you have to set ''%%--%%mem-per-cpu=200'' (4000/20 = 200). Don't forget to set the ''OMP_NUM_THREADS'' environment variable. Example:+For OpenMP jobs, set the ''%%--%%cpus-per-task'' parameter. You could specify a ''%%--%%mem-per-cpu'' value. But in this case you have to divide the total RAM required by your program by the number of threads. E.g. if your application needs 4000 MB and you want to run 20 threads, then you have to set ''%%--%%mem-per-cpu=200'' (4000/20 = 200). However, it's also possible to specify the total amount of RAM using the ''%%--%%mem'' parameter. Don't forget to set the ''OMP_NUM_THREADS'' environment variable. Example:
  
 <code bash>#!/bin/bash <code bash>#!/bin/bash
 #SBATCH --partition=fuchs #SBATCH --partition=fuchs
 +#SBATCH --nodes=1
 #SBATCH --ntasks=1 #SBATCH --ntasks=1
 #SBATCH --cpus-per-task=20 #SBATCH --cpus-per-task=20
-#SBATCH --mem-per-cpu=200+#SBATCH --mem=4000
 #SBATCH --mail-type=ALL #SBATCH --mail-type=ALL
 #SBATCH --time=48:00:00 #SBATCH --time=48:00:00
  
-export OMP_NUM_THREADS=20 +export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK 
-./omp_program+./your_omp_program
 </code> </code>
  
Line 306: Line 350:
 ==== MPI Jobs ==== ==== MPI Jobs ====
  
-**Remember:** Nodes are used exclusively. Each node has 20 CPU cores. If you want to run a lot of small jobs (i.e. where more than one job could be run on a single node concurrently), consider running more than one computation within a job (see next section). Otherwise it will most likely result in a waste of resources and will lead to a longer queueing time (for you and others).+**Remember:** Nodes are used exclusively. Each node has many [[#Node Types|CPU cores]]. If you want to run small jobs (i.e. where more than one job could be run on a single node concurrently), consider running more than one computation within a job. Otherwise it will most likely result in a waste of resources and will lead to a longer queueing time (for you and others).
  
 See also: http://www.schedmd.com/slurmdocs/faq.html#steps See also: http://www.schedmd.com/slurmdocs/faq.html#steps
  
-As an example, we want to run a program that spawns 80 Open MPI ranks and where 1200 MB of RAM are allocated for each rank.+As an example, we want to run a program that spawns 80 MPI ranks and where 1200 MB of RAM are allocated for each rank.
  
 <code bash>#!/bin/bash <code bash>#!/bin/bash
 #SBATCH --partition=fuchs #SBATCH --partition=fuchs
 +#SBATCH --nodes=4
 #SBATCH --ntasks=80 #SBATCH --ntasks=80
 +#SBATCH --ntasks-per-node=20
 #SBATCH --cpus-per-task=1 #SBATCH --cpus-per-task=1
 #SBATCH --mem-per-cpu=1200 #SBATCH --mem-per-cpu=1200
 #SBATCH --mail-type=ALL #SBATCH --mail-type=ALL
-#SBATCH --extra-node-info=2:10:1+#SBATCH --extra-node-info=2:10:# Don't use this with Intel MPI.
 #SBATCH --time=48:00:00 #SBATCH --time=48:00:00
  
-module load mpi/XXXX/...+module load mpi/.../<version> 
-export OMP_NUM_THREADS=1 +mpirun ./your_mpi_program
-mpirun -n 80 ./example_program+
 </code>  </code> 
  
-==== Combining Small MPI Jobs ==== +<note>If  the final amount of memory requested by job can't be satisfied by any of the nodes configured in the partitionthe job will be rejectedThis could happen if  ''%%--%%mem-per-cpu'' is used for a job allocation and ''%%--%%mem-per-cpu'' times the number of CPUs on  node is greater than the total memory of that nodePlease see [[#Memory Allocation]].</note>
- +
-As mentioned earlier, running small jobs while full nodes are allocated leads to waste of resources. In cases where you have, let's say, a lot of 10-rank MPI jobs (with similar runtimes and low memory consumption)you can start more than one computation within a single allocation (and on a single node)Open MPI example (running two MPI jobs concurrently on a 20-core node): +
- +
-<code bash>#!/bin/bash +
-#SBATCH --partition=fuchs +
-#SBATCH --nodes=1 +
-#SBATCH --ntasks=20 +
-#SBATCH --cpus-per-task=1 +
-#SBATCH --mem-per-cpu=2000 +
-#SBATCH --time=48:00:00 +
-#SBATCH --mail-type=FAIL +
- +
-export OMP_NUM_THREADS=1 +
-mpirun -np 10 ./program input01 >& 01.out & +
-# Wait little before starting the next one. +
-sleep 3 +
-mpirun -np 10 ./program input02 >& 02.out & +
-Wait for all child processes to terminate. +
-wait +
-</code>+
  
-You might also need to disable core binding (please see the ''mpirun'' man page, or when using MVAPICH2, set ''MV2_ENABLE_AFFINITY=0''). Otherwise the ranks of the second run will interfere with the first one.+Some MPI installations support the ''srun'' command (instead of or in addition to ''mpirun''), e.g.:  
 +  [...] 
 +  module load mpi/.../<version> 
 +  srun --mpi=pmix ./your_mpi_program
  
 +MPI implementations are typically designed to work seamlessly with job schedulers like Slurm. When you launch MPI tasks with ''mpirun'' (or ''srun'') inside your job script, the MPI library uses the information provided by Slurm (via environment variables or other means) to determine the communication topology and allocate processes accordingly.
 ==== Hybrid Jobs: MPI/OpenMP ==== ==== Hybrid Jobs: MPI/OpenMP ====
  
-MVAPICH2 example script (20 ranks, 5 threads each and 200 MB per thread, i.e. 1.2 GB per rank; so, for 20*5 threads, you'll get five 20-core nodes):+MPI example script (20 ranks, 5 threads each and 200 MB per thread, i.e. 1 GB per rank; so, for 20*5 threads, you'll get five 20-core nodes):
  
 <code bash>#!/bin/bash <code bash>#!/bin/bash
 #SBATCH --partition=fuchs #SBATCH --partition=fuchs
-#SBATCH --ntasks=40+#SBATCH --ntasks=20
 #SBATCH --cpus-per-task=5 #SBATCH --cpus-per-task=5
 #SBATCH --mem-per-cpu=200 #SBATCH --mem-per-cpu=200
 #SBATCH --mail-type=ALL #SBATCH --mail-type=ALL
-#SBATCH --extra-node-info=2:10:1+#SBATCH --extra-node-info=2:10:# Don't use this with Intel MPI.
 #SBATCH --time=48:00:00 #SBATCH --time=48:00:00
  
 +module load mpi/.../<version>
 export OMP_NUM_THREADS=5 export OMP_NUM_THREADS=5
 +# When using MVAPICH2 disable core affinity.
 export MV2_ENABLE_AFFINITY=0 export MV2_ENABLE_AFFINITY=0
 mpirun -np 20 ./example_program mpirun -np 20 ./example_program
Line 370: Line 401:
 Please note, that this is just an example. You may or may not run it as-it-is with your software, which is likely to have a different scalability. Please note, that this is just an example. You may or may not run it as-it-is with your software, which is likely to have a different scalability.
  
-You have to disable the core affinity when running hybrid jobs with MVAPICH2. Otherwise all threads of an MPI rank will be pinned to the same core. Our example now includes the command+You have to disable the core affinity when running hybrid jobs with MVAPICH2((MVAPICH2 is another MPI library)). Otherwise all threads of an MPI rank will be pinned to the same core. Our example now includes the command
  
 <code bash> <code bash>
Line 376: Line 407:
 </code> </code>
  
-which disables this feature. The OS scheduler is now responsible for the placement of the threads during the runtime of the program. But the OS scheduler can dynamically change the thread placement during the runtime of the program. This leads to cache invalidation, which degrades performance. This can be prevented by thread pinning.+which disables this feature. The OS scheduler is now responsible for the placement of the threads during the runtime of the program. But the OS scheduler can dynamically change the thread placement during the runtime of the program. This leads to cache invalidation, which degrades performance. This can be prevented by thread pinning (topic not covered here). 
 + 
 +When using **Intel MPI**, please also check its [[https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-15/environment-variables-for-process-pinning.html|pinning]] parameters. 
 + 
 +==== Memory Allocation ==== 
 + 
 +Normally the memory available per CPU thread is calculated by the whole amount of RAM divided by the number of threads. For instance 128GB / 40 threads = 3.2GB per thread. Keep in mind that the FUCHS cluster provides two threads per core. Now imagine you need more memory, let's say 8192MB per task. Type ''scontrol show node=<node>'' and look for 'mem'. Another way is to login into an Ivy Bridge Node and type ''free -m''. In both cases you will determine 128768M(B) of memory. Now we calculate how many processes we can launch on one node like this: 128782MB/8192MB=15.72... . With this result we can determine how many nodes we need. In the following example we like to have 69 processes à 8192MB. 
 + 
 +<code> 
 +#!/bin/bash 
 +#SBATCH --job-name=<your_job_name> 
 +#SBATCH --partition=fuchs 
 +#SBATCH --ntasks=69            # Whole amount of processes we have. 
 +#SBATCH --cpus-per-task=1      # Only one task per CPU. 
 + 
 +# #SBATCH --mem-per-cpu=8192   # We can't use this argument here, because it ends up with an error 
 +                               # message, therefore it's commented out. 
 + 
 +#SBATCH --mem=0                # Use this argument instead, which means "full memory of the node"  
 +#SBATCH --ntasks-per-node=15   # We have calculated and rounded that we need 15 per node. 
 + 
 +srun hostname 
 +</code> 
 + 
 +If everything works fine you were granted 5 nodes. For example 4 nodes à 14 tasks and 1 node à 13 tasks, i.e. 56 tasks + 13 tasks = 69 tasks, as requested.
  
 ==== Local Storage ==== ==== Local Storage ====
Line 386: Line 441:
 For interactive workflows you can use SLURM's ''salloc'' command. With ''salloc'' almost the same options can be used as with ''sbatch'', e.g.: For interactive workflows you can use SLURM's ''salloc'' command. With ''salloc'' almost the same options can be used as with ''sbatch'', e.g.:
  
-<code>[user@loginnode ~]$ salloc --nodes=4 --time=0:45:00 --mem=100g --partition=test +<code>[user@fuchs ~]$ salloc --nodes=4 --time=0:45:00 --mem=64g --partition=fuchs 
-salloc: Granted job allocation 197553+salloc: Granted job allocation 1122971
 salloc: Waiting for resource configuration salloc: Waiting for resource configuration
-salloc: Nodes node45-[002-005] are ready for job +salloc: Nodes node27-[012-015] are ready for job 
-[user@loginnode ~]$ +[user@fuchs ~]$ 
 </code> </code>
  
 Now you can ''ssh'' into the nodes that were allocated for the job and run further commands, e.g.: Now you can ''ssh'' into the nodes that were allocated for the job and run further commands, e.g.:
- +<code>[user@fuchs ~]$ ssh node27-012 
-<code> +[user@node27-012 ~]$ hostname 
-[user@loginnode ~]$ ssh node45-002 +node27-012 
-[user@node45-002 ~]$ hostname +[user@node27-012 ~]$ exit 
-node45-002.cm.cluster +logout 
-[user@node45-002 ~]$ logout +Connection to node27-012 closed.
-Connection to node45-002 closed. +
-... +
-[user@loginnode ~]$ ssh node45-003 +
-[user@node45-003 ~]$ hostname +
-node45-003.cm.cluster +
-[user@node45-003 ~]$ logout +
-Connection to node45-003 closed. +
-... +
-[user@loginnode ~]$ ssh node45-005 +
-[user@node45-005 ~]$ hostname +
-node45-005.cm.cluster +
-[user@node45-005 ~]$ logout +
-Connection to node45-005 closed.+
 </code> </code>
  
 Or you can use ''srun'' for running a command on all allocated nodes in parallel: Or you can use ''srun'' for running a command on all allocated nodes in parallel:
  
-<code>[user@loginnode ~]$ srun hostname +<code>[user@fuchs ~]$ srun hostname 
-node45-002.cm.cluster +node27-013 
-node45-003.cm.cluster +node27-012 
-node45-005.cm.cluster +node27-015 
-node45-004.cm.cluster +node27-014
-[user@loginnode ~]$+
 </code> </code>
  
 Finally you can terminate your interactive job session by running ''exit'', which will free the allocated nodes: Finally you can terminate your interactive job session by running ''exit'', which will free the allocated nodes:
  
-<code>[user@loginnode ~]$ exit +<code>[user@fuchs ~]$ exit 
-salloc: Relinquishing job allocation 197553 +exit 
-[user@loginnode ~]$ +salloc: Relinquishing job allocation 1122971
 </code> </code>
  
public/usage/fuchs.1589532626.txt.gz · Last modified: 2020/05/15 10:50 by geier
CC Attribution-Noncommercial-Share Alike 4.0 International
Driven by DokuWiki Recent changes RSS feed Valid CSS Valid XHTML 1.0