Differences

This shows you the differences between two versions of the page.

--- public:usage:goethe-hlr [2020/05/15 10:48] – [The test Partition: Your First Job Script] geier
+++ public:usage:goethe-hlr [2020/12/09 21:46] – [Hyper-Threading] keiling
@@ Line 9: / Line 9: @@
 <note important>Warnings - Security Breach - Keys etc.\\ \\
-You may receive a warning from the system that something with the security is wrong. We switched the old LOEWE Cluster IP to our new GOETHE Cluster. If you used the LOEWE Cluster in the past you receive are warning that something is wrong. This is because the unique LOEWE key within the clientsoftware you use differs from the new unique GOETHE-HLR key. Just erase your old LOEWE key and everything is set.\\ \\
+You may receive a warning from the system that something with the security is wrong. We switched the old LOEWE Cluster IP to our new GOETHE Cluster. If you used the LOEWE Cluster in the past you receive a warning that something is wrong. This is related to the unique LOEWE key within the clientsoftware you use differs from the new unique GOETHE-HLR key. Just erase your old LOEWE key and everything is set.\\ \\
-If you may used linux just look up ''ssh-keygen -R''.</note>
+If you may use Linux just look up ''ssh-keygen -R''.</note>
 On Windows systems please use/install a Windows SSH client (e.g. PuTTY, or the Cygwin ssh package).
@@ Line 121: / Line 121: @@
 On our systems, compute jobs and resources are managed by SLURM (Simple Linux Utility for Resource Management). The compute nodes are organized in the partition (or queue) named ''general1''. There is also a small test partition called ''test''. Additionally we offer the use of some old compute nodes from LOEWE-CSC. Those nodes are organized in the partition ''general2''. You can see more details (the current number of nodes in each partition and their state) by running the ''sinfo'' command on a login node.
-^Partition^Node type^Implemented^
+^Partition^Node type^GPU^Implemented^
-| ''general1'' | Intel Skylake CPU|yes|
+| ''general1'' | Intel Skylake CPU|none|yes|
-| ''general2'' | Intel Ivy Bridge CPU \\ Intel Broadwell CPU|yes \\ yes|
+| ''general2'' | Intel Ivy Bridge CPU \\ Intel Broadwell CPU|none|yes \\ yes|
-| ''gpu'' | n/a | not yet |
+| ''gpu''      | AMD EPYC 7452    |8x MI50 \\ 16GB|yes|
-| ''test'' | Intel Skylake CPU|yes|
+| ''test''     | Intel Skylake CPU|none|yes|
 Nodes are used **exclusively**, i.e. only whole nodes are allocated for a job and no other job can use the same nodes concurrently.
@@ Line 152: / Line 152: @@
 The following instructions shall provide you with the basic information you need to get started with SLURM on our systems. However, the official SLURM documentation covers some more use cases (also in more detail). Please read the SLURM man pages (e.g. ''man sbatch'' or ''man salloc'') and/or visit http://www.schedmd.com/slurmdocs. It's highly recommended.
-Helpful SLURM link: [[https://slurm.schedmd.com/faq.html|SLURM FAQ]]
+Helpful SLURM links: [[https://slurm.schedmd.com/faq.html|SLURM FAQ]]\\
+SLURM documentation: [[https://slurm.schedmd.com|SLURM]]
 ==== The test Partition: Your First Job Script ====
@@ Line 190: / Line 190: @@
 ==== Job Monitoring ====
-For job monitoring (to check the current state of your jobs) you can use the ''squeue'' command. Depending on the current cluster utilization (and other factors), your job(s) may take a while to start. You can see the current queuing times by running ''sqtimes'' on the command line.
+For job monitoring (to check the current state of your jobs) you can use the ''squeue'' command. Depending on the current cluster utilization (and other factors), your job(s) may take a while to start. You can list the current queuing times by running ''sqtimes'' on the command line.
 If you need to cancel a job, you can use the ''scancel'' command (please see the manpage, ''man scancel'', for further details).
 ==== Node Types And Constraints ====
+<note important>The GPU nodes will be available step by step after final performance tests.</note>
 On Goethe-HLR **four different types** of compute nodes are available. There are
-^Number^Type^Vendor^CPU^Cores per CPU^Cores per Node^Hyper-Threads per Node^RAM [GB]^
+^Number^Type^Vendor^CPU^GPU^Cores per CPU^Cores per Node^Hyper-Threads \\ per Node^RAM [GB]^
-|412|dual-socket |Intel|Xeon Skylake Gold 6148   |20|40|80|192|
+|412|dual-socket |Intel|Xeon Skylake Gold 6148   |none|20|40|80|192|
-|72 |dual-socket |Intel|Xeon Skylake Gold 6148   |20|40|80|772|
+|72 |dual-socket |Intel|Xeon Skylake Gold 6148   |none|20|40|80|772|
-|139|dual-socket |Intel|Xeon Broadwell E5-2640 v4|10|20|40|128|
+|139|dual-socket |Intel|Xeon Broadwell E5-2640 v4|none|10|20|40|128|
-|47 |dual-socket \\ GPU|Intel \\ AMD|Xeon Ivy Bridge E5-2650 v2 \\ FirePro S10000|6|12|24|128|
+|112|dual-socket |AMD  |EPYC 7452                |8x MI50 \\ 16GB|32|64|128|512|
 In order to separate the node types, we employ the concept of partitions. We provide three partitions  for the nodes, one for the Skylake CPU node, one for the Broadwell and one for the GPU nodes, furthermore we have a test partition. When running CPU jobs, you can select the node type you prefer by setting
@@ Line 207: / Line 209: @@
 |general1|''#SBATCH %%--%%partition=general1''|Intel Skylake CPU|yes|
 |general2|''#SBATCH %%--%%partition=general2'' \\ ''#SBATCH %%--%%constraint=broadwell''|Intel Broadwell CPU|yes|
-|gpu|''#SBATCH %%--%%partition=gpu''| n/a |not yet|
+|gpu|''#SBATCH %%--%%partition=gpu''  |AMD EPYC 7452    |yes|
 |test|''#SBATCH %%--%%partition=test''|Intel Skylake CPU|yes|
@@ Line 221: / Line 223: @@
 | ''MaxArraySize'' |  1001| the maximum job array size |
+For the partition ''test'' we have following limits:
+^Limit^Value^Description^
+| ''MaxTime'' | 2 hours | the maximum run time for jobs |
+| ''MaxJobsPU'' |  3| max. number of jobs a user is able to run simultaneously |
+| ''MaxSubmitPU'' |  4| max. number of jobs in running or pending state	|
+| ''MaxNodesPU'' |  3|  max. number of nodes a user is able to use at the same time |
 ==== GPU Jobs ====
-Currently there are no GPU nodes available. In future: if you want to use GPUs in your calculations, select the ''gpu'' partition by setting
+Since december 2020 gpu nodes are part of the cluster. Select the ''gpu'' partition with following command:
 <code>#SBATCH --partition=gpu</code>
 ==== Hyper-Threading ====
@@ Line 229: / Line 238: @@
 On compute nodes you can use Hyper-Threading. That means, in addition to each physical CPU core a virtual core is available. SLURM identifies all physical and virtual cores of a node, so that you have 80 logical CPU cores on an Intel Skylake node, 40 logical CPU cores on an Intel Broadwell or Ivy Bridge node, and 24 logical CPU cores on a GPU node. If you don't want to use HT, you can disable it by adding
-^Node type^sbatch command^
+^Node type^hyperthreading=OFF^#cores \\ per node^hyperthreading=ON^#cores \\ per node^
-|Skylake|''#SBATCH %%--%%extra-node-info=2:20:1''|
+|Skylake               |''#SBATCH %%--%%extra-node-info=2:20:1''|40|''#SBATCH %%--%%extra-node-info=2:20:2''|80|
-|Broadwell / Ivy Bridge|''#SBATCH %%--%%extra-node-info=2:10:1''|
+|Broadwell / Ivy Bridge|''#SBATCH %%--%%extra-node-info=2:10:1''|20|''#SBATCH %%--%%extra-node-info=2:10:2''|40|
+|AMD EPYC 7452         |''#SBATCH %%--%%extra-node-info=2:32:1''|64|''#SBATCH %%--%%extra-node-info=2:32:2''|128|
 to your job script. Then you'll get half the threads per node (which will correspond to the number of cores). This can be beneficial in some cases (some jobs may run faster and/or more stable).