This is an old revision of the document!

LOEWE-CSC Cluster Usage

The LOEWE-CSC is a general-purpose computer cluster based on several CPU and GPU architectures running Scientific Linux 6 and SLURM. Also, we maintain a large collection of free and commercial software for your scientific work. Please read the following instructions and ensure that this guide is fully understood before using the system.

An SSH client is required to connect to the cluster. On a Linux system the command is usually:

ssh <user_account>@loewe-csc.hhlr-gu.de

On Windows systems please use/install a Windows SSH client (e.g. PuTTY, or the Cygwin ssh package).

After your first login you will get the message, that your password has expired and you have to change it. Please use the password provided by CSC at the prompt, choose a new one and retype it. You will be logged out automatically. Now you can login with your new password and work on the cluster.

Never run heavy calculations, i.e. CPU-time-consuming processes, on the login nodes. You can check the CPU-time limit (in seconds) by running

ulimit -t

on the command line. On a login node, any process that exceeds the CPU-time limit (e.g. a long running test program or a long running rsync) will be killed automatically.

Environment Modules

There are several versions of software packages installed on our systems. The same name for an executable (e.g. mpirun) and/or library file may be used by more than one package. The environment module system, with its module command, helps to keep them apart and prevents name clashes. You can list the module-managed software by running module avail on the command line. Other important commands are module load <name> (loads a module) and module list (lists the already loaded modules). E.g. if you want to work with Open MPI 1.8.1 and the GCC, run module load openmpi/gcc/1.8.1

It's important to know, which modules you really need. Loading more than one MPI module at the same time will likely lead to overlapping.

A number of additional “unstable” (“deprecated”) module files is kept in a separate name space. If you want to see/use them, please run or add the following command to your .bashrc file:

module --append use /cm/shared/modulefiles-unstable
module --append use /cm/shared/modulefiles-deprecated

or

module load unstable
module load deprecated

If you want to know more about module commands, the module help command will give you an overview.

Despite the number of available modules, you might want to install your own software in your home directory. Also, you can write your own module files (see module help use.own). Ambitious users might find it useful to take a look at Spack, which is another framework for managing software packages (please also see this guide).

Compiling Software

You can compile your software on the login nodes (or on any other node, inside a job allocation). On LOEWE-CSC several compiler suites are available:

GNU compilers version 4.4.7 (built-in distribution default)
+ higher GCC versions (as modules)
Intel compilers version 17.0.1 (as a module)
PGI compilers version 16.10 (as a module)

Please be aware, that the CPUs of the login nodes are Intel Xeon Ivy Bridge E5-2670v2 CPUs. If you want to use your code on the AMD nodes, please choose the correct compiler flags when optimizing your code. Please read the man pages of the compilers to find out more about the available flags.

Debugging

The TotalView parallel debugger is available on the LOEWE-CSC cluster. Follow these steps to start a debugging session:

Compile your code with your favored MPI using the debug option -g, e.g.
```
mpicc -g -o mpi_prog mpi_prog.c
```
Load the TotalView module by running
```
module load totalview
```
Allocate the resources you need using salloc, e.g.
```
salloc -n 4 --partition=test --time=00:59:00
```

Start an interactive debugging session, e.g.

totalview -args srun -n 4 ./mpi_prog (MVAPICH2) or
totalview -args mpirun -np 4 ./mpi_prog (Open MPI)

Please notice the difference between the MVAPICH2 and Open MPI command lines. For a simple debugging session click 'OK' in the Startup Parameters dialog. Now you can start your application by clicking 'Go'. To follow your program code and set some breakpoints, answer the question about stopping your application with 'yes'. Afterwards click 'Go' again. For a more advanced use of the TotalView debugger please read the official documentation.

Storage

There are various storage systems available on the cluster. In this section we describe the most relevant:

your home directory /home/<group>/<user> (NFS, slow),
your scratch directory /scratch/<group>/<user> (parallel file system FhGFS, fast),
the non-shared local storage (i.e. only accessible from the compute node it's connected to, max. 1.4 TB, slow) under /local/$SLURM_JOB_ID on each compute node
and the two (slow) archive file systems /data01 and /data02 (explained at the end of this section).

Please use your home directory for small permanent files, e.g. source files, libraries and executables. Use the scratch space for large temporary job data and delete the data as soon as you no longer need it, e.g. when it's older than 30 days.

By default, the space in your home directory is limited to 10 GB and in your scratch directory to 5 TB and/or 800000 inodes (which corresponds to approximately 200000+ files). You can check your homedir and scratch usage by running the quota command on a login node.

While the data in your home directory is backed up nightly (please ask, if you want us to restore anything from there), there is no backup of your scratch directory.

If you need local storage on the compute nodes, you have to add the --tmp parameter to your job script (see SLURM section below). Set the amount of storage in megabytes, e.g. set --tmp=5000 to allocate 5 GB of local disk space. The local directory (/local/$SLURM_JOB_ID) is deleted after the corresponding job has finished. If, for some reason, you don't want the data to be deleted (e.g. for debugging), you can use salloc instead of sbatch and work interactively (see man salloc). Or, one can put an rsync at the end of the job script, in order to save the local data to /scratch just before the job exits:

...
 
mkdir /scratch/<groupid>/<userid>/$SLURM_JOBID
scontrol show hostnames $SLURM_JOB_NODELIST | xargs -i ssh {} \
    rsync -a /local/$SLURM_JOBID/ \
    /scratch/<groupid>/<userid>/$SLURM_JOBID/{}

In addition to the “volatile” /scratch and the permanent /home, which come along with every user account, more permanent disk space (2 × N, where N ≤ 10 TB) can be requested by group leaders for archiving. Upon request, two file systems will be created for every group member, to be accessed through rsync¹⁾, e.g. list the contents of your folder and archive a /scratch directory:

rsync data01:/archive/<group>/<user>/
...
cd /scratch/<group>/<user>/
rsync [--progress] -a <somefolder> data01:/archive/<group>/<user>/

or, for data02:

rsync data02:/archive/<group>/<user>/
...
cd /scratch/<group>/<user>/
rsync [--progress] -a <somefolder> data02:/archive/<group>/<user>/

The space is limited by N on each of the both systems. Limits are set for an entire group (there's no user quota). The disk usage can be checked by running

df -h /data0{1,2}/<group>

or

quota

on the command line. The corresponding hardware resides in separate server rooms. There is no automatic backup. However, for a user, a possible backup scenario is to backup his or her data manually to both storage systems, data01 and data02 (e.g. at the end of a compute job). Note: Although the archive file systems are mounted through NFS, don't use the archive for direct job I/O, please use rsync as described above.

All shared file systems are shared between users and jobs. There is no guarantee, that you always get the desired bandwidth and/or response time.

Although our storage systems are protected by RAID mechanisms, we can't guarantee the safety of your data. It is within the responsibility of the user to backup important files.

Running Jobs With SLURM

On our systems, compute jobs and resources are managed by SLURM (Simple Linux Utility for Resource Management). Most of the compute nodes are organized in the partition (or queue) named parallel. The GPU nodes are in a separate partition called gpu. There is also a small test partition called test. You can see more details (the current number of nodes in each partition and their state) by running the sinfo command on a login node.

Partition	Node types
`parallel`	Intel and AMD CPU nodes
`gpu`	S10000 GPU nodes
`test`	AMD CPU nodes

Nodes are used exclusively, i.e. only whole nodes are allocated for a job and no other job can use the same nodes concurrently.

In this document we discuss several job types and use cases. In most cases, a compute job falls under one (or more than one) of the following categories:

embarrassingly parallel
OpenMP (multi-threaded)
MPI
hybrid MPI/OpenMP
GPU

For every compute job you have to submit a job script (unless working interactively using salloc or srun, see man page for more information). If jobscript.sh is such a script, then a job can be enqueued by running

sbatch jobscript.sh

on a login node. A SLURM job script is a shell script which may contain SLURM directives (options), i.e. pseudo-comment lines starting with

#SBATCH ...

The SLURM options define the resources to be allocated for the job (and some other properties). Otherwise the script contains the “job logic”, i.e. commands to be executed.

The test Partition: Your First Job Script

Besides the parallel and the gpu partition, where you should run your production jobs, you can use the test partition for pre-production or tests. In test you can allocate up to six 24-core AMD nodes²⁾ (the number of nodes in test can change dynamically, however at least six nodes are always part of test) and run jobs with a walltime of no longer than one hour. Please also note, that the nodes in test have a slower connection to the /scratch file system in contrast to the production nodes. In the following example we allocate 72 CPU cores and 512 MB per core for 5 minutes (SLURM may kill the job after that time, if it's still running):

#!/bin/bash
#SBATCH --job-name=foo
#SBATCH --partition=test
#SBATCH --constraint=dual   1)
#SBATCH --nodes=3
#SBATCH --ntasks=72
#SBATCH --cpus-per-task=1   2)
#SBATCH --mem-per-cpu=512   
#SBATCH --time=00:05:00
#SBATCH --no-requeue        3)
#SBATCH --mail-type=FAIL    4)
 
srun hostname

1) See next section.
2) For SLURM, a CPU core (a CPU thread, to be more precise) is a CPU.
3) Prevent the job from being requeued after node failure.
4) Send an e-mail if sth. goes wrong.

In this example three nodes are allocated, obviously. The srun command is responsible for the distribution of the program (hostname in our case) across the allocated resources, so that 24 instances of hostname will run on each of the allocated nodes concurrently. Please note, that this is not the only way to run or to distribute your processes. Other cases and methods are covered later in this document (and even more methods exist).

Although nodes are allocated exclusively, you should always specify a memory value that reflects the RAM requirements of your job. If you don't set a memory value, a default of 250 MB per core is used. The scheduler treats RAM as a consumable resource. As a consequence, if you omit the --nodes parameter (so that only the number of CPU cores is defined) and allocate more memory per core than there is on a node, you'll automatically get more nodes if the job doesn't fit in otherwise. Moreover, jobs are killed through SLURM's memory enforcement when using more memory than requested.

As already mentioned, after saving the above job script as e.g. jobscript.sh, you can submit it by running

sbatch jobscript.sh

on the command line. The job's output streams (stdout and stderr) will be joined and saved to slurm-ID.out, where ID is a SLURM job ID, which is assigned automatically. You can change this behavior by adding an --output and/or --error argument to the SLURM options.

For job monitoring (to check the current state of your jobs) you can use the squeue command. Depending on the current cluster utilization (and other factors), your job(s) may take a while to start. See also the Queueing And Monitoring section below.

If you need to cancel a job, you can use the scancel command (please see the manpage, man scancel, for further details).

Node Types And Constraints

On LOEWE-CSC four different types of compute nodes are available. There are

438 dual-socket AMD Magny-Cours CPU/GPU nodes with 24 CPU cores, 64 GB of RAM and 1 AMD Radeon HD 5870 with 1 GB of RAM,
198 dual-socket Intel Xeon Ivy Bridge E5-2670v2 nodes with 20 CPU cores and 128 GB of RAM,
139 dual-socket Intel Xeon Broadwell E5-2640 v4 nodes with 20 CPU cores and 128 GB of RAM and
50 dual-socket Intel Xeon Ivy Bridge E5-2650v2 CPU/GPU nodes with 12 CPU cores, 128 GB of RAM and 2 AMD FirePro S10000 dual GPU cards, each with 12 GB of RAM.

In order to separate the node types, we employ the concept of constraints. However, as already mentioned, the S10000 GPU nodes are in turn in an extra partition. When running CPU jobs, you can select the node type you prefer by setting

#SBATCH --constraint=dual
for AMD Magny-Cours CPU/GPU nodes,
#SBATCH --constraint=intel20
for Intel Ivy Bridge CPU nodes or
#SBATCH --constraint=broadwell
for Intel Broadwell CPU nodes.

Unless you know, what you're doing, please always specify a node type. If you omit the --constraint option, your job will run “somewhere” — the default constraint cpu will be set implicitly, which means the job may run on nodes of any type.

Per-User Resource Limits

On LOEWE-CSC, you have the following default limits for the partitions parallel and gpu (most of them are enforced by SLURM QoS rules):

Limit	`parallel`	`gpu`	Description
`MaxJobsPU`	40	40	max. number of jobs a user is able to run simultaneously
`MaxSubmitPU`	50	50	max. number of jobs in running or pending state
`MaxNodesPU`	150	50	max. number of nodes a user is able to use at the same time
`MaxArraySize`	1001	1001	the maximum job array size

The walltime limit (--time parameter) in the parallel and gpu partition is 30 days, and 1 hour in the test partition. If you set --time=T, your job will be killed after T+E if not finished, where E is some “extra time”; E can vary between 0 and 2 hours. However, as you can imagine, the shorter the specified walltime T (and the smaller the job), the better it fits into (time) gaps (e.g. when backfilling is applied, which is a queuing strategy we use).

GPU Jobs

If you want to use GPUs in your calculations, select the gpu partition by setting --partition to gpu. On these nodes you can use two AMD FirePro S10000 dual GPU cards with 12 GB of memory, so you have four GPUs (6 GB each) per node. The AMD APP SDK is available as a module. You can list the available version(s) by running module avail amdappsdk .

Hyper-Threading

On the Intel nodes (constraint intel20, broadwell and in the gpu partition) you can use Hyper-Threading. That means, in addition to each physical CPU core a virtual core is available. SLURM identifies all physical and virtual cores of a node, so that you have 40 logical CPU cores on an Intel node and 24 logical CPU cores on a GPU node. If you don't want to use HT, you can do so by adding

#SBATCH --extra-node-info=2:10:1

to your job script. Then you'll get half the threads per node (which will correspond to the number of cores). This can be beneficial in some cases (some jobs may run faster and/or more stable).

Bundling Single-Threaded Tasks

Note: Please also see the Job Arrays section below. Because only full nodes are given to you, you have to ensure, that the available resources (the 24 CPU cores on a Magny-Cours compute node or the 20 cores on an Intel node) are used efficiently. Please combine as many single-threaded jobs as possible into one. The limits for the number of combined jobs are given by the number of cores and the available memory. A simple job script to start 24 independent processes may look like this one:

#!/bin/bash
#SBATCH --partition=parallel
#SBATCH --constraint=dual
#SBATCH --nodes=1
#SBATCH --ntasks=24
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=2000
#SBATCH --time=01:00:00
#SBATCH --mail-type=FAIL
 
export OMP_NUM_THREADS=1
 
#
# Replace by a for loop.
 
./program input01 >& 01.out &
./program input02 >& 02.out &
 
...
 
./program input24 >& 24.out &
# Wait for all child processes to terminate.
wait

In this (SIMD) example we assume, that there is a program (called program) which is run 24 times on 24 different inputs (usually input files). Both output streams (stdout and stderr) of each process are redirected to a file N.out. A job script is always executed on the first allocated node, so we don't need to use srun, since exactly one node is allocated. Further we assume that the executable is located in the same directory where the job was submitted (that is the initial working directory).

If the running times of your processes vary a lot, consider using the thread pool pattern. Have a look at the xargs -P command, for instance.

Job Arrays

If you have a lot of single-core computations to run, job arrays are worth a look. Telling SLURM to run a job script as a job array will result in running that script multiple times (after the corresponding resources have been allocated). Each instance will have a distinct SLURM_ARRAY_TASK_ID variable defined in its environment.

Due to our full-node policy, you still have to ensure, that your jobs don't waste any resources. Let's say, you have 192 single-core tasks. In the following example 192 tasks are run inside a job array while ensuring that only 24-core nodes are used and that each node runs exactly 24 tasks in parallel.

#!/bin/bash
#SBATCH --partition=parallel
#SBATCH --constraint=dual
#SBATCH --nodes=1
#SBATCH --ntasks=24
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=2000
#SBATCH --time=00:10:00
#SBATCH --array=0-191:24
#SBATCH --mail-type=FAIL
 
my_task() {
     # Print the given "global task number" with leading zeroes
     # followed by the hostname of the executing node.
     K=$(printf "%03d" $1)
     echo "$K: $HOSTNAME"
 
     # Do nothing, just sleep for 3 seconds.
     sleep 3
}
 
#
# Every 24-task block will run on a separate node.
 
for I in $(seq 24); do
     # This is the "global task number". Since we have an array of
     # 192 tasks, J will range from 1 to 192.
     J=$(($SLURM_ARRAY_TASK_ID+$I))
 
     # Put each task into background, so that tasks are executed
     # concurrently.
     my_task $J &
 
     # Wait a little before starting the next one.
     sleep 1
done
 
# Wait for all child processes to terminate.
wait

If the task running times vary a lot, consider using the thread pool pattern. Have a look at the xargs -P command, for instance.

OpenMP Jobs

For OpenMP jobs, set the --cpus-per-task parameter. As usual, you should also specify a --mem-per-cpu value. But in this case you have to divide the total RAM required by your program by the number of threads. E.g. if your application needs 4800 MB and you want to run 24 threads, then you have to set --mem-per-cpu=200 (4800/24 = 200). Don't forget to set the OMP_NUM_THREADS environment variable. Example:

#!/bin/bash
#SBATCH --partition=parallel
#SBATCH --constraint=dual
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH --mem-per-cpu=200
#SBATCH --mail-type=ALL
#SBATCH --time=48:00:00
 
export OMP_NUM_THREADS=24
./omp_program

MPI Libraries

Currently, we provide the following MPI implementations³⁾:

MVAPICH2 version 2.0 (Intel and PGI compiler versions), modules:
- mpi/mvapich2/intel-17.0.1/2.2
- mpi/mvapich2/intel-17.0.1/2.2-dbg
- mpi/mvapich2/pgi-16.10/2.2

Open MPI version 1.8.1 (Intel compiler version), module:
- openmpi/intel-17.0.1/1.8.1

When loading an intel or pgi MPI module, you don't have to load any of the compiler modules. A corresponding compiler module is loaded automatically.

When using MVAPICH2, use the srun command instead of mpirun (see examples below).

Note: MVAPICH2 is installed with core affinity enabled. Every MPI rank is pinned to a CPU core during run time. This prevents the OS scheduler from shifting the MPI ranks from core to core, invalidating caches and degrading performance. But you have to be careful, if you want to run MPI + OpenMP jobs (see the Hybrid Jobs section below).

MPI Jobs

Remember: Nodes are used exclusively. Each node has 20 or more CPU cores. If you want to run a lot of small jobs (i.e. where more than one job could be run on a single node concurrently), consider running more than one computation within a job (see next section). Otherwise it will most likely result in a waste of resources and will lead to a longer queueing time (for you and others).

As an example, we want to run a program that spawns 96 Open MPI ranks and where 1200 MB of RAM are allocated for each rank.

#!/bin/bash
#SBATCH --partition=parallel
#SBATCH --constraint=dual
#SBATCH --ntasks=96
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1200
#SBATCH --mail-type=ALL
#SBATCH --time=48:00:00
 
module load openmpi/intel-XX.X.X/...
export OMP_NUM_THREADS=1
mpirun ./example_program

The main difference between an Open MPI script and an MVAPICH2 script is the command for executing your parallel program. With Open MPI you have to use mpirun (or srun --mpi=pmi2) and with MVAPICH2 you have to use srun (except you're using one of the hydra modules, which in turn have an mpirun command). Otherwise an MVAPICH2 script looks almost the same.

#!/bin/bash
#SBATCH --partition=parallel
#SBATCH --constraint=dual
#SBATCH --ntasks=96
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1200
#SBATCH --mail-type=ALL
#SBATCH --time=48:00:00
 
module load mpi/mvapich2/intel-XX.X.X/...
export OMP_NUM_THREADS=1
srun ./example_program

Note: If you are concerned about InfiniBand bandwidth, SLURM is topology-aware. It “knows” how everything is connected. There is no guarantee that job placement is always optimal, though. However, in most cases you shouldn't worry.

Combining Small MPI Jobs

As mentioned earlier, running small jobs while full nodes are allocated leads to a waste of resources. In cases where you have, let's say, a lot of 12-rank MPI jobs (with similar runtimes and low memory consumption), you can start more than one computation within a single allocation (and on a single node). Open MPI example (running two MPI jobs concurrently on a 24-core node):

#!/bin/bash
#SBATCH --partition=parallel
#SBATCH --constraint=dual
#SBATCH --nodes=1
#SBATCH --ntasks=24
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=2000
#SBATCH --time=48:00:00
#SBATCH --mail-type=FAIL
 
export OMP_NUM_THREADS=1
mpirun -np 12 ./program input01 >& 01.out &
# Wait a little before starting the next one.
sleep 3
mpirun -np 12 ./program input02 >& 02.out &
# Wait for all child processes to terminate.
wait

You might also need to disable core binding (please see the mpirun man page, or when using MVAPICH2, set MV2_ENABLE_AFFINITY=0). Otherwise the ranks of the second run will interfere with the first one.

Hybrid Jobs: MPI/OpenMP

MVAPICH2 example script (24 ranks, 6 threads each and 200 MB per thread, i.e. 1.2 GB per rank; so, for 24*6 threads, you'll get six 24-core nodes):

#!/bin/bash
#SBATCH --partition=parallel
#SBATCH --constraint=dual
#SBATCH --ntasks=24
#SBATCH --cpus-per-task=6
#SBATCH --mem-per-cpu=200
#SBATCH --mail-type=ALL
#SBATCH --time=48:00:00
 
export OMP_NUM_THREADS=6
export MV2_ENABLE_AFFINITY=0
srun -n 24 ./example_program

Please note, that this is just an example. You may or may not run it as-it-is with your software, which is likely to have a different scalability.

You have to disable the core affinity when running hybrid jobs with MVAPICH2. Otherwise all threads of an MPI rank will be pinned to the same core. Our example now includes the command

export MV2_ENABLE_AFFINITY=0

which disables this feature. The OS scheduler is now responsible for the placement of the threads during the runtime of the program. But the OS scheduler can dynamically change the thread placement during the runtime of the program. This leads to cache invalidation, which degrades performance. This can be prevented by thread pinning.

Local Storage

On each node there is up to 1.4 TB of local disk space (see also Storage). If you need local storage, you have to add the --tmp parameter to your SLURM script. Set the amount of storage in megabytes, e.g. set --tmp=5000 to allocate 5 GB of local disk space. The data in the local directory (/local/$SLURM_JOB_ID) is automatically deleted after the corresponding batch job has finished.

Nodes Vs. Tasks And Threads

As already indicated, SLURM resource allocations can be further specified by using the --nodes parameter (instead of or in addition to --ntasks). E.g., with the Magny-Cours nodes (constraint dual),

#SBATCH --nodes=2
#SBATCH --ntasks=48
#SBATCH --cpus-per-task=1

will result in virtually the same resource allocation (i.e. two nodes) as just

#SBATCH --nodes=2

or

#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=24

However, for SLURM the three have different meanings (hence resulting in different environments): 48 processes on two nodes vs. 2 processes on two nodes vs. 2 processes on two nodes where each process consists of 24 threads.

Planning Work

Using the --begin option it's possible to tell SLURM that you need the resources at some point in the future. Also, you might find it useful to use this feature for creating “reservations”. E.g.

Submit a sleep job (allocate twenty intel20 nodes for 3 days), you can logout after running this command (but check the output of the squeue command first, if there is no corresponding pending job, then sth. went wrong):
```
$ sbatch --begin=201X-07-23T08:00 --time=3-0 --nodes=20 \
  --partition=parallel --mem=120g \
  --constraint=intel20 --wrap="sleep 3d"
```
Wait until the time has come (07/23/201X 8:00am or later, there is no guarantee, that the allocation will be made on time, but the earlier you submit the job, the more likely you'll get the resources by that time).
Find out whether the sleep job is running (i.e. is in R state) and run a new job step within that allocation (see also http://slurm.schedmd.com/faq.html#multi_batch):
```
$ squeue
  JOBID    PARTITION     NAME   ST      TIME  NODES
2717365     parallel   sbatch    R   3:28:29     20

$ srun --jobid 2717365 hostname
...
```
Note: Please note, that we are using the srun command. The sbatch command is not supported in this scenario.
Finally, don't forget to release the allocation, if there's time left and the sleep job is still running:
```
$ scancel 2717365
```

Queuing Times

After submitting a job you may use the squeue command to check its status. One can also specify the information to be displayed (please read the man page for more details), e.g.:

squeue -o "%.7i %.9P %.7f %.2t %.10M %.4D %R"

or

squeue --start

The latter shows approximate start times for your jobs. A start time prediction doesn't always exist and is never a guarantee, though.

Upon login via SSH a “message of the day” is shown. The last line separator at the end of the MOTD contains a util and a qtime value, e.g.:

... --- util: 0.97 --- avg./max. qtime (h): 17.41 / 96.69

The util value is the current utilization A/(A+I) of the partition parallel, where A is the number of allocated nodes and I is the number of idling nodes. The avg. (max.) qtime value is the average (maximum) waiting time (in hours) obtained from the jobs started within the last 72 hours. The information is refreshed every three hours.

While the queuing times may change quickly (see Fig. 1) and range from some minutes to many hours (or even several days), a snapshot of a typical cluster utilization scenario (where > 90% of the currently available resources are allocated) may look like the one in Fig. 2 (the bar chart captures the “job sizes” but doesn't depict the walltimes of the jobs; note that the pending jobs needed even more resources than already allocated).

<html><center>Fig. 1: Example q-time statistics (two weeks)</center></html>

<html><center>Fig. 2: Example cluster utilization</center></html>

¹⁾

For further information on how to use rsync, please read its excellent man page.

²⁾

there are no Intel nodes and no S10000 GPU nodes in test

³⁾

you can list the MPI modules by running module avail [open]mpi

Table of Contents

LOEWE-CSC Cluster Usage

Environment Modules

Compiling Software

Debugging

Storage

Running Jobs With SLURM

Read More

The test Partition: Your First Job Script

Node Types And Constraints

Per-User Resource Limits

GPU Jobs

Hyper-Threading

Bundling Single-Threaded Tasks

Job Arrays

OpenMP Jobs

MPI Libraries

MPI Jobs

Combining Small MPI Jobs

Hybrid Jobs: MPI/OpenMP

Local Storage

Nodes Vs. Tasks And Threads

Planning Work

Queuing Times

Table of Contents

LOEWE-CSC Cluster Usage

Login

Environment Modules

Compiling Software

Debugging

Storage

Running Jobs With SLURM

Read More

The test Partition: Your First Job Script

Node Types And Constraints

Per-User Resource Limits

GPU Jobs

Hyper-Threading

Bundling Single-Threaded Tasks

Job Arrays

OpenMP Jobs

MPI Libraries

MPI Jobs

Combining Small MPI Jobs

Hybrid Jobs: MPI/OpenMP

Local Storage

Nodes Vs. Tasks And Threads

Planning Work

Queuing Times