Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
public:usage:loewe-csc [2018/05/22 15:26] – [The test Partition: Your First Job Script] geier | public:usage:loewe-csc [2019/08/07 13:49] (current) – removed jankowiak | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== LOEWE-CSC Cluster Usage ====== | ||
- | The [[..: | ||
- | |||
- | |||
- | |||
- | ===== Login ===== | ||
- | |||
- | An SSH client is required to connect to the cluster. On a Linux system the command is usually: | ||
- | < | ||
- | |||
- | On Windows systems please use/install a Windows SSH client (e.g. PuTTY, or the Cygwin ssh package). | ||
- | |||
- | After your first login you will get the message, that your password has expired and you have to change it. Please use the password provided by CSC at the prompt, choose a new one and retype it. You will be logged out automatically. Now you can login with your new password and work on the cluster. | ||
- | |||
- | <note warning> | ||
- | |||
- | '' | ||
- | |||
- | on the command line. On a login node, any process that exceeds the CPU-time limit (e.g. a long running test program or a long running rsync) will be killed automatically.</ | ||
- | |||
- | ===== Environment Modules ===== | ||
- | |||
- | |||
- | There are several versions of software packages installed on our systems. The same name for an executable (e.g. mpirun) and/or library file may be used by more than one package. The environment module system, with its '' | ||
- | |||
- | < | ||
- | |||
- | A number of additional " | ||
- | module --append use / | ||
- | module --append use / | ||
- | or | ||
- | module load unstable | ||
- | module load deprecated | ||
- | If you want to know more about module commands, the '' | ||
- | |||
- | Despite the number of available modules, you might want to install your own software in your home directory. Also, you can write your own module files (see '' | ||
- | ===== Compiling Software ===== | ||
- | |||
- | You can compile your software on the login nodes (or on any other node, inside a job allocation). On LOEWE-CSC several compiler suites are available: | ||
- | |||
- | * GNU compilers version 4.4.7 (built-in distribution default) \\ + higher GCC versions (as modules) | ||
- | * Intel compilers version 17.0.1 (as a module) | ||
- | * PGI compilers version 16.10 (as a module) | ||
- | |||
- | < | ||
- | ===== Debugging ===== | ||
- | |||
- | The [[http:// | ||
- | |||
- | - Compile your code with your favored MPI using the debug option -g, e.g.< | ||
- | mpicc -g -o mpi_prog mpi_prog.c</ | ||
- | - Load the TotalView module by running< | ||
- | module load totalview</ | ||
- | - Allocate the resources you need using salloc, e.g.< | ||
- | salloc -n 4 --partition=test --time=00: | ||
- | | ||
- | totalview -args srun -n 4 ./mpi_prog (MVAPICH2) or | ||
- | totalview -args mpirun -np 4 ./mpi_prog (Open MPI)</ | ||
- | |||
- | Please notice the difference between the MVAPICH2 and Open MPI command lines. For a simple debugging session click ' | ||
- | |||
- | |||
- | |||
- | ===== Storage ===== | ||
- | |||
- | There are various storage systems available on the cluster. In this section we describe the most relevant: | ||
- | |||
- | * your home directory ''/ | ||
- | * your scratch directory ''/ | ||
- | * the non-shared local storage (i.e. only accessible from the compute node it's connected to, max. 1.4 TB, slow) under ''/ | ||
- | * and the two (slow) archive file systems ''/ | ||
- | |||
- | Please use your home directory for small permanent files, e.g. source files, libraries and executables. Use the scratch space for large temporary job data and delete the data as soon as you no longer need it, e.g. when it's older than 30 days. | ||
- | |||
- | {{ : | ||
- | |||
- | By default, the space in your home directory is limited to 10 GB and in your scratch directory to 5 TB and/or 800000 inodes (which corresponds to approximately 200000+ files). You can check your homedir and scratch usage by running the '' | ||
- | |||
- | < | ||
- | |||
- | If you need local storage on the compute nodes, you have to add the '' | ||
- | <code bash> | ||
- | ... | ||
- | |||
- | mkdir / | ||
- | scontrol show hostnames $SLURM_JOB_NODELIST | xargs -i ssh {} \ | ||
- | rsync -a / | ||
- | / | ||
- | </ | ||
- | |||
- | In addition to the " | ||
- | |||
- | rsync data01:/ | ||
- | ... | ||
- | cd / | ||
- | rsync [--progress] -a < | ||
- | or, for '' | ||
- | rsync data02:/ | ||
- | ... | ||
- | cd / | ||
- | rsync [--progress] -a < | ||
- | |||
- | The space is limited by //N// on each of the both systems. Limits are set for an entire group (there' | ||
- | |||
- | df -h / | ||
- | or | ||
- | quota | ||
- | |||
- | on the command line. The corresponding hardware resides in separate server rooms. There is no automatic backup. However, for a user, a possible backup scenario is to backup his or her data manually to both storage systems, '' | ||
- | |||
- | < | ||
- | |||
- | Although our storage systems are protected by RAID mechanisms, we can't guarantee the safety of your data. It is within the responsibility of the user to backup important files. | ||
- | </ | ||
- | ===== Running Jobs With SLURM ===== | ||
- | |||
- | On our systems, compute jobs and resources are managed by SLURM (Simple Linux Utility for Resource Management). Most of the compute nodes are organized in the partition (or queue) named '' | ||
- | |||
- | ^Partition^Node types^ | ||
- | | '' | ||
- | | '' | ||
- | | '' | ||
- | |||
- | Nodes are used **exclusively**, | ||
- | |||
- | In this document we discuss several job types and use cases. In most cases, a compute job falls under one (or more than one) of the following categories: | ||
- | |||
- | * [[# | ||
- | * [[# | ||
- | * [[# | ||
- | * [[# | ||
- | * [[# | ||
- | |||
- | For every compute job you have to submit a job script (unless working interactively using '' | ||
- | |||
- | sbatch jobscript.sh | ||
- | |||
- | on a login node. A SLURM job script is a shell script which may contain SLURM directives (options), i.e. pseudo-comment lines starting with | ||
- | |||
- | #SBATCH ... | ||
- | | ||
- | The SLURM options define the resources to be allocated for the job (and some other properties). Otherwise the script contains the "job logic", | ||
- | |||
- | |||
- | ==== Read More ==== | ||
- | |||
- | __Helpful SLURM links__ | ||
- | |||
- | [[https:// | ||
- | |||
- | The following instructions shall provide you with the basic information you need to get started with SLURM on our systems. However, the official SLURM documentation covers some more use cases (also in more detail). Please read the SLURM man pages (e.g. '' | ||
- | |||
- | ==== The test Partition: Your First Job Script ==== | ||
- | |||
- | Besides the '' | ||
- | |||
- | <code bash> | ||
- | #!/bin/bash | ||
- | #SBATCH --job-name=foo | ||
- | #SBATCH --partition=test | ||
- | #SBATCH --constraint=dual | ||
- | #SBATCH --nodes=3 | ||
- | #SBATCH --ntasks=72 | ||
- | #SBATCH --cpus-per-task=1 | ||
- | #SBATCH --mem-per-cpu=512 | ||
- | #SBATCH --time=00: | ||
- | #SBATCH --no-requeue | ||
- | #SBATCH --mail-type=FAIL | ||
- | |||
- | srun hostname | ||
- | </ | ||
- | |||
- | 1) See next section.\\ | ||
- | 2) For SLURM, a CPU core (a CPU thread, to be more precise) is a CPU.\\ | ||
- | 3) Prevent the job from being requeued after node failure.\\ | ||
- | 4) Send an e-mail if sth. goes wrong.\\ | ||
- | |||
- | In this example three nodes are allocated, obviously. The '' | ||
- | |||
- | Although nodes are allocated exclusively, | ||
- | |||
- | As already mentioned, after saving the above job script as e.g. '' | ||
- | |||
- | sbatch jobscript.sh | ||
- | |||
- | on the command line. The job's output streams ('' | ||
- | |||
- | For job monitoring (to check the current state of your jobs) you can use the '' | ||
- | |||
- | If you need to cancel a job, you can use the '' | ||
- | |||
- | ==== Node Types And Constraints ==== | ||
- | |||
- | On LOEWE-CSC **four different types** of compute nodes are available. There are | ||
- | |||
- | * 438 dual-socket AMD Magny-Cours CPU/GPU nodes with 24 CPU cores, 64 GB of RAM and 1 AMD Radeon HD 5870 with 1 GB of RAM, | ||
- | * 198 dual-socket Intel Xeon Ivy Bridge E5-2670v2 nodes with 20 CPU cores and 128 GB of RAM, | ||
- | * 139 dual-socket Intel Xeon Broadwell E5-2640 v4 nodes with 20 CPU cores and 128 GB of RAM and | ||
- | * 50 dual-socket Intel Xeon Ivy Bridge E5-2650v2 CPU/GPU nodes with 12 CPU cores, 128 GB of RAM and 2 AMD FirePro S10000 dual GPU cards, each with 12 GB of RAM. | ||
- | |||
- | In order to separate the node types, we employ the concept of constraints. However, as already mentioned, the S10000 GPU nodes are in turn in an extra partition. When running CPU jobs, you can select the node type you prefer by setting | ||
- | |||
- | * ''# | ||
- | * ''# | ||
- | * ''# | ||
- | |||
- | Unless you know, what you're doing, please always specify a node type. If you omit the '' | ||
- | ==== Per-User Resource Limits ==== | ||
- | |||
- | On LOEWE-CSC, you have the following default limits for the partitions '' | ||
- | |||
- | ^ Limit ^ '' | ||
- | | '' | ||
- | | '' | ||
- | | '' | ||
- | | '' | ||
- | |||
- | The walltime limit ('' | ||
- | |||
- | |||
- | |||
- | ==== GPU Jobs ==== | ||
- | |||
- | If you want to use GPUs in your calculations, | ||
- | |||
- | ==== Hyper-Threading ==== | ||
- | |||
- | On the Intel nodes (constraint '' | ||
- | |||
- | #SBATCH --extra-node-info=2: | ||
- | |||
- | to your job script. Then you'll get half the threads per node (which will correspond to the number of cores). This can be beneficial in some cases (some jobs may run faster and/or more stable). | ||
- | |||
- | ==== Bundling Single-Threaded Tasks ==== | ||
- | |||
- | **Note:** Please also see the Job Arrays section below. Because only full nodes are given to you, you have to ensure, that the available resources (the 24 CPU cores on a Magny-Cours compute node or the 20 cores on an Intel node) are used efficiently. Please combine as many single-threaded jobs as possible into one. The limits for the number of combined jobs are given by the number of cores and the available memory. A simple job script to start 24 independent processes may look like this one: | ||
- | |||
- | <code bash># | ||
- | #SBATCH --partition=parallel | ||
- | #SBATCH --constraint=dual | ||
- | #SBATCH --nodes=1 | ||
- | #SBATCH --ntasks=24 | ||
- | #SBATCH --cpus-per-task=1 | ||
- | #SBATCH --mem-per-cpu=2000 | ||
- | #SBATCH --time=01: | ||
- | #SBATCH --mail-type=FAIL | ||
- | |||
- | export OMP_NUM_THREADS=1 | ||
- | |||
- | # | ||
- | # Replace by a for loop. | ||
- | |||
- | ./program input01 >& 01.out & | ||
- | ./program input02 >& 02.out & | ||
- | |||
- | ... | ||
- | |||
- | ./program input24 >& 24.out & | ||
- | # Wait for all child processes to terminate. | ||
- | wait | ||
- | </ | ||
- | |||
- | In this (SIMD) example we assume, that there is a program (called '' | ||
- | |||
- | If the running times of your processes vary a lot, consider using the //thread pool pattern//. Have a look at the '' | ||
- | |||
- | ==== Job Arrays ==== | ||
- | |||
- | If you have a lot of single-core computations to run, job arrays are worth a look. Telling SLURM to run a job script as a job array will result in running that script multiple times (after the corresponding resources have been allocated). Each instance will have a distinct '' | ||
- | |||
- | Due to our full-node policy, you still have to ensure, that your jobs don't waste any resources. Let's say, you have 192 single-core tasks. In the following example 192 tasks are run inside a job array while ensuring that only 24-core nodes are used and that each node runs exactly 24 tasks in parallel. | ||
- | |||
- | <code bash># | ||
- | #SBATCH --partition=parallel | ||
- | #SBATCH --constraint=dual | ||
- | #SBATCH --nodes=1 | ||
- | #SBATCH --ntasks=24 | ||
- | #SBATCH --cpus-per-task=1 | ||
- | #SBATCH --mem-per-cpu=2000 | ||
- | #SBATCH --time=00: | ||
- | #SBATCH --array=0-191: | ||
- | #SBATCH --mail-type=FAIL | ||
- | |||
- | my_task() { | ||
- | # Print the given " | ||
- | # followed by the hostname of the executing node. | ||
- | | ||
- | echo "$K: $HOSTNAME" | ||
- | |||
- | # Do nothing, just sleep for 3 seconds. | ||
- | sleep 3 | ||
- | } | ||
- | |||
- | # | ||
- | # Every 24-task block will run on a separate node. | ||
- | |||
- | for I in $(seq 24); do | ||
- | # This is the " | ||
- | # 192 tasks, J will range from 1 to 192. | ||
- | | ||
- | |||
- | # Put each task into background, so that tasks are executed | ||
- | # concurrently. | ||
- | | ||
- | |||
- | # Wait a little before starting the next one. | ||
- | sleep 1 | ||
- | done | ||
- | |||
- | # Wait for all child processes to terminate. | ||
- | wait | ||
- | </ | ||
- | |||
- | If the task running times vary a lot, consider using the //thread pool pattern//. Have a look at the '' | ||
- | |||
- | ==== OpenMP Jobs ==== | ||
- | |||
- | For OpenMP jobs, set the '' | ||
- | |||
- | <code bash># | ||
- | #SBATCH --partition=parallel | ||
- | #SBATCH --constraint=dual | ||
- | #SBATCH --ntasks=1 | ||
- | #SBATCH --cpus-per-task=24 | ||
- | #SBATCH --mem-per-cpu=200 | ||
- | #SBATCH --mail-type=ALL | ||
- | #SBATCH --time=48: | ||
- | |||
- | export OMP_NUM_THREADS=24 | ||
- | ./ | ||
- | </ | ||
- | |||
- | ==== MPI Libraries ==== | ||
- | |||
- | Currently, we provide the following MPI implementations((you can list the MPI modules by running '' | ||
- | |||
- | * MVAPICH2 version 2.0 (Intel and PGI compiler versions), modules: | ||
- | * mpi/ | ||
- | * mpi/ | ||
- | * mpi/ | ||
- | |||
- | * Open MPI version 1.8.1 (Intel compiler version), module: | ||
- | * openmpi/ | ||
- | |||
- | When loading an '' | ||
- | |||
- | <note important> | ||
- | |||
- | **Note:** MVAPICH2 is installed with core affinity enabled. Every MPI rank is pinned to a CPU core during run time. This prevents the OS scheduler from shifting the MPI ranks from core to core, invalidating caches and degrading performance. But you have to be careful, if you want to run MPI + OpenMP jobs (see the Hybrid Jobs section below). | ||
- | |||
- | ==== MPI Jobs ==== | ||
- | |||
- | **Remember: | ||
- | |||
- | See also: http:// | ||
- | |||
- | As an example, we want to run a program that spawns 96 Open MPI ranks and where 1200 MB of RAM are allocated for each rank. | ||
- | |||
- | <code bash># | ||
- | #SBATCH --partition=parallel | ||
- | #SBATCH --constraint=dual | ||
- | #SBATCH --ntasks=96 | ||
- | #SBATCH --cpus-per-task=1 | ||
- | #SBATCH --mem-per-cpu=1200 | ||
- | #SBATCH --mail-type=ALL | ||
- | #SBATCH --time=48: | ||
- | |||
- | module load openmpi/ | ||
- | export OMP_NUM_THREADS=1 | ||
- | mpirun ./ | ||
- | </ | ||
- | |||
- | The main difference between an Open MPI script and an MVAPICH2 script is the command for executing your parallel program. With Open MPI you have to use '' | ||
- | |||
- | <code bash># | ||
- | #SBATCH --partition=parallel | ||
- | #SBATCH --constraint=dual | ||
- | #SBATCH --ntasks=96 | ||
- | #SBATCH --cpus-per-task=1 | ||
- | #SBATCH --mem-per-cpu=1200 | ||
- | #SBATCH --mail-type=ALL | ||
- | #SBATCH --time=48: | ||
- | |||
- | module load mpi/ | ||
- | export OMP_NUM_THREADS=1 | ||
- | srun ./ | ||
- | </ | ||
- | |||
- | **Note:** If you are concerned about InfiniBand bandwidth, SLURM is topology-aware. It " | ||
- | |||
- | ==== Combining Small MPI Jobs ==== | ||
- | |||
- | As mentioned earlier, running small jobs while full nodes are allocated leads to a waste of resources. In cases where you have, let's say, a lot of 12-rank MPI jobs (with similar runtimes and low memory consumption), | ||
- | |||
- | <code bash># | ||
- | #SBATCH --partition=parallel | ||
- | #SBATCH --constraint=dual | ||
- | #SBATCH --nodes=1 | ||
- | #SBATCH --ntasks=24 | ||
- | #SBATCH --cpus-per-task=1 | ||
- | #SBATCH --mem-per-cpu=2000 | ||
- | #SBATCH --time=48: | ||
- | #SBATCH --mail-type=FAIL | ||
- | |||
- | export OMP_NUM_THREADS=1 | ||
- | mpirun -np 12 ./program input01 >& 01.out & | ||
- | # Wait a little before starting the next one. | ||
- | sleep 3 | ||
- | mpirun -np 12 ./program input02 >& 02.out & | ||
- | # Wait for all child processes to terminate. | ||
- | wait | ||
- | </ | ||
- | |||
- | You might also need to disable core binding (please see the '' | ||
- | ==== Hybrid Jobs: MPI/OpenMP ==== | ||
- | |||
- | MVAPICH2 example script (24 ranks, 6 threads each and 200 MB per thread, i.e. 1.2 GB per rank; so, for 24*6 threads, you'll get six 24-core nodes): | ||
- | |||
- | <code bash># | ||
- | #SBATCH --partition=parallel | ||
- | #SBATCH --constraint=dual | ||
- | #SBATCH --ntasks=24 | ||
- | #SBATCH --cpus-per-task=6 | ||
- | #SBATCH --mem-per-cpu=200 | ||
- | #SBATCH --mail-type=ALL | ||
- | #SBATCH --time=48: | ||
- | |||
- | export OMP_NUM_THREADS=6 | ||
- | export MV2_ENABLE_AFFINITY=0 | ||
- | srun -n 24 ./ | ||
- | </ | ||
- | |||
- | Please note, that this is just an example. You may or may not run it as-it-is with your software, which is likely to have a different scalability. | ||
- | |||
- | You have to disable the core affinity when running hybrid jobs with MVAPICH2. Otherwise all threads of an MPI rank will be pinned to the same core. Our example now includes the command | ||
- | |||
- | <code bash> | ||
- | export MV2_ENABLE_AFFINITY=0 | ||
- | </ | ||
- | |||
- | which disables this feature. The OS scheduler is now responsible for the placement of the threads during the runtime of the program. But the OS scheduler can dynamically change the thread placement during the runtime of the program. This leads to cache invalidation, | ||
- | |||
- | ==== Local Storage ==== | ||
- | |||
- | On each node there is up to 1.4 TB of local disk space (see also [[# | ||
- | |||
- | ==== Nodes Vs. Tasks And Threads ==== | ||
- | |||
- | As already indicated, SLURM resource allocations can be further specified by using the '' | ||
- | |||
- | < | ||
- | #SBATCH --nodes=2 | ||
- | #SBATCH --ntasks=48 | ||
- | #SBATCH --cpus-per-task=1 | ||
- | </ | ||
- | |||
- | will result in virtually the same resource allocation (i.e. two nodes) as just | ||
- | |||
- | < | ||
- | #SBATCH --nodes=2 | ||
- | </ | ||
- | |||
- | or | ||
- | |||
- | < | ||
- | #SBATCH --nodes=2 | ||
- | #SBATCH --ntasks=2 | ||
- | #SBATCH --cpus-per-task=24 | ||
- | </ | ||
- | |||
- | However, for SLURM the three have different meanings (hence resulting in different environments): | ||
- | |||
- | ==== Planning Work ==== | ||
- | |||
- | Using the '' | ||
- | |||
- | - Submit a sleep job (allocate twenty intel20 nodes for 3 days), you can logout after running this command (but check the output of the squeue command first, if there is no corresponding pending job, then sth. went wrong): < | ||
- | $ sbatch --begin=201X-07-23T08: | ||
- | --partition=parallel --mem=120g \ | ||
- | --constraint=intel20 --wrap=" | ||
- | - Wait until the time has come (07/23/201X 8:00am or later, there is no guarantee, that the allocation will be made on time, but the earlier you submit the job, the more likely you'll get the resources by that time). | ||
- | - Find out whether the sleep job is running (i.e. is in R state) and run a new job step within that allocation (see also http:// | ||
- | $ squeue | ||
- | JOBID PARTITION | ||
- | 2717365 | ||
- | |||
- | $ srun --jobid 2717365 hostname | ||
- | ...</ | ||
- | - Finally, don't forget to release the allocation, if there' | ||
- | $ scancel 2717365</ | ||
- | |||
- | ==== Queuing Times ==== | ||
- | |||
- | After submitting a job you may use the '' | ||
- | |||
- | squeue -o "%.7i %.9P %.7f %.2t %.10M %.4D %R" | ||
- | |||
- | or | ||
- | |||
- | squeue --start | ||
- | |||
- | The latter shows approximate start times for your jobs. A start time prediction doesn' | ||
- | |||
- | Upon login via SSH a " | ||
- | |||
- | ... --- util: 0.97 --- avg./max. qtime (h): 17.41 / 96.69 | ||
- | |||
- | The '' | ||
- | |||
- | While the queuing times may change quickly (see Fig. 1) and range from some minutes to many hours (or even several days), a snapshot of a typical cluster utilization scenario (where > 90% of the currently available resources are allocated) may look like the one in Fig. 2 (the bar chart captures the "job sizes" but doesn' | ||
- | |||
- | < | ||
- | |||
- | {{ : | ||
- | \\ | ||
- | \\ | ||
- | |||
- | < | ||
- | |||
- | {{ : |