Cluster@WU User’s Manual 1 Introduction and scope
Transcription
Cluster@WU User’s Manual 1 Introduction and scope
Cluster@WU User’s Manual Stefan Theußl Martin Pacala September 29, 2014 1 Introduction and scope At the WU Wirtschaftsuniversit¨at Wien the Research Institute for Computational Methods (Forschungsinstitut f¨ ur rechenintensive Methoden or FIRM for short) hosts a cluster of Intel workstations, known as cluster@WU. See http://www.wu.ac.at/firm. The scope of this manual is limited to a general introduction of the cluster@wu and its usage. It assumes general basic linux knowledge on the part of the user. For a more comprehensive guide, including detailed instructions for windows users and users new to unix and optional arguments to available commands, please visit http://statmath.wu.ac.at/cluster Suggestions and improvements to this manual (as well as the website manual) can be emailed to [email protected]. 1.1 cluster@WU With a total of 528 64-bit computation cores and a total of one terabyte of RAM, cluster@WU is well equipped to tackle challenging problems from various research areas. The high performance computing cluster consists of four parts: the cluster running the applications, which itself consists of 44 nodes, a login server, a file server and storage from a storage area network (SAN). The 44 nodes, each offering 12 cores for a total of 528 cores capable of processing jobs, are combined in the queue node.q. Table 1 provides a brief overview of the specs of each individual node. 1 2 24 node.q – 44 nodes Intel x5670 (6 Cores) @ 2.93 GHz GB RAM Table 1: cluster@WU specification The file server (clusterfs.wu.ac.at) hosts the userdata, application data and the scheduling system Sun Grid Engine. This grid engine is responsible for job administration and supports the submission of serial tasks as well as parallel tasks. The login server (cluster.wu.ac.at) is the main entry point for application developers and cluster users. This server solely handles user authentication and execution of programs. This machine provides the cluster users with a platform for managing their computational intensive jobs. 2 Cluster Access To get access to cluster@WU a local unix account is needed. To acquire such an account send an email to [email protected]. You will be notified by email once your account is created. 2.1 Login To login to the cluster type ssh [email protected] into your unix shell (a number of ssh clients eg putty are available for Windows users). After authentication with your username and password a shell on the login server will become available. The user is provided with programs for editing and compiling as well as tools for managing jobs for the grid engine. The login server solely serves as an access point to the main cluster and therefore it should only be used for editing programs (eg installing R packages into your personal library), compiling small applications and managing jobs submitted to the grid engine. 2.2 Changing the password In order to change your password simply use the command passwd after logging into cluster@WU and then enter your password into the terminal. 2 3 Using the Cluster In this section a summary of the capabilities of the Sun Grid Engine (SGE) is presented as well as an overview of how to use this software. Sun Grid Engine is an open source cluster resource management and scheduling software. On cluster@WU version 6.0 of the Grid Engine manages the remote execution of cluster jobs. It can be obtained from http://gridengine.sunsource.net/, the commercial version Sun N1 Grid Engine can be found on http://www.sun.com/software/ gridware/. 3.1 Definitions Before going into further details the user should be aware of certain often used terms: Nodes In terms of the SGE a node is referred to as one core in the cluster. We refer to a node as one rack unit. Each of the cluster@WU nodes has 12 cores of which each can process one job simultaneously. This can sometimes cause confusion between different texts dealing with clusters. Jobs are user requests for resources available in a grid. These are then evaluated by the SGE and distributed to the nodes for processing. Task are smaller parts of a job that can be separately processed on different nodes or cores. A single job can consist of many tasks (even thousands). Each of these tasks can be performing the similar or completely different calculations depending on the arguments passed to the SGE. Job Arguments Each job can be submitted with extra parameters which affect how the job gets processed. These arguments can be specified in the job submission file. 3.2 Fair Use In order to provide maximum flexibility, some aspects of the grid engine are set up with very few restriction. For expample there is no limit on how many jobs a user can start or have in the queue. However this also means that users need to write and submit their jobs in a way, which does not adversely affect the rights of other users to run jobs on the cluster. 3 For example, if you wish to start a job, which contains hundreds or even a few thousand tasks, which will occupy a significant amount of resources for extended periods of time, then submit the job with a reduced priority (see section on Arguments below) so other users jobs can get processed whenever any task is completed It is not allowed to start programs which are time or resource intensive on the login server (as this will have side effects to all users logged in). If such tasks are started anyway, they may be terminated by the administrators without notice. 3.3 How the Grid Engine Operates In general the grid engine has to match the available resources to the requests of the grid users. Therefore the grid engine is responsible for • accepting jobs from the outside world • delay a job until it can be run • sending job from the holding area to an execution device (node) • managing running jobs • logging of jobs The User needs not to care about things like “On which node should I run my tasks” or “How do I get the results of my computation back from one node to my home directory”, etc. All this is handled via the grid engine. 3.4 Submission of a Job A simple example shows how one can submit a job on the WU cluster infrastructure: 1. Login to the cluster with your user (when using a terminal then using the following command) ssh [email protected] 2. Create a text file on the cluster (we will call the file myJob) with the following content 4 #$ -N firstJob R-i --vanilla < myRprogram.R 3. Submit the job to the grid engine with the following command qsub myJob This will queue the job and run it immediately whenever a node is free. 3.5 Output For every submitted and processed job the system creates 2 files. Each filename consists of the name of the job, distinction if it contains the output or the errors ("e" and "o" respectively) and the job ID. The submitted myJob from the previous example would hence result in the following files being created: firstJob.oXXXXX : Starts with a prologue with some meta information of the job and then continues the actual output (only the standard output) and ends with an epilogue (contains runtime etc.) firstJob.eXXXXX : Contains any errors encountered XXXXX refers to the job id. Note that the output is cached while the job is running and will be displayed in the 2 files with a delay (or immediately if the job finishes). 3.6 Deletion of a Job To delete a job use the qdel command. 3.7 Monitoring Job/cluster status Statistics of the jobs running on cluster@WU can be obtained via calling qstat. An overview of all jobs/nodes and their utilization can be obtained using the sjs and sns commands. 5 3.8 Summary of Grid Engine Arguments A job begins typically with commands to the grid engine. These commands start with #$ followed by the argument as described below and then the desired value. One or more arguments can be passed to the grid engine: -N defines the actual jobname that will be displayed in the queue and on the output and error files -m bea this tells the system to send an email when the job starts (b) / ends (e) / is aborted (a) -M followed by the email address -q selects one of currently two node queues (either bignode.q or node.q). Default is node.q -pe [type] [n] starts up a parallel environment [type] reserving [n] cores. 3.9 Job best practices & what to avoid While it is possible to run jobs on nodes which then spawn further processes (see Fair Use section above for an explanation), please refrain from doing so if you didn’t reserve the appropriate amount of cores on a node you will be using (such as submitting a job in a parallel environment, see example below). Otherwise the Grid Engine might try to allocate further jobs to a particular node even though it is already is running at maximum capacity. It might also be tempting to have your jobs write data back into your home directory (which are remotely mounted on each node when needed) as the job gets processed. This isn’t an issue if done in a limited fashion but if done excessively with hundreds of simultaneous jobs can cause the fileserver to become unresponsive. This then results in both jobs being unable to pass the data they need to the fileserver as well as users being unable to access their own home directories upon logging in to the cluster (a typical symptom would be a user who normally uses ssh-key based authentification being asked to input their password, since due to the unresponsiveness of the fileserver, their public key cannot get accessed). Instead make use of the local storage in the /tmp/ directory of each node (approx 10-15 GB) as well as the /scratch/ directory which is mounted on each node and allows for cross-node access to data and have your scripts write into these directories instead of your home directory. Once you have larger chunks of data ready, then you can have them get copied to your home directory. 6 3.10 Troubleshooting & further details Troubleshooting issues and further details are outside the scope of this manual. Please refer to the website at http://statmath.wu.ac.at/cluster for more Information. If that is also insufficient to resolve your issues then you are welcome to contact the cluster admins at [email protected] 4 Job Examples 4.1 A Simple Job What follows is a simple job without any parameters to the SGE. The shell commands date is run then after a pause of 30 seconds this command is run again. # print date and time date # sleep for 30 seconds sleep 30 # print date and time again date 4.2 Compilation of Applications on the Cluster The following job starts remote compilation on cluster@WU. The arguments to the grid engine define the email address of the user to whom a mail should be sent. The flag -m e causes that the email is sent at the end of the job. #$ #$ #$ #$ -N -M -m -q compile-application [email protected] e node.q cd /path/to/src echo "#### clean ####" make clean echo "#### configure ####" 7 ./configure CC=icc CXX=icpc FC=f77 --prefix=/path/to/local/lib/ echo "#### make ####" make all echo "#### install ####" make install echo "#### finished ####" 4.3 Same Job with Different Parameters This is a common used possibility to (pseudo) parallelize tasks. For one job different tasks are executed. The key is an environment variable called SGE TASK ID. For a range of task numbers provided by the -t argument a task is started running the given job having access to a unique environment variable which identifies this task. To illustrate this way of job creation see the following task: #$ -N R_alternatives -t 1:10 R-i --vanilla <<-EOF run <- as.integer(Sys.getenv("SGE_TASK_ID")) param <- expand.grid(mu=c(0.01, 0.025, 0.05, 0.075, 0.1), sd = c(0.04, 0.1)) param vec <- rnorm(50, param[[run,1]], param[[run,2]]) mean(vec) sd(vec) EOF For each task ID a vector with 50 normally distributed pseudo random numbers is generated. The parameters for the normal distribution are chosen using the SGE TASK ID environment variable. 4.4 openMPI/parallel Job The grid engine helps the user with setting up parallel environments. The -pe argument followed by the desired parallel environment (e.g., orte, PVM) informs the grid engine to start the specified environment. 8 #$ -N RMPI #$ -pe orte 20 #$ -q node.q # Job for using the MPI implementation LAM on 20 Nodes mpirun -np 20 /path/to/lam/executable 4.5 PVM Job #$ -N pvm-example #$ -pe pvm 5 #$ -q node.q /path/to/pvm/executable 5 Available Software In this section a summary of available programs is given. The operating system is Debian GNU Linux (http://www.debian.org). 9 R-i R-g gcc g++ gfortran icc icpc ifort R R compiled with Intel compiler and linked against libgoto R compiled with the default settings Compiler GNU C Compiler, Stallman and the GCC Developer Community (2007) GNU C++ Compiler, GNU FORTRAN Compiler Intel C Compiler, Intel corporation (2007a) Intel C++ Compiler, Intel corporation (2007a) Intel FORTRANCompiler, Intel corporation (2007b) Editor emacs vi/vim nano joe Scientific octave LAM/MPI PVM HPC version 7.1.3—this is for running MPI programs Parallel Virtual Machine version 3.4 Table 2: Available software A .bashrc Modifications In this appendix parts of the .bashrc are explained which enables specific functionality on the cluster. Keep in mind that jobs need to be modified to specifically include your .bashrc by adding #!/bin/sh to the beginning of the file which contains your job information and instructions. A.1 Enable openMPI To get the MPI wrappers and libraries add the following to your .bashrc: export LD_LIBRARY_PATH=/opt/libs/openmpi-1.4.3-INTEL-12.0-64/lib:$LD_LIBRARY_PATH export PATH=$PATH:/opt/libs/openmpi-1.4.3-INTEL-12.0-64/bin 10 A.2 Enable PVM To enable PVM add the following to your .bashrc: # PVM> # you may wish to use this for your own programs (edit the last # part to point to a different directory f.e. ~/bin/_$PVM_ARCH. # if [ -z $PVM_ROOT ]; then if [ -d /usr/lib/pvm3 ]; then export PVM_ROOT=/usr/lib/pvm3 else echo "Warning - PVM_ROOT not defined" echo "To use PVM, define PVM_ROOT and rerun your .bashrc" fi fi if [ -n $PVM_ROOT ]; then export PVM_ARCH=‘$PVM_ROOT/lib/pvmgetarch‘ # # uncomment one of the following lines if you want the PVM commands # directory to be added to your shell path. # # export PATH=$PATH:$PVM_ROOT/lib # generic export PATH=$PATH:$PVM_ROOT/lib/$PVM_ARCH # arch-specific # # uncomment the following line if you want the PVM executable directory # to be added to your shell path. # export PATH=$PATH:$PVM_ROOT/bin/$PVM_ARCH fi A.3 Local R Library Only the administrator has write access to the site library and not all packages that users may require are pre-installed. For this reason one should create their own package library in their home directory. Since the home directories are exported to all nodes during execution of jobs the personal package library will also be available as well. To do this create a directory in your home folder and add the following to your .bashrc file: # R package directory export R_LIBS=~/path/to/R/lib If upon your next login to the cluster you start R (see the Section Available Software above) and your newly created folder gets displayed as primary library then the .bashrc modification worked. 11 References Richard Stallman and the GCC Developer Community. Using the GNU Compiler Collection. The Free Software Foundation, 2007. URL http://gcc.gnu.org Intel Corporation Intel C++ Compiler Documentation. Intel Corporation 2007a. URL www.intel.com Intel Corporation Intel C++ Compiler Documentation. Intel Corporation 2007b. URL www.intel.com 12