Scyld ClusterWare Orientation
Transcription
Scyld ClusterWare Orientation
Scyld ClusterWare Orientation Confidential – Internal Use Only 1 Orientation Agenda Beowulf Architecture Booting and Provisioning Name Service Process Creation, Monitoring, and Control Providing Data to Clients System Management Interactive and Serial Jobs Resource Management and Job Queuing Other Resources 2 Confidential – Internal Use Only Beowulf Architecture Confidential – Internal Use Only 3 Beowulf Clustering: The Early Years Conceived by Don Becker and Thomas Sterling in ’93 and initiated at NASA in ’94 Objective: show that a scalable commodity clusters could solve problems usually handled by $million supercomputers but at a fraction of the cost Initial prototype » 16 processors, Ethernet bonding » Scalable (processing power, memory, storage, network bandwidth) » Under $50k (1994) but matched performance of contemporary $1M SMP system 4 Confidential – Internal Use Only Traditional “Ad-Hoc” Linux Cluster The Master node: » Fully loaded with hardware components » Full compliment of RAS features » Complete Linux distribution » User access, roles and security implemented Interconnection Network » Two network connections. One for private cluster and one external Each compute node: Master Node » Fully loaded with hardware components » Complete Linux distribution Internet or Internal Network • Installation manual and slow (5 to 30 min) » User access, roles and security implemented • Difficult to manage Monitoring & management added as isolated tools 5 Confidential – Internal Use Only Lessons Learned Issues with this approach » Complexity » Requires extensive training to install, configure and use » Long-term administration and updates were difficult » Only “static scalability” A better cluster system implementation » Create a unified view of independent machines » Single installation to manage and upgrade Single, unified application environment » Central location for monitoring, trouble-shooting, security 6 Confidential – Internal Use Only Cluster Virtualization Architecture Realized Manage & use a cluster like a single SMP machine Compute nodes: no disk required Minimal in-memory OS Optional Disks » Deployed at boot in less than 20 sec Virtual, unified process space enables intuitive single sign-on, job submission » One set of user security to manage Interconnection Network » Consistent environment for migrating jobs to nodes Master is full implementation Monitor & manage efficiently from the Master » Single System Install Master Node • Single point of provisioning » Single Process Space Internet or Internal Network 7 » Better performance due to lightweight nodes Confidential – Internal Use Only Booting and Provisioning Confidential – Internal Use Only 8 Booting Compute node booting requirements Optional Disks » No dependency on local storage: Only CPU, memory and NIC needed » Automatic scaling (a few nodes to thousands) » Centralized control and updates (from the Master) Interconnection Network » Reporting and diagnostics designed into the system Core system is "stateless" non-persistent » Kernel and minimal environment from master as ramdisk to client Master Node Internet or Internal Network 9 » Just enough in ramdisk for the client to boot and access the master » Additional elements provided under centralized master contro on demandl Confidential – Internal Use Only Booting implementation Boot server (beoServe) supports PXE and other well-know protocols » Understands PXE versions » Avoids TFTP "capture effect" • Multiple node access will not return 'time out' • UDP will wait for connection » DHCP service for non-cluster hosts Kernel and minimal environment from Master Just enough to say, “what do I do now” Remaining configuration driven by Master Boot diagnostics for all nodes available on the Master – /var/log/messages – /var/log/beowulf/node.n 10 Confidential – Internal Use Only NFS to client systems Mounting FS is per-installation config Optional Disks » ClusterWare needs “no” NFS file systems. Data could be transferred as needed » Some NFS mounts set for path searching and convenience • /bin /usr/bin /opt /home /usr/lib /usr/lib64 Interconnection Network Administration is done on the master » File system configuration tables are on the master Master Node NFS mounts to client 11 • Standard format, but in /etc/beowulf/fstab » Cluster-wide default with per-node specialization » Mount failures are non-fatal and diagnosable Confidential – Internal Use Only /etc/beowulf/fstab [root@cw00 beowulf]# grep -v '^#' fstab | grep -v '^$' none none none none none /dev/pts /proc /sys /bpfs /dev/shm $MASTER:/bin $MASTER:/usr/bin $MASTER:/localhome devpts proc sysfs bpfs tmpfs gid=5,mode=620 defaults defaults defaults defaults /bin /usr/bin /localhome nfs nfs nfs 0 0 0 0 0 0 0 0 0 0 nolock,nonfatal nolock,nonfatal nolock,nonfatal 0 0 0 0 0 0 $MASTER:/usr/lib64/python2.4 /usr/lib64/python2.4 nfs nolock,nonfatal 0 0 $MASTER:/usr/lib/perl5 /usr/lib/perl5 nfs nolock,nonfatal 0 0 $MASTER:/usr/lib64/perl5 /usr/lib64/perl5 nfs nolock,nonfatal 0 0 10.54.30.0:/opt 10.54.30.0:/home /dev/sda1 /dev/sda2 12 swap /scratch /opt /home nfs nfs swap ext2 nolock,nonfatal nolock,nonfatal defaults 0 0 defaults 0 0 Confidential – Internal Use Only 0 0 0 0 Executing init.d scripts on compute nodes Optional Disks Located in /etc/beowulf/init.d/ Interconnection Network Scripts start on the head node and need remote-execution commands to operate on compute nodes Order is based on file name Master Node Startup scripts on clients 13 » Numbered files can be used to control order beochkconfig is used to set +x bit on files Confidential – Internal Use Only Typical set of startup scripts [root@cw00 ~]# ls -w 20 -F /etc/beowulf/init.d 03kickbackproxyd* 08nettune* 09nscd* 10mcelog* 12raid.example 13dmidecode* 13sendstats* 15openib 15openib.local* 16ipoib* 20ipmi* 20srp 23panfs 25cuda 30cpuspeed 80rcmdd 81sshd* 85run2complete 90torque* 99local 14 Confidential – Internal Use Only Clinet boot diagnostics Optional Disks Boot diagnostics for all nodes available on the Master Interconnection Network – /var/log/messages – /var/log/beowulf/node.n Master Node Diagnose Client boot on Master 15 Confidential – Internal Use Only /var/log/messages as a node boots Sep 28 12:52:53 cw00 beoserv: NODESTATUS 0 dhcp-pxe Assigned address 10.54.50.0 for node 0 during PXE BIOS from 00:A0:D1:E4:87:D6 Sep 28 12:52:53 cw00 beoserv: NODESTATUS 0 tftp-bootloader TFTP download /usr/lib/syslinux/pxelinux.0 to node 0 Sep 28 12:52:53 cw00 beoserv: NODESTATUS 0 tftp-bootloader TFTP download /usr/lib/syslinux/pxelinux.0 to node 0 Sep 28 12:52:53 cw00 beoserv: NODESTATUS 0 tftp-bootconfig TFTP download autogenerated PXELINUX config file to node 0 Sep 28 12:52:53 cw00 beoserv: NODESTATUS 0 tftp-kernel TFTP download /boot/vmlinuz-2.6.18308.11.1.el5.582g0000 to node 0 Sep 28 12:52:53 cw00 beoserv: NODESTATUS 0 tftp-file TFTP download /var/beowulf/boot/computenode.initrd to node 0 ... Sep 28 12:53:14 10.54.50.0 (none) ib_mthca: Initializing 0000:09:00.0 Sep 28 12:53:14 10.54.50.0 (none) GSI 18 sharing vector 0xC1 and IRQ 18 Sep 28 12:53:14 10.54.50.0 (none) ACPI: PCI Interrupt 0000:09:00.0[A] -> GSI 28 (level, low) -> IRQ 193 Sep 28 12:53:15 10.54.50.0 (none) ib_mthca 0000:09:00.0: HCA FW version 3.4.000 is old (3.5.000 is current). Sep 28 12:53:15 10.54.50.0 (none) ib_mthca 0000:09:00.0: If you have problems, try updating your HCA FW. ... Sep 28 12:53:33 10.54.50.0 n0 ipmi: Found new BMC (man_id: 0x005059, prod_id: 0x000e, dev_id: 0x20) Sep 28 12:53:33 10.54.50.0 n0 IPMI kcs interface initialized Sep 28 12:53:33 10.54.50.0 n0 ipmi device interface Sep 28 12:53:34 10.54.50.0 n0 Sep 28 12:53:34 sshd[65653]: Server listening on :: port 22. Sep 28 12:53:34 10.54.50.0 n0 Sep 28 12:53:34 sshd[65653]: Server listening on 0.0.0.0 port 22. Sep 28 12:53:34 cw00 mountd[2986]: authenticated mount request from n0:1020 for /var/spool/torque/mom_logs (/var/spool/torque/mom_logs) 16 Confidential – Internal Use Only /var/log/beowulf/node.0 node_up: Initializing cluster node 0 at Fri Sep 28 12:53:13 PDT 2012. node_up: Setting system clock from the master. node_up: Configuring loopback interface. node_up: Explicitly mount /bpfs. node_up: Initialize kernel parameters using /etc/beowulf/conf.d/sysctl.conf node_up: Loading device support modules for kernel version 2.6.18-308.11.1.el5.5 82g0000. node_up: eth0 is the cluster interface node_up: Using eth0:10.54.0.1 as the default route node_up: Making compute node devices and running setup_fs. setup_fs: Configuring node filesystems using /etc/beowulf/fstab setup_fs: Mounting /bpfs (type=bpfs; options=defaults) setup_fs: Mounting /dev/pts (type=devpts; options=gid=5,mode=620) setup_fs: Mounting /dev/shm (type=tmpfs; options=defaults) setup_fs: Mounting /proc (type=proc; options=defaults) setup_fs: Mounting /sys (type=sysfs; options=defaults) setup_fs: Mounting 10.54.0.1:/bin on /bin (type=nfs; options=nolock,nonfatal) ... setup_fs: Creating libcache directory trees. node_up: Using master's time zone setting from /etc/localtime. node_up: Copying ld.so.cache. node_up: Copying loader files. node_up: Configuring BeoNSS cluster name service (nsswitch.conf). node_up: Enabling slave node automatic kernel module loading. node_up: Change slave node /rootfs to be the real root. node_up: Start rpc.statd daemon for NFS mounts without 'nolock'. node_up: Prestage /etc/alternatives files. node_up: Prestage libcache file: /lib64/libcrypt.so.1 ... 17 Confidential – Internal Use Only more on /var/log/beowulf/node.0 Starting /etc/beowulf/init.d/03kickbackproxyd... Starting /etc/beowulf/init.d/08nettune... Starting /etc/beowulf/init.d/09nscd... started nscd on node 0 Starting /etc/beowulf/init.d/10mcelog... Starting /etc/beowulf/init.d/13dmidecode... Starting /etc/beowulf/init.d/13sendstats... Starting /etc/beowulf/init.d/15openib.local... Creating UDAPL Configuration: [ OK ] ... Starting /etc/beowulf/init.d/16ipoib... Configuring IP over Infiniband modprobe ib_umad Using device ib1 for IP over IB 6: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast qlen 256 link/infiniband 80:00:04:05:fe:80:00:00:00:00:00:00:00:06:6a:01:a0:00:40:e3 ... Starting /etc/beowulf/init.d/20ipmi... Loading IPMI drivers Created ipmi character device 252 0 Starting /etc/beowulf/init.d/81sshd... started sshd on node 0 Starting /etc/beowulf/init.d/90torque... Mounts Founds, Configuring Torque node_up: Node setup completed at Fri Sep 28 12:53:34 PDT 2012. 18 Confidential – Internal Use Only Name Service Confidential – Internal Use Only 19 Host names and IP addresses compute node names n0 n1 n2 n3 n4 n5 Master node IP and netmask from interface Master IP: 10.54.0.1 Netmask: 255.255.0.0 BeoNSS sets hostnames and IP address » Default compute node name: .0 .1 .2 .3 etc. Interconnection Network Master .-1 Master Node Internet or Internal Network » Set node info from default in “config” Node name: n0 n1 n2 n3 etc. Compute node IP: 10.54.50.$node Node IPMI: 10.54.150.$node Node IB: 10.55.50.$node Name format » Cluster host names have the base form: n<$node> » Admin-defined name and IP in /etc/hosts NFS/Storage: 10.54.30.0 GigE Switch: 10.54.10.0 IB Switch: 10.54.11.0 Special names for "self" and "master" » Current machine is ".-2" or "self". 20 » Master Confidential – is known as Internal Use Only ".-1", “master”, “master0” Host names and IP in /etc/beowulf/config 21 Confidential – Internal Use Only beonss: Dynamically generated lookup The information beonss provides includes hostnames, netgroups, and user information Head node runs a kickback daemon. Compute nodes run kickback service. Optional Disks beonss creates a netgroup which includes all of the nodes in the cluster – Interconnection Network Master Node Internet or Internal Network 22 /etc/exports @cluster Name service information available to the master (NIS, LDAP, AD) is transparently available to compute nodes: /etc/nsswitch.conf Name services used on the compute nodes are order in: /etc/beowulf/nsswitch.conf Confidential – Internal Use Only Process Creation, Monitoring and Control Confidential – Internal Use Only 23 Unified process creation Master runs bpmaster daemon Each compute node runs bpslave daemon The process is loaded on the master bpslave Interconnection Network bpmaster Master Node Master moves process image to compute node Internet or Internal Network 24 » Process gets everything it needs to begin execution from the Master (shell environment, libraries, etc.) Compute node begins process execution As additional items are need by the process, they are transferred from the master and cached, Confidential – Internal Use Only Process monitoring and control Single virtual process space over cluster » One cluster-wide (or unified) process space • Process commands on master like 'ps' or 'top' will output jobs that are running on all nodes. » Standard process monitoring tools work unchanged • Well known POSIX job control: kill, bg, fg » Negligible performance impact » Major operational and performance benefits Consider cluster-wide “killall” » Over 7 minutes on University of Arizona cluster with 'ssh' » Real-time, interactive response with Scyld approach 25 Confidential – Internal Use Only Process operation Benefits of the single, unified, cluster-wide process space Execution consistency » No inconsistent environments on compute nodes » Remote execution produces same results as local execution Implications: » Clusters jobs are issued from designated master » That master has the required environment (no difference on nodes) » Same executable (including version!) » same libraries, including library linkage order » same parameters and environment 26 Confidential – Internal Use Only Summery: Unified Process Space Optional Disks One of the key advantages of the Sclyd cluster systems is a unified process space Interconnection Network Single Process Table Master Node Internet or Internal Network 27 Users can submit multiple jobs using bpsh and use standard POSIX job control (i.e. &, bg, fg, kill, etc.) ps aux | bpstat –P will show which processes are associated with which nodes Confidential – Internal Use Only Unified process space implementation Unified process space is implemented by modifying kernel – extending the master's process table Correct semantics and efficient monitoring/control Optional Disks » 2.2/2.4 Linux kernel implementation with custom hooks » Redesigned 2.6 Interconnection Network Single Process Table Master Node Internet or Internal Network 28 implementation minimizes/eliminates hooks Obtain kernel upgrade from Penguin » Upgrade apps in the usual way yum -y --exclude=kernel* update Confidential – Internal Use Only Providing Data to Clients Confidential – Internal Use Only 29 Local data Disks on each compute node for local, temporary data Local - Use storage on each node's disk » Relatively high performance » Each node has a potentially different filesystem Interconnection Network » Shared data files must be copied to each node » No synchronization Master Node Internet or Internal Network 30 » Most useful for temporary/scratch files accessed only by copy of program running on single node Confidential – Internal Use Only Remote file systems Persistent data available to clients from remote sources File system to support application » Just like in managing processes and administering the cluster, the optimal file system would have a single system image to all nodes of the cluster » Such file systems exist but have various drawbacks, one inparticular being degraded performance » Since each node in a Beowulf can have its own disk, making the same files available on each node can be problematic 31 Confidential – Internal Use Only Data from NFS mounts Optional Disks Remote - Share a single disk among all nodes NFS » Simplest solution for small clusters • Reading/writing small files Interconnection Network Client data from NFS » Every node sees same filesystem » Well know protocol » Ubiquitous - supported by all major OSs » Relatively low performance • 10s MB/s NFS node 32 » Doesn't scale well; server becomes bottleneck in large systems Confidential – Internal Use Only Data from remote parallel file systems Parallel - Stripe files across storage volumes on multiple nodes » Relatively high performance » Each node sees same filesystem » File system distributed over many computes and their volumes » Aggregate network bandwidth and disks of many computers » Scalable IO throughput and capacity (up to 100+ GB/sec) » Works best for I/O intensive applications » Not a good solution for small files 33 Confidential – Internal Use Only Lustre parallel file system Lustre 34 Confidential – Internal Use Only » The three main components are the Metadata Server (MDS), Object Storage Server (OSS), and client » File system metadata stored on MDS » File data stored on OSSs disks (OST) » Stripe across OSSs for aggregate bandwidth » Clients can use a number of interconnects » Installation and management is challenging Panasas (panfs) parallel file systems Panasas » Director blade (metadata) and storage blades in an 11 slot shelf » Single director blade controls file layout and access » Stripe across storage blades for aggregate bandwidth » Switched Gigabit Ethernet connects cluster nodes to multiple Panasas blades » Direct file I/O from cluster node to storage blade » Relatively easy to setup and manage 35 Confidential – Internal Use Only Parallel file systems Global File System (GFS) » Available with Red Hat distribution » Works best with Fibre Channel Parallel Virtual File System (PVFS) » Open software developed at Clemson University General Parallel File System (GPFS) » Proprietary IBM software solution 36 Confidential – Internal Use Only System Management Confidential – Internal Use Only 37 Physical Management ipmitool » Intelligent Platform Management Interface (IPMI) is integrated into the base management console (BMC) » Serial-over-LAN (SOL) can be implemented » Allows access to hardware such as sensor data or power states » E.g. ipmitool –H n$NODE-ipmi –U admin –P admin power {status,on,off} • Use bpctl instead of 'power off' • for i in {0..99} ; do ipmitool –H n$i-ipmi –U admin –P admin power on ; done bpctl » Controls the operational state and ownership of compute nodes » Examples might be to reboot or power off a node • Reboot: bpctl –S all –R • Power off: bpctl –S all –P » Limit user and group access to run on a particular node or set of nodes 38 Confidential – Internal Use Only Physical Management beostat » Displays raw data from the Beostat system • Basic hardware data (CPU’s, RAM, network) • Load and utilization beosetup » GUI to administer a Scyld cluster » Shows the dynamic node addition when a new node is booted » Edit other values which will be correctly entered into the global /etc/beowulf/config file service beowulf {start,stop,restart,reload} » O/S level control of beowulf service. • Stop/start and restart will cause all compute nodes to reboot • Reload to implement changes in /etc/beowulf/config without rebooting nodes 39 Confidential – Internal Use Only Physical Management – User level bpstat » Unified state, status and statistics used for • Scheduling • Monitoring • Diagnostics » Report status of compute nodes and which processes are associated with each • ps aux | bpstat –P beostatus » Display status information about the cluster » X-windows and curses options are available • ‘beostatus’ versus ‘beostatus –c’ beoconfig » Returns keyword values from the Scyld cluster /etc/beowulf/config file for use in scripts if needed » e.g. beoconfig interface Confidential 40 – Internal Use Only Physical Management – copying files bpcp » Stage data on compute nodes locally to improve performance • Default directory is the current directory • bpcp host1:file1 host2:file2 • Global copy can be done when combined with bpsh – bpsh –a bpcp master:file /tmp beochkconfig » Controls the node startup scripts in /etc/beowulf/init.d • Scripts that are run on the headnode when a compute node boots • Modifications to the compute node configuration are done via bpsh commands » Sets the execute bit on or off 41 Confidential – Internal Use Only Interactive and Serial Jobs Confidential – Internal Use Only 42 User directed process migration Basic mechanism for running programs on nodes » patterned after the rsh and ssh commands • Note: by default, nodes don't run remote access daemons (e.g. sshd, rshd) bpsh [options] nodenumber command [command-args] » Compare with ssh –l user hostname uname –a » nodenumber can be a single node, a comma separated list of nodes, or –a for all nodes that are up • bpsh 1,3,2 hostname » No guarantee for order unless –s is specified » Common flags: bpsh -asp • Perform on all nodes, display output in sequential order, prefix output with node number Input and output are redirected from the remote process » -N: no IO forwarding » -n: /dev/null is stdin » -I, -O, -E: redirect from/to a file 43 Confidential – Internal Use Only Resource Management bpsh requires a nodenumber to be provided, but how does a user choose which node? » Assign a given node to a particular user » Randomly choose a node » Etc. Determine which node has the lowest utilization and run there » Manually using beostat –l to display load averages » Use beomap to do so automatically 44 Confidential – Internal Use Only Map concept Mapping is the assignment of processes to nodes based on current CPU load » parse data from beostat automatically » Colon delimited list of nodes » default mapping policy consists of the following steps: • run on nodes that are idle • run on CPUs that are idle • minimize the load per CPU bpsh `beomap –nolocal` command » Benefit that std IO is forwarded and redirected 45 Confidential – Internal Use Only Distributed Serial Applications mpprun and beorun provide you with true "dynamic execution“ capabilities, whereas bpsh provides "directed execution" only » Specify the number of processors on which to start copies of the program » Start one copy on each node in the cluster » Start one copy on each CPU in the cluster » Force all jobs to run on the master node » Prevent any jobs from running on the master node Key difference between mpprun and beorun: » beorun runs the job on the selected nodes concurrently » mpprun runs the job sequentially on one node at a time 46 Confidential – Internal Use Only beorun vs. mpprun beorun takes ~1 second to run all 8 threads » [user@cluster username]$ date;beorun -np 8 sleep 1;date Mon Mar 22 11:48:30 PDT 2010 Mon Mar 22 11:48:32 PDT 2010 mpprun takes 8 seconds to run all 8 threads » [user@cluster username]$ date;mpprun -np 8 sleep 1;date Mon Mar 22 11:48:46 PDT 2010 Mon Mar 22 11:48:54 PDT 2010 47 Confidential – Internal Use Only Combining with beomap beorun and mpprun can be used to dynamically select nodes when combined with beomap » mpprun –map `beomap -np 4 -nolocal` hostname Can be used to specify a mapping specifically: » mpprun –map 0:0:0:0 hostname » mpprun –map 0:1:2:3 hostname 48 Confidential – Internal Use Only Resource Management and Job Queuing Confidential – Internal Use Only 49 Queuing How are resources allocated among multiple users and/or groups? » Statically by using bpctl user and group permissions • Slave node 5 can only run jobs by user Needy: bpctl -S 5 -u Needy • Slave node 5 can only run jobs by group GotGrant: bpctl -S 5 -g GotGrant • Make slave node 5 unavailable to run jobs: bpctl -S 5 -s unavailable » ClusterWare supports a variety of queuing packages • TORQUE – Open source scheduler sponsored by Adaptive Computing – Included with ClusterWare distribution • Moab – Advanced policy base scheduler product from Adaptive Computing – Integrates with resource mngr daemons from TORQUE (pbs_server, pbs_mom) – Available from Penguin Computing for integration into ClusterWare • Grid Engine (was Sun Grid Engine, now Oracle Grid Engine – Son of Grid Engine – Open Grid Scheduler 50 Confidential – Internal Use Only Lineage of PBS-based Queuing Systems A number of queuing systems have been developed (e.g. NQS, LSF, OpenPBS, PBSPro, Torque) » PBSPro is a commercial product » OpenPBS was an open source component of the product • OpenPBS had many contributions from the community, but vendor ceased development » TORQUE (Terascale Open-source Resource and QUEue manager) • Was forked from the OpenPBS project • Sponsored by Adaptive Computing All of the PBS-type schedulers consist of three components: » pbs_server – keeps tracks of jobs in the queue and resources available to run jobs » pbs_sched – scheduler that analyzes information from the pbs_server and returns with which jobs should be run » pbs_mom – communicates with the pbs_server about what resources are available and used, ALSO spawns the job submission scripts 51 Confidential – Internal Use Only TaskMaster Implementation Integration schemes for TORQUE and Moab » pbs_mom on compute nodes » pbs_server and bps_sched on the master Compute node 0 - N: pbs_mom Interconnection Network Master node: pbs_server Master Node pbs_sched Internet or Internal Network 52 Confidential – Internal Use Only Scheduler Improvements Default scheduler in TORQUE is pbs_sched » Essentially a FIFO scheduler. Some capabilities exist for more complex policies such as priority based scheduling » Queue based scheduling where multiple queues are defined for different job profiles Maui is an improvement on the default pbs_sched » Maui extends the capabilities of the base resource management system by adding a number of fine-grained scheduling features » Utilizes TORQUE’s pbs_server and pbs_mom components Adaptive Computing has improved and commercialized Maui as the Moab product » More functionality and administration and user interfaces Penguin Computing has licensed Moab and integrated it with ClusterWare 53 Confidential – Internal Use Only Interacting with TORQUE To submit a job: » All jobs are submitted to qsub in a script • Example script.sh: #!/bin/sh #PBS –j oe #PBS –l nodes=4 cd $PBS_O_WORKDIR hostname » PBS directives » qsub does not accept arguments for script.sh. All executable arguments must be included in the script itself • Administrators can create a ‘qapp’ script that takes user arguments, creates script.sh with the user arguments embedded, and runs ‘qsub script.sh’ 54 Confidential – Internal Use Only Other options to qsub Options that can be included in a script (with the #PBS directive) or on the qsub command line » Join output and error files : #PBS –j oe » Request resources: #PBS –l nodes=2:ppn=2 » Request walltime: #PBS –l walltime=24:00:00 » Define a job name: #PBS –N jobname » Send mail at jobs events: #PBS –m be » Assign job to an account: #PBS –A account » Export current environment variables: #PBS –V To start an interactive queue job use: » qsub -I 55 Confidential – Internal Use Only Interacting with TORQUE Some TORQUE commands and files » qstat – Status of queue server and jobs » qdel – Remove a job from the queue » qhold, qrls – Hold and release a job in the queue » qmgr – Administrator command to configure pbs_server » /var/spool/torque/server_name: should match hostname of the head node » /var/spool/torque/mom_priv/config: file to configure pbs_mom • ‘$usecp *:/home /home’ indicates that pbs_mom should use ‘cp’ rather than ‘rcp’ or ‘scp’ to relocate the stdout and stderr files at the end of execution » pbsnodes – Administrator command to monitor the status of the resources » qalter – Administrator command to modify the parameters of a particular job (e.g. requested time) 56 Confidential – Internal Use Only Torque and Scyld Scyld bundles torque with its distribution The pbs_server, pbs_sched and pbs_mom services in /etc/init.d/ are started by the torque service. To enable torque beochkconfig 90torque on To configure torque service torque reconfigure 57 this will set up the nodes file with all the compute nodes and correct CPU core counts set up the server_name file correctly reinitialize the pbs_server configuration (do not run service torque reconfigure on a cluster with customizations) Confidential – Internal Use Only Torque and Scyld 2 To start torque To stop torque service torque cluster-start service torque cluster-stop To verify the configuration > pbsnodes -a n0 state = free np = 16 ntype = cluster status = rectime=1348867669,varattr=,jobs=,state=free,netload=263331978,gres=,loadave=0.00,ncpus =16,physmem=66053816kb,availmem=65117716kb,totmem=66053816kb,idletime=72048,nusers=0,ns essions=0,uname=Linux node01 2.6.32-279.el6.x86_64 #1 SMP Fri Jun 22 12:19:21 UTC 2012 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 n1 state = free np = 16 . . . 58 Confidential – Internal Use Only FAQ: Scripts with arguments qapp script: » Be careful about escaping special characters in the redirect section (\$, \’, \”) #!/bin/bash #Usage: qapp arg1 arg2 debug=0 opt1=“${1}” opt2=“${2}” if [[ “${opt2}” == “” ]] ; then echo “Not enough arguments” exit 1 fi cat > app.sh << EOF #!/bin/bash #PBS –j oe #PBS –l nodes=1 cd \$PBS_O_WORKDIR app $opt1 $opt2 EOF if [[ “${debug}” –lt 1 ]] ; then qsub app.sh fi if [[ “${debug}” –eq 0 ]] ; then /bin/rm –f app.sh fi 59 Confidential – Internal Use Only FAQ: Data on local scratch Using local scratch: #!/bin/bash #PBS –j oe #PBS –l nodes=1 cd $PBS_O_WORKDIR tmpdir=“/scratch/$USER/$PBS_JOBID” /bin/mkdir –p $tmpdir rsync –a ./ $tmpdir cd $tmpdir $pathto/app $1 $2 cd $PBS_O_WORKDIR rsync –a $tmpdir/ . /bin/rm –fr $tmpdir 60 Confidential – Internal Use Only FAQ: Data on local scratch with MPICH Using local scratch for MPICH parallel jobs: #!/bin/bash #PBS –j oe #PBS –l nodes=2:ppn=8 cd $PBS_O_WORKDIR tmpdir=“/scratch/$USER/$PBS_JOBID” /usr/bin/pbsdsh –u “/bin/mkdir –p $tmpdir” /usr/bin/pbsdsh –u bash –c “cd $PBS_O_WORKDIR ; rsync –a ./ $tmpdir” cd $tmpdir mpirun –machine vapi $pathto/app $1 $2 cd $PBS_O_WORKDIR /usr/bin/pbsdsh –u “rsync –a $tmpdir/ $PBS_O_WORKDIR” /usr/bin/pbsdsh –u “/bin/rm –fr $tmpdir” 61 Confidential – Internal Use Only FAQ: Data on local scratch with OpenMPI Using local scratch for OpenMPI parallel jobs: » Do a ‘module load openmpi/gnu’ prior to running qsub #!/bin/bash #PBS –j oe #PBS –l nodes=2:ppn=8 #PBS -V cd $PBS_O_WORKDIR tmpdir=“/scratch/$USER/$PBS_JOBID” /usr/bin/pbsdsh –u “/bin/mkdir –p $tmpdir” /usr/bin/pbsdsh –u bash –c “cd $PBS_O_WORKDIR ; rsync –a ./ $tmpdir” cd $tmpdir /usr/openmpi/gnu/bin/mpirun –np `cat $PBS_NODEFILE | wc –l` –mca btl openib,sm,self $pathto/app $1 $2 cd $PBS_O_WORKDIR /usr/bin/pbsdsh –u “rsync –a $tmpdir/ $PBS_O_WORKDIR” /usr/bin/pbsdsh –u “/bin/rm –fr $tmpdir” 62 Confidential – Internal Use Only Other considerations A queue script need not be a single command » Multiple steps can be performed from a single script • Guaranteed resources • Jobs should typically be a minimum of 2 minutes » Pre-processing and post-processing can be done from the same script using the local scratch space » If configured, it is possible to submit additional jobs from a running queued job To remove multiple jobs from the queue: » qstat | grep “ [RQ] “ | awk ‘{print $1}’ | xargs qdel 63 Confidential – Internal Use Only Other Resources Confidential – Internal Use Only 64 Other ClusterWare Resources PDF Manuals (/usr/share/doc/PDF) » Administrator’s Guide » Programmer’s Guide » User’s Guide » Reference Guides 65 Adobe Acrobat Document Confidential – Internal Use Only Online Documentation man pages exist for most commands » man command HTML documentation is available in a web browser (/usr/share/doc/HTML) » Need to start httpd service to access remotely Penguin Computing Masterlink » http://www.penguincomputing.com/ScyldSupport • Login ID can be generated by Technical Support Moab/TORQUE » Available at http://www.adaptivecomputing.com at the “Support” link 66 Confidential – Internal Use Only Support Contacts Penguin Computing Technical Support » Can help with ClusterWare configuration and basic system questions » Provided as part of the software support contract • 1-888-PENGUIN • [email protected] Penguin Computing Professional Services » Higher level application specific support and optimization based on a pre-scoped Statement of Work » Other custom consulting 67 Confidential – Internal Use Only Additional Topics Confidential – Internal Use Only 68 Moab Scheduler Confidential – Internal Use Only 69 Integration with MOAB When the MOAB scheduler is installed, the pbs_sched service has to be disabled – typically through changes to the /etc/init.d/torque script – and the moab service enabled. Scyld releases rpms of MOAB that directly integrate with Scyld/ClusterWare. These releases are not the latest versions, but do make the necessary changes to the start up scripts and configure a basic /opt/moab/etc/moab.cfg configuration file. One can also download MOAB from adaptivecomputing.com, configure it to work with torque and install normally. With this option, the /etc/init.d/moab script will have to be created and enabled and pbs_sched disabled by hand. Confidential – 70 Internal Use Only MOAB Initial Setup Edit configuration in /opt/moab/etc/moab.cfg » SCHEDCFG[Scyld] MODE=NORMAL SERVER=scyld.localdomain:42559 • Ensure hostname is consistent with `hostname` » ADMINCFG[1] USERS=root • Add additional users who can be queue managers » RMCFG[base] TYPE=PBS or multiple remote servers running pbs_server » RMCFG[b0] RMCFG[b1] 71 HOST=hn0.localdomain SUBMITCMD=qsub TYPE=PBS VERSION=2.5.9 HOST=hn1.localdomain SUBMITCMD=qsub TYPE=PBS VERSION=2.5.9 Confidential – Internal Use Only Interacting with Moab Because the Moab scheduler uses TORQUE pbs_server and pbs_mom components, all TORQUE commands are still valid » qsub will submit a job to TORQUE, Moab then polls pbs_server to detect new jobs » msub will submit a job to Moab which then pushes the job to pbs_server Other Moab commands » qstat -> showq » qdel, qhold, qrls -> mjobctl » pbsnodes -> showstate » qmgr -> mschedctl, mdiag » Configuration in /opt/moab/moab.cfg 72 Confidential – Internal Use Only Tuning Default walltime can be set in Torque using: » qmgr -c ‘set queue batch resources_default.walltime=16:00:00’ If many small jobs need to be submitted, uncomment the following in /opt/moab/moab.cfg » JOBAGGREGATIONTIME 10 To exactly match node and processor requests, add the following to /opt/moab/moab.cfg » JOBNODEMATCHPOLICY EXACTNODE Changes in /opt/moab/moab.cfg can be activated by doing a ‘service moab restart’ 73 Confidential – Internal Use Only Parallel Jobs Confidential – Internal Use Only 74 Explicitly Parallel Programs Different paradigms exist for parallelizing programs » Shared memory » OpenMP » Sockets » PVM » Linda » MPI Most distributed parallel programs are now written using MPI » Government standard » Different options for MPI stacks: MPICH, OpenMPI, HP, Intel » ClusterWare comes integrated with customized versions of MPICH and OpenMPI 75 Confidential – Internal Use Only MPI Implementation Comparison MPICH is provided by Argonne National Labs » Runs only over Ethernet Ohio State University has ported MPICH to use the Verbs API => MVAPICH » Similar to MPICH but uses Infiniband LAM-MPI was another implementation which provided a more modular format OpenMPI is the successor to LAM-MPI and has many options » Can use different physical interfaces and spawning mechanisms » http://www.openmpi.org HP-MPI, Intel-MPI » Licensed MPICH2 code and added functionality » Can use a variety ofConfidential physical interconnects – 76 Internal Use Only MPI Implementation Comparison part 2 MPICH2 is provided by Argonne National Labs » MPI-1 and MPI-2 compliant » hydra interconnect preferred – no longer need to deploy a daemon on a compute node » has run on the Raspberry Pi MVAPICH2 is provided by Ohio State University » MPI-1 and MPI-2 compliant » Based on MPICH2 » has InfiniBand support 77 Confidential – Internal Use Only Compiling MPICH programs mpicc, mpiCC, mpif77, mpif90 are used to automatically compile code and link in the correct MPI libraries from /usr/lib64/MPICH » Environment variables can used to set the compiler: • CC, CPP, FC, F90 » Command line options to set the compiler: • -cc=, -cxx=, -fc=, -f90= » GNU, PGI, and Intel compilers are supported 78 Confidential – Internal Use Only Running MPICH programs mpirun is used to launch MPICH programs Dynamic allocation can be done when using the –np flag Mapping is also supported when using the –map flags If Infiniband is installed, the interconnect fabric can be chosen using the machine flag: » -machine p4 » -machine vapi 79 Confidential – Internal Use Only Environment Variable Options Additional environment variable control: » NP — The number of processes requested, but not the number of processors. As in the example earlier in this section, NP=4 ./a.out will run the MPI program a.out with 4 processes. » ALL_CPUS — Set the number of processes to the number of CPUs available to the current user. Similar to the example above, --all-cpus=1 ./a.out would run the MPI program a.out on all available CPUs. » ALL_NODES—Set the number of processes to the number of nodes available to the current user. Similar to the ALL_CPUS variable, but you get a maximum of one CPU per node. This is useful for running a job per node instead of per CPU. » ALL_LOCAL — Run every process on the master node; used for debugging purposes. » NO_LOCAL — Don’t run any processes on the master node. » EXCLUDE — A colon-delimited list of nodes to be avoided during node assignment. » BEOWULF_JOB_MAP — A colon-delimited list of nodes. The first node listed will be the first process (MPI Rank Confidential – 0) and so on. 80 Internal Use Only Compiling and Running OpenMPI programs env-modules package allow users to change their environment variables accoring to predefined files » module avail » module load openmpi/gnu » GNU, PGI, and Intel compilers are supported mpicc, mpiCC, mpif77, mpif90 are used to automatically compile code and link in the correct MPI libraries from /usr/lib64/OMPI mpirun is used to run code Interconnect can be selected at runtime » -mca btl openib,tcp,sm,self » -mca btl udapl,tcp,sm,self Confidential – 81 Internal Use Only Compiling and Running OpenMPI programs What env-modules does: Set user environment prior to compiling » export PATH=/usr/openmpi/gnu/bin:${PATH} mpicc, mpiCC, mpif77, mpif90 are used to automatically compile code and link in the correct MPI libraries from /usr/lib64/OMPI » Environment variables can used to set the compiler: • OPMI_CC, OMPI_CXX, OMPI_F77, OMPI_FC Prior to running PATH and LD_LIBRARY_PATH should be set » module load openmpi/gnu » /usr/openmpi/gnu/bin/mpirun –np 16 a.out OR: » export PATH=/usr/openmpi/gnu/bin:${PATH} export OPAL_PKGDATADIR=/usr/openmpi/gnu/share export MANPATH=/usr/openmpi/man export LD_LIBRARY_PATH=/usr/lib64/OMPI/gnu:${LD_LIBRARY_PATH} » /usr/openmpi/gnu/bin/mpirun –np 16 a.out 82 Confidential – Internal Use Only