Scyld ClusterWare Orientation

Transcription

Scyld ClusterWare Orientation
Scyld ClusterWare
Orientation
Confidential – Internal Use Only
1
Orientation Agenda
 Beowulf Architecture
 Booting and Provisioning
 Name Service
 Process Creation, Monitoring, and Control
 Providing Data to Clients
 System Management
 Interactive and Serial Jobs
 Resource Management and Job Queuing
 Other Resources
2
Confidential –
Internal Use Only
Beowulf Architecture
Confidential – Internal Use Only
3
Beowulf Clustering: The Early Years
 Conceived by Don Becker and Thomas
Sterling in ’93 and initiated at NASA in ’94
 Objective: show that a scalable commodity
clusters could solve problems usually
handled by $million supercomputers but at
a fraction of the cost
 Initial prototype
» 16 processors, Ethernet bonding
» Scalable (processing power, memory,
storage, network bandwidth)
» Under $50k (1994) but matched
performance of contemporary $1M
SMP system
4
Confidential –
Internal Use Only
Traditional “Ad-Hoc” Linux Cluster
 The Master node:
» Fully loaded with hardware components
» Full compliment of RAS features
» Complete Linux distribution
» User access, roles and security implemented
Interconnection Network
» Two network connections. One for private
cluster and one external
 Each compute node:
Master Node
» Fully loaded with hardware components
» Complete Linux distribution
Internet or Internal Network
• Installation manual and slow (5 to 30 min)
» User access, roles and security implemented
• Difficult to manage
 Monitoring & management added as
isolated tools
5
Confidential –
Internal Use Only
Lessons Learned
 Issues with this approach
» Complexity
» Requires extensive training to install, configure and use
» Long-term administration and updates were difficult
» Only “static scalability”
 A better cluster system implementation
» Create a unified view of independent machines
» Single installation to manage and upgrade
Single, unified application environment
» Central location for monitoring, trouble-shooting, security
6
Confidential –
Internal Use Only
Cluster Virtualization Architecture Realized
Manage & use a cluster like a single SMP machine
 Compute nodes: no disk required
 Minimal in-memory OS
Optional Disks
» Deployed at boot in less than 20 sec
 Virtual, unified process space enables
intuitive single sign-on, job submission
» One set of user security to manage
Interconnection Network
» Consistent environment for migrating jobs to
nodes
 Master is full implementation
 Monitor & manage efficiently from the
Master
» Single System Install
Master Node
• Single point of provisioning
» Single Process Space
Internet or Internal Network
7
» Better performance due to lightweight nodes
Confidential –
Internal Use Only
Booting and Provisioning
Confidential – Internal Use Only
8
Booting
 Compute node booting requirements
Optional Disks
» No dependency on local storage:
Only CPU, memory and NIC needed
» Automatic scaling (a few nodes to thousands)
» Centralized control and updates (from the
Master)
Interconnection Network
» Reporting and diagnostics designed into the
system
 Core system is "stateless" non-persistent
» Kernel and minimal environment from master
as ramdisk to client
Master Node
Internet or Internal Network
9
» Just enough in ramdisk for the client to boot
and access the master
» Additional elements provided under centralized
master contro on demandl
Confidential –
Internal Use Only
Booting implementation
 Boot server (beoServe) supports PXE and other well-know protocols
» Understands PXE versions
» Avoids TFTP "capture effect"
• Multiple node access will not return 'time out'
• UDP will wait for connection
» DHCP service for non-cluster hosts
 Kernel and minimal environment from Master
 Just enough to say, “what do I do now”
 Remaining configuration driven by Master
 Boot diagnostics for all nodes available on the Master
– /var/log/messages
– /var/log/beowulf/node.n
10
Confidential –
Internal Use Only
NFS to client systems
 Mounting FS is per-installation config
Optional Disks
» ClusterWare needs “no” NFS file systems.
Data could be transferred as needed
» Some NFS mounts set for path searching
and convenience
• /bin /usr/bin /opt /home /usr/lib /usr/lib64
Interconnection Network
 Administration is done on the master
» File system configuration tables are on
the master
Master Node
NFS mounts to
client
11
• Standard format, but in
/etc/beowulf/fstab
» Cluster-wide default with per-node
specialization
» Mount failures are non-fatal and
diagnosable
Confidential –
Internal Use Only
/etc/beowulf/fstab
[root@cw00 beowulf]# grep -v '^#' fstab | grep -v '^$'
none
none
none
none
none
/dev/pts
/proc
/sys
/bpfs
/dev/shm
$MASTER:/bin
$MASTER:/usr/bin
$MASTER:/localhome
devpts
proc
sysfs
bpfs
tmpfs
gid=5,mode=620
defaults
defaults
defaults
defaults
/bin
/usr/bin
/localhome
nfs
nfs
nfs
0
0
0
0
0
0
0
0
0
0
nolock,nonfatal
nolock,nonfatal
nolock,nonfatal
0 0
0 0
0 0
$MASTER:/usr/lib64/python2.4 /usr/lib64/python2.4 nfs nolock,nonfatal 0 0
$MASTER:/usr/lib/perl5
/usr/lib/perl5
nfs nolock,nonfatal 0 0
$MASTER:/usr/lib64/perl5
/usr/lib64/perl5
nfs nolock,nonfatal 0 0
10.54.30.0:/opt
10.54.30.0:/home
/dev/sda1
/dev/sda2
12
swap
/scratch
/opt
/home
nfs
nfs
swap
ext2
nolock,nonfatal
nolock,nonfatal
defaults 0 0
defaults 0 0
Confidential –
Internal Use Only
0 0
0 0
Executing init.d scripts on compute nodes
Optional Disks
 Located in
/etc/beowulf/init.d/
Interconnection Network
 Scripts start on the head node
and need remote-execution
commands to operate on
compute nodes
 Order is based on file name
Master Node
Startup scripts on
clients
13
» Numbered files can be used to
control order
 beochkconfig is used to set +x
bit on files
Confidential –
Internal Use Only
Typical set of startup scripts
[root@cw00 ~]# ls -w 20 -F /etc/beowulf/init.d
03kickbackproxyd*
08nettune*
09nscd*
10mcelog*
12raid.example
13dmidecode*
13sendstats*
15openib
15openib.local*
16ipoib*
20ipmi*
20srp
23panfs
25cuda
30cpuspeed
80rcmdd
81sshd*
85run2complete
90torque*
99local
14
Confidential –
Internal Use Only
Clinet boot diagnostics
Optional Disks
 Boot diagnostics for all
nodes available on the
Master
Interconnection Network
– /var/log/messages
– /var/log/beowulf/node.n
Master Node
Diagnose Client
boot on Master
15
Confidential –
Internal Use Only
/var/log/messages as a node boots
Sep 28 12:52:53 cw00 beoserv: NODESTATUS 0 dhcp-pxe Assigned address 10.54.50.0 for node 0
during PXE BIOS from 00:A0:D1:E4:87:D6
Sep 28 12:52:53 cw00 beoserv: NODESTATUS 0 tftp-bootloader TFTP download
/usr/lib/syslinux/pxelinux.0 to node 0
Sep 28 12:52:53 cw00 beoserv: NODESTATUS 0 tftp-bootloader TFTP download
/usr/lib/syslinux/pxelinux.0 to node 0
Sep 28 12:52:53 cw00 beoserv: NODESTATUS 0 tftp-bootconfig TFTP download autogenerated
PXELINUX config file to node 0
Sep 28 12:52:53 cw00 beoserv: NODESTATUS 0 tftp-kernel TFTP download /boot/vmlinuz-2.6.18308.11.1.el5.582g0000 to node 0
Sep 28 12:52:53 cw00 beoserv: NODESTATUS 0 tftp-file TFTP download
/var/beowulf/boot/computenode.initrd to node 0
...
Sep 28 12:53:14 10.54.50.0 (none) ib_mthca: Initializing 0000:09:00.0
Sep 28 12:53:14 10.54.50.0 (none) GSI 18 sharing vector 0xC1 and IRQ 18
Sep 28 12:53:14 10.54.50.0 (none) ACPI: PCI Interrupt 0000:09:00.0[A] -> GSI 28 (level,
low) -> IRQ 193
Sep 28 12:53:15 10.54.50.0 (none) ib_mthca 0000:09:00.0: HCA FW version 3.4.000 is old
(3.5.000 is current).
Sep 28 12:53:15 10.54.50.0 (none) ib_mthca 0000:09:00.0: If you have problems, try
updating your HCA FW.
...
Sep 28 12:53:33 10.54.50.0 n0 ipmi: Found new BMC (man_id: 0x005059, prod_id: 0x000e,
dev_id: 0x20)
Sep 28 12:53:33 10.54.50.0 n0 IPMI kcs interface initialized
Sep 28 12:53:33 10.54.50.0 n0 ipmi device interface
Sep 28 12:53:34 10.54.50.0 n0 Sep 28 12:53:34 sshd[65653]: Server listening on :: port 22.
Sep 28 12:53:34 10.54.50.0 n0 Sep 28 12:53:34 sshd[65653]: Server listening on 0.0.0.0
port 22.
Sep 28 12:53:34 cw00 mountd[2986]: authenticated mount request from n0:1020 for
/var/spool/torque/mom_logs (/var/spool/torque/mom_logs)
16
Confidential –
Internal Use Only
/var/log/beowulf/node.0
node_up: Initializing cluster node 0 at Fri Sep 28 12:53:13 PDT 2012.
node_up: Setting system clock from the master.
node_up: Configuring loopback interface.
node_up: Explicitly mount /bpfs.
node_up: Initialize kernel parameters using /etc/beowulf/conf.d/sysctl.conf
node_up: Loading device support modules for kernel version 2.6.18-308.11.1.el5.5
82g0000.
node_up: eth0 is the cluster interface
node_up: Using eth0:10.54.0.1 as the default route
node_up: Making compute node devices and running setup_fs.
setup_fs: Configuring node filesystems using /etc/beowulf/fstab
setup_fs: Mounting /bpfs (type=bpfs; options=defaults)
setup_fs: Mounting /dev/pts (type=devpts; options=gid=5,mode=620)
setup_fs: Mounting /dev/shm (type=tmpfs; options=defaults)
setup_fs: Mounting /proc (type=proc; options=defaults)
setup_fs: Mounting /sys (type=sysfs; options=defaults)
setup_fs: Mounting 10.54.0.1:/bin on /bin (type=nfs; options=nolock,nonfatal)
...
setup_fs: Creating libcache directory trees.
node_up: Using master's time zone setting from /etc/localtime.
node_up: Copying ld.so.cache.
node_up: Copying loader files.
node_up: Configuring BeoNSS cluster name service (nsswitch.conf).
node_up: Enabling slave node automatic kernel module loading.
node_up: Change slave node /rootfs to be the real root.
node_up: Start rpc.statd daemon for NFS mounts without 'nolock'.
node_up: Prestage /etc/alternatives files.
node_up: Prestage libcache file: /lib64/libcrypt.so.1
...
17
Confidential –
Internal Use Only
more on /var/log/beowulf/node.0
Starting /etc/beowulf/init.d/03kickbackproxyd...
Starting /etc/beowulf/init.d/08nettune...
Starting /etc/beowulf/init.d/09nscd...
started nscd on node 0
Starting /etc/beowulf/init.d/10mcelog...
Starting /etc/beowulf/init.d/13dmidecode...
Starting /etc/beowulf/init.d/13sendstats...
Starting /etc/beowulf/init.d/15openib.local...
Creating UDAPL Configuration: [ OK ]
...
Starting /etc/beowulf/init.d/16ipoib...
Configuring IP over Infiniband
modprobe ib_umad
Using device ib1 for IP over IB
6: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast qlen 256
link/infiniband
80:00:04:05:fe:80:00:00:00:00:00:00:00:06:6a:01:a0:00:40:e3
...
Starting /etc/beowulf/init.d/20ipmi...
Loading IPMI drivers
Created ipmi character device 252 0
Starting /etc/beowulf/init.d/81sshd...
started sshd on node 0
Starting /etc/beowulf/init.d/90torque...
Mounts Founds, Configuring Torque
node_up: Node setup completed at Fri Sep 28 12:53:34 PDT 2012.
18
Confidential –
Internal Use Only
Name Service
Confidential – Internal Use Only
19
Host names and IP addresses
compute node names
n0
n1 n2
n3
n4
n5
 Master node IP and netmask from interface
Master IP: 10.54.0.1
Netmask: 255.255.0.0
 BeoNSS sets hostnames and IP address
» Default compute node name: .0 .1 .2 .3 etc.
Interconnection Network
Master
.-1
Master Node
Internet or Internal Network
» Set node info from default in “config”
Node name: n0 n1 n2 n3 etc.
Compute node IP: 10.54.50.$node
Node IPMI: 10.54.150.$node
Node IB: 10.55.50.$node
 Name format
» Cluster host names have the base form:
n<$node>
» Admin-defined name and IP in /etc/hosts
NFS/Storage: 10.54.30.0
GigE Switch: 10.54.10.0
IB Switch: 10.54.11.0
 Special names for "self" and "master"
» Current machine is ".-2" or "self".
20
» Master
Confidential
– is known as
Internal Use Only
".-1", “master”, “master0”
Host names and IP in /etc/beowulf/config
21
Confidential –
Internal Use Only
beonss: Dynamically generated lookup
The information beonss provides includes hostnames,
netgroups, and user information
 Head node runs a kickback daemon.
Compute nodes run kickback service.
Optional Disks
 beonss creates a netgroup which
includes all of the nodes in the cluster
–
Interconnection Network
Master Node
Internet or Internal Network
22
/etc/exports
@cluster
 Name service information available to
the master (NIS, LDAP, AD) is
transparently available to compute
nodes:
/etc/nsswitch.conf
 Name services used on the compute
nodes are order in:
/etc/beowulf/nsswitch.conf
Confidential –
Internal Use Only
Process Creation, Monitoring
and Control
Confidential – Internal Use Only
23
Unified process creation
 Master runs bpmaster daemon
 Each compute node runs bpslave
daemon
 The process is loaded on the master
bpslave
Interconnection Network
bpmaster
Master Node  Master moves process image to
compute node
Internet or Internal Network
24
» Process gets everything it needs to
begin execution from the Master (shell
environment, libraries, etc.)
 Compute node begins process
execution
 As additional items are need by the
process, they are transferred from the
master and cached,
Confidential –
Internal Use Only
Process monitoring and control
 Single virtual process space over cluster
» One cluster-wide (or unified) process space
• Process commands on master like 'ps' or 'top' will output jobs that are
running on all nodes.
» Standard process monitoring tools work unchanged
• Well known POSIX job control: kill, bg, fg
» Negligible performance impact
» Major operational and performance benefits
 Consider cluster-wide “killall”
» Over 7 minutes on University of Arizona cluster with 'ssh'
» Real-time, interactive response with Scyld approach
25
Confidential –
Internal Use Only
Process operation
Benefits of the single, unified, cluster-wide process space
 Execution consistency
» No inconsistent environments on compute nodes
» Remote execution produces same results as local execution
 Implications:
» Clusters jobs are issued from designated master
» That master has the required environment (no difference on
nodes)
» Same executable (including version!)
» same libraries, including library linkage order
» same parameters and environment
26
Confidential –
Internal Use Only
Summery: Unified Process Space
Optional Disks
 One of the key advantages
of the Sclyd cluster
systems is a unified
process space
Interconnection Network
Single
Process
Table
Master Node
Internet or Internal Network
27
 Users can submit multiple
jobs using bpsh and use
standard POSIX job
control (i.e. &, bg, fg, kill,
etc.)
 ps aux | bpstat –P will
show which processes are
associated with which
nodes
Confidential –
Internal Use Only
Unified process space implementation
Unified process space is implemented by modifying kernel –
extending the master's process table
 Correct semantics and
efficient monitoring/control
Optional Disks
» 2.2/2.4 Linux kernel
implementation with custom hooks
» Redesigned 2.6
Interconnection Network
Single
Process
Table
Master Node
Internet or Internal Network
28
implementation
minimizes/eliminates hooks
 Obtain kernel upgrade from
Penguin
» Upgrade apps in the usual way
yum -y --exclude=kernel* update
Confidential –
Internal Use Only
Providing Data to Clients
Confidential – Internal Use Only
29
Local data
Disks on each compute node
for local, temporary data
 Local - Use storage on each
node's disk
» Relatively high performance
» Each node has a potentially
different filesystem
Interconnection Network
» Shared data files must be
copied to each node
» No synchronization
Master Node
Internet or Internal Network
30
» Most useful for
temporary/scratch files
accessed only by copy of
program running on single
node
Confidential –
Internal Use Only
Remote file systems
Persistent data available to clients from remote sources
 File system to support application
» Just like in managing processes and administering the cluster,
the optimal file system would have a single system image to all
nodes of the cluster
» Such file systems exist but have various drawbacks, one inparticular being degraded performance
» Since each node in a Beowulf can have its own disk, making
the same files available on each node can be problematic
31
Confidential –
Internal Use Only
Data from NFS mounts
Optional Disks
 Remote - Share a single disk among
all nodes
 NFS
» Simplest solution for small clusters
• Reading/writing small files
Interconnection Network
Client data
from NFS
» Every node sees same filesystem
» Well know protocol
» Ubiquitous - supported by all major
OSs
» Relatively low performance
• 10s MB/s
NFS node
32
» Doesn't scale well; server becomes
bottleneck in large systems
Confidential –
Internal Use Only
Data from remote parallel file systems
 Parallel - Stripe files across storage volumes on multiple nodes
» Relatively high performance
» Each node sees same filesystem
» File system distributed over many computes and their volumes
» Aggregate network bandwidth and disks of many computers
» Scalable IO throughput and capacity (up to 100+ GB/sec)
» Works best for I/O intensive applications
» Not a good solution for small files
33
Confidential –
Internal Use Only
Lustre parallel file system
 Lustre
34
Confidential –
Internal Use Only
» The three main
components are the
Metadata Server (MDS),
Object Storage Server
(OSS), and client
» File system metadata
stored on MDS
» File data stored on
OSSs disks (OST)
» Stripe across OSSs for
aggregate bandwidth
» Clients can use a
number of interconnects
» Installation and
management is
challenging
Panasas (panfs) parallel file systems
 Panasas
» Director blade (metadata) and storage
blades in an 11 slot shelf
» Single director blade controls file
layout and access
» Stripe across storage blades for
aggregate bandwidth
» Switched Gigabit Ethernet connects
cluster nodes to multiple Panasas
blades
» Direct file I/O from cluster node to
storage blade
» Relatively easy to setup and manage
35
Confidential –
Internal Use Only
Parallel file systems
 Global File System (GFS)
» Available with Red Hat distribution
» Works best with Fibre Channel
 Parallel Virtual File System (PVFS)
» Open software developed at Clemson University
 General Parallel File System (GPFS)
» Proprietary IBM software solution
36
Confidential –
Internal Use Only
System Management
Confidential – Internal Use Only
37
Physical Management
 ipmitool
» Intelligent Platform Management Interface (IPMI) is integrated into
the base management console (BMC)
» Serial-over-LAN (SOL) can be implemented
» Allows access to hardware such as sensor data or power states
» E.g. ipmitool
–H n$NODE-ipmi –U admin –P admin power {status,on,off}
• Use bpctl instead of 'power off'
• for i in {0..99} ; do ipmitool –H n$i-ipmi –U admin –P admin power
on ; done
 bpctl
» Controls the operational state and ownership of compute
nodes
» Examples might be to reboot or power off a node
• Reboot: bpctl –S all –R
• Power off: bpctl –S all –P
» Limit user and group access to run on a particular node or set of
nodes
38
Confidential –
Internal Use Only
Physical Management
 beostat
» Displays raw data from the Beostat system
• Basic hardware data (CPU’s, RAM, network)
• Load and utilization
 beosetup
» GUI to administer a Scyld cluster
» Shows the dynamic node addition when a new node is booted
» Edit other values which will be correctly entered into the
global /etc/beowulf/config file
 service beowulf {start,stop,restart,reload}
» O/S level control of beowulf service.
• Stop/start and restart will cause all compute nodes to reboot
• Reload to implement changes in /etc/beowulf/config without rebooting
nodes
39
Confidential –
Internal Use Only
Physical Management – User level
 bpstat
» Unified state, status and statistics used for
• Scheduling
• Monitoring
• Diagnostics
» Report status of compute nodes and which processes
are associated with each
• ps aux | bpstat –P
beostatus
» Display status information about the cluster
» X-windows and curses options are available
• ‘beostatus’ versus ‘beostatus –c’
 beoconfig
» Returns keyword values from the Scyld cluster
/etc/beowulf/config file for use in scripts if needed
» e.g. beoconfig interface Confidential
40
–
Internal Use Only
Physical Management – copying files
 bpcp
» Stage data on compute nodes locally to
improve performance
• Default directory is the current directory
• bpcp host1:file1 host2:file2
• Global copy can be done when combined with bpsh
– bpsh –a bpcp master:file /tmp
 beochkconfig
» Controls the node startup scripts in /etc/beowulf/init.d
• Scripts that are run on the headnode when a compute node boots
• Modifications to the compute node configuration are done via bpsh
commands
» Sets the execute bit on or off
41
Confidential –
Internal Use Only
Interactive and Serial Jobs
Confidential – Internal Use Only
42
User directed process migration
 Basic mechanism for running programs on nodes
» patterned after the rsh and ssh commands
• Note: by default, nodes don't run remote access daemons (e.g. sshd, rshd)
 bpsh [options] nodenumber command [command-args]
» Compare with ssh –l user hostname uname –a
» nodenumber can be a single node, a comma separated list of nodes, or –a for all
nodes that are up
• bpsh 1,3,2 hostname
» No guarantee for order unless –s is specified
» Common flags: bpsh -asp
• Perform on all nodes, display output in sequential order, prefix output with node number
 Input and output are redirected from the remote process
» -N: no IO forwarding
» -n: /dev/null is stdin
» -I, -O, -E: redirect
from/to a file
43
Confidential –
Internal Use Only
Resource Management
 bpsh requires a nodenumber to be provided, but how
does a user choose which node?
» Assign a given node to a particular user
» Randomly choose a node
» Etc.
 Determine which node has the lowest utilization and
run there
» Manually using beostat –l to display load averages
» Use beomap to do so automatically
44
Confidential –
Internal Use Only
Map concept
 Mapping is the assignment of processes to nodes
based on current CPU load
» parse data from beostat automatically
» Colon delimited list of nodes
» default mapping policy consists of the following steps:
• run on nodes that are idle
• run on CPUs that are idle
• minimize the load per CPU
 bpsh `beomap –nolocal` command
» Benefit that std IO is forwarded and redirected
45
Confidential –
Internal Use Only
Distributed Serial Applications
 mpprun and beorun provide you with true "dynamic
execution“ capabilities, whereas bpsh provides
"directed execution" only
» Specify the number of processors on which to start copies of
the program
» Start one copy on each node in the cluster
» Start one copy on each CPU in the cluster
» Force all jobs to run on the master node
» Prevent any jobs from running on the master node
 Key difference between mpprun and beorun:
» beorun runs the job on the selected nodes concurrently
» mpprun runs the job sequentially on one node at a time
46
Confidential –
Internal Use Only
beorun vs. mpprun
 beorun takes ~1 second to run all 8 threads
» [user@cluster username]$ date;beorun -np 8 sleep 1;date
Mon Mar 22 11:48:30 PDT 2010
Mon Mar 22 11:48:32 PDT 2010
 mpprun takes 8 seconds to run all 8 threads
» [user@cluster username]$ date;mpprun -np 8 sleep 1;date
Mon Mar 22 11:48:46 PDT 2010
Mon Mar 22 11:48:54 PDT 2010
47
Confidential –
Internal Use Only
Combining with beomap
 beorun and mpprun can be used to dynamically select
nodes when combined with beomap
» mpprun –map `beomap -np 4 -nolocal` hostname
 Can be used to specify a mapping specifically:
» mpprun –map 0:0:0:0 hostname
» mpprun –map 0:1:2:3 hostname
48
Confidential –
Internal Use Only
Resource Management and
Job Queuing
Confidential – Internal Use Only
49
Queuing
 How are resources allocated among multiple users and/or groups?
» Statically by using bpctl user and group permissions
• Slave node 5 can only run jobs by user Needy: bpctl -S 5 -u Needy
• Slave node 5 can only run jobs by group GotGrant: bpctl -S 5 -g GotGrant
• Make slave node 5 unavailable to run jobs: bpctl -S 5 -s unavailable
» ClusterWare supports a variety of queuing packages
• TORQUE
– Open source scheduler sponsored by Adaptive Computing
– Included with ClusterWare distribution
• Moab
– Advanced policy base scheduler product from Adaptive Computing
– Integrates with resource mngr daemons from TORQUE (pbs_server, pbs_mom)
– Available from Penguin Computing for integration into ClusterWare
• Grid Engine (was Sun Grid Engine, now Oracle Grid Engine
– Son of Grid Engine
– Open Grid Scheduler
50
Confidential –
Internal Use Only
Lineage of PBS-based Queuing Systems
 A number of queuing systems have been developed (e.g. NQS, LSF, OpenPBS,
PBSPro, Torque)
» PBSPro is a commercial product
» OpenPBS was an open source component of the product
• OpenPBS had many contributions from the community, but vendor ceased development
» TORQUE (Terascale Open-source Resource and QUEue manager)
• Was forked from the OpenPBS project
• Sponsored by Adaptive Computing
 All of the PBS-type schedulers consist of three components:
» pbs_server – keeps tracks of jobs in the queue and resources available to run jobs
» pbs_sched – scheduler that analyzes information from the pbs_server and returns
with which jobs should be run
» pbs_mom – communicates with the pbs_server about what resources are available
and used, ALSO spawns the job submission scripts
51
Confidential –
Internal Use Only
TaskMaster Implementation
 Integration schemes for TORQUE and Moab
» pbs_mom on compute nodes
» pbs_server and bps_sched on the master
Compute node 0 - N:
pbs_mom
Interconnection Network
Master node:
pbs_server
Master Node
pbs_sched
Internet or Internal Network
52
Confidential –
Internal Use Only
Scheduler Improvements
 Default scheduler in TORQUE is pbs_sched
» Essentially a FIFO scheduler. Some capabilities exist for more complex
policies such as priority based scheduling
» Queue based scheduling where multiple queues are defined for different job
profiles
 Maui is an improvement on the default pbs_sched
» Maui extends the capabilities of the base resource management system by
adding a number of fine-grained scheduling features
» Utilizes TORQUE’s pbs_server and pbs_mom components
 Adaptive Computing has improved and commercialized Maui as the Moab
product
» More functionality and administration and user interfaces
 Penguin Computing has licensed Moab and integrated it with ClusterWare
53
Confidential –
Internal Use Only
Interacting with TORQUE
 To submit a job:
» All jobs are submitted to qsub in a script
• Example script.sh:
#!/bin/sh
#PBS –j oe
#PBS –l nodes=4
cd $PBS_O_WORKDIR
hostname
» PBS directives
» qsub does not accept arguments for script.sh. All executable
arguments must be included in the script itself
• Administrators can create a ‘qapp’ script that takes user
arguments, creates script.sh with the user arguments embedded,
and runs ‘qsub script.sh’
54
Confidential –
Internal Use Only
Other options to qsub
 Options that can be included in a script (with the #PBS
directive) or on the qsub command line
» Join output and error files : #PBS –j oe
» Request resources: #PBS –l nodes=2:ppn=2
» Request walltime: #PBS –l walltime=24:00:00
» Define a job name: #PBS –N jobname
» Send mail at jobs events: #PBS –m be
» Assign job to an account: #PBS –A account
» Export current environment variables: #PBS –V
 To start an interactive queue job use:
» qsub -I
55
Confidential –
Internal Use Only
Interacting with TORQUE
 Some TORQUE commands and files
» qstat – Status of queue server and jobs
» qdel – Remove a job from the queue
» qhold, qrls – Hold and release a job in the queue
» qmgr – Administrator command to configure pbs_server
» /var/spool/torque/server_name: should match hostname of the head
node
» /var/spool/torque/mom_priv/config: file to configure pbs_mom
• ‘$usecp *:/home /home’ indicates that pbs_mom should use ‘cp’
rather than ‘rcp’ or ‘scp’ to relocate the stdout and stderr files at the end of
execution
» pbsnodes – Administrator command to monitor the status of the
resources
» qalter – Administrator command to modify the parameters of a
particular job (e.g. requested time)
56
Confidential –
Internal Use Only
Torque and Scyld



Scyld bundles torque with its distribution
The pbs_server, pbs_sched and pbs_mom services
in /etc/init.d/ are started by the torque service.
To enable torque


beochkconfig 90torque on
To configure torque

service torque reconfigure



57
this will set up the nodes file with all the compute nodes
and correct CPU core counts
set up the server_name file correctly
reinitialize the pbs_server configuration (do not run
service torque reconfigure on a cluster with
customizations)
Confidential –
Internal Use Only
Torque and Scyld 2

To start torque


To stop torque


service torque cluster-start
service torque cluster-stop
To verify the configuration
> pbsnodes -a
n0
state = free
np = 16
ntype = cluster
status =
rectime=1348867669,varattr=,jobs=,state=free,netload=263331978,gres=,loadave=0.00,ncpus
=16,physmem=66053816kb,availmem=65117716kb,totmem=66053816kb,idletime=72048,nusers=0,ns
essions=0,uname=Linux node01 2.6.32-279.el6.x86_64 #1 SMP Fri Jun 22 12:19:21 UTC 2012
x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
n1
state = free
np = 16
. . .
58
Confidential –
Internal Use Only
FAQ: Scripts with arguments
 qapp script:
» Be careful about
escaping special
characters in the
redirect section
(\$, \’, \”)
#!/bin/bash
#Usage: qapp arg1 arg2
debug=0
opt1=“${1}”
opt2=“${2}”
if [[ “${opt2}” == “” ]] ; then
echo “Not enough arguments”
exit 1
fi
cat > app.sh << EOF
#!/bin/bash
#PBS –j oe
#PBS –l nodes=1
cd \$PBS_O_WORKDIR
app $opt1 $opt2
EOF
if [[ “${debug}” –lt 1 ]] ; then
qsub app.sh
fi
if [[ “${debug}” –eq 0 ]] ; then
/bin/rm –f app.sh
fi
59
Confidential –
Internal Use Only
FAQ: Data on local scratch
 Using local scratch:
#!/bin/bash
#PBS –j oe
#PBS –l nodes=1
cd $PBS_O_WORKDIR
tmpdir=“/scratch/$USER/$PBS_JOBID”
/bin/mkdir –p $tmpdir
rsync –a ./ $tmpdir
cd $tmpdir
$pathto/app $1 $2
cd $PBS_O_WORKDIR
rsync –a $tmpdir/ .
/bin/rm –fr $tmpdir
60
Confidential –
Internal Use Only
FAQ: Data on local scratch with MPICH
 Using local scratch for
MPICH parallel jobs:
#!/bin/bash
#PBS –j oe
#PBS –l nodes=2:ppn=8
cd $PBS_O_WORKDIR
tmpdir=“/scratch/$USER/$PBS_JOBID”
/usr/bin/pbsdsh –u “/bin/mkdir –p
$tmpdir”
/usr/bin/pbsdsh –u bash –c “cd
$PBS_O_WORKDIR ; rsync –a ./ $tmpdir”
cd $tmpdir
mpirun –machine vapi $pathto/app $1 $2
cd $PBS_O_WORKDIR
/usr/bin/pbsdsh –u “rsync –a $tmpdir/
$PBS_O_WORKDIR”
/usr/bin/pbsdsh –u “/bin/rm –fr $tmpdir”
61
Confidential –
Internal Use Only
FAQ: Data on local scratch with OpenMPI
 Using local scratch for
OpenMPI parallel jobs:
» Do a ‘module load
openmpi/gnu’ prior to
running qsub
#!/bin/bash
#PBS –j oe
#PBS –l nodes=2:ppn=8
#PBS -V
cd $PBS_O_WORKDIR
tmpdir=“/scratch/$USER/$PBS_JOBID”
/usr/bin/pbsdsh –u “/bin/mkdir –p
$tmpdir”
/usr/bin/pbsdsh –u bash –c “cd
$PBS_O_WORKDIR ; rsync –a ./ $tmpdir”
cd $tmpdir
/usr/openmpi/gnu/bin/mpirun –np `cat
$PBS_NODEFILE | wc –l` –mca btl
openib,sm,self $pathto/app $1 $2
cd $PBS_O_WORKDIR
/usr/bin/pbsdsh –u “rsync –a $tmpdir/
$PBS_O_WORKDIR”
/usr/bin/pbsdsh –u “/bin/rm –fr $tmpdir”
62
Confidential –
Internal Use Only
Other considerations
 A queue script need not be a single command
» Multiple steps can be performed from a single script
• Guaranteed resources
• Jobs should typically be a minimum of 2 minutes
» Pre-processing and post-processing can be done from the
same script using the local scratch space
» If configured, it is possible to submit additional jobs from a
running queued job
 To remove multiple jobs from the queue:
» qstat | grep “ [RQ] “ | awk ‘{print $1}’ | xargs qdel
63
Confidential –
Internal Use Only
Other Resources
Confidential – Internal Use Only
64
Other ClusterWare Resources
 PDF Manuals (/usr/share/doc/PDF)
» Administrator’s Guide
» Programmer’s Guide
» User’s Guide
» Reference Guides
65
Adobe Acrobat
Document
Confidential –
Internal Use Only
Online Documentation
 man pages exist for most commands
» man command
 HTML documentation is available in a web browser
(/usr/share/doc/HTML)
» Need to start httpd service to access remotely
 Penguin Computing Masterlink
» http://www.penguincomputing.com/ScyldSupport
• Login ID can be generated by Technical Support
 Moab/TORQUE
» Available at http://www.adaptivecomputing.com at the “Support” link
66
Confidential –
Internal Use Only
Support Contacts
 Penguin Computing Technical Support
» Can help with ClusterWare configuration and basic system
questions
» Provided as part of the software support contract
• 1-888-PENGUIN
• [email protected]
 Penguin Computing Professional Services
» Higher level application specific support and optimization
based on a pre-scoped Statement of Work
» Other custom consulting
67
Confidential –
Internal Use Only
Additional Topics
Confidential – Internal Use Only
68
Moab Scheduler
Confidential – Internal Use Only
69
Integration with MOAB



When the MOAB scheduler is installed, the pbs_sched
service has to be disabled – typically through changes to
the /etc/init.d/torque script – and the moab service
enabled.
Scyld releases rpms of MOAB that directly integrate with
Scyld/ClusterWare. These releases are not the latest
versions, but do make the necessary changes to the start
up scripts and configure a basic
/opt/moab/etc/moab.cfg configuration file.
One can also download MOAB from
adaptivecomputing.com, configure it to work with torque
and install normally. With this option, the
/etc/init.d/moab script will have to be created and
enabled and pbs_sched disabled by hand.
Confidential –
70
Internal Use Only
MOAB Initial Setup
 Edit configuration in /opt/moab/etc/moab.cfg
» SCHEDCFG[Scyld]
MODE=NORMAL SERVER=scyld.localdomain:42559
• Ensure hostname is consistent with `hostname`
» ADMINCFG[1]
USERS=root
• Add additional users who can be queue managers
» RMCFG[base]
TYPE=PBS
or multiple remote servers running pbs_server
» RMCFG[b0]
RMCFG[b1]
71
HOST=hn0.localdomain SUBMITCMD=qsub TYPE=PBS VERSION=2.5.9
HOST=hn1.localdomain SUBMITCMD=qsub TYPE=PBS VERSION=2.5.9
Confidential –
Internal Use Only
Interacting with Moab
 Because the Moab scheduler uses TORQUE pbs_server and
pbs_mom components, all TORQUE commands are still valid
» qsub will submit a job to TORQUE, Moab then polls pbs_server to
detect new jobs
» msub will submit a job to Moab which then pushes the job to
pbs_server
 Other Moab commands
» qstat -> showq
» qdel, qhold, qrls -> mjobctl
» pbsnodes -> showstate
» qmgr -> mschedctl, mdiag
» Configuration in /opt/moab/moab.cfg
72
Confidential –
Internal Use Only
Tuning
 Default walltime can be set in Torque using:
» qmgr
-c ‘set queue batch resources_default.walltime=16:00:00’
 If many small jobs need to be submitted, uncomment
the following in /opt/moab/moab.cfg
» JOBAGGREGATIONTIME 10
 To exactly match node and processor requests, add
the following to /opt/moab/moab.cfg
» JOBNODEMATCHPOLICY EXACTNODE
 Changes in /opt/moab/moab.cfg can be activated by
doing a ‘service moab restart’
73
Confidential –
Internal Use Only
Parallel Jobs
Confidential – Internal Use Only
74
Explicitly Parallel Programs
 Different paradigms exist for parallelizing programs
» Shared memory
» OpenMP
» Sockets
» PVM
» Linda
» MPI
 Most distributed parallel programs are now written using MPI
» Government standard
» Different options for MPI stacks: MPICH, OpenMPI, HP, Intel
» ClusterWare comes integrated with customized versions of MPICH
and OpenMPI
75
Confidential –
Internal Use Only
MPI Implementation Comparison
 MPICH is provided by Argonne National Labs
» Runs only over Ethernet
 Ohio State University has ported MPICH to use the Verbs API =>
MVAPICH
» Similar to MPICH but uses Infiniband
 LAM-MPI was another implementation which provided a more
modular format
 OpenMPI is the successor to LAM-MPI and has many options
» Can use different physical interfaces and spawning mechanisms
» http://www.openmpi.org
 HP-MPI, Intel-MPI
» Licensed MPICH2 code and added functionality
» Can use a variety ofConfidential
physical interconnects
–
76
Internal Use Only
MPI Implementation Comparison part 2
 MPICH2 is provided by Argonne National Labs
» MPI-1 and MPI-2 compliant
» hydra interconnect preferred – no longer need to deploy a daemon
on a compute node
» has run on the Raspberry Pi
 MVAPICH2 is provided by Ohio State University
» MPI-1 and MPI-2 compliant
» Based on MPICH2
» has InfiniBand support
77
Confidential –
Internal Use Only
Compiling MPICH programs
 mpicc, mpiCC, mpif77, mpif90 are used to
automatically compile code and link in the correct MPI
libraries from /usr/lib64/MPICH
» Environment variables can used to set the compiler:
• CC, CPP, FC, F90
» Command line options to set the compiler:
• -cc=, -cxx=, -fc=, -f90=
» GNU, PGI, and Intel compilers are supported
78
Confidential –
Internal Use Only
Running MPICH programs
 mpirun is used to launch MPICH programs
 Dynamic allocation can be done when using the –np
flag
 Mapping is also supported when using the –map flags
 If Infiniband is installed, the interconnect fabric can be
chosen using the machine flag:
» -machine p4
» -machine vapi
79
Confidential –
Internal Use Only
Environment Variable Options
 Additional environment variable control:
» NP — The number of processes requested, but not the number of
processors. As in the example earlier in this section, NP=4 ./a.out will run the
MPI program a.out with 4 processes.
» ALL_CPUS — Set the number of processes to the number of CPUs
available to the current user. Similar to the example above, --all-cpus=1
./a.out would run the MPI program a.out on all available CPUs.
» ALL_NODES—Set the number of processes to the number of nodes
available to the current user. Similar to the ALL_CPUS variable, but you get
a maximum of one CPU per node. This is useful for running a job per node
instead of per CPU.
» ALL_LOCAL — Run every process on the master node; used for debugging
purposes.
» NO_LOCAL — Don’t run any processes on the master node.
» EXCLUDE — A colon-delimited list of nodes to be avoided during node
assignment.
» BEOWULF_JOB_MAP — A colon-delimited list of nodes. The first node
listed will be the first process
(MPI Rank
Confidential
– 0) and so on.
80
Internal Use Only
Compiling and Running OpenMPI
programs
 env-modules package allow users to change their
environment variables accoring to predefined files
» module avail
» module load openmpi/gnu
» GNU, PGI, and Intel compilers are supported
 mpicc, mpiCC, mpif77, mpif90 are used to
automatically compile code and link in the correct MPI
libraries from /usr/lib64/OMPI
 mpirun is used to run code
 Interconnect can be selected at runtime
» -mca btl openib,tcp,sm,self
» -mca btl udapl,tcp,sm,self
Confidential –
81
Internal Use Only
Compiling and Running OpenMPI
programs
What env-modules does:
 Set user environment prior to compiling
» export PATH=/usr/openmpi/gnu/bin:${PATH}
 mpicc, mpiCC, mpif77, mpif90 are used to automatically compile code and
link in the correct MPI libraries from /usr/lib64/OMPI
» Environment variables can used to set the compiler:
• OPMI_CC, OMPI_CXX, OMPI_F77, OMPI_FC
 Prior to running PATH and LD_LIBRARY_PATH should be set
» module load openmpi/gnu
» /usr/openmpi/gnu/bin/mpirun –np 16 a.out
OR:
» export PATH=/usr/openmpi/gnu/bin:${PATH}
export OPAL_PKGDATADIR=/usr/openmpi/gnu/share
export MANPATH=/usr/openmpi/man
export LD_LIBRARY_PATH=/usr/lib64/OMPI/gnu:${LD_LIBRARY_PATH}
» /usr/openmpi/gnu/bin/mpirun –np 16 a.out
82
Confidential –
Internal Use Only