Arno Ziebart - Tools for High Performance Computing 2015

Transcription

Arno Ziebart
Business Development Manager
Germany
5th Parallel Tools Workshop Sept. 2011
www.clustervision.com
1
2011
About us
• Specialists in Compute, Storage & GPU Clusters
(Tailor-Made, Turn-Key)
• Unique position in Europe (EMEA)
Oslo
• Offices in Amsterdam, Gloucester, Munich,
•
Paris, Milan, Geneva, Madrid, Oslo,
• 50 Staff, most technical, all specialized
in clusters
Gloucester Amsterdam
•
•
• Hardware independent
• Background in Science, Research,
Paris
Munich
•
•
Engineering
Geneva
•
• At forefront of clustering technology
Milan
•
• Financially strong, profitable, growing
Madrid
•
• Over 300 customers
• SpinOff in USA – Bright Computing
2
2011
•Customers — Academia
3
2011
Customers — TOP500
TOP500 (2008/2009/2010)
•
•
•
•
•
•
•
•
University Frankfurt
RU Groningen
Saudi Aramco (Saudi Arabia)
University of Cambridge (UK)
University of Bristol (UK)
University College London (UK)
CASPUR (Italy)
University of Gent (Belgium)
4
2011
ClusterVision Customers
22nd on TOP500 list (Nov. 2010 + June 2011)
20 784 CPU cores (2.1GHz)
772 Ati Radeon HD 5870 GPUs
Fastest x86-based System in Germany
Fastest system in the world based on AMD/ATI GPUs
60.7% efficiency
5
2011
Products & Services
• Turnkey clusters
–
–
–
Compute clusters
Storage clusters
GPU clusters
• Cluster software
–
–
Bright Cluster Manager
MS Windows HPC Server 2008
• HPC Services
–
–
–
–
Cluster
Cluster
Cluster
Cluster
design and benchmarking
installation and deployment
support and service
cooling
• Parallel file systems
–
–
–
–
Lustre
Fraunhofer Global Filesystem (FhGFS)
IBM GPFS (Official world-wide OEM)
NAS
6
2011
Cluster Architecture
…
Storage001
Storage002
Storage015
Storage016
HeadNode02
HeadNode01
node001
x 24
PDUs
x 16
Switch
node002
MonitoringNode01
node003
SNMP
…
x8
Racks
MonitoringNode02
node511
ProvisioningNode01 ProvisioningNode02
FabricNode01
LoginNode01
7
LoginNode02
LoginNode03
LoginNode04
node512
2011
8
2011
Management environment
based on Linux, which is
integrated
Goals
1. Make clusters really easy to manage and use
2. Scale clusters to thousands of nodes
3. Be complete
4. Let users focus on performing computations
9
2011
Cluster Management
• Most solutions use the “toolkit” approach
• Tools typically used: Ganglia, Cacti, Nagios,
Cfengine, xcat, etc
• Issues with the “toolkit” approach:
•
•
•
•
•
•
Tools rarely designed to work together
Tools rarely designed for HPC
Tools rarely designed to scale
Each tool has its own command line interface and GUI
Each tool has its own daemon and database
Roadmap dependent on developers of the tools
• Making a collection of unrelated tools work together
•
•
•
•
Requires a lot of expertise and scripting
Rarely leads to a really easy-to-use and scalable solution
Often leads to long installation and ramp uptime
Low throughput
10
2011
Capacity Approach
Utilisation
Invisible cost of delay to
productive use and
Utilisation
Extended learning
period for users
Extended time to
fully operational
system
It takes a long time
to
“Sweat the Assets”
Time in months
•11
11
2011
Capability Approach
Utilisation
Better throughput
and Utilisation
Faster time
to full User
Productivity
Faster time
to full system
readiness
Strong Policies driven
allocation of resources
“Sweat the
Assets” much
earlier
Time in months
•12
12
2011
Cluster Manager
Molecular
• Intel Cluster Ready certified
Biophysic
• Integration and Support
QCD
Physics
CFD
Chemical
Cluster Manager
• Years of HPC expertise
• User Moduls Environment
• Cluster Administration
Cluster Administration
• Monitoring
Parallel
Filesystem
• Node boot and provisioning system
• Linux distribution
Application
Libraries
Provisioning
Scientific
13
MPI
Libraries
CentOS
Suse
Monitoring
Workload Management
Account.
HPC User Environment
• Workloadmanager
• HPC Middleware
Manufacturing
Redhat
2011
Architecture — CMDaemon
Cluster
CMDaemon
Admin GUI
procedure call
procedure call
SOAP+SSL
SOAP+SSL
node001
event
Admin CLI
User
application
event
Head Node
node002
Monitoring
script
node003
14
2011
Management Interface
Graphical User Interface (GUI)
•
•
•
•
•
Offers administrator full cluster control
Standalone desktop application
Manages multiple clusters simultaneously
Runs on Linux, Windows, MacOS X
Built on top of Mozilla XUL engine
Admin GUI
Command Line Interface (CLI)
•
•
All GUI functionality also available
through Command Line Interface (CLI)
Interactive and scriptable in batch mode
15
Admin CLI
2011
Bright Health Check
• Goal: provide problem free environment for
running jobs
• Hardware & software health
• Three types of health check
– Health checks before jobs are run
•
•
•
•
Halt workload manager few (milli)seconds before job is executed
Check health of each reserved node
If unhealthy, take off line, inform system administrator
Hand job back to workload manager
– Frequently scheduled health checks
• Run health check when node is not used
• Run health check through queuing system
– Hardware burn-in environment
• Most thorough health check
• Requires reboot
• All types are extensible
25
2011
Bright Health Check
Architecture — Monitoring
Cluster
CMDaemon
Admin GUI
metrics
node001
events
monitoring data
metrics
metrics
monitoring
data
Head Node
node002
monitoring data
metrics
Raw data
Consolidated
data
node003
28
2011
Bright GPU Metrics
29
2011
Bright GPU Metrics
30
2011
33
2011
Node Provisioning
Image based




Nodes always boot over the network
 Slave nodes PXE boot into Node Installer, which
 Identifies node (switch port or MAC based)
 Configures BMC
 Partition disks (if any) and creates file systems
 Installs or updates software image
 Pivot the root from NFS to the local file system
34
2011
35
2011
Redundancy
In HA-setup two master nodes monitor each
other:
• One master node active, one passive
• If active goes down, passive takes over all resources (services,
storage, IP addresses)
• Goal is not to interrupt compute jobs
In alternative cluster management software,
setting up HA requires large amount of
manual work.
Bright Cluster Manager allows robust
failover set-up to be created with minimal
effort.
36
2011
Scalability
Cluster Management software should not be
limiting factor for cluster size.
Philosophy used for Bright Cluster Manager:
• All tasks performed by master node should be off-loadable to
dedicated nodes.
• If master node cannot handle a task as a result of cluster size,
task can be placed on 1 or more dedicated nodes.
• For example: multiple dedicated load-balanced provisioning
nodes may be assigned in a cluster.
37
2011
Advanced Features
•
•
•
•
•
•
•
•
Daemon with low resource consumption (multithreaded)
Synchronised daemon to prevent OS jitter
Multiple, load-balanced provisioning nodes
Node discovery using Ethernet switch port detection
Live & incremental image updates
Automated BIOS updates and configurations
Infiniband only storage & diskless client support
Node and service checks (pre/post to scheduler)
Roadmap Features
• More power saving features,
• Scheduler job integration
• Virtualisation, Cloud computing
38
2011
Cluster Architecture
…
Storage001
Storage002
Storage015
Storage016
HeadNode02
HeadNode01
node001
x 24
PDUs
x 16
Switch
node002
MonitoringNode01
node003
SNMP
…
x8
Racks
MonitoringNode02
Cluster Management
node511
ProvisioningNode01 ProvisioningNode02
FabricNode01
LoginNode01
39
LoginNode02
LoginNode03
LoginNode04
node512
2011
Conclusions
• Proven track-record in cluster computing
• Best cluster software stack on the market
–
–
–
–
Easy manage and use
Scalable for very large clusters
Comprehensive HPC user environment
Complete & consistently integrated
• 100% committed to cluster computing
40
2011
Thank you
41
2011

Arno Ziebart - Tools for High Performance Computing 2015

Transcription

Similar documents

Svensk-Norska Handelskammaren

Egils.Milbergs Testimony 12-6-2010

Mukesh Gulati Foundation for MSME Clusters New Delhi

Wild Capeb.indd - World Wide Creative

giant radio halos

part 1 - Adirondack NCRS

Non-Luster Mid-Year Cluster Part II

PInCom project: SaaS Big Data Platform for and Communication

HPC Software Requirements to Support an HPC Cluster

OpsCenter 5.1 User Guide