here
Transcription
here
Munich University of Applied Sciences Department of Computer Science and Mathematics, Computer Science in Commerce Diploma Thesis Designing and Deploying High Availability Cluster Solutions in UNIX Environments Stefan Peinkofer 2006-01-12 Supervisor: Prof. Dr. Christian Vogt Peinkofer Stefan (Geb. 12.06.1982) Matrikelnummer: 01333101 Studiengruppe 8W (Wintersemester 2005/2006) Erklärung gemäß § 31 Abs. 7 RaPO Hiermit erkläre ich, dass ich die Diplomarbeit selbständig verfasst, noch nicht anderweitig für Prüfungszwecke vorgelegt, keine anderen als die angegebenen Quellen oder Hilfsmittel benützt sowie wörtliche und sinngemäße Zitate als solche gekennzeichnet habe. Oberhaching, den 12.01.2006 Stefan Peinkofer c Stefan Peinkofer II [email protected] Contents 1 2 3 Preface 1.1 Overview . . . . . . . . . . . . . . . 1.2 Background . . . . . . . . . . . . . . 1.3 The Zentrum für angewandte Kommunikationstechnologien . . . . 1.4 Problem Description . . . . . . . . . 1.4.1 Central File Services . . . . . 1.4.2 Radius Authentication . . . . 1.4.3 Telephone Directory . . . . . 1.4.4 Identity Management System 1.5 Objective of the Diploma Thesis . . . 1.6 Typographic Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 . . . . . . . . . . . . . . . . 2 3 4 4 4 5 5 6 . . . . . . . 7 7 8 9 10 11 14 15 . . . . . . . . . . . 16 16 17 18 19 21 21 22 27 28 31 33 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . High Availability Theory 2.1 Availability and High Availability . . . . . . . . . . 2.2 Faults, Errors and Failures . . . . . . . . . . . . . . 2.2.1 Types of Faults . . . . . . . . . . . . . . . . 2.2.2 Planned Downtime . . . . . . . . . . . . . . 2.2.3 Dealing with Faults . . . . . . . . . . . . . . 2.3 Avoiding Single Points of Failure . . . . . . . . . . . 2.4 High Availability Cluster vs. Fault Tolerant Systems . High Availability Cluster Theory 3.1 Clusters . . . . . . . . . . . . 3.2 Node Level Fail Over . . . . . 3.2.1 Heartbeats . . . . . . 3.2.2 Resources . . . . . . . 3.2.3 Resource Agents . . . 3.2.4 Resource Relocation . 3.2.5 Data Relocation . . . . 3.2.6 IP Address Relocation 3.2.7 Fencing . . . . . . . . 3.2.8 Putting it all Together . 3.3 Resource Level Fail Over . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 35 38 39 40 41 44 45 47 48 Designing for High Availability 4.1 System Management and Organizational Issues . . . . 4.1.1 Requirements . . . . . . . . . . . . . . . . . . 4.1.2 Personnel . . . . . . . . . . . . . . . . . . . . 4.1.3 Security . . . . . . . . . . . . . . . . . . . . . 4.1.4 Maintenance and Modifications . . . . . . . . 4.1.5 Testing . . . . . . . . . . . . . . . . . . . . . 4.1.6 Backup . . . . . . . . . . . . . . . . . . . . . 4.1.7 Disaster Recovery . . . . . . . . . . . . . . . 4.1.8 Active/Passive vs. Active/Active Configuration 4.2 Hardware . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Network . . . . . . . . . . . . . . . . . . . . . 4.2.2 Shared Storage . . . . . . . . . . . . . . . . . 4.2.3 Server . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Cables . . . . . . . . . . . . . . . . . . . . . . 4.2.5 Environment . . . . . . . . . . . . . . . . . . 4.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Operating System . . . . . . . . . . . . . . . . 4.3.2 Cluster Software . . . . . . . . . . . . . . . . 4.3.3 Applications . . . . . . . . . . . . . . . . . . 4.3.4 Cluster Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 51 51 52 53 54 57 57 59 61 62 63 66 69 70 71 73 75 76 79 80 IT Infrastructure of the Munich University of Applied Sciences 5.1 Electricity Supply . . . . . . . . . . . . . . . . . . . . . . . 5.2 Air Conditioning . . . . . . . . . . . . . . . . . . . . . . . 5.3 Public Network . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Shared Storage Device . . . . . . . . . . . . . . . . . . . . 5.5 Storage Area Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 82 83 84 85 86 Implementing a High Availability Cluster System Using Sun Cluster 6.1 Initial Situation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 General Information on Sun Cluster . . . . . . . . . . . . . . . . . . . . . . . 88 88 88 89 3.5 4 5 6 Problems to Address . . . . . . . . . . . . . . . 3.4.1 Split Brain . . . . . . . . . . . . . . . . 3.4.2 Fencing Loops . . . . . . . . . . . . . . 3.4.3 Amnesia . . . . . . . . . . . . . . . . . 3.4.4 Data Corruption . . . . . . . . . . . . . Data Sharing . . . . . . . . . . . . . . . . . . . 3.5.1 Cluster File System vs. SAN File System 3.5.2 Types of Shared File Systems . . . . . . 3.5.3 Lock Management . . . . . . . . . . . . 3.5.4 Cache consistency . . . . . . . . . . . . c Stefan Peinkofer IV . . . . . . . . . . . . . . . . . . . . [email protected] CONTENTS 6.4 6.5 6.6 7 Initial Cluster Design and Configuration . . . . . . . . 6.4.1 Hardware Layout . . . . . . . . . . . . . . . . 6.4.2 Operating System . . . . . . . . . . . . . . . . 6.4.3 Shared Disks . . . . . . . . . . . . . . . . . . 6.4.4 Cluster Software . . . . . . . . . . . . . . . . 6.4.5 Applications . . . . . . . . . . . . . . . . . . Development of a Cluster Agent for Freeradius . . . . 6.5.1 Sun Cluster Resource Agent Callback Model . 6.5.2 Sun Cluster Resource Monitoring . . . . . . . 6.5.3 Sun Cluster Resource Agent Properties . . . . 6.5.4 The Sun Cluster Process Management Facility 6.5.5 Creating the Cluster Agent Framework . . . . 6.5.6 Modifying the Cluster Agent Framework . . . 6.5.7 Radius Health Checking . . . . . . . . . . . . Using SUN QFS as Highly Available SAN File System 6.6.1 Challenge 1: SCSI Reservations . . . . . . . . 6.6.2 Challenge 2: Meta Data Communications . . . 6.6.3 Challenge 3: QFS Cluster Agent . . . . . . . . 6.6.4 Cluster Redesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Implementing a High Availability Cluster System Using Heartbeat 7.1 Initial Situation . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Customer Requirements . . . . . . . . . . . . . . . . . . . . . . 7.3 General Information on Heartbeat Version 2 . . . . . . . . . . . 7.3.1 Heartbeat 1.x vs. Heartbeat 2.x . . . . . . . . . . . . . . 7.4 Cluster Design and Configuration . . . . . . . . . . . . . . . . 7.4.1 Hardware Layout . . . . . . . . . . . . . . . . . . . . . 7.4.2 Operating System . . . . . . . . . . . . . . . . . . . . . 7.4.3 Shared Disks . . . . . . . . . . . . . . . . . . . . . . . 7.4.4 Cluster Software . . . . . . . . . . . . . . . . . . . . . 7.4.5 Applications . . . . . . . . . . . . . . . . . . . . . . . 7.4.6 Configuring the STONITH Devices . . . . . . . . . . . 7.4.7 Creating the Heartbeat Resource Configuration . . . . . 7.5 Development of a Cluster Agent for PostgreSQL . . . . . . . . 7.5.1 Heartbeat Resource Agent Callback Model . . . . . . . 7.5.2 Heartbeat Resource Monitoring . . . . . . . . . . . . . 7.5.3 Heartbeat Resource Agent Properties . . . . . . . . . . 7.5.4 Creating the PostgreSQL Resource Agent . . . . . . . . 7.6 Evaluation of Heartbeat 2.0.x . . . . . . . . . . . . . . . . . . . 7.6.1 Test Procedure Used . . . . . . . . . . . . . . . . . . . 7.6.2 Problems Encountered During Testing . . . . . . . . . . c Stefan Peinkofer V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 90 95 101 102 114 123 123 126 126 130 130 131 134 135 136 139 143 145 . . . . . . . . . . . . . . . . . . . . 151 151 151 152 153 153 153 156 166 167 171 173 173 182 182 184 184 186 190 190 194 [email protected] CONTENTS 8 9 Comparing Sun Cluster with Heartbeat 8.1 Comparing the Heartbeat and Sun Cluster Software 8.1.1 Cluster Software Features . . . . . . . . . 8.1.2 Documentation . . . . . . . . . . . . . . . 8.1.3 Usability . . . . . . . . . . . . . . . . . . 8.1.4 Cluster Monitoring . . . . . . . . . . . . . 8.1.5 Support . . . . . . . . . . . . . . . . . . . 8.1.6 Costs . . . . . . . . . . . . . . . . . . . . 8.1.7 Cluster Software Bug Fixes and Updates . 8.2 Comparing the Heartbeat and Sun Cluster Solutions 8.2.1 Documentation . . . . . . . . . . . . . . . 8.2.2 Commercial Support . . . . . . . . . . . . 8.2.3 Software and Firmware Bug Fixes . . . . . 8.2.4 Costs . . . . . . . . . . . . . . . . . . . . 8.2.5 Additional Availability Features . . . . . . 8.2.6 “Time to Market“ . . . . . . . . . . . . . . 8.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 199 199 201 201 202 203 205 205 206 206 206 207 207 207 208 208 Future Prospects of High Availability Solutions 210 9.1 High Availability Cluster Software . . . . . . . . . . . . . . . . . . . . . . . . 210 9.2 Operating System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 9.3 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 A High Availability Cluster Product Overview c Stefan Peinkofer VI 214 [email protected] List of Figures 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 Shared Storage . Remote mirroring Sample fail over 1 Sample fail over 2 Sample fail over 3 Split Brain 1 . . . Split Brain 2 . . . Split Brain 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 24 31 32 33 36 36 36 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 Active/Active Configuration . . . . . . . . . . . Active/Passive Configuration . . . . . . . . . . . Inter-Switch Link Failure Without Spanning Tree Inter-Switch Links With Spanning Tree . . . . . Inter-Switch Link Failure With Spanning Tree . . Redundant RAID Controller Configuration . . . . Redundant Storage Enclosure Solution . . . . . . Drawing a Resource Dependency Graph Step 1 . Drawing a Resource Dependency Graph Step 2 . Drawing a Resource Dependency Graph Step 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 62 64 65 65 67 68 77 77 78 5.1 5.2 5.3 Electricity Supply of the Server Room . . . . . . . . . . . . . . . . . . . . . . 3510 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fibre Channel Fabric Zone Configuration . . . . . . . . . . . . . . . . . . . . 83 86 87 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 PCI Card Installation Fire V440 . . . . . . . . . . . . . . . . . . PCI Card Installation Enterprise 450 . . . . . . . . . . . . . . . . Cluster Connection Scheme . . . . . . . . . . . . . . . . . . . . . Shared Disks Without I/O Multipathing . . . . . . . . . . . . . . Shared Disks With I/O Multipathing . . . . . . . . . . . . . . . . Resources and Resource Dependencies on the Sun Cluster . . . . Cluster Interconnect and Meta Data Network Connection Scheme Adopted Cluster Connection Scheme . . . . . . . . . . . . . . . . 7.1 7.2 PCI Card Installation RX 300 . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Cluster Connection Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 VII . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 92 93 100 100 112 146 148 LIST OF FIGURES 7.3 7.4 7.5 7.6 7.7 7.8 7.9 Important World Wide Names (WWNs) of a 3510 Fibre Channel Array New Fibre Channel Zone Configuration . . . . . . . . . . . . . . . . . 3510 Fibre Channel Array Connection Scheme . . . . . . . . . . . . . 3510 Fibre Channel Array Failure . . . . . . . . . . . . . . . . . . . . Resources and Resource Dependencies on the Heartbeat Cluster . . . . Valid STONITH Resource Location Configuration . . . . . . . . . . . Invalid STONITH Resource Location Configuration . . . . . . . . . . . 9.1 9.2 High Availability Cluster and Server Virtualization . . . . . . . . . . . . . . . 211 Virtual Host Fail Over . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 c Stefan Peinkofer VIII . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 161 162 163 175 176 176 [email protected] List of Tables 2.1 Classes of Availability 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 6.1 6.2 6.3 Boot Disk Partition Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . Boot Disk Volumes V440 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Boot Disk Volumes Enterprise 450 . . . . . . . . . . . . . . . . . . . . . . . . 96 98 99 7.1 Heartbeat Test Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 A.1 High Availability Cluster Products . . . . . . . . . . . . . . . . . . . . . . . . 215 IX Chapter 1 Preface 1.1 Overview The diploma thesis is divided into nine main sections and an appendix. • Section 1 contains the conceptual formulation and the goal of the diploma thesis as well as the structure of the document. • Section 2 discusses the basic theory of high availability systems in general. • Section 3 contains the underlying theory of high availability cluster systems. • Section 4 discusses design issues for high availability systems in general and for high availability cluster systems in particular. • Section 5 briefly introduces the infrastructure in which the concrete cluster implementations were deployed. • Section 6 discusses the sample implementation of a high availability cluster solution which is based on Sun’s cluster product Sun Cluster. • Section 7 discusses the sample implementation of a high availability cluster solution which is based on the Open Source cluster product Heartbeat. • Section 8 contains a comparison of the two cluster products Sun Cluster and Heartbeat. 1 CHAPTER 1. PREFACE • Section 9 gives a brief overview of the future trends of high availability systems in general and high availability cluster systems in particular. • The appendix contains references to various high availability cluster systems. 1.2 Background In recent years, computers have dramatically changed the way we live and work. Almost everything in our “brave new world“ depends on computers. Communication, business processes, purchasing and entertainment are just a few examples. Unfortunately, computer systems are not perfect. Sooner or later every system will fail. When your personal computer ends up with a blue screen while you are breaking the high score of your fancy new game, it’s just annoying to you. But when a system supporting a business process of a company breaks, many people get annoyed and the company loses money, either because the employees can’t get their work done without the system or because the customers can’t submit orders and therefore will change to a competitor. The obvious solution to minimize system downtime is to deploy a spare system, which can do the work when the primary system fails to do it. If the spare system is able to detect that the primary system has failed and if it is able to take over the work automatically, the entity of primary system and spare system is called high availability cluster. 1.3 The Zentrum für angewandte Kommunikationstechnologien The Zentrum für angewandte Kommunikationstechnologien (ZaK) is the computer center of the Munich University of Applied Sciences. The field of activity of the department is divided into two main areas: c Stefan Peinkofer 2 [email protected] 1.4. PROBLEM DESCRIPTION • University Computing - This area includes but is not limited to the following tasks: – Operation of the fibre optics network between the headquarter and the branch offices of the university. – Operation of a central Identity Management System, which holds all students, professors and employees. – Operation of the central IT systems for E-mail, HTTP, DNS, backup and remote disk space, for example. – IT support for faculties and other departments of the university. • Student Administration Computing - This area includes the following tasks: – Development and maintenance of a student administration application, which is also used by approximately twelve other German universities. – Development and maintenance of online services for students, like exam registration, mark announcement and course registration. 1.4 Problem Description Since the usage of the university computing infrastructure has dramatically increased over the last few years, assuring availability of the central server and network systems has became a big issue for the ZaK. Currently most of the server systems deployed at the ZaK are not highly available. To decrease the downtime in case of a hardware failure, the ZaK has a spare server for every deployed server type. In case a server fails, the administrator takes the disks out of the failed server and puts them into the spare server. This concept is from a time when the university IT systems weren’t extremely important for most people. But since today nearly everyone in the university, be they students, employees or university leaders, uses the IT infrastructure on a regular basis, this no longer satisfies today’s availability demands. c Stefan Peinkofer 3 [email protected] CHAPTER 1. PREFACE Four of the most critical applications the ZaK provides to its customers, besides E-mail and Internet presence are: • Central file services for Unix and Windows, providing the user home directories. • Radius authentication for dial-in and WLAN access to the Munich Science Network. • The backend database for the internal telephone directory. • The backend database for the Identity Management System. The following sections show why the availability of these systems is so important. 1.4.1 Central File Services If the central file server fails, the user’s home directories become inaccessible. Since the mail server needs to access the user’s home directory to process incoming mail, the messages are rejected with a “No such user“ error. Also, registration of new users through the Identity Management System will fail partly because it will not be able to create the user’s home directory. 1.4.2 Radius Authentication If the Radius server is unavailable, users are not able to access the Munich Science Network via dial-in or WLAN. Additionally, some Web sites that are protected by an authentication mechanism using Radius are inaccessible. 1.4.3 Telephone Directory If the backend database of the telephone directory fails, employees are unable to perform internal directory searches. This is so critical because the telephone directory is frequently used by the university leaders. c Stefan Peinkofer 4 [email protected] 1.5. OBJECTIVE OF THE DIPLOMA THESIS 1.4.4 Identity Management System If the backend database of the Identity Management System is unavailable, users are not able to: • enable their accounts for using the computers of the ZaK and some faculties • change or reset their passwords • use laboratories which are protected by the card reader access control system of the Identity Management System • access the Web applications for exam registration, mark announcement and course registration 1.5 Objective of the Diploma Thesis The main objective of this diploma thesis is to provide the ZaK with two reference implementations of high availability clustered systems: • A file server cluster running NFS, Samba, the SAN file system SUN SAM/QFS and Radius. • A database cluster running PostgreSQL. The file server cluster will be based on Sun Solaris 10 using the Sun Cluster 3.1 high availability software. The database cluster will be based on Red Hat Enterprise Linux 4.0 and the Open Source cluster software Heartbeat 2.0. This thesis should provide the Unix administrators of the ZaK with the knowledge and basic experience that is needed to make other services highly available and to decide which of the two cluster systems is appropriate for the specific service. However, this thesis should not be understood as a replacement of the actual hardware and software documentation. c Stefan Peinkofer 5 [email protected] CHAPTER 1. PREFACE 1.6 Typographic Conventions The following table describes the typographic changes that are used in this thesis. • AaBbCc123 - The names of commands, configuration variables, files, directories and hostnames. • AaBbCc123 - New terms and terms to be emphasized. In addition to that, sometimes the construct <description> is used. This has to be understood as a placeholder for the value which is described in the angle bracket. c Stefan Peinkofer 6 [email protected] Chapter 2 High Availability Theory 2.1 Availability and High Availability A system is considered available if it is able to do the work for which it was designated. Availability is the probability that the system is available over a specific period of time. It is measured by the ratio between system uptime and downtime.1 Availability = U ptime U ptime+Downtime In more theoretical discussions, the term uptime is often replaced by the term Mean Time Between Failure (MTBF) and the term downtime is replaced by the term Mean Time To Repair (MTTR). If we ask people what high availability is, they will probably show us something like table 2.1, which tells us the maximum amount of time a system is allowed to be unavailable per year. The answer to our question would then be, “If it has a certain number of nines, it is highly available“. At first glance, this seems reasonable because availability is measured by system downtime.2 But if system vendors say, “Our high availability system is available 99.99 percent of the time“, by “available“, they normally mean that it is showing an operating system prompt. 1 2 [HELD1] Page 2 [PFISTER] Pages 383-385 7 CHAPTER 2. HIGH AVAILABILITY THEORY So of what avail is it if our high availability system shows us the operating system prompt 99.99 percent of the time but our application is just running 99 percent of it? Availability class Name Availability Downtime per year 2 Stable 99 3.7 days 3 Available 99.9 8.8 hours 4 Highly Available 99.99 52.2 minutes 5 Very Highly Available 99.999 5.3 minutes 6 Fault Tolerant 99.9999 32 seconds 7 Fault Resistant 99.99999 3 seconds Table 2.1: Classes of Availability 1 [HELD2] Page 13 Another definition of high availability, which I like best because it’s unambiguous, is from Gregory F. Pfister. A system which is termed highly available must comply with the following requirements: • No single point of failure exists in the system. This means the system is able to provide its services even in the case of a single component failure. • The likelihood that a failed component can be repaired or replaced before another component fails is sufficiently high.3 2.2 Faults, Errors and Failures Faults are the ultimate cause which forces us to think about how we can improve the availability of our critical IT systems. This section gives a brief overview of the different types of faults and how we can deal with these faults. But first let us define what the terms fault, error and failure mean. 3 [PFISTER] Page 393 c Stefan Peinkofer 8 [email protected] 2.2. FAULTS, ERRORS AND FAILURES • Fault / Defect - Anything that has the potential to prevent a functional unit from operating in the way it was meant to. Faults in software are often referenced as bugs. • Error - An error is a discrepancy between the observed and the expected behavior of a functional unit. Errors are caused by faults that occur. • Failure - A failure is a situation in which a system is not able to provides its services in the expected manner. Failures result from uncorrected errors.4 2.2.1 Types of Faults We can distinguish between three types of faults: • Persistent Faults - Faults that appear and, without human intervention, don’t disappear. Hardware and software can contain this type of fault in equal measure. Persistent faults in hardware could be caused by a broken wire or micro chip, for example. In software these faults can be caused by a design error in an application module or an inadequate specification of the module. Persistent faults are easy to analyze. In case of a persistent hardware fault, normally a maintenance light will flash on the affected units. If this is not the case, we can still find the defective parts by changing the units consecutively. To analyze persistent software faults, we can normally find a sequence of actions which will result in the occurrence of the specific fault. That makes it easy to locate and fix the problem. Even if the software cannot be fixed immediately, it is very likely that we will find a procedure to work around the bug.5 • Transient Faults - Faults that appear and after a while disappear. This type of fault appears in the hardware of the system because of outside influences like electrical interferences, electrical stress peaks and so on. Software on the other hand can’t contain transient faults. Although faults in the software may appear as transient faults, these faults are persistent faults, activated through a procedure which is too complex to reproduce it.6 4 [ELLING] Page 5, [BENEDI] [SOLTAU] Page 14 6 [SOLTAU] Page 14, [MENEZES] Page 1 5 c Stefan Peinkofer 9 [email protected] CHAPTER 2. HIGH AVAILABILITY THEORY • Intermittent Faults - Faults that are similar to transient Faults but reappear after some time. Like transient faults this type is a hardware only fault. It can be caused by overheating under high load or loose contacts, for example.7 2.2.2 Planned Downtime When people think of downtime, they first associate it with a failure. That’s what we refer to as unplanned downtime. But there is another type of downtime, namely the result of an intended system shutdown. This type of downtime is termed as planned downtime. Planned downtime is mostly required to perform maintenance tasks like adding or replacing hardware, applying patches or installing software updates. If these maintenance tasks cannot be performed at a time when the system does not need to be available8 , planned downtime can be considered a failure. Companies purchase IT Systems to make money. From the company’s point of view it makes no difference whether the system is not available because of an unplanned or planned downtime. It is not making money, so it is broken.9 There is also another point which makes planned downtime an issue we should think about. The ratio between planned and unplanned downtime is approximately10 two-thirds to one-third11 . The reasons which make planned downtime less bad than planned downtime is that we can schedule planned downtime during hours when it will result in the lowest revenue losses12 and we can prenotify users of the downtime, so they can plan to do something else while the system maintenance is performed.13 7 [ANON1] Outside business hours for example. 9 [SOLTAU] Pages 14 - 15 10 It highly depends on whom we ask. 11 [MARCUS] Page 12 12 [ANON2]. [SOLTAU] Page 15 13 [PFISTER] 8 c Stefan Peinkofer 10 [email protected] 2.2. FAULTS, ERRORS AND FAILURES 2.2.3 Dealing with Faults High availability systems are typically based on normal server hardware. Therefore the components in these systems will fail at the same rates as they would in normal systems. The difference between a normal system and a high availability system is the way in which they respond to faults. This fault responding process can be divided into six elementary steps. 2.2.3.1 Fault Detection To detect faults, high availability systems use so-called agents or probes. Agents are programs which monitor the health of a specific hardware or software component and provide this health information to a high availability management process. Monitoring is done by querying status information from a component or by actively checking a component, or by both.14 For example, an agent which monitors the operation of the cooling fans can just query the fan status from the hardware. To monitor a database application an agent program could query the status of the application15 and/or it could perform a sequence of database transactions and see whether they complete successfully. 2.2.3.2 Fault Isolation Fault isolation or fault diagnosis is the process of determining which component and which fault in the component caused the error. Since every agent is normally responsible for a single component, the system can identify the erroneous component by determining firstly the agent which reported the fault and secondly the component the agent is responsible for. After the erroneous component is isolated, the system must determine which fault caused the component to fail. However, in some error scenarios it is almost impossible to identify the fault accurately. In this case the fault isolation process has to find out which faults could have caused the error.16 14 [ELLING] Pages 7-9, [ANON3] If supported by the application. 16 [ANON4], [ELLING] Page 10, [RAHNAMAI] Page 9 15 c Stefan Peinkofer 11 [email protected] CHAPTER 2. HIGH AVAILABILITY THEORY For example, if the network fails because the network cable is unplugged, it’s easy to identify the fault because the link status of the network interface card will switch to off. Since the error signature “link status is off“ is unique for the fault “no physical network connection available“, it’s the only possible fault that could cause the network failure. But if the network fails because the connected network cable is too long, it’s impossible to identify the fault unambiguously. This is because the error signature for this fault is “unexpected high bit error rate“, which is also the error signature of other faults like electromagnetic interferences.17 2.2.3.3 Fault Reporting The fault reporting process informs components and the system administrator about a detected fault. This can be done in various ways: Writing log files, sending E-mails, issuing a SNMP (Simple Network Management Protocol) trap, feeding a fault database and many more ways. Independent of the way in which fault reporting is done, the usability of fault reports depends primarily on two factors: • Accuracy of fault isolation - The more accurate the system can determine which component caused an error and what the reason for the error is, the better and clearer fault information can be provided to the administrators. • Good prioritization of faults - Different faults have different importance to the administrator. Faults which can affect the availability of the system are of course more important than faults which cannot. Additionally, faults of the latter type occur much more often than the former ones. Reporting both types with the same priority to the system administrator makes it harder to respond to the faults in an appropriate manner, first because the administrator may not be able to determine how critical the reported fault is and second because the administrator may lose sight of the critical faults because of the huge amount of noncritical ones. 17 18 18 [ELLING] Page 8 [ELLING] Pages 10 - 12 c Stefan Peinkofer 12 [email protected] 2.2. FAULTS, ERRORS AND FAILURES 2.2.3.4 Fault Correction A fault correction process can only be performed by components which are able to detect and correct errors internally and transparent to the other components. The most famous example for this type of component is Error Correcting Code (ECC) memory. On each memory access it checks the requested data for accuracy and automatically corrects invalid bits before it passes the data to the requestor.19 2.2.3.5 Fault Containment Fault containment is the process of trying to prevent the effects of a fault from spreading out over a defined boundary. This should prevent a fault from setting off other faults in other components. If two components A and B share one common component C, like a SCSI bus or a shared disk, a fault in component A could propagate over the shared component C to the component B. To prevent this, the fault must be contained in component A. On high availability cluster systems, for example the typical boundary for faults are servers. This means that containing a fault is done by keeping the faulty server off the shared components.20 2.2.3.6 System Reconfiguration The system reconfiguration step recovers the system from a non-correctable fault. The way in which the system is reconfigured depends on the fault. For example, if a network interface card of a server fails, the server will use an alternate network interface card. If a server in a high availability cluster system fails completely, for example, the system will use another server to provide the services of the failed one.21 19 [ELLING] Pages 6 and 11 [ELLING] Pages 12 - 13 21 [ELLING] Pages 13 - 14 20 c Stefan Peinkofer 13 [email protected] CHAPTER 2. HIGH AVAILABILITY THEORY 2.3 Avoiding Single Points of Failure A Single Point Of Failure (SPOF) is anything that will cause unavailability of the system if it fails. Obvious SPOFs are the hardware components of the high availability system like cables, controller cards, disks, power supplies, and so on. But there are also other types of SPOFs, such as applications, network and storage components, external services like DNS, server rooms, buildings and many more. To prevent all these components from becoming SPOFs, the common strategy is to keep them redundant. So in case the primary component breaks, the secondary component takes over. Although it is easy to remove a SPOF, it may be very complex firstly to figure out the SPOFs and secondly determine whether it is cost effective to remove the SPOF. To find the SPOFs we must look at the whole system workflow, from the data backend over the HA system itself to the point of the clients. This requires a massive engineering effort of many different IT subdivisions. After all the SPOFs are identified, we must do a risk analysis for every component which constitutes a SPOF to find out how expensive a failure of the component would be. The definition of risk is: Risk = Occurrence P ossibility ∗ Amount of Loss The occurrence possibility has to be estimated. To give a good estimation, we could use the mean time between failure (MTBF) information of components or assurance field studies or consult an expert. To calculate the amount of loss, we must know how long it takes to recover from a failure of the specific component and how much money we lose because of the system unavailability. After we calculate the risk, we can compare it to the costs of removing the SPOF to see if we should either live with the risk or eliminate the SPOF.22 22 [MARCUS] Pages 27 - 28 and 32 - 33 c Stefan Peinkofer 14 [email protected] 2.4. HIGH AVAILABILITY CLUSTER VS. FAULT TOLERANT SYSTEMS 2.4 High Availability Cluster vs. Fault Tolerant Systems Many people use the terms high availability cluster and fault tolerant system interchangeably but there are big differences between them. Fault tolerant systems use specialized, proprietary hardware and software to guarantee that an application is available without any interruption. This is achieved not only by duplicating each hardware component but also by replicating every software process of the application. So in the worst case scenario, in which the memory of a server fails, the replicated version of the application continues to run. In contrast to that a high availability cluster doesn’t replicate the application processes. If the memory of a server in a high availability cluster fails, the application gets restarted on another server. For a database system, running on a high availability cluster, this means for instance that the users get disconnected from the database and all their not yet committed transactions are lost. As soon as they reconnect, they can normally start their work again. The users of a fault tolerant database system would in this scenario not even notice that something is going wrong with their database system.23 However, high availability clusters have some advantages compared to fault tolerant systems. They are composed of commodity hardware and software so they are less expensive and can be deployed in a wider range of scenarios. While performing maintenance tasks like adding hardware or applying patches, application availability is not impacted because most of these tasks can be done on server after server. Additionally, high availability clusters are able to recover from some types of software faults, which are single points of failures in fault tolerant systems.24 23 24 [BENDER] Page 3 [ELLING] Page 53 c Stefan Peinkofer 15 [email protected] Chapter 3 High Availability Cluster Theory What has been discussed in the last chapter applies to high availability systems in general. This is why the term high availability cluster has been avoided as far as possible. Although people often use high availability system and high availability cluster synonymously, a system which is highly available doesn’t necessarily have to be a cluster. As the definition in chapter 2.1 on page 7 said, a high availability system must not contain a single point of failure. This characteristic applies to some non-clustered systems as well. Especially high-end, enterprise scale server systems like the SUN Fire Enterprise or HP Superdome servers are designed without a single point of failure and because of their hot-plug functionality of almost every component, a failed component can be replaced without service disruption. In the following chapter we will discuss the basic theory of high availability clusters. We will look at the building blocks of a high availability cluster, how they work together, what particular problems arise and how these problems are solved. 3.1 Clusters A cluster in the context of computer science is an accumulation of interconnected standalone computers, working together on a common problem. To the users, the cluster thereby acts like one large consistent computer system. Usually there are two reasons to build a cluster, either 16 3.2. NODE LEVEL FAIL OVER to deliver the computing power or the reliablility1 that a single computer system can’t achieve without being much more expensive than a cluster2 . The several computers forming a cluster are usually referenced as cluster nodes. The boot up of the first cluster node will initialize the cluster. This is referenced as incarnation of a cluster. A cluster node which is up and running and delivers its computing resources to the cluster is referenced as cluster member. Therefore, the event when a node starts to deliver its computing resources to an already incarnated cluster is referenced as joining the cluster. A high availability cluster, in the context of this thesis, is a cluster which makes an application instance highly available by running the application instance on one cluster node and starting the application instance on another node in case either the application instance itself or the cluster node the application instance ran on failed. This means that on a high availability cluster, no more than one specific instance of an application is allowed to run at a time. An application instance is thereby defined as the collectivity of belonging together processes of the application, the corresponding IP address on which the processes are listening and the files and directories in which the configuration and application state information files of the application instance are stored. Application state information files are, for instance, .pid files or log files. So on a high availability cluster it is only possible to run an specific application more than once at a time if each set of belonging together processes listens on a dedicated IP address and uses a dedicated set of configuration and application state information files. 3.2 Node Level Fail Over A cluster typically consists of two or more nodes. To achieve high availability, the basic concept of a high availability cluster is known as fail over. When a cluster member fails, the other cluster members will do the work of the failed member. This concept sounds rather simple, but there are a few issues we have to look at: 1 2 Or both. [WIKI1] c Stefan Peinkofer 17 [email protected] CHAPTER 3. HIGH AVAILABILITY CLUSTER THEORY 1. How can the other members know that another member failed? 2. How can the other members know which work the failed member did and which things they need in order to do the work? 3. Which cluster member(s) should do the work of the failed node? 4. How can the other members access the data the failed node used for its work? 5. How do the clients of the failed node know which member they have to contact if a fail over occurred? 3.2.1 Heartbeats Cluster members continuously send so-called heartbeat messages to the other cluster members. To transport these heartbeat messages, several communication paths like normal Ethernet connections, proprietary cluster interconnects, serial connections or I/O interconnects can be used. These heartbeat messages indicate to the receiver that the cluster member which sent it is operational. Every cluster member expects to receive another heartbeat message from every other cluster member within a specific time interval. When an expected heartbeat message fails to appear within the specified time, the node whose heartbeat message is missing is considered dead. Of course real-life heartbeat processing is not that easy. The problem is that sending and receiving heartbeat messages is a hard real-time task because a node has to send its next heartbeat message before exceeding a deadline which is given by the other nodes. Unfortunately, almost none of the common operating systems which are used for high availability clustering, are capable of handling hard real-time task. The only things that can be done to alleviate the problem are giving the heartbeat process the highest scheduling priority and preventing parts of the heartbeat process from getting paged out and, of course, preventing the complete heartbeat process from getting swapped out onto disk. However, this doesn’t solve the problem completely. Maybe the node managed to send the heartbeat message within the deadline but one or some of the other c Stefan Peinkofer 18 [email protected] 3.2. NODE LEVEL FAIL OVER nodes didn’t receive the message by the deadline. Reasons could be that network traffic is high or some nodes are experiencing a high workload and hence the message receive procedure from the network card to the heartbeat process is taking too long. To alleviate the problem further, we can use dedicated communication paths for the heartbeat messages, though this doesn’t solve the problem completely. The last thing we can do is set the deadline to a reasonably high value so that the probability of a missed deadline is low enough or consider a node dead only if a specific number of heartbeats have not occurred.3 However, the problem itself cannot be eliminated completely and, therefore, the cluster system must be able to respond appropriately, when the problem occurs. How cluster systems do this in particular is discussed in chapter 3.4.1 on page 35. When we denote a node as failed we mean from now on that the other cluster members no longer receive heartbeat messages from the node, regardless of the cause. 3.2.2 Resources Everything in a cluster which can be failed over from one cluster member to another is called a resource. The most famous examples for resources are application instances, IP addresses and disks. Resources can depend on one another. For example an application may depend on an IP address to listen and a disk which contains the data of the application. A cluster system must be aware of the resources and their dependencies. An application which runs on a cluster node but of which the cluster system is not aware is not highly available because the application won’t be started elsewhere if the node the application is currently running on dies. On the other hand, an application resource which depends on a disk resource also isn’t highly available if the cluster system is not aware of the dependency. In case the cluster member currently hosting the resources dies, the application resource may get started on one member and the disk may be mounted on another member. Even if they get started on the same node, 3 [PFISTER] Pages 419 - 421 c Stefan Peinkofer 19 [email protected] CHAPTER 3. HIGH AVAILABILITY CLUSTER THEORY they may get started in the wrong order. Even if they get started in the right order, the cluster system would start the application even if it failed in mounting the shared disk. In addition to that, resources can depend not only on resources which have to be started on the same node. They may just depend on a resource which has to be online, independent of the cluster member it runs on.4 For example an application server like Apache Tomcat may depend on a MySQL database. But for Tomcat it’s not important that the MySQL database runs on the same node. Another challenge is that resources may not be able to run on all cluster nodes, for example, because an application is not installed on all nodes or some nodes can’t access the needed application data. To keep track of all cluster resources, their dependencies and their potential host nodes, the cluster systems use a cluster-wide resource repository.5 Since the cluster system itself usually cannot figure out what resources and what dependencies exist on the cluster6 , it typically provides a set of tools which allow the administrator to add, remove and modify the resource information. To define which resources must run on the same node, most cluster systems use so-called resource groups. On these cluster systems, a resource group is the entity which will be failed over to another node. Between the resources within a resource group, further dependencies have to be specified to indicate the order in which the resources have to be started.7 To designate a resource to depend on another resource running elsewhere in the cluster, the resources must be put into two different resource groups and either a dependency between the two resources or a 4 [PFISTER] Pages 398 - 400 [PFISTER] Page 398 6 That would be the optimal solution, but it’s very hard to implement. 7 [ELLING] Pages 102 - 104 5 c Stefan Peinkofer 20 [email protected] 3.2. NODE LEVEL FAIL OVER dependency between the two resource groups has to be specified8 . For clarity reasons, method two is preferable because in this case resource dependencies exist only within a resource group. However, not all cluster systems stick with this. 3.2.3 Resource Agents A cluster system contains many different types of resources. Almost any resource type requires a custom start-up procedure. As we already know, the cluster system knows which resources exist and how they depend on another. But now there’s another question to answer. How does the cluster system know what exactly it has to do to start a particular type of resource? The answer to this question is, it doesn’t know and it doesn’t have to know. The cluster system leaves this task to an external program or set of programs called resource agents. Resource agents are the key to one of the main features of high availability clusters. Almost any application can be made highly available. All that is needed is a resource agent for the application. What the cluster system knows about the start up of a resource is which resource agent it has to call. Typically resources get not only started but also stopped or monitored. So the basic functions a resource agent must provide are start, stop and monitor functions. The cluster system tells the agent what it should do and the agent performs whatever is needed to carry out the task and returns to the cluster system whether it was successful or not. 3.2.4 Resource Relocation When a cluster member fails, the resources of the failed node have to be relocated on the remaining cluster members. In a two-node cluster the decision of which node will host the resources is straightforward. In a cluster of three or more nodes things get more difficult. The best solution would be to distribute the resource groups among the remaining nodes in such a manner that every node has roughly the same workload. An even better solution would be to distribute the resource groups in such a manner that the service level agreements of the various applications 8 Dependent on the used cluster system. c Stefan Peinkofer 21 [email protected] CHAPTER 3. HIGH AVAILABILITY CLUSTER THEORY are violated as few as possible. However this requires a facility which has a comprehensive understanding of the workload or the service levels of the applications. Some cluster systems which are tightly integrated with the operating system9 have such facilities and therefore can provide this type of solution. But the majority of high availability cluster systems are not so smart.10 They use various more or less good solutions like: • Call a user defined program which determines which node is best for a particular resource group.11 • Let the administrator define constraints on how resource groups should be allocated among the nodes. • Use an user defined list of nodes for each resource group which indicates that the resource group should be run on the first node in the list, if this is not possible on the second node in the list, and so on. • Distribute the resource groups so that every cluster member runs roughly the same number of resources. 3.2.5 Data Relocation If we want to fail over an application from one node to another, we have to fail over the application data as well. Basically there are two ways to achieve this. Either deploy a central disk to which all/some cluster nodes are connected or replicate the data from the application hosting node to all/some of the other nodes. Both methods have benefits and drawbacks. In the following section, we will discuss how the two techniques basically work and compare them to each other. 9 Like the VMScluster. [PFISTER] Pages 416 - 417 11 [PFISTER] Page 416 10 c Stefan Peinkofer 22 [email protected] 3.2. NODE LEVEL FAIL OVER 3.2.5.1 Shared Storage A shared storage configuration requires that every cluster member which should potentially be able to access a particular set of application data is physically connected to one or more central disk(s) which contain(s) the application data. Therefore, as figure 3.1 shows, a special type of I/O interconnect is required which must allow more than one host to be attached to it. In the past, a couple of proprietary I/O interconnects with this feature existed12 . Nowadays mostly two industry standard I/O interconnects are used: • Multi-Initiator SCSI (Small Computer System Interface) is used in low-end, two-node cluster systems. The SCSI bus allows two hosts to be connected to the ends of the bus and share the disks which are connected in between. • Fibre Channel (FC) is used in high-end and more than two-node cluster systems. With fibre channel it’s possible to connect many disks and hosts together in a storage network. This is often referred to as Storage Area Network (SAN). Public Network I/O Interconnect Shared Disks Figure 3.1: Shared Storage 12 And probably still exist. c Stefan Peinkofer 23 [email protected] CHAPTER 3. HIGH AVAILABILITY CLUSTER THEORY 3.2.5.2 Remote Mirroring A remote mirroring configuration typically uses a network connection to replicate the data. As figure 3.2 shows, every node needs a local attached disk which holds a copy of the data and a network connection to the other nodes. Depending on the application, the replication process can be done in several intervals and on several levels. For example the data of a network file server has to be replicated instantaneously on the disk block level, whereas the data of a domain name server may just require a file level replication, done manually by the administrator, every time he has changed something in the DNS files. Public Network Updates Local Attached Disks Local Attached Disks Figure 3.2: Remote mirroring However, in any case it must be ensured that every replication member holds the same data. This means that a data update must be applied either on all members or on no member at all. To achieve this, a two-phase commit protocol can be used. In phase one, every member tries to apply the update but also remembers the state before the update. If a member successfully applies the update it sends out an OK message. If it doesn’t update, it sends an ERROR message. Phase two begins after all members have sent their message. If all members send an OK, the c Stefan Peinkofer 24 [email protected] 3.2. NODE LEVEL FAIL OVER state before the update is discarded and the write call on the source host returns successfully. If at least one member has sends an ERROR message, the members restore the state before the update and the write call on the source host returns with an error.13 3.2.5.3 Shared Storage vs. Remote Mirroring • Performance - The read and write performance of shared storage is virtually the same as that of a local attached storage. Remote mirroring uses a local attached disk for read operations, so the read performance can be the same as with shared storage. But the write operations have to block until all replication targets have updated their data. In addition, the replication source and target hosts must run a replication process which consumes some CPU resources.14 So write performance is not as good as with shared storage but it may be sufficient depending on the application which uses the data. • Synchronisation - This is no problem for shared storage since only one potential data source exists. Using the two-phase commit protocol for remote mirroring ensures that the data is kept in sync, but using it can be a performance issue. However, if the write call on the source host returns immediately after the update was carried out on the local disk and the replication targets are notified about the update, but does not wait until all targets have updated their data successfully, data loss is possible, if a replication target is not able to apply the update. Another problem with remote mirroring is that a node which is down for some reason holds outdated data. So before the node can be put back in service again, the data on the node has to be resynced.15 In addition to that it must be ensured that at any point in time, only one replication source exists. • Possible node distance - Multi-Initiator SCSI bus length is limited to 12 meters. Fibre channel can span distances up to 10 kilometres without a repeating device. With the use of repeating devices no theoretical distance limitation exists. With remote mirroring virtually no distance limitation exists, either. However, the transmission delays have to be 13 [SMITH] [PFISTER] Page 405 15 [PFISTER] Page 406 14 c Stefan Peinkofer 25 [email protected] CHAPTER 3. HIGH AVAILABILITY CLUSTER THEORY kept in mind for large distance fibre channel and remote mirroring configurations. Although it is more critical for remote mirroring because the packets have to travel through the TCP/IP stack, the delay of large distance fibre channel links cannot be ignored completely. For example, in a fibre optics cable light can be transmitted approximately one metre in five nanoseconds16 . If a target device is 10 kilometres away, we have a round trip distance of 20 km, since we must send a packet to the target and await a response from it. With a distance of 20 kilometres we have a delay of 100 microseconds. A high performance hard disk drive has a mean access time of 6 milliseconds17 . So the delay of the fibre channel link adds 1.66 percent of the disk’s mean access delay to the overall delay. That is tolerable in most cases, but if we want to span a distance of 100 kilometres the fibre channel link delay adds 16.66 percent of the disk’s mean access delay to the overall delay. Especially for applications which perform many small random disk accesses this might become a performance issue. • Disaster tolerance - Since the SCSI bus length can be up to 12 meters, both cluster nodes and the storage must be located in a single site. In case of a disaster like a flood for instance the whole cluster may become unavailable. A remote mirroring configuration can survive such a disaster since the cluster nodes and with it the data can be located in different sites.18 Fibre channel storage configurations are not disaster tolerant per se since we could use only one fibre channel storage device, which can be placed only on one site, of course. To make fibre channel configuration disaster tolerant, we can put one storage device on each site and use software RAID (Redundant Array of Independent Disks) to mirror the data. Since software RAID is not the optimal solution to mirror disks, today’s more advanced fibre channel storage devices provide in-the-box off-site mirroring capabilities. 16 17 [MELLOR] and [MOREAU] Page 19 3 milliseconds for average seek time + 2 milliseconds for average rotational delay + 1 millisecond which compensates the palliation of the hard disk manufacture’s marketing department. 18 [PFISTER] Page 403 c Stefan Peinkofer 26 [email protected] 3.2. NODE LEVEL FAIL OVER • Simultaneous data access - In conjunction with special file systems19 , the data on shared storage solutions can be accessed by multiple nodes at the same time. Remote mirroring solutions don’t provide this capability yet. • Costs - Shared storage configurations using fibre channel are typically the most expensive solutions. We need special fibre channel controller cards, one or two fibre channel storage enclosures and eventually two or more fibre channel hubs or switches. Low budget fibre channel solutions are available with costs of approximately 20,000 EUR and enterprise level fibre channel solutions can cost millions. The costs of multi-initiator SCSI and remote mirroring solutions are roughly the same. For shared SCSI we need common SCSI controller cards and at least two external SCSI drives or an external SCSI attached RAID sub-system. Remote mirroring requires Ethernet adapters, some type of local disk in each replication target host and a license for the remote mirroring software. SCSI and remote mirroring solutions cost about 1,500 to 15,000 EUR. 3.2.6 IP Address Relocation Clients don’t know on which cluster node their application is running. In fact they don’t even know that the application is running on a cluster. So clients cannot use the IP address of a cluster node to contact their application because in case of a fail over the application would listen on a different IP address. To solve this problem, every application is assigned a dedicated IP address, which will be failed over together with the application. Now, regardless of which node the application is running on, the clients can always contact the application through the same IP address. To make IP Address Fail Over reasonably fast, we have to address an issue with the data link layer of LANs. The link layer doesn’t use IP addresses to identify the devices on the network; it uses Media Access Control (MAC) addresses. For this reason a host, which wants to send something over the network to another host, must first determine the MAC address of the network 19 Which are discussed in chapter 3.5 on page 41. c Stefan Peinkofer 27 [email protected] CHAPTER 3. HIGH AVAILABILITY CLUSTER THEORY interface through which the IP address of the remote host is reachable. In Ethernet networks, the Address Resolution Protocol (ARP) is responsible for this task. ARP basically broadcasts a question on the network, asking if anybody knows the corresponding MAC address to an IP address and awaits a response. To keep the ARP traffic low and to speed up the address resolution process, operating systems usually cache already resolved IP - MAC address mappings for some time. This means that a client wouldn’t be able to contact a failed over IP address until the corresponding ARP cache entry on the client expired. The solution is that a cluster member which takes over an IP address sends out a gracious ARP message. This is a special ARP packet which is broadcast to the network devices, announcing that the IP address is now reachable over the MAC address of the new node. Thus the ARP caches of the clients will be updated and a new TCP/IP connection can be established.20 3.2.7 Fencing As we already know, missing heartbeat messages from a node needn’t necessarily mean that a node is really dead and therefore is not hosting resources or issuing I/O operations anymore. Taking over the resources in this state is potentially dangerous because it could end up having more than one instance of the resources running. This situation can lead to application unavailability, for example because of an duplicate IP address error. On the storage level, this can even lead to data corruption and data loss. So before a cluster member takes over the resources of a failed node, it has to make sure that the failed node is really dead or at least that the failed node doesn’t access shared disks and doesn’t host resources anymore. The operation which achieves this is called fencing. In the following section, some of the common fencing methods are discussed in more detail. 3.2.7.1 STOMITH STOMITH is an acronym for Shoot The Other Machine In The Head, which means that the failed node is rebooted or shut down21 by another cluster member. Since the cluster member 20 21 [KOPPER] Page 122 Based on the cluster developer’s religion. c Stefan Peinkofer 28 [email protected] 3.2. NODE LEVEL FAIL OVER which wants to take over the resources can’t ask the failed node to reboot/shut down itself, some type of external device is needed which can reliably trigger the reboot/shut down of the failed node. The most commonly used STOMITH devices are software controllable power switches and uninterruptible power supplies since the most reliable method to reboot/shut down a node is to perform a power cycle of the node or just power off the node. Of course this method is not the optimal solution, and therefore STOMITH is only used in environments in which no other method can be used. Note: Many people use the acronym STONITH (Shoot The Other Node In The Head) instead of STOMITH. 3.2.7.2 SCSI-2 Reservation SCSI-2 Reservation is a feature of the SCSI-2 command set which allows a node to prevent other nodes from accessing a particular disk. To fence a node off the storage, a cluster member which wants to take over the data of a failed node must first put a SCSI reservation on the disk. When the failed node tries to access the reserved disk, it receives a SCSI reservation conflict error. To prevent the failed node from running any resources, a common method is that a node which gets a SCSI reservation conflict error “commits suicide“ by issuing a kernel panic which implicitly stops all operations on the node. When the failed node becomes a cluster member again, the SCSI reservation is released, so that all nodes can access the disks again. However SCSI-2 reservations have a drawback: they act in a mutual exclusion manner, which means that only one node is able to reserve and access the disk at a time. So simultaneous data access of more than two nodes is not supported.22 3.2.7.3 SCSI-3 Persistent Group Reservation SCSI-3 Persistent Group Reservations is the logical successor of SCSI-2 reservations and as the name suggests, it allows a group of nodes to reserve a disk. SCSI-3 group reservations allow 22 [ELLING] Page 110 c Stefan Peinkofer 29 [email protected] CHAPTER 3. HIGH AVAILABILITY CLUSTER THEORY up to 64 nodes to register on a disk, by putting a unique key on it23 . In addition, one node can reserve the disk. The reserving node can choose between different reservation modes. The mode which is typically used in cluster environments is WRITE EXCLUSIVE / REGISTRANTS ONLY which means that only registered nodes have write access to the disk. Since nodes can register on the disk even if a reservation is already in effect, the disks are usually continuously reserved by one cluster member. To fence a node from the disk, the cluster members remove the registration key of the failed node so it can no longer write to it. 24 If the node which should be fenced currently holds the reservation of the disk, the reservation is also removed and another cluster member reserves the disk. To keep a fenced node from re-registering on the disk, the cluster software ensures that the registration task is only performed by the node at boot time when it joins the cluster. 23 24 In fact, the key is written by the drive firmware. [ELLING] Page 110 c Stefan Peinkofer 30 [email protected] 3.2. NODE LEVEL FAIL OVER 3.2.8 Putting it all Together Now that we have discussed the building blocks of node level fail over let’s look on an example fail over scenario. As shown in figure 3.3, we have three cluster members WORP, HAL and EARTH in our example scenario. Every member is hosting one resource group. The application data is stored on a shared storage pool. Client Client Access Mount Public Network Heartbeat Interconnect I'm OK I'm OK I'm OK WORP R1 HAL R2 EARTH R3 I/O Interconnect Figure 3.3: Sample fail over 1 c Stefan Peinkofer 31 [email protected] CHAPTER 3. HIGH AVAILABILITY CLUSTER THEORY Now, as shown in figure 3.4, we consider that EARTH isn’t sending heartbeat messages anymore. Client Client Access Mount Public Network Heartbeat Interconnect I'm OK I'm OK WORP R1 HAL R2 EARTH R3 I/O Interconnect Figure 3.4: Sample fail over 2 As can be seen in figure 3.5, the surviving nodes prepare to take over the resources by fencing EARTH from the shared storage pool. After that they negotiate which node will start the resource group. In our example, HAL will start the resource group. Therefore HAL assigns the fail over IP address of the resource group to its network interface, mounts the disks which are required for the application and finally starts the application resource. Now the fail over process is completed. c Stefan Peinkofer 32 [email protected] 3.3. RESOURCE LEVEL FAIL OVER Client Client Access Mount Public Network Heartbeat Interconnect I'm OK I'm OK WORP R1 HAL R2 EARTH R3 Fail Over e nc Fe I/O Interconnect R3 f of Figure 3.5: Sample fail over 3 3.3 Resource Level Fail Over So far, we have assumed that a fail over occurs only when a cluster node fails. But what if the node itself is healthy and just a hosted resource fails? Since our concern is the availability of the resources25 and not the availability of cluster nodes showing an operating system prompt, we must also deal with resource failures. Denoting the node, hosting the failed resource, failed 25 At least it should be that. c Stefan Peinkofer 33 [email protected] CHAPTER 3. HIGH AVAILABILITY CLUSTER THEORY and initiating a node level fail over would do the job but it’s obviously not the best solution. The node may be hosting many other resources which operate just fine. The best solution would be to fail over just the resource group which contains the failed resource. As we have discussed in chapter 3.2.3 on page 21 resource agents can monitor the health of a resource. So to observe the state of the resources, the cluster system will ask the resource agent from time to time to perform the monitor operation. When a resource agent returns a negative result, the cluster system will either immediately initiate a fail over of the resource group or it will first try to restart the resource locally and just fail over if the resource still fails. To fail over, the cluster system will stop the failed resource and all resources which belong to the same resource group by requesting that the appropriate resource agents perform the stop operation. After all resource are stopped successfully, the cluster system will ask the other nodes in the cluster to take over the resource group. Since the node, which hosted the failed resource originally, is still a reputable member of the cluster, the node taking over must not fence the node. It is up to the resource agents to stop the resources reliably, to prevent multiple instances of the same resource from running. The resource agent must make sure that the resource was stopped successfully and return an error if it failed in doing so. How the cluster system reacts to such an error is dependent on the cluster system or the configuration. Basically there are two options: either leave the resource alone and call for human intervention or stop the resource by removing the node from the cluster membership and then performing a node level fail over. Stopping the resource is implicit in this case because the node is fenced off during the node level fail over. Another problem arises if a resource fails because of a failure which will cause the resource to fail on every node, it will be taken over. Typically this is caused by shared data failures or application configuration mistakes. In such a case the resource group will be failed over from node to node until the resource can be started successfully again. These ping-pong fail overs are usually not harmful, but they are not desirable because they are typically caused by failures c Stefan Peinkofer 34 [email protected] 3.4. PROBLEMS TO ADDRESS which require human intervention. In other words, ping-pong fail overs provide no benefit, so most cluster systems will give up failing over a resource group if it failed to start N times on every cluster member. 3.4 Problems to Address In the fail over chapter above we left some problems which might occur on a high availability cluster unaddressed. In this chapter we want to look at these problems and discuss how a cluster system can deal with them. 3.4.1 Split Brain The split brain syndrome or cluster partitioning is a common failure scenario in clusters. It is usually caused by a failure of all available heartbeat paths between one or more cluster nodes. In such a scenario a working cluster is divided into two or more independent cluster partitions, each assuming it has to take over the resources of the other partition(s). It is very hard to predict what will happen in such a case since each partition will try to fence the other partitions off. In the best case scenario, a single partition will manage to fence all the other partitions off before they can do it and therefore will survive. In the worst and more likely case, each partition will fence the other partitions off simultaneously and therefore no partition will survive. How this can happen can be easily understood in a STOMITH environment, in which the partitions simultaneously trigger the reboot of the other partitions. In a SCSI reservation environment it’s not so obvious but it can occur too. Like figure 3.6 shows, for example in a two-node cluster with two shared disks A and B, node one reserves A and then B and node two reserves B and then A. As shown in figure 3.7 this procedure leads to a deadlock and because both nodes will get a SCSI reservation error when reserving the second disk, both nodes will stop working. c Stefan Peinkofer 35 [email protected] CHAPTER 3. HIGH AVAILABILITY CLUSTER THEORY I'm OK Heartbeat Interconnect I'm OK I'm OK Heartbeat R1 R2 R1 R2 I/O Interconnect A Reservation in effect Interconnect I'm OK I/O Interconnect B A Reservation in effect B Reservation conflict Figure 3.6: Split Brain 1 Figure 3.7: Split Brain 2 Heartbeat Interconnect R1 R2 f of Fe nc e e nc of f Fe I/O Interconnect A B Figure 3.8: Split Brain 3 So as we have seen, fencing alone cannot solve the problem of split brain scenarios. What we need is some kind of tie breaking algorithm which elects one winner partition that will take over the resources of the other partitions. Since the most preferable winner partition is the one with the most nodes in it, cluster systems use a voting algorithm to determine the winner partition. Therefore every cluster node gets one vote. In order to continue with the work, a cluster partition must have a quorum. The minimum number of votes to constitute a quorum is more than half of the overall votes. A more formal definition is, to gain quorum in a cluster with n possible votes the partition must hold bn∗0.5c+1 votes. All nodes in a cluster partition without quorum must reliably give up their cluster membership, which means that they must stop their c Stefan Peinkofer 36 [email protected] 3.4. PROBLEMS TO ADDRESS resources and must not carry out any fencing operation, so the winner partition can fence the nodes in the other partitions and take over their resources. Assigning votes only to cluster nodes is not sufficient in two-node cluster configurations because in a split brain situation or if one node dies, no partition can constitute a quorum and therefore every node will give up its cluster membership. To prevent this we must use an additional vote which will deliver quorum to one of the two partitions. A common approach to deliver this additional vote is the use of a quorum disk. A quorum disk must be shared between the two nodes and delivers one additional vote to the cluster. Now when the nodes lose all heartbeat paths, they first try to acquire the vote of the quorum disk by an atomic test and set method, like the SCSI-2 or SCSI-3 reservation feature or by using some kind of synchronisation algorithm which eliminates the possibility of both nodes thinking they have acquired the quorum disk. Using this method, only one node will gain quorum and therefore can continue as a viable cluster. Although a quorum disk is not required in a cluster of more than two nodes, its deployment is advisable to prevent unnecessary unavailability if none of the cluster partitions can constitute quorum – for example when a four-node cluster splits into two partitions, each holds two votes, or when two of the four nodes die. The optimal quorum disk configuration in a 2+ node cluster is to share the quorum disk among all nodes and assign it a vote of N −1 where N is the number of cluster nodes. So the minimal votes needed to gain quorum is N . Since the quorum disk has a vote of N − 1 a single node in the cluster can gain quorum. This provides the advantage of system availability, even if all but one node fails. However this has the disadvantage that when a cluster is partitioned, the partition with fewer nodes could gain quorum if it wins the race to the quorum disk. To avoid this, the partition which contains the most nodes must be given a head start. An easy and reliable way to achieve this is that every partition waits S seconds before it tries to acquire the quorum disk, whereas S is the number of nodes which are not in the partition. This approach will reliably deliver quorum to the partition with the most nodes. c Stefan Peinkofer 37 [email protected] CHAPTER 3. HIGH AVAILABILITY CLUSTER THEORY A few cluster systems don’t support the concept of quorum devices. These systems solve the problem of two-node clusters by asserting that even a single-node cluster partition has quorum, and therefore has the permission to fence the other node off. To prevent both nodes from getting fenced at the same time, they use a random time delay before the fencing operation is carried out. However, this approach may cause the nodes to enter a fence loop. Fence loops are discussed in chapter 3.4.2 on page 38. We also have to discuss the relationship between quorum and fencing. At first glance, it seems that through the use of the quorum algorithm, the fencing step becomes dispensable. For the majority of cluster systems this is not true. Most cluster systems are built on top of the operating system as a set of user processes. If a cluster is partitioned, the nodes in one partition don’t know anything about the state of the nodes in the other partition. Maybe the cluster software itself failed and is no longer able to stop the resources, or maybe the operating system is causing errors on the shared storage even though the resources have been stopped. So the nodes in the quorum partition cannot rely on the convention that a node without quorum will stop its resources and the I/O operations on the shared disks. So for these cluster systems, quorum defines who should proceed and successful accomplishment of the fencing operations defines that it is safe to proceed. It is worth mentioning that some cluster systems which are tightly integrated with the operating system, like the VMScluster, don’t need the fencing step. The loss of quorum causes the operating system to suspend all I/O operations and processes. On these cluster systems, having quorum also means it’s safe to proceed. Of course this requires the quorum algorithm itself and the “loss of quorum code“ to work reliably under all circumstances. 3.4.2 Fencing Loops As already discussed in chapter 3.4.1 on page 37, some cluster systems ignore quorum in two-node cluster configurations. If a node is not shut down or halted, but rebooted as an effect of being fenced, the nodes will enter a fencing loop if the fencing was an effect of a split c Stefan Peinkofer 38 [email protected] 3.4. PROBLEMS TO ADDRESS brain syndrome. In a fencing loop, the fenced node A will reboot and therefore try to join the cluster, once it’s up again. The cluster system on A will notice that it cannot reach the other node B and fence node B to make sure that it is safe to incarnate a new cluster. After node B has rebooted, it cannot reach node A and will fence node A and so on. The nodes will continue with this behavior as long as they are not able to exchange heartbeat messages or until human intervention occurs. If a cluster system ignores quorum, it is not possible to prevent the nodes from entering a fencing loop. This fact has to be kept in mind when designing a cluster which uses such a cluster system. The only thing that can be done to alleviate the problem is to use any available interconnect between the nodes to exchange heartbeat messages, so the likelihood of a split brain scenario is minimized. The reason cluster software developers may choose to fence a node by rebooting and not halting the node is that it is likely that the failure can be removed by rebooting the node. 3.4.3 Amnesia Amnesia is a failure mode in which a cluster is incarnated with outdated configuration information. Amnesia can occur if the administrator does some reconfiguration on the cluster, like adding resources, while one or more nodes are down. If one of the down nodes is started and joins the cluster again it receives the configuration updates from the other nodes. However, if the administrator brings down the cluster after he does the reconfiguration and then starts one or more of the nodes which were down during the reconfiguration, they will form a cluster that is using the outdated configuration.26 Some cluster systems prevent this by leaving the names of the nodes which were members of the last cluster incarnation on a shared storage medium. Before a node incarnates a new cluster it checks to determine whether it was part of the last incarnation and if not, it waits until a member of the last incarnation comes up.27 Some other 26 27 [ELLING] Page 30 [ELLING] Pages 107 - 108 c Stefan Peinkofer 39 [email protected] CHAPTER 3. HIGH AVAILABILITY CLUSTER THEORY cluster systems leave the task of avoiding amnesia to the system administrator. 3.4.4 Data Corruption Even if we could assume that a healthy node doesn’t corrupt the data it is using, we cannot assume the same of a failed node. Maybe it failed while it was writing a file to disk, for example. The cluster software ensures that data is not corrupted by uncoordinated simultaneous data access of more than one node. As we have seen, data corruption is not only caused by this failure scenario. Even the fencing operation could cause data corruption when it fences a node in the middle of an I/O operation. So the node which takes over the data must accept that the data may got corrupted and it needs some strategies to recover from that data corruption. To deal with data corruption, we can basically use two different approaches. The first one is to use some kind of analyzing and repair program like the fsck command for UNIX file systems. Those programs will check to determine whether the data got corrupted. If so, they will try to recover it by bringing the data back to a usable state, somehow. However, these tools are usually very time consuming because they have to “guess“ which parts of the data are corrupted and how to recover them. Therefore it would be useful if the corrupted data could tell the taking over node what the problem is and how it can be fixed. The key to this approach are transactions. Among other things, they provide durability and atomicity. This means that a transaction which has completed will survive a system failure and that a transaction which could not be completed can be undone. This is achieved by maintaining a log file on disk which contains all the changes that were made to the data and a property which indicates whether the change was already successfully applied or not. To get a better understanding of transactions, let’s briefly look at the steps of an example transaction:28 1. Update request “Change value A to 5“ comes in 2. Look up the current value of A (e.g. 1) and append a record to the log file containing “Changed value A from 1 to 5“ 28 [PFISTER] Pages 408 - 409 c Stefan Peinkofer 40 [email protected] 3.5. DATA SHARING 3. Make sure that the log record is written to disk 4. Change value A to 5 on disk 5. Note in the log file record that the update was applied 6. Make sure that the log record is written to disk 7. Return success to the requestor When a node takes over the data of another node, it just has to look at the log file and undo the changes which aren’t marked as applied yet. It is worth mentioning that this algorithm is even tolerant against corruption of the log file. For example if step 3 is not carried out completely, and therefore, the log file is corrupted, the corrupted log file record can be ignored because no changes have been made to the data yet. Almost any modern operating system uses transactional based file systems because of the great advantages of transactions, compared to the analyze and repair tools. These file systems are usually termed as journalizing or logging file systems. 3.5 Data Sharing As mentioned in chapter 3.2.5.3 on page 25 the use of shared storage devices provides the opportunity to let multiple nodes access the data on the shared storage at the same time. Of course, what benefit this provides to high availability clusters is a legitimate question, since a resource can only run on one node at a time. In fact, there are not so many scenarios in which this is beneficial. Generally speaking, it’s only valuable if a set of applications which normally has to run on a single node in order to access a common set of data can be distributed among two or more nodes to distribute the workload. For example if we want to build a file server cluster which provides the same data to UNIX and Windows users, we have to use two different file serving applications, namely NFS and c Stefan Peinkofer 41 [email protected] CHAPTER 3. HIGH AVAILABILITY CLUSTER THEORY Samba. Without sharing the data between the cluster nodes, both resources have to be located on the same node. With sharing the data, we can distribute the load among two nodes by running NFS on one and Samba on another node. Unfortunately, standard file systems are not able to be mounted more than once at a time. To understand why this restriction is in effect, we must take a look at how these file systems operate. Every file system contains data and meta data. Data is the actual content of the files and meta data contains the management information of the file system, like • which disk blocks belong to which file, • which disk blocks are free, • which directories exist and which files are in them. If a file system is mounted, parts of the file system data and meta data are cached in the main memory of the computer which has mounted the file system. If a cached part of the file system is modified, the changes are applied to the disk and to the cache. If a cached part of the file system is read, the information is retrieved only from main memory, since this is many times faster than retrieving it from disk.29 In addition the operating system on the computer which has mounted the file system assumes that it has exclusive access to the file system. Therefore it does not need to pay attention to file system modifications which are carried out by another computer which has mounted the file system at the same time, since this is forbidden. In order to be able to mount a file system on more than one node simultaneously, four main problems have to be solved. • Meta data cache inconsistency - Changes of the meta data, which are carried out by one node, are not recognized by the other nodes. For example, if node A creates a new file X and allocates disk block 1 for it, it will update the file system’s free block list in its 29 [STALKER] c Stefan Peinkofer 42 [email protected] 3.5. DATA SHARING local memory as well as on the disk, but node B is unaware of this update. Now if node B creates a new file Y, it will allocate disk block 1, too, since the cached free block list on B still indicates that block 1 is not yet allocated.30 • Meta data inconsistency - The file system assumes that it has exclusive access to the meta data and therefore does not need any locking mechanism for that. Meta data changes are not atomic operations but a series of I/O operations. If two nodes perform an update of the same meta data item at the same time, the meta data item on the disk can become corrupted. • Data cache inconsistency - A once written or read block will remain in the file system cache of the node for some time. If a block is written by a node while it is cached by another node, the file system cache of that node becomes inconsistent. For example, node A reads block 1 from disk, which contains the value 1000. Now when node B changes the value of block 1 to 2000, the cache of node A becomes outdated. But since A is not aware of that, it will pass the value 1000 back to the processes which request the value of block 1 until the file system cache entry expires.31 • Data inconsistency - If a process locks a file for exclusive access, this lock is just in effect on the node on which the process runs. Therefore a process on another node could gain a lock for the same file at the same time. This can lead to data inconsistency. For example, let node A lock file X and read a value, say 4000, from it. Now node B locks the file too and reads the same value. Node A adds 1000 to the value and B adds 2000 to the value. After that node A updates the value on the disk and then node B updates the value too. So the new value on disk is 6000 but it’s supposed to be 7000. The special file systems which are able to deal with these problems are usually termed as cluster file systems or SAN file systems. In the following sections we will look at the differences between cluster and SAN file systems as well as the different design approaches of these file systems. 30 31 [STALKER] [STALKER] c Stefan Peinkofer 43 [email protected] CHAPTER 3. HIGH AVAILABILITY CLUSTER THEORY 3.5.1 Cluster File System vs. SAN File System Before storage area networks were invented, using a cluster file system was the only possibility to share a file system on the I/O interconnect level. With the emergence of storage area networks as an industry standard shared I/O interconnect, customers wanted to be able to share their file systems not only within a cluster but also among virtually any node which is attached to the SAN. So companies like IBM, SUN and many more began to develop stand alone shared file systems, which are termed SAN file systems. Actually it is very hard to set a clear boundary between cluster and SAN file systems. One approach is that a cluster file system is a shared file system which cannot be deployed without an appropriate cluster system because it makes use of functions the cluster system provides. On the other hand, some shared file systems32 don’t rely on a cluster system but behave exactly like a file system which does. They simply implement the needed cluster concepts themselves. So a better definition may be that a cluster file system uses the concepts of cluster membership and quorum in order to determine which hosts are allowed to access the file system whereas SAN file systems don’t. If we use this definition, we can point out further differences between cluster and SAN file systems: 1. SAN file systems must deploy a central file system coordinator which manages the file system accesses. To perform a file system operation, a node has to get the permission of the file system coordinator first. If the node fails to contact the coordinator it must not write to the file system. In contrast, cluster file systems can, but do not have to, deploy such a coordinator since every node is allowed to access the file system, as long as it is a member of the quorum partition, a fact which hosts on a SAN file system are not aware of. 2. SAN file systems are not highly available by default since the file system coordinator is a single point of failure. However, the coordinator task can usually be manually failed over to an alternate host. Cluster file systems which use a central file system coordinator will 32 Like the Lustre file system. c Stefan Peinkofer 44 [email protected] 3.5. DATA SHARING automatically ensure that the file system coordinator task is always done by a member of the quorum partition. 3. SAN file systems can be deployed in a cluster as a cluster file system33 . But making the file system highly available as a SAN file system, meaning that nodes outside of the cluster can access the file system too, can be difficult if the cluster system uses SCSI reservation for fencing.34 This is because the cluster software will ensure that only cluster members can access the disks, so non-cluster members are fenced by default. 4. Cluster file systems can usually only be shared between nodes which run the same operating system type. SAN file systems can typically be shared between more than one operating system type. 3.5.2 Types of Shared File Systems This chapter discusses the different approaches to how file systems can be shared between hosts. The first two methods discussed deal with file systems which really share access to the physical disks, whereas the third one deals with a virtual method of disk sharing, sometimes termed I/O shipping. 3.5.2.1 Asymmetric Shared File Systems On asymmetric shared file systems, every node is allowed to access the file system data, but only one is allowed to access the meta data. This node is called meta data server whereas the other nodes are called meta data clients. To access meta data, all meta data clients must request this from the meta data server. So if a meta data client wishes to create a file, for example, it advises the meta data server to create it and the meta data server returns the disk block address which it has allocated for the file to the meta data client.35 Since all meta data operations are coordinated by a single instance, meta data consistency is assured implicitly. 33 34 The use of quorum and membership is implicit in this case through the cluster software. Therefore some vendors like IBM offer special appliances which provide a highly available file system direc- tor. 35 [KRAMER] c Stefan Peinkofer 45 [email protected] CHAPTER 3. HIGH AVAILABILITY CLUSTER THEORY 3.5.2.2 Symmetric Shared File Systems On symmetric shared file systems, every node is allowed to access not only the file system data but also the meta data directly. In order to prevent meta data inconsistency, it has to be ensured that only one host can modify a specific meta data item at a time, and that no host is able to read a meta data item which is currently being changed by another node. This functionality is provided by a file system wide lock manager. 3.5.2.3 Proxy File Systems On proxied file systems, the disks are not physically shared. Instead, one node mounts the file system physically and shares the file system with the other nodes over a network connection. The node which has mounted the file system physically is called file system proxy server; the other nodes are called file system proxy clients. In principle a proxy file system works like a network file system like NFS or CIFS (Common Internet File System). The difference is that network file systems share the files which are located on some type of file system, whereas proxy file systems directly share the file system on which the files are located. For example, let’s consider that a server exports an UFS file system over NFS and over a cluster proxy file system. The network file system clients mount the exported file system as NFS file system but the cluster nodes mount the exported file system directly as UFS. If an application on a file system proxy client requests a file system operation, the kernel reroutes it over the network connection to the kernel on the file system proxy server, which carries out the actual I/O operation and returns the result to the requesting node. 36 Usually, this type of file system is only deployed in clusters since in non-cluster environments network file systems are widely accepted as a standard for sharing data over the network. Since only one instance controls access to the whole file system, data and meta data consistency are implicit. 36 [ARPACI] Pages 8 - 9 c Stefan Peinkofer 46 [email protected] 3.5. DATA SHARING 3.5.3 Lock Management As we have seen, file locks and eventually even locks on meta data items must be maintained file system wide so that data and meta data inconsistency is avoided. To implement locking in a distributed environment, there are two basic approaches. The first is deploying a central lock server and the second is distributing the lock server tasks among the nodes. The basic problems a lock manager has to deal with are deadlock detection, releasing locks of failed nodes and recovering the file system locks if the/a lock manager has failed. The concepts of centralized lock management are similar to the meta data server concept of asymmetric shared file systems. The process of requesting a lock is the same as requesting a meta data operation. Since all lock management operations are done on a single node, deadlock detection is no problem because ordinary algorithms can be used for this. Centralized lock management can be used by a cluster file system but it must be used by a SAN file system, since the central lock manager coordinates the file system accesses. With distributed lock management, every node can be a lock server for a well defined, not overlapping subset of resources. For example node A is responsible for files beginning with A-M and node B is responsible for files beginning with N-Z37 . The main advantage of this method is that the computing overhead for lock management can be distributed among all nodes.38 The main disadvantage is that deadlock detection is much more complex and slower, since a distributed algorithm has to be used. How the lock manager deals with locks of failed clients depends on whether it is used by a cluster file system or not. On a cluster file system, the lock manager knows when a node fails and therefore can safely release all locks of the failed member.39 On a SAN file system the lock server doesn’t know if a client has failed, so another strategy must be used. One possible solu37 Of course this is an abstract example. [KRONEN] Pages 140 - 141 39 [KRONEN] Page 142 38 c Stefan Peinkofer 47 [email protected] CHAPTER 3. HIGH AVAILABILITY CLUSTER THEORY tion is to grant locks only for a specific period of time, called lease time. If a client needs a lock for longer than the lease time, it has to re-request the lock before the lock times out. If the client is not able to contact the lock server to request more time, it must suspend all I/O operations until it can contact the lock server again. Assuming this works reliably, the lock manager can safely release all locks for which the lease time has expired. To recover the lock state in case of a lock manager failure, two different strategies can be used. The first one is to keep a permanent copy of the lock information on a disk so if a lock manager fails, another node can read the last lock state and take over the log manager function. Of course, this method performs not very well, since a hard disk access is required for each lock operation. The other method is that every node keeps a list of the locks it currently owns. To recover the lock state, the new lock manager asks all nodes to tell it what locks are in effect on the system.40 3.5.4 Cache consistency The final thing we have to discuss is how the data and meta data caches can be kept synchronized on all nodes. For this purpose basically three different approaches can be deployed. The first and easiest method is called read-modify-write. The method is so easy because read means reading from disk, so no caching is done at all. Of course a file system which uses this method does not perform very well. But it may be suitable for solving the meta data cache problem in symmetric shared file systems41 . The second concept is active cache invalidation. If a node wants to modify an item on the disk, it notifies all other nodes about that. The notified nodes will look in their local cache if it contains the announced item and, if so, they will remove it from the cache or mark the cache 40 41 [PFISTER] Pages 418 - 419 [YOSHITAKE] Page 3 c Stefan Peinkofer 48 [email protected] 3.5. DATA SHARING entry as invalid42 . The last method is passive cache invalidation. It’s based on maintaining a version number for each file system item. If a node modifies the item, the version number gets incremented. If another node wants to read the item, it looks first at the version number of the item on disk and compares it with the version number of the item in the cache. If they match, the node can use the cached version; if not, it has to re-read it from the disk. Of course, having a version number for every disk block, for example, would be too large an overhead. Because of this, version numbers are usually assigned at the level of lockable resources. For example if a lock manager allows file level locks, every file gets a version number. The coupling of passive cache invalidation and locking adds another advantage. Instead of writing and reading the version numbers to/from the disk by each node individually, the numbers can be maintained by the lock manager. So if a node requests a lock, the version number of the locked item is passed to the node together with the lock.43 42 43 [KRONEN] Page 142 [KRONEN] Page 142 c Stefan Peinkofer 49 [email protected] Chapter 4 Designing for High Availability After we have seen how high availability clusters work in general, we have to look at some basic design considerations, which have to be taken in account when planning a high availability cluster solution. The chapter is divided into three main areas of design considerations. The first area deals with general high-level design considerations which are usually implemented together with the IT management. The second area is about planning the hardware layout of the cluster system and the environment in which the high availability system will be deployed. The third area is dedicated to the various software components involved in a cluster system. Since the concrete design of a high availability cluster solution depends mainly on the hardware and software components used, the available environment and the customer requirements, we can only discuss the general design issues here. We will look at two concrete designs in the sample implementations in chapters 6 and 7. This chapter also addresses design issues which are not directly related to cluster systems but deal with high availability in general. It’s worth mentioning that if someone plans to deploy a high availability cluster system in a poorly available environment, it’s often better to use a non-clustered system and spend the saved money on improving the availability of the environment first. The design recommendations and examples in the following chapters should be understood as best case solutions to achieve a maximum of availability. In a real-life scenario, the decision 50 4.1. SYSTEM MANAGEMENT AND ORGANIZATIONAL ISSUES for or against a particular recommendation is the result of a cost/risk analysis. Also it is possible to implement particular recommendations only to certain extents. 4.1 System Management and Organizational Issues High availability cluster systems provide a framework to increase the availability of IT services. However, achieving real high availability requires more than just two servers and a cluster software. In order to build, deploy and maintain these systems successfully, the IT management must provide a basic framework which defines clear processes for system management and implements some organizational rules. 4.1.1 Requirements The first task in the design process is to identify and record the requirements of the high availability cluster system. The “requirements document“ contains high level information which will be needed in the subsequent design process. The document is the result of an requirements engineering process which can contain, but is not limited to, the following steps: • Create an abstract description of the project, together with the management. • Identify the services the system should provide and the users of the various services. • Determine the individual availability and performance requirements of the different services. • Identify dependencies between services hosted by the system and services hosted by external systems. • Negotiate service level agreements like service call response time and maximum downtime with the various vendors of the system components. • Work out a timeline for system development.1 1 [ELLING] Page 198 c Stefan Peinkofer 51 [email protected] CHAPTER 4. DESIGNING FOR HIGH AVAILABILITY 4.1.2 Personnel As mentioned before, high availability cannot be achieved by simply buying an expensive high availability system. One of the key factors to high availability is the staff which administers the system. Therefore we have to take some considerations about the personnel into account. The first thing we have to do is to remove personnel single points of failures. For example, when a high availability system is managed by only one system administrator, this person is a SPOF. If he leaves the company or goes on holidays for a week, the other administrators may not be able to operate the system in the appropriate manner. The first step in removing the SPOF is creating a comprehensive documentation of the system design, including network topology, hardware diagrams, deployed applications, inter-system dependencies and so on. In addition to that a troubleshooting guide, which contains advice for various failure scenarios and hints for problem tracking, must be created. The troubleshooting guide should also contain all problems and their solutions which already occurred during the system deployment.2 Having system documentation is mandatory, but it is not sufficient. If a system fails and the primary system administrator is unavailable, the backup administrator usually does not have the time to read through the system documentation. Therefore the backup administrators have to be trained on the system design, the handling of the various hardware and software components and basic system troubleshooting techniques, before the system goes into production use. An additional approach could be that a system is designed and maintained by a team of administrators, in the first place.3 What has to be kept in mind is that documentation and training cannot replace experience. Some managers think that a trained administrator has the same skills as an administrator who has maintained a system for years. Since this is not the case, unless the system is very simple, personnel turnover of such highly experienced people should be avoided, if at all possible.4 2 [MARCUS] Pages 289 - 291 [ELLING] Page 199 4 [MARCUS] Pages 291 - 293 3 c Stefan Peinkofer 52 [email protected] 4.1. SYSTEM MANAGEMENT AND ORGANIZATIONAL ISSUES To achieve real high availability, not only systems, but also the administrators, have to be highly available. Since systems not only fail during business hours5 , it must be ensured that someone from the IT staff can be notified about failures, 24 hours a day, 7 days a week and 52 weeks a year. The primary solution for this are pagers or mobile phones. However, this doesn’t guarantee that the person is reachable all the time, so this is another single point of failure which must be removed. To solve this problem, we must define an escalation process. This process defines which person should be notified first and which person should be notified next, in case the first person does not respond within a specific time. Of course, the notification is useless if the administrators cannot access the system during non-business hours. They need at least physical access to the system around the clock. A better solution is to provide the administrators additionally with remote access to the systems. This can significantly speed up the failure response process, because the time for getting dressed and driving to the office building can be saved. However, since some tasks can only be performed with physical access to the system, remote access can only be an add-on for physical access.6 4.1.3 Security Security leaks and weakness can doubtless lead to unavailability if someone in bad faith exploits them to access the systems. But even someone in good faith could cause system downtime because of a security weakness. Therefore the systems must be protected from unauthorized access from both outside and inside the company. Some common methods for this are firewalls, to protect the systems against attackers from the Internet, intrusion detection systems, to alert of attacks and passwords that are hard to guess and are really kept secret. Additionally, as few people as possible should be authorized to have administrative access to the system. For example, developers should usually have their own development environment, but under some circumstances, developers may also need access to the productive systems. Giving them administrative access to the production system when unprivileged access to the system would suffice 5 6 In fact it seems that they fail more often during non-business hours. [MARCUS] Pages 294 - 295 c Stefan Peinkofer 53 [email protected] CHAPTER 4. DESIGNING FOR HIGH AVAILABILITY must not be allowed since privileged users have far more possibilities to make a system unavailable by accident than unprivileged users. If the specific task of the developer requires special privileges, he must also not be given full administrative access but his unprivileged account has to be assigned only the privileges which are necessary to let him carry out the specific task.7 Another aspect of security is that physical access to the system must be limited to authorized personel.8 This is needed to protect against the famous cleaning lady pulling the power plug of the server to plug in the vacuum cleaner and on the other hand from wilful sabotage by an angry employee for example. 4.1.4 Maintenance and Modifications Like any other computer system, high availability clusters require software and hardware maintenance and modifications from time to time. The advantage of high availability clusters is that most of the tasks can be done without putting the whole cluster out of service. A common strategy is to bring one node out of cluster operation, perform the maintenance tasks on that node, check and see whether the maintenance was successful, bring the node back in the cluster and perform maintenance on the next node.9 However, this means that the services are unavailable for a short time period because the resources, hosted by the node on which the maintenance should be applied, have to be failed over. Therefore, performing maintenance tasks in high workload times, in which the short unavailability would affect many users, should be avoided. In addition to that, a few maintenance tasks even require the shutdown of the whole cluster. Not performing maintenance tasks which require that a node or the whole cluster has to be put out of operation at all is no option, since the likelihood of unplanned downtime increases over the time when a system is not properly maintained. So we should appoint periodical maintenance 7 During my practice term, a customer called us to restore one of their cluster nodes. A developer wrote a “clean up“ script which should delete all files in a particular directory and the sub directories, which are older than 30 days. The problem was that the script did not change in the directory to clean up and she scheduled the task in the crontab of root. So in the evening, the script began to run as root, in the home directory of root, which is / on Solaris. 8 [MARCUS] Pages 287 - 288 9 [MARCUS] Page 270 c Stefan Peinkofer 54 [email protected] 4.1. SYSTEM MANAGEMENT AND ORGANIZATIONAL ISSUES windows, preferable in times when the system does not have to be available or at least in low workload times. These windows are used to perform common maintenance tasks like software and firmware updates, adding or removing hardware, installing new applications, creating new cluster resources, and so on.10 Unfortunately, maintenance and modification tasks are critical even if they are performed during a maintenance window. For example the maintenance could take longer than the maintenance window or something may break because of the performed task. Another “dirty trick“ of maintenance tasks is that sometimes they seem to work fine at first, but cause a system failure many weeks after the actual maintenance task is carried out. To minimize the likelihood of maintenance tasks affecting availability, we must define and follow some guidelines. • Plan for maintenance - Every maintenance task has to be well planned in the first place. Reading documentation and creating step-by-step guidelines of the various tasks is mandatory. Since something could go wrong during the maintenance, identifying the worst case scenarios and planning for their occurrence is also vital. In addition to that, a fallback plan to roll back the changes, in case the changes do not work as expected or the maintenance task cannot be finished during the maintenance window, has to be prepared. • Document all changes - Every maintenance task has to be documented in a run book or another appropriate place like a change management system. Things to document, among the usual things like date, time and name of the person who performed the maintenance, are the purpose of the task, what files or hardware items were changed and how to undo the changes. In addition to the run book, it’s a good idea to note changes in configurations files with the same information as in the run book. • Make all changes permanent - Especially in stressful times, administrators seem to take the path of least resistance. In this case, it can happen that changes are applied only in a non-permanent way. For example adding a network route or an IP address, by the usual commands route and ifconfig lasts only until the next reboot unless 10 [STOCK] Page 20 c Stefan Peinkofer 55 [email protected] CHAPTER 4. DESIGNING FOR HIGH AVAILABILITY they are made permanent by changing the appropriate configurations files. The effect after the next reboot, which is usually carried out some time later, most likely within a maintenance window in which several maintenance tasks are carried out, is that the non-permanent modifications have vanished and the users will complain about it. Since the actual modifications were made some time ago and a system maintenance was carried out a few minutes or hours ago, it is usually the beginning of a long fault isolation night since everybody thinks in the first place that the problem was caused by the recent system maintenance and not by a non-permanent modification made some days or weeks ago. • Apply changes one after another - Applying more than one change at a time makes it very hard to track the problem if something goes wrong. Administrators should apply only one change at a time and after that make sure that everything still works as expected. Rebooting after the change is also a good idea, since some changes only take full effect after a reboot. After that, the next change can be applied.11 Another point to consider, in conjunction with maintenance, are spare parts. Keeping a spare parts inventory can help to decrease the mean time to repair. In order to get the most benefit of such an inventory, from the administrator’s as well as form the manager’s point of view, some rules have to be followed. The first thing is to decide which spare parts will be stocked. This should be at least the parts which fail the most, like disks or power supplies for example and the parts which are hard to get, meaning they have a long delivery period or are offered only by a few suppliers. Another point is that it must be ensured that the parts in stock are working. So it’s mandatory to test new parts before they are put into the inventory. In addition to that, authorized personnel must have access to the spare parts around the clock and access of unauthorized personnel must be prevented.12 11 12 [MARCUS] Pages 270 - 271 [MARCUS] Page 273 c Stefan Peinkofer 56 [email protected] 4.1. SYSTEM MANAGEMENT AND ORGANIZATIONAL ISSUES 4.1.5 Testing Every change which is planned to be applied to a productive system should first be tested in an insular test environment. This is especially important for patches and software updates which should be applied and for new software which should be installed. An ideal test environment would be an identical copy of the productive system. In this case, the test environment can be used as spare part inventory, too.13 But mostly, the test environment is a shrunken copy with smaller servers and storage. Often a company deploys more than one nearly identical productive system so that only one test environment is needed. Though the costs of such a test environment are not negligible, it provides various benefits. Applying broken patches and software on the productive systems, and therefore unplanned downtime, can be avoided. Maintenance tasks can be tested with no risk and performing the maintenance task on the productive system can be done faster, since the administrators are already familiar with the specific tasks. The test environment can also be used for training and to gain experience. In addition to that application developers can use the test environment for developing and testing new applications. Another aspect of testing is the regular functional checking of the productive systems. For this purpose, common failure scenarios are initiated while monitoring the systems if they respond to the failure in the desired way. But not only the cluster system itself has to be tested. Infrastructure services, like the network, air conditioning or uninterruptible power supplies have to be tested regularly as well.14 4.1.6 Backup It should be self-evident that all local and shared data of the cluster has to be backed up somehow. In addition to proper backup media handling, some additional guidelines have to be followed to get the maximum benefit from a backup system. 1. Disk mirroring is not backup because mirroring cannot restore accidentally deleted files.15 13 [STOCK] Page 23 [SNOOPY] Page 7 15 [MARCUS] Page 238 14 c Stefan Peinkofer 57 [email protected] CHAPTER 4. DESIGNING FOR HIGH AVAILABILITY 2. Backup to disk is not an effective backup system. The price of hard disks has declined over the last few years, so ATA and S-ATA disks have become cheaper than magnetic tapes. In addition to that, ATA RAID systems provide a better read/write performance than tape drives, so the backup process can be finished faster. So companies have begun to back up to ATA disks rather than to magnetic tapes. However, ATA disks are not very reliable; they are not built for round-the-clock operation; and they tend to fail at the same time if they have been equally burdened over time. Therefore the possibility of too many disks in a RAID set breaking at nearly the same time and all the data being lost is considerably high. So in order to get a fast and reliable backup, the data on the backup disks have to be backed up to tapes also.16 3. Backup tapes, which contain the data of the high availability system, should be stored in another building or at least in another fire compartment. In addition to that, the backup tapes should be copied and stored in a further building or fire compartment so that the backup is not destroyed in case of a disaster.17 Some applications must be shut down in order to back up the application data. If this cannot be done in times in which the system doesn’t have to be available, other strategies have to be used. One solution for this problem is taking a block level snapshot of the related disks. Block level snapshots take a “picture“ of a disk at a specific point in time. This is done by a copy on write algorithm, which copies the blocks which are modified after the snapshot was taken to another place. For the operating system, the snapshot looks like a new disk which can be mounted in read-only mode. To back up the application data, the application has to be shut down only for a short moment, during which the snapshot is taken, and after that the snapshot can be mounted and the data can be backed up. The block level snapshot feature is provided by almost all enterprise scale storage sub-systems. Additionally, there are various software tools which implement the block level snapshot features in software. An advantage of snapshots, provided by the storage sub-system, is that the backup task can be transferred to another server because the snapshot can be mounted on any server which is connected to the storage sub-system. 16 17 [PARABEL] [MARCUS] Page 239 c Stefan Peinkofer 58 [email protected] 4.1. SYSTEM MANAGEMENT AND ORGANIZATIONAL ISSUES The major rule of backup is: The faster lost files can be restored from the backup, the better. Unfortunately, this rule is often violated in favour of a fast backup process. Today’s magnetic tape drives provide a write performance which is greater than the average read performance of a file system on a single disk. To gain the full tape write performance, many backup systems provide the ability to write more than one backup stream to a single tape simultaneously. This speeds up the backup process but slows down the restoration process. For example if ten backup streams are written to a tape simultaneously and the tape has a read/write performance of 30 MB/s, this means that the restore will run only with 3 MB/s18 . Such features have to be used with precaution. To speed up the overall time to restore, meaning the time which is needed from starting the restore application until the system is available again, restoring a system should be trained on a regular basis or at least a documented step-by-step restore procedure should be created. Normal backup software will require a working operating system and backup client application in order to be able to restore the files from the backup system. So it’s a good idea to take disk images of the local boot disks from time to time. These images usually back up the disk on the block level and therefore preserve the partition layout and the master boot record of the disk. So in case of a boot disk failure, the administrators just have to copy the disk image to the new disk instead of reinstalling the complete operating system first, before the last backed up state of the boot disk can be restored. 4.1.7 Disaster Recovery Disaster recovery deals with making the computer systems of a company (or even the whole company) available again within a specific time span, in case of a disaster strike.19 A disaster is an event which causes the unavailability of one or more computer systems or even the unavail18 During my practice term, we had to spend the night and half of the next day at the customer’s site to restore 4 GB of data, because they backed up 40 parallel streams to one tape at the same time. 19 [ANON6] c Stefan Peinkofer 59 [email protected] CHAPTER 4. DESIGNING FOR HIGH AVAILABILITY ability of company parts or the whole company. Disasters can be major fire, flood, earthquake, storm, war, plane crash, terrorist attack, sabotage, area wide power failure and many more.20 Clusters can protect against some, but not all, disasters because the maximum physical distance between the cluster nodes is limited. The greater the distance between the nodes, the higher the possibility of a split brain situation is.21 Placing the clusters in different buildings which are some kilometres apart can protect against fire, plane crashes and floods but for other disasters, the distance is not large enough. There are many ways in which a company could protect against disasters and the concrete implementation of disaster recovery goes beyond the scope of this thesis. We will just discuss some high-level matters. The first thing which is needed is a backup computer center which is far enough away from the primary center that the disaster cannot affect both sites.22 The second thing that is needed for disaster recovery is a disaster recovery plan, which should contain at least the following points: • What types of disasters the plan covers. • A risk analysis of each covered disaster type. • What preventive actions were taken to prevent or contain the effects of the covered disaster types. • Who has to be notified about the disaster and who has arbitrament about all further actions taken. • Which systems are covered by the disaster recovery plan. • How the data gets to the backup data center. • Who is responsible for recovering the various systems. 20 [MARCUS] Pages 302 - 303 [STOCK] Page 22 22 [MARCUS] Page 299 21 c Stefan Peinkofer 60 [email protected] 4.1. SYSTEM MANAGEMENT AND ORGANIZATIONAL ISSUES • With which priority the systems at the backup site should be started again. • What steps have to be taken to start up the various systems. • Who is responsible for maintaining the disaster recovery plan.23 It is mandatory that the disaster recovery plan is always up-to-date24 and that the procedures in the plan are trained on a regular basis. It’s worth mentioning that even if a company cannot afford a backup computer center for disaster recovery, it is a good idea to create at least a computer center cold start plan. Because in case of a computer center shutdown as the effect of a power outage, for instance, it usually takes weeks until everything works like it did before.25 4.1.8 Active/Passive vs. Active/Active Configuration One of the main decisions which has to be made when deploying a high availability cluster is whether at least one node should do nothing but wait until one of the others fails or whether every node in the cluster should do some work. An active/passive configuration has a slight availability advantage against active/active configurations because applications can cause system outage, too. So the risk of a node not being available because of a software bug is higher in an active/active configuration. However, to argue to the management that an active/passive solution is needed can be hard because it’s not very economical. The economical balance can be improved if the cluster contains more than two nodes so only one passive system is needed for many active systems. But most high availability clusters used today are active/active solutions because they appear to be more cost-efficient to the management. What has to be kept in mind with active/active solutions is that every server must also have enough CPU power and memory capacity to run all the cluster resources by itself. As the figures 4.2 and 4.1 show, active/passive solution require more servers than active/active solutions and active/active solutions require more powerful servers than active/passive solutions. 23 [ANON5] [STOCK] Page 20 25 [SNOOPY] Page 8 24 c Stefan Peinkofer 61 [email protected] CHAPTER 4. DESIGNING FOR HIGH AVAILABILITY R1 R2 R1 Passive Cluster Node Active Cluster Node Active Cluster Node Passive Cluster Node Figure 4.1: Active/Active Configuration R1 R2 Active Cluster Node Active Cluster Node Figure 4.2: Active/Passive Configuration 4.2 Hardware In the following sections we will look at the hardware layout design of high availability clusters. In addition to that, we will look at some other hardware components which are not directly cluster related, but have to be reliable too, in order to achieve high availability. c Stefan Peinkofer 62 [email protected] 4.2. HARDWARE 4.2.1 Network Networks are not part of clusters, but since clusters usually provide their services over a network, the network has to be highly available, too. There are many different implementations which make networks highly available so we will discuss the whole issue just on a high level. The first thing we have to consider is network connectivity. This can be divided into three different paths: 1. Server to Switch 2. Switch to Switch 3. Switch to Router In order to make server to switch connections highly available, we need two network cards on the server, which are connected to two different switches. In addition we need some piece of software on the server which either detects the failure of a connection and fails communication over to the other connection or which uses both connections at the same time and discontinues the use of a failed connection automatically. Of course, the clients have to be connected to both switches, too, in order to benefit from the highly available network. Usually, a company network consist of more than two switches. In this case we have to consider switch to switch connections too. On ISO/OSI Layer 2, Ethernet based networks are not allowed to contain loops. A loop is for example in a three switch network when switch A is connected to B which is connected to C, which is connected to A. But without such a loop, one or more switches can be a single point of failure, as shown in figure 4.3. c Stefan Peinkofer 63 [email protected] CHAPTER 4. DESIGNING FOR HIGH AVAILABILITY Active Inter-Switch Link Failed Inter-Switch Link Client Network Switches Server Figure 4.3: Inter-Switch Link Failure Without Spanning Tree One method for removing the loop limitation is the IEEE 802.1w Rapid Spanning Tree Protocol, which is supported by mid-range and enterprise level switches. This method allows the forming of interconnect loops. As the figures 4.4 and 4.5 show, the switches set the redundant paths offline and activate them in case an active interconnect fails.26 26 [KAKADIA] Page 15 c Stefan Peinkofer 64 [email protected] 4.2. HARDWARE Active Inter-Switch Link Inactive Inter-Switch Link Active Inter-Switch Link Failed Inter-Switch Link Client Client Network Switches Network Switches Server Server Figure 4.4: Inter-Switch Links With Span- Figure 4.5: Inter-Switch Link Failure With ning Tree Spanning Tree In addition to that, there are some proprietary solutions which do not disable the additional links, but utilize them like any other inter-switch connection. In contrast to the rapid spanning tree algorithm, these solutions work only between switches of the same manufacturer. To let the network tolerate more than one switch or link failure, both methods provide the ability to deploy additionally switch to switch connections. If the services should be provided to clients on the Internet or on a remote site, the routers and Internet/WAN connections have to be highly available, too. What we need are two routers which are connected to two different switches, with each router connected to the outside network over a different service provider. However, the use of two routers introduces a challenge since a server usually does not use a routing protocol to find the appropriate routes. A server normally just knows a default IP address, to which all traffic which is not located in the same c Stefan Peinkofer 65 [email protected] CHAPTER 4. DESIGNING FOR HIGH AVAILABILITY subnet should go. This IP address is called the default gateway. One possible solution for this problem is that the routers themselves act like a high availability cluster and when the active router or its connections fail, the other router takes over the IP and MAC address of the failed router.27 The second thing to look at are failures and impacts on the logical level. These problems are harder to address, because they cannot be solved by simply adding redundancy. Some of the common failure scenarios on the logical level are duplicate IP address errors or high network latency, caused by a broadcast storm or the new Internet worm which infected the company’s Microsoft workstations.28 To minimize the occurrence of these failures, a comprehensive management effort is needed to implement clearly defined processes and security policies. 4.2.2 Shared Storage The storage sub-system is the most critical part of a high availability solution, since the failure of a storage system can cause data corruption or loss. In order to provide a “bullet-proof“ storage system, various things have to be taken into account: 1. Requirements for disks 2. Requirements for hardware RAID controllers 3. Requirements for disk enclosures 4. Server to Storage connections To deploy redundant disks, some type of RAID level is used. The commonly used RAID levels are RAID 1 to mirror a disk to another, RAID 10 to mirror a set of disks to another set or RAID 5 to provide redundancy of a disk set by using an additional parity disk. In addition to the disks in the RAID set, some hot spare drives which are enabled when a disk fails, have to be deployed.29 27 [KAKADIA] Pages 19 - 20 [MARCUS] Pages 138 - 139 29 [ELLING] Page 202 28 c Stefan Peinkofer 66 [email protected] 4.2. HARDWARE The RAID functionality can either be done by software on the cluster nodes or, if available, by a hardware RAID controller in the disk enclosure. If software RAID is used, some amount of CPU and I/O capacity will be occupied by the RAID software. Since RAID 5 requires the calculation of parity bits, deploying it with a software RAID solution is not recommended because it doesn’t perform very well. In addition to that, not all software RAID solutions can be used for shared cluster or SAN file systems. The hardware RAID controller must provide redundant I/O interfaces, so in case of an I/O interconnect failure the nodes can use a second path to the controller. If a hardware RAID controller uses a write cache it must be ensured that the controller’s write cache is battery backed up or, if this is not the case, that the write cache is turned off. Otherwise, the data in the write cache, which can be a few GB, is lost in case of a power outage.30 In addition to that, as shown in figure 4.6, the RAID controllers themselves have to be redundant, so in case of a primary controller failure, the secondary controller continues the work. Disk Enclosure Health Monitoring RAID Controller A RAID Controller B Redundant I/O Paths Active Controller Standby Controller Active Connection to Disks Standby Connection to Disks Active Fibre Channel Connection Standby Fibre Channel Connection SAN Figure 4.6: Redundant RAID Controller Configuration 30 [ELLING] Page 202 c Stefan Peinkofer 67 [email protected] CHAPTER 4. DESIGNING FOR HIGH AVAILABILITY The disk enclosure must have redundant power supplies which are connected to different power sources and it must provide the ability to hot-swap all field replaceable units. This means that every functional unit, like a disk, a controller or a power supply can be changed during normal operation. Also some environmental monitoring capabilities, like temperature sensors and an automatic shutdown capability, which turns off the enclosure when the environment values deviate from the specified range, are desirable.31 If a disk enclosure contains no hardware RAID controller it must provide at least two I/O interfaces to survive an I/O path failure. To improve storage availability or to compensate for the lack of redundant RAID controllers, I/O interfaces or power supplies, the disk enclosures themselves can be deployed in a redundant way. As shown in figure 4.7, we must mirror the disks between two enclosures for this purpose. For low cost enclosures we have to use software RAID 1. High-end enclosures usually provide this feature on the enclosures’ RAID controller level. With redundant enclosures, the data can be held on two different sites, if desired. 0101010 1010101 0101010 RAID Controller SAN Disk Enclosures RAID 1 Mirror RAID Controller 0101010 1010101 0101010 Figure 4.7: Redundant Storage Enclosure Solution 31 [ELLING] Page 202 c Stefan Peinkofer 68 [email protected] 4.2. HARDWARE The same considerations as for network server to switch connections apply also to fibre channel server to switch connections, if a SAN is used. In contrast to the network switch to switch connections, the loop restriction does not apply to SAN switches; hence, a SAN natively supports fault resilient topologies. In contrast to Ethernet networks, the costs per connection port of SANs are not yet negligible. To let the SAN tolerate more than one failure at a time, additional inter-switch links are needed, so the decision regarding how many failures the SAN can tolerate should be based upon a comprehensive cost/risk analysis. If a cluster system is connected to a SAN and uses SCSI reservations for fencing, it must be ensured that it will only reserve the shared disks which are dedicated to the cluster system. Usually the cluster system will provide a method to exclude shared disks from the fencing operation. If this method is based on an “opt-out“ algorithm, the system administrators must continuously maintain the list of excluded shared disks so that in case of newly added shared disks the cluster does not place SCSI reservations on them. A better approach is the use of LUN (Logical Unit Number) masking, which provides the ability to define which hosts can access a shared disk, directly on the corresponding storage device. However, this function is not provided by all storage devices. 4.2.3 Server Today’s server market provides an uncountable number of different server types, with different availability features. Generally, high-end servers can be unrestrictedly used as cluster nodes since they were designed with availability considerations in mind. However, things look different at the low-end side. In low-end servers, sometimes even basic availability features are not implemented in order to achieve a lower price. Since many smaller companies on the one hand cannot afford and don’t even need enterprise scale servers, but on the other hand have a demand for high availability clusters, we will look only at the basic availability features of servers in a cluster environment. The first component we have to look at are the power supplies. They must be redundant and c Stefan Peinkofer 69 [email protected] CHAPTER 4. DESIGNING FOR HIGH AVAILABILITY additionally provide the capability to be connected to two different power sources. The cooling fans of the server chassis must provide at least an N + 1 redundancy, which means that the failure of one fan can be compensated by the other fans. Like the storage enclosure, environmental monitoring and automatic shutdown functions are desirable to prevent hardware damage.32 The server should have at least two internal disks, so that the operating system and local data can be mirrored to the second disk. In addition, the disks should be connected to two different I/O controllers, so in case of a controller failure only one disk is unavailable. Also it must be ensured that the server can also boot from the second disk, in case the primary disk fails. At least the power supplies and the disks must be hot pluggable. The server must provide enough PCI slots to contain the needed PCI cards like network or I/O controllers. At a minimum, it must provide two network connections to the public net, two connections for the cluster interconnect, and two I/O controller cards. Also it should use two separate PCI buses so the failure of a bus affects only half of the PCI cards. Some vendors provide PCI cards which provide more than one connection at once, such as dual or quad Ethernet cards. If such cards are used, at least two of them must be used so the cards don’t become single points of failure. The system memory must be ECC memory to prevent memory corruption. In addition to the availability features, servers as well as storages should be acquired with later increases of capacity requirements in mind. It’s always a good idea to have some free slots for CPUs, memory and disk expansions, since otherwise we will be forced to acquire new servers and to build a new cluster, when the actual capacity requirements exceed the system capacity.33 4.2.4 Cables Cables are a common source of failures in computer environments. Often, they get accidentally disconnected because someone thinks they are no longer being used or they get cut during construction or maintenance work. To minimize the potential danger of cables, we have to consider them in the design process. The first rule is that all cables should be labeled at both 32 33 [ELLING] Page 201 [MARCUS] Page 40 c Stefan Peinkofer 70 [email protected] 4.2. HARDWARE ends. The label should tell where the cable comes from and where it should go. If the cabling is changed for some reason, the labels have to be immediately updated, too, since a false label is worse than no label. The second rule is that redundant cables should be laid in different lanes. For example, if we have our cluster nodes in two different buildings and all cables between the nodes are laid in the same lane, an excavator who digs at the wrong place will likely cut all the cables. This rule also applies to externally maintained cables, like redundant Internet/WAN connections and power grid connections. It is worth mentioning that we must not assume that two different suppliers use two different cable lanes. So it has to be ensured with the suppliers that different lanes are used.34 4.2.5 Environment As last but not least item we have to consider in the hardware design process is the environment in which the cluster system will be deployed. A high availability system can only be beneficial if the environment meets some criteria. The first point we have to consider are the power sources. A power outage is probably the most dangerous threat for data centers since even if the center is well prepared, something will always go wrong in case of emergency. To minimize the effects of a power outage, at least battery backed up uninterruptible power supplies have to be used, to bridge over short power outages and let the systems gracefully shut down in case of a longer power outage. If the systems are critical enough that they have to be available even in the case of a longer power outage, the use of backup power generators is mandatory. What has to be kept in mind is that these generators require fuel in order to operate, so the fuel tank should be always filled completely. Also it’s a good idea to use redundant power grid connections, but it has to be ensured that the power comes from different power lines.35 The second item to consider is the air conditioning. As we have already discussed in previous chapters, the systems should be able to shut down themselves if the environmental temperature gets too high. In order to prevent this situation, the air conditioning has to be redundant. High 34 35 [SNOOPY] Page 8 [SNOOPY] Pages 7 - 8 c Stefan Peinkofer 71 [email protected] CHAPTER 4. DESIGNING FOR HIGH AVAILABILITY temperature can not only be caused by an air conditioning failure, it can also occur if the cooling power of the air conditioning becomes insufficient. This can for example occur because of high outdoor temperature or because some new servers were added to the computer room. Therefore the environmental temperature and the relative humidity has to be monitored continuously and someone has to be notified if it increases beyond some value.36 If the redundant air conditioning runs in an active/active configuration, it must be ensured that one air conditioner alone can deliver sufficient cooling power for the computer center. Therefore the waste heat produced by the IT systems has to be compared with the cooling power of the air conditioning, every time a new system is added to the center. The third problem we have to deal with is dust. Dust can cause overheating if it deposits on cooling elements or if it clogs air filters. Also, it can damage the bearings of cooling fans. In addition to that, metallic particles can cause short circuits in the electric components. To minimize contamination, the air in the computer room should be filtered and the filters should be maintained regularly.37 The fourth issue is the automatic fire extinguishing equipment, deployed in the computer room. Under all circumstances, the equipment must use a extinguishing device which causes no damage to the electrical equipment. So water or dry powder must not be used.38 If such a system cannot be afforded, it’s better to have no automatic fire extinguishing equipment at all, since it usually causes more damage than the fire itself. However, in this case it is mandatory that the fire alarms automatically notify the fire department and that fast and highly available first responders are available, like janitors who live in the building which contains the computer systems. In addition to the four main concerns we discussed above some other precautions have to be taken, depending on the geographical position of the data center. For example in earthquake 36 [ELLING] Page 201 [ELLING] Page 201 38 [ELLING] Page 201 37 c Stefan Peinkofer 72 [email protected] 4.3. SOFTWARE prone areas, the computer equipment has to be secured to the floor to prevent it from falling over. Also, it’s a good idea to keep all computer equipment on at least the first floor, not only in flood prone areas. 4.3 Software The last big area of design are the software components which will be used on the cluster system. This area is divided into four main components: the operating system, the cluster software, the applications which should be made highly available and the cluster agents for the applications. In addition to the component specific design considerations, there are some common issues which should be mentioned. The first rule in the software election process for a high availability system is to use only mature and well tested software. This will minimize the likelihood of experiencing new software bugs, because many of them will have already been found by other people. However, if the deployment of a “x.0 software release“ is necessary, plenty of time should be scheduled for testing the software in a non-productive environment. All problems and bugs which are found during testing must be reported to the software producers.39 The software producers will need some time until they deliver a patch for the bugs. This should also be considered in the project time plan. For each commercial software product which is deployed in the cluster, a support contract should be concluded to get support if something doesn’t work as expected. Unlike to the open source community, commercial software producers will not provide support at no charge. In addition to that, not all known bugs and problems are disclosed to the public. So in order to get the information needed, a support case has to be submitted. Without a support contract, these calls will be deducted on a time and material basis, which is in the long term usually more expensive than a support contract. Typically different support contracts with different service level agreements are available. For high availability systems, premium grade support contracts which provide the ability to get support round the clock should be chosen. 39 [PFISTER] Page 395 c Stefan Peinkofer 73 [email protected] CHAPTER 4. DESIGNING FOR HIGH AVAILABILITY For open source software, the open source community usually provides free support through mailing lists and IRC (Internet Relay Chat) channels. However, the quality of the support provided through these channels varies from software to software. Also there is no guarantee that someone will reply to a support call within an acceptable time range. To eliminate this drawback, some companies provide commercial support for open source software. If the IT staff has no comprehensive skills on the deployed open source software, such contracts are mandatory to provide high availability. During the software life cycle, customers will continuously encounter software bugs and software producers will deliver patches to fix them. In order to fix these known bugs in the production environment, before they fail, patches have to be installed proactively. Unfortunately, it is very hard to keep track of all available patches manually. Five hundred patches only for an operating system are not unusual. Additionally, patch revision dependencies exist in some cases. For example, application A does not work with revision 10 of operating system patch number 500. To alleviate the problem, most software producers maintain a list of recommended patches, which contain patches for the most critical bugs. At least these patches should be applied regularly, as long as no revision dependencies exist. In addition to the recommended patches, the system should be analyzed to determine whether further system specific patches are needed.40 Some software producers provide software tools, which analyze the software and automatically find and install all patches which are available for the software. These tools can dramatically simplify and speed up the patch election process. Unfortunately, these tools usually don’t pay attention to patch dependencies with other software components, like the operating system. Regardless of the method which is used to find the needed software patches, all proactively installed patches should be first tested in the test environment41 . In this way we ensure that they work as expected since some patches will introduce new bugs. For example, during the practice part of this thesis, I proactively applied a patch for the NFS server program. After the patch was applied, shutting down of the NFS server triggered a kernel panic. Applying 40 41 [MARCUS] Pages 272 - 273 [MARCUS] Page 272 c Stefan Peinkofer 74 [email protected] 4.3. SOFTWARE such a patch to a production environment can cause unexpected downtime and it will definitely require planned downtime, to isolate and back out the faulty patch. If software is deployed on a cluster system, special requirements or restrictions can exist. It is mandatory to read the documentation of the deployed software and follow the stated guidelines. This is of particular importance if support contracts are concluded, since the software producers will usually refuse to support a configuration which violates their restrictions42 . 4.3.1 Operating System The first design issue on the operating system level is the partition layout of the boot disk. The first task for this is to find out whether the cluster software, the volume manager or an application has special requirements for the partition layout. If these requirements are not known before creating the partition layout, a repartitioning of the boot disk and therefore a reinstall of the operating system may be needed. After these requirements are met, the partition layout for the root file system can be designed. As a general rule, it should be as simple as possible. So creating one partition for the whole root file system is advisable, but only if the available space on the root file system is sufficiently big. Cluster systems typically produce a huge amount of log file messages. These messages are usually automatically deleted after some time. However, if the root file system is too small, it may run out of space when too many log file messages are generated over time. In such a situation, along with all other negative effects of a full file system, no one will be able to log on to the node, since the log on procedure will try to write some information to disk, which of course fails. For smaller root file system partitions it is recommended to put the /var directory, which contains the system log messages, on a separate partition so that the administrator can still log on in such a situation. Depending on the deployed cluster software, some local component fail over mechanisms are relinquished to the operating system. For example, fail over of redundant storage paths is often 42 This goes so far that you will not be supported even if your problem obviously has nothing to do with the violation. c Stefan Peinkofer 75 [email protected] CHAPTER 4. DESIGNING FOR HIGH AVAILABILITY done by the operating system. All redundant components for which the cluster system provides no fail over function have to be identified and alternative fail over methods have to be deployed. If such fail over methods cannot be found for the desired operating system, it should not be used on the cluster system. The system time is a critical issue on cluster systems. If cluster nodes have different times, random side effects can occur in a fail over situation. To prevent this, the time of all cluster nodes must be synchronized. Usually the network time protocol is used for this purpose but it has to be assured that the nodes be kept synchronized, even if the used time servers are not available since the synchronisation of time between the nodes is more important than accuracy to the real time. Operating systems depend on various external provided services like Domain Name System (DNS) for hostname to IP resolution or Lightweight Directory Access Protocol (LDAP) and Network Information Service (NIS) for user authentication. All these external services must be identified and it has to be assured that they are highly available, too. If this is not the case, it has to be ensured that the system is able to provide its services, even when the external services are not available. 4.3.2 Cluster Software The design issues for the cluster software are highly dependent on the deployed cluster product and are usually discussed in detail in the documentation provided along with the cluster software. One of the common design task is to decide which resources should run on which cluster node during normal operation and which resources have to be failed over together in a resource group. Additionally, the resource dependencies within the resource group and, if they exist, the dependencies between resources in different resource groups have to be identified. A good method for planning the dependencies is to draw a graph, in which the resources are represented by vertexes and the dependencies by edges, as shown in figures 4.8, 4.9 and 4.10. c Stefan Peinkofer 76 [email protected] 4.3. SOFTWARE Resource 1 Resource 2 Resource 5 Resource 3 Resource 6 Resource 4 Resource 7 Resource 9 Resource 8 Resource 10 Figure 4.8: Drawing a Resource Dependency Graph Step 1 Resource 1 Resource 2 Resource 5 Resource 3 Resource 6 Resource 4 Resource 7 Resource 9 Resource 8 Resource 10 Resource X depends on that Resource Y runs on the same host Resource X depends on that Resource Y runs somewhere in the cluster Figure 4.9: Drawing a Resource Dependency Graph Step 2 c Stefan Peinkofer 77 [email protected] CHAPTER 4. DESIGNING FOR HIGH AVAILABILITY Resource 1 Resource 2 Resource 5 Resource 3 Resource 6 Resource 4 Resource 7 Resource 9 Resource Group 1 Resource 8 Resource 10 Resource Group 2 Resource X depends on that Resource Y runs on the same host Resource Group X depends on that Resource Group Y runs somewhere in the cluster Figure 4.10: Drawing a Resource Dependency Graph Step 3 The next thing to decide is whether a resource group which was failed over to another node should be automatically failed back to the original node when it joins the cluster again. In general, auto fail back should be disabled, unless there is a good reason to enable it. Failing back a resource group means that the resources are unavailable for a short period of time. This may not be tolerable during some hours of the day, or it may not even be tolerable until the next maintenance window. In addition, a failure scenario in which a node permanently joins and after a few minutes leaves the cluster, for example because of a CPU failure which occurs only under special conditions, the resource group would be ping-ponged between the nodes. The only reason which should legitimize the use of an automatic fail back is performance. If performance of the application is more important than a short disruption in service and the risk of a ping pong fail over / fail back, then auto fail back can be used. c Stefan Peinkofer 78 [email protected] 4.3. SOFTWARE 4.3.3 Applications On the application level, again many design issues are unique for the particular application. The first common design task is to decide which software product should be used in order to provide the desired service. An application which should be deployed on a high availability cluster system has to meet some requirements. The application must provide its services through a client/server model, whereby the clients access the server over a path which can be failed over from one node to another. For example an application which is connected to its clients over a serial line cannot be deployed on a high availability cluster. In addition, most cluster systems will only support the use of TCP/IP as client access path. Some applications require human intervention to recover when they are restarted after a system crash or they require the administrator to provide a password during the start-up process, for example. Such applications are also not suitable for high availability clusters. In addition to that, the recovery process must finish within a predictable time limit. The time limit can be specified by the administrator and it is used by the cluster software to determine whether an application failed during the start procedure. Since the application data and eventually the application configuration files must be placed on a shared storage, the location of these files must be configurable. If this is not the case it can be next to impossible to place the files on a shared storage. If the application provides its service through TCP/IP43 the cluster has to be configured to provide one or more dedicated IP addresses, which will be failed over along with the application. For this reason, the application must provide the ability to let the system administrator define to which IP addresses the application should bind. Some applications which do not provide this feature will bind to all available IP addresses on the system. That behavior is acceptable, as long as no other application running on the cluster uses the same TCP port. If both applications 43 What is the default for applications deployed on a HA cluster. c Stefan Peinkofer 79 [email protected] CHAPTER 4. DESIGNING FOR HIGH AVAILABILITY ran on the same host, one application would not be able to bind to the IP addresses.44 The decision which has to be made after the software product election is how the application will be installed. Option one is to place the application binaries and configuration files on the shared storage and option two is to place them on the local disk of each cluster node. Each option has assets and drawbacks. Option one provides the advantage that only one copy of the application and configuration files has to be maintained. Applying patches or changing the configuration has to be done only once. The disadvantage is that the application has to be shut down cluster wide in order to upgrade the application. Option two provides the advantage of rolling upgrades. The software can first be upgraded or reconfigured on the standby nodes and, after that, the service can be switched over to an upgraded node in order to perform the upgrade on the other node. This provides the additional advantage that when a problem arises during the upgrade process or during the start of the new version or new configuration of the software, the node which hosted the application originally provides a fail back opportunity45 . The disadvantage is that several copies of the application and the configuration have to be maintained. Also it must be ensured that the configuration files are synchronized on all hosts.46 Sometimes, applications depend on services which are not provided by applications that run on the cluster system. These services have to be identified and it must be ensured that they are highly available. A better approach would be to deploy the applications which provide these services on the same cluster as the applications that depend on the services. This allows the cluster system to take care of the dependencies. 4.3.4 Cluster Agents The design of a cluster agent depends mainly on two factors. The used cluster software, which specifies the functions that must or can be provided by the agent and the application the clus44 [BIANCO] Pages 45 - 49 Note that this doesn’t remove the need to test the upgrade on the test environment in the first place. 46 [ANON7] Pages 16 - 17 45 c Stefan Peinkofer 80 [email protected] 4.3. SOFTWARE ter agent should handle, specifies how the application can be started, stopped and monitored. Usually, an application can be monitored in different ways and in different detail. The more detailed the monitoring of the application is, the more failures can be detected and the better fault reporting can be provided. What should be kept in mind is that the complexity of the agent will increase along with the monitoring detail and therefore the likelihood that the agent itself contains bugs and hence may fail is increased also.47 So the general design rule for the monitoring function is as detailed as needed and as simple as possible. One requirement of nearly any cluster system is that all or at least some of the resource agent functions have to be idempotent. Idempotency means that the result of calling a function two or more times in a row is the same as calling the function only once. For example calling the stop function once should stop the resource and return successful. Calling the stop function a second time should leave the resource stopped and return successful. Or calling the start function once should start the resource and return successful. Calling the start function a second time should not start the resource again48 , but only return successful. 47 48 [ELLING] Page 95 We assume that the resource is still running. c Stefan Peinkofer 81 [email protected] Chapter 5 IT Infrastructure of the Munich University of Applied Sciences In the following chapter we will look at the infrastructure which is used by the sample implementations of the Sun Cluster and Heartbeat high availability cluster systems and also analyze which of these components constitute a single point of failure. 5.1 Electricity Supply As figure 5.1 shows, the building which contains the server room provides three main electric circuits whereby two of them are available in the server room and each device with a redundant power supply is connected to both circuits. Each of the main circuits is fed by a dedicated transformer in the basement of the building. However, each of the transformers is fed by only one common high voltage transmission line. Therefore the provision of electricity is a single point of failure. Because of the high costs of a second high-voltage transmission line or a centralized uninterruptible power supply system and the relatively rare occurrence of major power outages, this single point of failure will probably never be removed. 82 5.2. AIR CONDITIONING Server Room Main circuits Transformers High Voltage Transmission Line Figure 5.1: Electricity Supply of the Server Room 5.2 Air Conditioning The air conditioning of the server room is provided by two air conditioning units which work in an active/active configuration. Although in the past one unit alone was able to handle the waste heat of the IT systems, this is not true anymore. More and more servers have been deployed over the last years, and so the produced waste heat has exceeded the cooling capacity of a single air conditioning unit. So the air conditioning is a single point of failure1 . A direct solution for removing the single point of failure, namely installing new air conditioning units with higher capacity, is not possible for cost reasons. In the building which contains the computer room, there are two other rooms with autonomous air conditioning. Since these rooms will be released in the near future, because the faculty which owns the rooms is moving to another building, the redundancy of the air conditioning in the central computer room could be restored, by moving some servers to the other rooms. 1 This has already failed more than once. c Stefan Peinkofer 83 [email protected] CHAPTER 5. IT INFRASTRUCTURE OF THE MUNICH UNIVERSITY OF APPLIED SCIENCES 5.3 Public Network The public network of the Munich University of Applied Sciences spreads out over several buildings which are distributed across the whole city. Every building is connected to a router in building G which also contains the central server room. Indeed, most sub-networks within the buildings are redundant by using the rapid spanning tree algorithm and a proprietary enhancement. The connections to the router and the router itself are not redundant. In addition to that, no redundant Internet connection is available. So the router itself, the inter-building and Internet connections are single points of failures. Unfortunately, the situation cannot be improved in the medium term because fully redundant inter-building connections would cost several million euros. Although the network within the building, which contains the server room, provides redundancy, most servers and workstations do not fully utilize the feature yet. They are connected to the network over just one path. If a switch fails, network connection to/from the computers, connected to the failed switch, is lost. What makes this fact worse is that most switches do not have redundant power supplies because of cost reasons. So, from the point of view of the cluster system, the switch to service consumer connection is a single point of failure. In the short term, this single point of failure is planned to be removed for the servers, which use the services provided by the high availability clusters and the servers which provide services on which the clusters depend. However, even in the medium term, the single point of failure cannot be removed for workstations because the costs would be too high. As we have seen, the public network is an area which needs further improvement in order to provide comprehensive reliability for all who use the services provided by the cluster systems. c Stefan Peinkofer 84 [email protected] 5.4. SHARED STORAGE DEVICE 5.4 Shared Storage Device As shared storage device, the Sun StorEdge 3510 fibre channel array is used. The array uses hot pluggable, redundant power supplies, cooling fans and RAID controllers which fail over the data flow transparent to the connected servers, in case of a controller failure. Also every controller provides two I/O paths, which can be connected to the SAN. In addition to that, the controller can work in an active/active configuration in which every controller maintains one or more separate RAID sets. This feature is especially useful if RAID 5 is used, since the load for computing the parity bits is distributed among both controllers. Furthermore the enclosure provides a comprehensive set of monitoring features and in conjunction with a special software, the administrators can be automatically notified about critical events. The 3510 also supports LUN masking but unfortunately no off-site mirroring. So mirroring disks between different enclosures has to be done by software RAID tools. As figure 5.2 shows, in our configuration the 3510 contains two RAID 5 sets consisting of 5 disks each and additionally two hot spare disks are deployed. To increase the performance, the RAID sets are maintained by the controllers in an active/active configuration. As we have seen, the 3510 storage array meets all requirements to provide highly available data access and therefore is suitable to be deployed in a high availability cluster environment. c Stefan Peinkofer 85 [email protected] CHAPTER 5. IT INFRASTRUCTURE OF THE MUNICH UNIVERSITY OF APPLIED SCIENCES 3510 Storage Enclosure Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 RAID 5 Set 1 Maintains Disk Set Global Hot Spare Disk 6 Can be used by any controller if necessary RAID Controller 1 Global Can be used by any controller Hot Spare Disk 12 if necessary RAID Controller 2 Maintains Disk Set Disk 7 Disk 8 Disk 9 Disk 10 Disk 11 RAID 5 Set 2 Figure 5.2: 3510 Configuration 5.5 Storage Area Network The used storage area network consists of two dedicated switch fabrics, each consisting of two cascaded switches. One switch in each fabric is a 16-port switch which provides redundant power supplies and the other switch is an 8-port switch which provides no redundant power supplies. However, both provide at least redundant cooling fans. All switches are built as a single field replaceable unit, so no hot pluggable components exist. If something fails, the whole switch has to be replaced. As figure 5.3 shows, the fabrics2 are fragmented in two different zones, a production and a test environment zone. This zone will be used by the two sample 2 Note that the figure shows only one fabric. The other fabric is configured equally. c Stefan Peinkofer 86 [email protected] 5.5. STORAGE AREA NETWORK cluster systems until they are put into production use. A zone contains the fibre channel traffic, so no device in zone A can access a device in zone B and the other way round. 3510 Controller 1 3510 Controller 2 Inter-Switch Link Carries always the traffic of all zones Fibre Channel Switches Production Zone Test Zone Figure 5.3: Fibre Channel Fabric Zone Configuration The production zone consists only of ports on the 16-port switches, since they provide better reliability than the 8-port switches. The chosen topology protects against all single switch or path failures and against some double failures. Since adding more inter-switch links only increases the number of link failures, the topology could tolerate, but not the number of switch failures, the use of more inter-switch links was rejected because of cost reasons. c Stefan Peinkofer 87 [email protected] Chapter 6 Implementing a High Availability Cluster System Using Sun Cluster 6.1 Initial Situation Currently only a single server hosts the file serving applications NFS and Samba for the users’ home directories. This server is based on the SPARC platform and runs the Solaris 9 operating system. The home directory data is placed on a 1 TB shared disk, which is hosted on a 3510 fibre channel storage array. The file system used on this volume is a non-shared SUN QFS, which would also provide the possibility to be deployed as a shared SAN file system. In addition to that, the server also hosts the Radius authentication service. 6.2 Requirements The requirements for the new system are two-tiered. Tier one is to provide a high availability cluster solution, using two SPARC based servers, Solaris 10 as operating system and Sun Cluster as cluster software. On this cluster system, the services NFS, Samba and Radius should be made highly available. To eliminate the need to migrate the home directory data to a new file system, the cluster should be able to use the already existing SUN QFS file system, once the cluster goes into production. In addition to that, the SUN QFS file system should be deployed 88 6.3. GENERAL INFORMATION ON SUN CLUSTER as asymmetric shared SUN QFS1 , and thereby act as a cluster file system, in order to distribute the load of NFS and Samba among the two nodes, by running NFS on one and Samba on the other node. Tier two of the requirements is to evaluate whether the SUN QFS which contains the home directory data can also be deployed as a highly available SAN file system so that servers outside of the cluster can access the home directory data directly over the SAN. This is mainly needed for backup reasons because backing up a terabyte over the local area network would take too much time. In order to do a LAN-less backup, the backup server must be able to mount the home directory volume, which is of course not possible with a cluster file system. 6.3 General Information on Sun Cluster The Sun Cluster software is actually a hybrid cluster, which can be deployed as traditional Fail Over cluster as well as Load Balancing or High Performance cluster. Thereby, Sun Cluster provides various mechanisms and APIs which can be used by the corresponding types of services. For example, Sun Cluster provides an integrated load balancing mechanism whereby one node receives the requests and distributes them among the available nodes. For High Performance Computing, Sun Cluster provides a Remote Shared Memory API, which enables an application, running on one node, to access a memory region of another node. However, the features for load balancing and high performance computing are not further discussed in this thesis. Sun Cluster supports three different types of cluster interconnects. • Ethernet • Scalable Coherent Interconnect (SCI) • Sun Fire Link 1 QFS provides the possibility to migrate a non-shared QFS to a shared QFS and vice versa. c Stefan Peinkofer 89 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER For normal fail over and load balancing clusters, typically Ethernet is used as cluster interconnect, whereby Sun Cluster uses raw Ethernet packets to exchange heartbeats and TCP/IP packets to exchange further data. SCI or Sun Fire Link are typically used in a high performance computing configuration, since these interconnects enable the remote shared memory feature. Also larger load balancing configurations may benefit from these cluster interconnects because of their low latency and high data bandwidth. Sun Cluster uses a shared disk as quorum tie breaker and SCSI-2 or respectively SCSI-3 reservations to fence a failed node. In addition to the raw SCSI reservations, Sun Cluster deploys a so-called failfast driver on each cluster node, which initiates a kernel panic when a node gets an SCSI reservation conflict while trying to access a disk. 6.4 Initial Cluster Design and Configuration In the following sections we will discuss the design of the cluster for the tier one requirements. 6.4.1 Hardware Layout To build the cluster, two machines of different types were available: One SUN Fire V440 and one SUN Enterprise 450. Each server must provide various network and fibre channel interfaces: Two for connecting to the public network, two for the cluster interconnect network and two fibre channel interfaces for connecting to the storage area network. An additional connection for a SUN QFS meta data network is not needed, since a design restriction for deploying SUN QFS as a cluster file system is that the cluster interconnect has to be used for meta data exchange. For the public network connection, 1 GBit fibre optic Ethernet connections are deployed because the public network switches mainly provide fibre optic ports. For the cluster interconnect, copper based Ethernet is deployed because it is cheaper than fibre optics. Figures 6.1 and 6.2 show how the interface cards are installed in the servers. c Stefan Peinkofer 90 [email protected] 6.4. INITIAL CLUSTER DESIGN AND CONFIGURATION ce1 ce3 ce0 ce2 qlc2 qlc4 PCI Bus A PCI Bus B Gigabit Ethernet Copper (Twisted Pair) Gigabit Ethernet Fibre Fibre Channel Figure 6.1: PCI Card Installation Fire V440 c Stefan Peinkofer 91 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER glm2 ce1 ce0 qlc1 qlc0 ce2 ce3 PCI Bus A PCI Bus B Gigabit Ethernet Copper (Twisted Pair) Gigabit Ethernet Fibre Fibre Channel SCSI Controller Figure 6.2: PCI Card Installation Enterprise 450 The V440 already provides two copper gigabit Ethernet connection on board. Each of them is addressed by a different PCI bus. The additional network and fibre channels cards are installed in the PCI slots. One half is connected to PCI bus A and the other half to PCI bus B. This hardware setup of the V440 can tolerate a PCI bus failure. Unfortunately, this could not be achieved for the Enterprise 450, although it provides two dedicated PCI buses. The problem is that one of the busses provides only two PCI slots which can handle 64-bit cards and all interface cards require 64-bit slots. c Stefan Peinkofer 92 [email protected] 6.4. INITIAL CLUSTER DESIGN AND CONFIGURATION Figure 6.3 shows the various connections of the cluster nodes. 3510 Storage Enclosure Fibre Channel Switches gagh tribble Ethernet Switches Ethernet Copper (Twisted Pair) Ethernet Fibre Ethernet Fibre Fibre Channel Cluster Interconnect Public Network Connection Redundant Inter-Switch Link Fibre Channel Connection Figure 6.3: Cluster Connection Scheme The servers, fibre channel switches and public network switches are distributed throughout the server room. The cables were not laid in different lanes because the gained increase in availability does not justify the costs for doing so. The cluster interconnect interfaces are connected directly with cross-over Ethernet cables, since it was not planned to increase the number of cluster nodes in the future which would require the deployment of Ethernet switches. The two public network switches, to which the cluster nodes are connected, are built on a modular concept. Each switch is able to accommodate eight switch modules. To keep the modules from becoming a single point of failure, each public network cable is connected to a different switch module. As already mentioned in chapter 5.3 on page 84 the public network switches are rec Stefan Peinkofer 93 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER dundantly connected together. Each server is connected to both fibre channel fabrics, so it can survive a whole fabric failure. The V440 contains four hot pluggable 74 GB SCSI disks, which are all connected to a single SCSI controller. This single point of failure cannot be removed, since the SCSI back plane, to which the disks are connected, provides only a single I/O controller connection. The Enterprise 450 contains three hot pluggable 74 GB SCSI, whereby two are connected to SCSI controller A and one is connected to SCSI controller B. Even though the V440 provides a hardware RAID option for the local disks, it is not used in order to simplify management, since the Enterprise 450 does not provide such an option and therefore software RAID has to be used to mirror the boot disks. So it was decided to use software RAID on both servers. The Enterprise 450 provides three redundant power supplies, but it provides only a single power connector. So connecting the server to two different main power circuits is not possible and therefore power connection is a single point of failure on this machine. The V440 provides two redundant power supplies and each provides a dedicated power connector. This machine is connected to two different main power circuits. Uninterruptible power supplies are not deployed because of the high maintenance costs for the batteries. As we have seen, the servers are not completely free of single points of failures. Unfortunately, the ZaK cannot afford to buy other servers. Fortunately, the possibility that a component which constitutes a single point of failure in these systems will fail is very low, except with the non-redundant power connection, of course. The single points of failures are accepted because the costs to remove those points are greater than the benefit of the gained increase in availability. c Stefan Peinkofer 94 [email protected] 6.4. INITIAL CLUSTER DESIGN AND CONFIGURATION 6.4.2 Operating System Except for some special requirements concerning the boot disk partition layout, the operating system is installed as usual. Every node has to be assigned a hostname and single public network IP address during the installation. The hostname assigned in this step is called physical hostname. The V440 is named tribble and the Enterprise 450 is named gagh. 6.4.2.1 Boot Disk Partition Layout For the boot disk partition layout, there are two design requirements from the Solaris Volume Manager (SVM), which is used for software mirroring the boot disk, and the Sun Cluster software. The SVM requires a small, at least 8 MB large, partition on which the state database replicas will be stored. The state database replicas contain configuration and state information about the SVM volumes. The Sun Cluster software requires a partition at least 512 MB large, which will contain the global device files. This partition has to be mounted on /globaldevices. The global device file system will be exported to all cluster nodes over a proxy file system. This allows all cluster members to access the devices of all other cluster members. In addition to that, the global device file system contains an unified disk device naming scheme, which identifies each disk device, be it shared or non-shared, by a cluster wide unique name. For example instead of accessing a shared disk on two nodes over two different operating system generated device names each node can access the shared disk over a common name. For the root file system partition layout, a single 59 GB partition was created. A dedicated /var partition for log files is not needed because of the more than sufficient size of the root partition. For the swap partition, an 8 GB large partition was created. Since the Enterprise 450 has 4 GB memory and the V440 has 8 GB memory, 8 GB swap space should suffice. Table 6.1 gives an overview of the partition layout. c Stefan Peinkofer 95 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER Slice Tag Mount Point Size 0 root / 59 GB 1 swap swap 8 GB 2 backup na 68 GB 6 usr /globaldevices 1 GB 7 usr na 52 MB Table 6.1: Boot Disk Partition Layout 6.4.2.2 Boot Disk Mirroring For the boot disk mirroring, three disks are used. Two are used for mirroring and the third acts as a hot spare drive. The use of a hot spare drive is not necessarily needed. However, it is a good idea to have a third drive which contains an additional set of state data replicas, so using this third disk as a hot spare drive is the obvious procedure. To understand why the third drive is recommended, we must understand how the Solaris Volume Manager works. To determine whether and which state database replicas are valid, the SVM uses a majority consensus algorithm. This algorithm works in the following way: • The system is able to continue operation when at least the half of the state database replicas are available/valid. • The system will issue a kernel panic when less than half of the state database replicas are available/valid. • The system cannot boot into multi-user mode when the number of available/valid state database replicas constitutes no quorum2 .3 The SVM requires at least three state database replicas. If these three are distributed among only two disks, the failure of the wrong disk, namely the one which contains two state database 2 3 (boverall number of state database replicas ∗ 0, 5c + 1) [ANON8] Page 67 c Stefan Peinkofer 96 [email protected] 6.4. INITIAL CLUSTER DESIGN AND CONFIGURATION replicas, will lead to a system panic. If four state database replicas are distributed evenly among the two disks, the failure of one disk will disallow the system from being rebooted without human intervention. With a third disk and three or six state database replicas distributed evenly among the disks, a single disk failure will not compromise system operation. To recover the system, in case the state database replicas cannot constitute a quorum, the system must be booted into single user mode and the unavailable/invalid state database replicas have to be removed so the available/valid ones can constitute a quorum again. To mirror the root disk, each of the three disks has to have the same partition layout, since the Solaris Volume Manager will not mirror the whole disk, but each partition separately. To be able to mirror a partition, the partitions on the disk have to be first encapsulated in a pseudo RAID 0 volume, also referred to as sub mirror. Thereby, each volume has to be assigned a unique name of the form d<0-127>. Since the mirroring of RAID 0 volumes creates a new volume which also needs a unique name, the following naming scheme is used to keep track of the various volumes: • The number of the mirrored volume begins at 10 and is increased by steps of 10 for each additional mirrored volume. • The first sub mirror which is part of the mirrored volume is assigned the number < number of mirrored volume > + 1. • The second sub mirror which is part of the mirrored volume is assigned the number < number of mirrored volume > + 2. • The hot-spare sub mirror which is part of the mirrored volume is assigned the number < number of mirrored volume > + 3. A special restriction from the Sun Cluster software is that each volume, which contains a /globaldevices file system or on which a /globaldevices file system is mounted, c Stefan Peinkofer 97 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER which is the / partition in our case, has to be assigned a cluster wide unique volume name. Tables 6.2 and 6.3 give an overview of the boot disk volumes. Volume Name Type Parts Mount Point d10 RAID 1 d11 d12 d13 / d11 RAID 0 c3t0d0s04 na d12 RAID 0 c3t1d0s0 na d13 RAID 0 c3t2d0s0 na d20 RAID 1 d21 d22 d23 swap d21 RAID 0 c3t0d0s1 na d22 RAID 0 c3t1d0s1 na d23 RAID 0 c3t2d0s1 na d30 RAID 1 d31 d32 d33 /globaldevices d31 RAID 0 c3t0d0s6 na d32 RAID 0 c3t1d0s6 na d33 RAID 0 c3t2d0s6 na Table 6.2: Boot Disk Volumes V440 4 c = Controller ID, t = SCSI Target ID, d = SCSI LUN, s = Slice c Stefan Peinkofer 98 [email protected] 6.4. INITIAL CLUSTER DESIGN AND CONFIGURATION Volume Name Type Parts Mount Point d20 RAID 1 d21 d22 d23 swap d21 RAID 0 c0t0d0s1 na d22 RAID 0 c4t0d0s1 na d23 RAID 0 c0t2d0s1 na d40 RAID 1 d41 d42 d43 / d41 RAID 0 c0t0d0s0 na d42 RAID 0 c4t0d0s0 na d43 RAID 0 c0t2d0s0 na d60 RAID 1 d61 d62 d63 /globaldevices d61 RAID 0 c0t0d0s6 na d62 RAID 0 c4t0d0s6 na d63 RAID 0 c0t2d0s6 na Table 6.3: Boot Disk Volumes Enterprise 450 6.4.2.3 Fibre Channel I/O Multipathing The Sun Cluster software does not provide disk I/O path fail over and therefore this task has to be done on the operating system level. As already said, the hosts are connected to the storage device over two dedicated fibre channel controllers. Each controller can access the same set of shared disks. Since the operating system, by default, is not aware of this fact it will treat every path to a shared disk as a dedicated device. As figure 6.4 shows, this means that a shared disk can be accessed by two different device names. In order to access a shared disk over a common device name, which uses the two dedicated paths in a fail over configuration, the Solaris MPXIO (Multiplex I/O) function has to be enabled. As figure 6.5 shows, MPXIO replaces the dedicated device names of a shared disk with a virtual device name which is provided by the SCSI Virtual Host Controller Interconnect (VHCI) driver. The VHCI driver provides transparent I/O path fail over between the available physical paths to the disks. In addition to that, the VHCI driver c Stefan Peinkofer 99 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER can use the physical I/O-Paths in an active/active configuration, which can nearly double the I/O throughput rate. Two different Disk Device Names 0101 0101 0101 0101 Host Bus Adapters Shared Disk 0101 0101 Figure 6.4: Shared Disks Without I/O Multipathing Virtual Disk Device Name 0101 0101 VHCI Driver Host Bus Adapters Shared Disk 0101 0101 Figure 6.5: Shared Disks With I/O Multipathing c Stefan Peinkofer 100 [email protected] 6.4. INITIAL CLUSTER DESIGN AND CONFIGURATION 6.4.2.4 Dependencies to External Provided Services The operating system depends on two external provided services: DNS for hostname and IP lookups and LDAP for user authentication. Both services are not highly available yet. It is planned to make the DNS server highly available through a cluster solution and to make LDAP highly available through a multi-master replication mechanism, provided by the deployed LDAP server software. The use of these external services is needed, since some applications which should be made highly available access these services indirectly over operating system functions. A temporary workaround for these single points of failures would be to keep the needed information local on the cluster nodes. However, because of the huge amount of users and hosts, this is not applicable. 6.4.3 Shared Disks The following sections describe the various shared disks which are needed for implementing the sample cluster configuration. As mentioned in chapter 5.4 on page 85, the used 3510 storage maintains two RAID 5 sets. However, these RAID sets are not directly visible to the attached servers. To let the servers access the space on the RAID sets, the set has to be partitioned and the partitions have to be mapped to SCSI LUNs. The term shared disk is synonymous with a partition of a 3510 internal RAID set. 6.4.3.1 Sun Cluster Proxy File System The proxy file system is used to store the application configuration files, application state information and some application binaries, which are used by the various application instances which should be made highly available. Although the 3510 provides an acceptable level of availability, it was chosen to mirror additionally two shared disks by software to increase the reliability. Therefore, one shared disk from the 3510 RAID set one and one shared disk from the 3510 RAID set two is used. In a later production environment, the shared disks would of course be provided by different 3510 enclosures but in the test environment only one enclosure is available. The size of the disks are 10 GB since this is sufficient to store all needed data. c Stefan Peinkofer 101 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER 6.4.3.2 SUN Shared QFS For the shared QFS file system, two disks are needed: one large disk, which contains the file system data and one smaller disk, which contains the file system meta data. The size of the meta data disk determines how many files and directories can be created on the file system. The formula for calculating the needed disk size in bytes is as follows: ((number of f iles + number of directories) ∗ 512) + (16384 ∗ number of directories) Since it is very difficult to predict how much files and directories will be created in the future, the space allocated for the meta data was calculated as follows. At the time the home directory data was migrated to the production QFS, the current number of allocated files and directories was determined. Based on this data, the currently needed meta data disk size was calculated and was found to be about 2 GB. This value was multiplied by an estimated growth factor of 5. On the production file system, which the cluster system should overtake someday, the meta data disk is 10 GB large. This value was taken over for the test system. It is worth mentioning that additional data and meta data disks can be added to a QFS later on. So when the used data or meta data runs out of space, additional space can be added easily. Since the deployed SUN QFS version does not support volume manager enclosed disks in a shared QFS configuration5 it is not possible to mirror the disk between two enclosures, with software. Because of this and the fact that providing two 1 TB large shared disks is too expensive for the ZaK, mirroring the QFS disks was abandoned. 6.4.4 Cluster Software The Sun Cluster software can be installed and configured in several ways. It was chosen to manually install the software on all nodes and to configure the cluster over a text based interface since this seems to be the least error prone procedure. During the initial configuration, the following information has to be provided to the cluster system: 5 The deployed version is 4.3. After the completion of the practice part, SUN QFS 4.4 was released, which now supports the use of a volume manager. c Stefan Peinkofer 102 [email protected] 6.4. INITIAL CLUSTER DESIGN AND CONFIGURATION • The physical host names of the cluster nodes. • The network interface names, which should be used for the cluster interconnect. • The global device name of the quorum disk. After the initial configuration, the nodes are rebooted in order to incarnate the cluster for the first time. After this, various additional configuration tasks have to be performed, to implement the cluster design. 6.4.4.1 Cluster Time The initial configuration procedure will automatically create a NTP configuration which will synchronize the time between all cluster nodes. If the cluster should also synchronize to a time server, which was true in our case, the server directive in the NTP configuration file has to be changed from the local clock to the IP address of the time server. 6.4.4.2 IP Multipathing A node in a Sun Cluster system is typically connected to two different types of networks, a single cluster interconnect network and one or more public networks. For simplification reasons, we assume in the following that the cluster nodes are connected to only one public network. Since the cluster nodes are connected to each network over two or more network interfaces, the failure of a single public network interface should cause the cluster not to fail over the resource to another node, but just to fail over the assigned IP addresses to another network interface. Also the failure of a single cluster interconnect interface should not cause a split brain scenario. The way in which this functionality is achieved for the public network is different from the way for the cluster interconnect. On the public network interfaces, the IP addresses are assigned to only one physical interface at a time. If this interface fails, the IP addresses are reassigned to one of the standby interfaces. On the cluster interconnect interfaces, the IP address is assigned to a special virtual network device driver which bundles the available interfaces and uses them in parallel. So if a public network interface fails, clients will experience the c Stefan Peinkofer 103 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER network traffic stopping for a short period of time, whereas the failure of a cluster interconnect interface is completely transparent because the IP is so to speak assigned to all cluster interconnect interfaces at the same time. Before we can discuss the public network interface fail over process in more detail, we should first look at the various IP address and host name types, which are used by the cluster system. • Each cluster node is assigned a node address on the public network. The host name assigned to this address is referred to as public node name. If a public network interface fails, this IP address will be failed over to a standby interface. If the cluster node which is assigned this address fails, this IP will not be failed over to another node. If the cluster nodes are connected to only one public network, the public node name and node address refer to the values which were specified for the physical hostname and IP address at operating system installation time. • Each cluster node is assigned an IP address on the cluster interconnect network. The host name assigned to this address is referred to as private node name. If a cluster node fails, this IP address will not be failed over to another node. • For each resource which provides its services over a public network connection, a dedicated IP address is assigned to one of the public network interfaces of the node the resource currently runs on. The host name assigned to this address is referred to as logical host name. If a public network interface to which such an address is assigned fails, the IP address will be failed over to a standby interface. Also in case of a node failure, this type of addresses will be failed over to another node, together with the resource which uses the IP address. • Each interface which is connected to a public network is assigned a so-called test address. This IP address will neither be failed over to another local interface, nor be failed over to another node. c Stefan Peinkofer 104 [email protected] 6.4. INITIAL CLUSTER DESIGN AND CONFIGURATION The Sun Cluster software requires that all IP addresses used on the cluster system are assigned a unique host name. This host name to IP address mapping has to be defined in the local /etc/hosts file and in each name service system the cluster nodes use to do IP address resolutions. The functionality for failing over IP addresses, between the public network interfaces is actually provided by an operating system function called IPMP (IP Multipathing). In contrast to the MPXIO function, which is completely separated from the cluster software, the Sun Cluster software and the IPMP function are closely coupled. This means that Sun Cluster subscribes to the operating systems sysevent notification facility in order to be notified about events concerning the IPMP function. This allows the cluster to react to IPMP events in an appropriate manner. For example, if IPMP detects that all public network interfaces on a node have failed, the cluster system receives the event and will fail over all resources which use IP addresses assigned to the failed public network connection, to another node. To use IPMP on a node, first of all a group of two or more public network interfaces, between which the IP addresses should be failed over, has to be defined. The next step is to assign a test address to each interface in this group. On these IP addresses, a special flag named deprecated is set, which prevents any application but IPMP from using the IP address, since it is not highly available. In the last step, the IP address of the public node name has to be assigned to one of the network interfaces in the IPMP group. Further IP addresses, which should be failed over between the interfaces in the IPMP group, can either be assigned to the same interface or they can be assigned to different interfaces in the group to distribute the network load. The IP addresses of the logical host names must not be assigned to the interfaces of the IPMP groups since the cluster software will do this automatically. Of course, these steps have to be repeated on each cluster node. Although the design and configuration of IPMP seems to be simple and straightforward at first glance, it is found not to be so at further inspection. This is because IPMP behaves uniquely c Stefan Peinkofer 105 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER under some circumstances. To understand this, we need to take a closer look at it. IPMP can detect an interface failure in two ways. The first way is to monitor the network interface driver for link failure events. The second way is to check the network interfaces actively, which is done by sending and receiving ICMP echo requests/replies. Therefore the special test addresses are used. If one of the two failure detection methods indicates a failure, the IP addresses will be failed over to another network interface in the IPMP group. By default, IPMP will contact the IP addresses of the default gateway routers for the probe based failure detection. If no default gateway is specified on the system, it will send an ICMP echo broadcast at start up and then elect some of the hosts which responded to it as ping hosts. As long as one of the ping hosts responds to the ICMP echo requests, the corresponding interface is considered healthy, even if another interface in the same group can reach more ping hosts. If an interface is considered failed, IPMP will set a special flag called fail on all IP addresses which are currently assigned to the failed interface. This flag prevents applications from using these IP addresses to send data. As long as another interface in the IPMP group is considered healthy, this is no problem since no IP address, except for the test addresses, will be assigned to the failed interface. If all interfaces of the IPMP group are considered failed, the applications are not available anymore. Of course, this will trigger a fail over of the resources but under the following circumstances, a fail over won’t help. In a configuration in which only a single not highly available default router exists, the failure of the router would cause all public network interfaces to be considered failed since the router does not respond to the ICMP echo replies anymore. If all cluster nodes use the same, single default router entry, all cluster nodes are affected by this router failure and so the public network IPMP groups of all cluster nodes would be considered failed. This would cause the applications on the cluster to become unavailable, even to clients which can access the cluster directly, without the router. Fortunately IPMP provides a method to specify the ping targets manually, by setting static c Stefan Peinkofer 106 [email protected] 6.4. INITIAL CLUSTER DESIGN AND CONFIGURATION host routes to the desired ping hosts. So with this feature, it can be assured that IPMP will not use a single, not highly available ping target. On our cluster nodes, four of the most important servers for the cluster are manually specified as ping nodes. • Mail server • LDAP server • DNS server • NIS server 6.4.4.3 Shared File System for Application Files To mirror the two disks designated for the Sun Cluster proxy file system that will contain the application configuration files, state information files and binaries, the Solaris Volume Manager is used. In contrast to the local disk mirroring procedure, the mirroring of shared disks is a little bit more complicated. In addition to that, the shared disk volumes are controlled by the Sun Cluster software, which is not the case with local disk volumes. First of all a so-called shared disk set has to be created. During creation of the disk set, the global device names of the shared disks and the physical hostnames of the nodes, which should be able to access the shared disks, have to be specified. A shared disk set can be controlled by only one node at a time. This means that only one node can mount the volumes of a disk set at a time. The node which controls the disk set currently is referred to as disk set owner. If the disk set owner is no longer available, ownership of the disk set is failed over by Sun Cluster to another host which was specified as potential owner at disk set creation time. When the disks are added to the disk set, they are automatically repartitioned in the following way: The first few cylinders are occupied by slice 7, which contains the state database replicas. The rest of the available space is assigned to slice 0. Also, one state database replica is created automatically on c Stefan Peinkofer 107 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER slice 7 of each disk. If the disks should not be automatically repartitioned, the partition layout of the disks must meet the requirement that slice 7 begins at cylinder 0 and has sufficient space to contain the state database replica6 . However, for our cluster configuration, this is not required since all application configuration, state information files and binaries should be placed on one slice. After the disk set is created, the mirrored volumes can be created by first encapsulating the appropriate slices in pseudo RAID 0 volumes and then creating RAID 1 volumes which consist of the appropriate sub mirrors. Before the RAID 1 volumes can be used, the two cluster nodes must be configured as so-called mediator hosts of the disk set. While it is easy to understand how the cluster nodes are configured as mediator hosts, this is simply done by calling a single command which takes the name of the disk set and the physical hostnames of the cluster nodes as command line arguments, it is hard to understand why and under which circumstances mediator hosts are needed. The majority consensus algorithm for state database replicas, described in chapter 6.4.2.2 on page 96, is not applicable to shared disk sets. On shared disk sets, the loss of half of the state database replicas would render the system already unusable. In configurations in which a failure of a single component, like a disk enclosure or a disk controller would cause the loss of half of the state database replicas, this component would be a single point of failure, although it is redundant. The Sun Cluster documentation calls such configurations dual disk string configurations, whereby a disk string, in the context of a fibre channel environment, consists of a single controller disk enclosure, its physical disks and the fibre channel connections from the enclosure to the fibre channel switches. Since we use a dual controller enclosure and two disks from two different RAID sets, the failure of a RAID set would cause the described scenario in our configuration. Therefore we must remove this special single point of failure. To remove such special single points of failure in general, the Solaris Volume Manager provides two options: • Provide additional redundancy by having each component threefold, so the failure of a single component causes only the loss of a third of the state database replicas. 6 Usually at least 4 MB. c Stefan Peinkofer 108 [email protected] 6.4. INITIAL CLUSTER DESIGN AND CONFIGURATION • Configure cluster nodes as mediator hosts, which act as additional vote in the case only half of the state database replicas are active/valid. Mediator host configurations must meet the following criteria. Unfortunately, the reason why these rules apply is not documented: • A shared disk set must be configured with exactly two mediator hosts. • Only the two hosts which act as mediator hosts for the specific shared disk set are allowed to be potential owners of the disk set. Therefore only these two hosts can act as cluster proxy file system server for the file systems, contained on the disks within the disk set. These rules do not mean that the number of cluster nodes is limited to two but only that physical access to a particular disk set is limited to two of the cluster nodes. Mediator hosts keep track of the commit count of the state database replicas in a specific shared disk set. Therefore they are able to decide whether a state database replica is valid or not. Before we can discuss the algorithm, which is used by the Solaris Volume Manager to decide whether access to the disks is granted or not, we must first define two terms. • Replica quorum - It is achieved when more than half of the total number of state database replicas in a shared disk set are accessible/valid. • Mediator quorum - It is achieved when both mediator hosts are running and they both agree on which of the current state database replica commit counts is the valid one.7 The algorithm works as follows. • If the state database replicas constitute replica quorum, the disks within the disk set can be accessed. No mediator host is involved at this time. • If the state database replicas cannot constitute replica quorum, but half of the state database replicas are accessible/valid and the mediator quorum is met, the disks within the disk set can be accessed. 7 [ANON9] c Stefan Peinkofer 109 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER • If the state database replicas cannot constitute replica quorum, but half of the state database replicas are accessible/valid and the mediator hosts cannot constitute mediator quorum but one of the two mediator hosts is available and the commit counts of the state database replicas and the mediator host match, the system will call for human intervention to decide whether access to the disk in the disk set should be granted or not. • In all other cases, access to the disk set is automatically limited to read-only access.8 After the mediator hosts are defined, the proxy file system can be created on the mirrored volumes. This is done by creating an UFS file system as usual and specifying the mount option global in the /etc/vfstab configuration file. The /etc/vfstab file contains information about which disk partition or volume should be mounted to which mount point and which mount options should be applied. According to the Sun Cluster documentation, the shared file systems should be mounted under /global/<disk group name>/<volume name> but actually any mount point which exists on all cluster nodes can be used. The global mount option defines that the Sun Cluster software should enable the proxy file system feature for the specified file system. If the specified block device is a volume of a shared disk set, the disk set owner is automatically also the file system proxy server node. If the current disk set owner leaves the cluster for some reason, the cluster software will automatically fail over the disk set ownership and with it the file system proxy task to another node which is configured to be a potential owner of the particular disk set. Although all cluster members can access the data on a shared proxy file system, there is a performance discrepancy between the file system proxy server node and the file system proxy client nodes. The Sun Cluster software provides the ability to define that the file system proxy server task should be failed over together with a specific resource group so the applications in the resource group get the maximum I/O performance on the shared file system. In a scenario in which application data that is frequently accessed is placed on the proxy file system, such a configuration is highly recommended. 8 [ANON9] c Stefan Peinkofer 110 [email protected] 6.4. INITIAL CLUSTER DESIGN AND CONFIGURATION If more than one resource group requires this feature and the underlying block devices of the proxy file systems are managed by SVM, a shared disk set has to be created for each resource group so that the disk ownership, and with it the file system proxy server tasks for the file systems contained in the disk set, can be failed over independently for each resource group. For the sample cluster system, the choice was made to use only one common proxy file system for all application instances since the only data which is frequently changed are the application log files and the I/O performance of a proxy file system client is considered as sufficient for this task. One shared disk group named dg-global-1 was created, which consists of the two 10 GB volumes, described in chapter 6.4.3.1 on page 101. Since the automatic partition feature was used, the sub mirrors encapsulate slice 0 of the shared disks. The mirror volume is named d100 and the two sub mirrors d101 and d102, according the used naming scheme. The proxy file system is mounted on /global/dg-global-1/d100. 6.4.4.4 Resources and Resource Dependencies This chapter gives a high-level overview of the various resources, resource dependencies and resource groups which were configured for our cluster system. Based on the requirements, the cluster should provide three highly available applications. Each application requires a dedicated IP address and it requires that the global proxy file system is mounted before the application is started. In addition, NFS and Samba require that the meta data server of the shared QFS file system is online on the cluster. The needed resources and resource dependencies are shown in figure 6.6. c Stefan Peinkofer 111 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER ha-user-nfs-home nfs-cl1-rg ha-user-smb-home ha-nfs-appfs smb-cl1-rg ha-nfs ha-smb-appfs ha-smb qfsclusterfs ha-qfs ha-user-radiusd-auth radius-cl1-rg Resource Resource Group ha-radius-appfs ha-radius Resource X depends on that Resource Y runs on the same host Resource X depends on that Resource Y runs somewhere in the cluster Figure 6.6: Resources and Resource Dependencies on the Sun Cluster The names of the vertexes are the resource names, whereby the ha-user-* resources represent the application resources, the *-cl1-rg represent the IP address resources and the ha-*-appfs the resource which ensures that the global proxy file system is mounted. In addition, there is the special resource qfsclusterfs, which represents the meta data server of the QFS shared file system. The green arrows define strong dependencies between the resources, which means the resources have to be started on the same node, whereby the resource an arrow points to has to be started before the resource the arrow starts from. For the blue arrows, the same is true but it defines a weak dependency, which means that the resource must just be started somewhere in the cluster. The bright ellipses indicate that the resources contained in the ellipse form a resource group. c Stefan Peinkofer 112 [email protected] 6.4. INITIAL CLUSTER DESIGN AND CONFIGURATION The default resource group location is as follows: • V440: ha-nfs, ha-qfs, ha-radius • E450: ha-smb This is founded on the following thoughts: The V440 has more than twice the CPU power, that the E450 has. According to the requirements, NFS and Samba should be run on two different nodes. Since most of the file serving will be done by NFS, the NFS resource group is placed on the V440. In addition, the QFS resource group is also placed on the V440 since the host, which acts as QFS meta data server, has a slight I/O performance advantage. The Radius resource group is also placed on the V440 because it has more CPU power. However, the CPU power occupied by Radius is marginal, so it could also be placed on the E450. The Enterprise 450 also hosts, in addition to the Samba resource group, the proxy file system server by default since Samba will write the most log file messages to the shared proxy file system. The creation of resource groups and resources is done by a common command, which basically takes as arguments: • whether a resource group or a resource should be created, • the name of the entity to create, • dependencies to one or more other resources or resource groups, • zero or more resource group or resource attributes. In addition, when creating a resource, the resource type and the resource group, in which the resource should be created, have to be specified. The resource group and resource attributes contain values which are either used by the cluster system itself or by the corresponding cluster resource agent. c Stefan Peinkofer 113 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER The common two resource types on which the application resource depend are called LogicalHostname and HAStoragePlus. The LogicalHostname resource is responsible for assigning one or more IP addresses to the appropriate public network interfaces. To create a LogicalHostname resource, a comma separated list of logical hostnames has to be specified when creating the resource. Even if the cluster system is connected to more than one public network it is not necessary to specify the IPMP group to which the IP address should be assigned, since the resource agent will automatically assign it to the IPMP group which is connected to the appropriate public network. The HAStoragePlus resource ensures that one or more file systems are mounted. In addition to that, it provides the feature of failing over the cluster file system proxy server task for the specified file systems onto the cluster node on which the HAStoragePlus resource is started. To create a HAStoragePlus resource, one or more mount points have to be assigned, in a colon separated list, to the resource property FilesystemMountPoints. 6.4.5 Applications In the following sections we will discuss the design and configuration of the deployed applications. The application binaries for NFS and SUN QFS are installed through Solaris packages locally on each host so that rolling upgrades can be performed. Radius and Samba are installed on the cluster proxy file system, since these applications have to be compiled manually and so the overhead of compiling each application twice is avoided. To be able to perform rolling upgrades on the two globally placed applications, a special configuration was applied. On the global proxy file system, which is mounted on /global/dg-global1/d100, two directories were created, one named slocal-production and the other slocal-testing. On both nodes, the directory /usr/slocal is a symbolic link to either slocal-production or slocal-testing. Within the directories, two further directories were created, one named samba-stuff and the other named radius-stuff. To compile and test a new application version, /usr/slocal is linked on one node to slocal-testing. Then the application is compiled on this node with the install prefix /usr/slocal/<application>-stuff. After the application is successfully compiled c Stefan Peinkofer 114 [email protected] 6.4. INITIAL CLUSTER DESIGN AND CONFIGURATION and tested, the <application>-stuff directory is copied to the slocal-production directory and the /usr/slocal link on the node is set back to slocal-production. 6.4.5.1 SUN QFS The SUN QFS file system is a high performance file system, which can be used as a stand alone or as an asymmetric shared SAN file system. The SUN QFS cluster agent will make the meta data server service of a shared QFS highly available, by automatically failing over the meta data server task to another cluster node, when needed. Additionally the agent will mount the shared file system automatically on the cluster nodes when they join the cluster. For the use of SUN QFS as cluster file system, several restrictions exist. First of all it is not possible to access the file systems from outside the cluster. Second, the meta data traffic has to travel over the cluster interconnect. Third, although the configuration files must contain the same information on all nodes, all configuration files must be placed locally on the cluster nodes in the directory /etc/opt/SUNWsamfs/. And fourth, all cluster nodes which should be able to mount the file system must be configured as a potential meta data server. In order to create a SUN QFS shared file system, first of all, two configuration files have to be created. The first configuration file, named mcf contains the file system name and the global device names of the shared disks for file system data and meta data. The second file, called hosts.<file system name>, contains a mapping entry for each cluster node, which should be able to mount the file system. Such an entry maps the physical host name to an IP address, which the node will use to send and receive meta data communication messages. As already mentioned, the IP address which has to be specified when QFS is used as a cluster file system is the address of the cluster interconnect interface. In addition to that, this file provides the ability to define that a corresponding node cannot become a meta data server. However, because of the special restrictions for the use of SUN QFS in a cluster environment, this feature must not be used. After the configuration files are created, the file system can be constructed. After that, the file system has to be registered in the /etc/vfstab configuration file. In this c Stefan Peinkofer 115 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER file the QFS file system must be assigned a mount point and the mount option shared, which indicates that the file system is a shared SUN QFS, must be set. Now, the shared QFS can be registered on the cluster software. First a resource group has to be created and, after this, the QFS cluster resource has to be registered within the resource group. During the registration of the QFS resource, the file system mount point has to be specified. After this, the resource group can be brought online for the first time. SUN QFS does not depend on an IP address which will be failed over together with the meta data server, since a special region of the meta data disk contains the information indicating which node currently acts as meta data server. So a meta data client which wants to mount the file system looks in this special region to determine which host it has to contact for meta data operations. In case of a meta data server fail over, the change will be announced to the meta data clients so they can establish a new connection to the new meta data server. 6.4.5.2 Radius As Radius server, the open source software Freeradius is deployed. Since no Freeradius Solaris package is available, the program had to be compiled from source and therefore the application binaries are placed on the global proxy file system. Also no cluster agent for Freeradius was available, so a new cluster agent for Freeradius had to be developed. The development of the agent is discussed in chapter 6.5 on page 123. To deploy Freeradius in conjunction with the cluster agent, the application configuration must meet some special requirements. The Freeradius cluster agent allows more than one instance of Freeradius to be run; of course, all of these have to bind to different IP addresses. Therefore, each Freeradius instance needs a dedicated directory on a shared file system, which contains the configuration files, application state information and log files of the instance. The name of this instance directory has to be exactly the same as the cluster resource name of the corresponding Freeradius resource. However, on our cluster, only one instance is needed. The instance directory is named ha-user-radiusd-auth and it’s located in the directory /usr/slocal/radius-stuff/ on the cluster proxy file c Stefan Peinkofer 116 [email protected] 6.4. INITIAL CLUSTER DESIGN AND CONFIGURATION system. Inside this directory, the following directory structure has to be created. 1 etc 2 var 3 var / run 4 var / run / r a d i u s 5 var / log 6 var / log / radius 7 var / log / radius / radacct After that, the default configuration directory raddb, which is located in <specified-install-prefix-at-compile-time>/etc has to be copied to the ha-user-radiusd-auth/etc directory. Now the configuration can be customized to meet the respective needs. The general configuration of Freeradius is not further discussed here. However, some cluster specific configuration changes are needed: • The configuration directive bind_address has to be set to the IP which will be used by the Radius resource group as fail over IP address so the Freeradius instance will only listen for requests on the dedicated IP address • The configuration directive prefix has to be set to the application instance directory, that is /usr/slocal/radius-stuff/ha-user-radiud-auth in our configuration. • The configuration directive exec_prefix has to be set to the installation prefix which was specified at compile time, which is /usr/slocal/radius-stuff in our configuration. • All public node names configured on the cluster must be allowed to access the Radius server, to monitor the service. For this, the node names have to be configured as Radius clients. Since Radius works with shared secret keys to encrypt the password sent between client and server, all these client entries must be given the same shared secret key. In the next step, a local user has to be created on each cluster node, which will be used by the cluster agent to monitor the Freeradius instance. Usually Freeradius will be configured to use c Stefan Peinkofer 117 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER one or more remote password backends, either directly or indirectly over the operating system functions. Even if these backends are highly available, it is recommended to use a local user for monitoring the service. This is because in a scenario in which the password backend is not available, the resource would fail consecutively on every cluster node it is failed over to, which would cause Sun Cluster to put the resource group in a maintenance state to prevent further ping-pong fail overs. If a resource group is in this maintenance state, it can only brought online with human intervention again, so even if the password backend becomes available again, the Radius resource group would remain offline. In contrast to a cluster wide failure of the authentication backend, a situation in which one cluster node can access the password backend and the other node can’t is very unlikely, since the only likely failure scenario which could cause such a behavior is the failure of all public network interfaces on a node9 and this will cause a resource fail over anyway. In the last step, a file named monitor-radiusd.conf has to be created in the etc directory of the Radius instance directory. In this file the two following values have to be specified: • RADIUS_SECRET - The shared secret key which should be used by the monitoring function • LOCAL_PASSWORD - The password of the local user which was created for the monitoring function To register the Radius instance on the cluster system, a resource group has to be created. After this a LogicalHostname resource for the IP address and a HAStoragePlus resource have to be created, which ensures that the file system that contains the Radius instance directory has been mounted. After this, the Freeradius resource can be created. To create the Freeradius resource, the following special resource parameters have to be set: • Resource_bin_dir - This is the absolute installation prefix path, with which Freeradius was compiled 9 We assume here that the password backend is connected to the cluster over the public network. c Stefan Peinkofer 118 [email protected] 6.4. INITIAL CLUSTER DESIGN AND CONFIGURATION • Resource_base_dir - This is the absolute path to the Freeradius instance directory • Local_username - This is the user name with which the monitoring function will try to authenticate • Radius_ld_lib_path - This defines the directories which contain shared libraries, used by Freeradius There are several other resource parameters which can be set, but usually don’t have to be because they are set to reasonable values by default. These additional values are further discussed in chapter 6.5 on page 123. In addition to that, it has to be specified that the Radius resource depends on the HAStoragePlus resource. For the LogicalHostname resource, this does not have to be specified since the Sun Cluster software implicitly assumes that all resources in a resource group depend on the IP address resource. After this, the resource group can be brought online for the first time. 6.4.5.3 NFS The application binaries needed by NFS are usually automatically installed during the operating system installation. Configuring NFS as a cluster resource is relatively straightforward. First of all a directory on a shared file system has to be created. On our cluster it is created on the cluster proxy file system under /global/dg-global1/d100/nfs. After this, the resource group has to be created whereby a special resource group property named Pathprefix has to be set to the created directory on the shared storage. The NFS resource requires that hostname and RPC (Remote Procedure Call) lookups are performed first on the local files before the operating system tries to contact an external backend like DNS, NIS or LDAP. Therefore, the name service switch configuration, which is located in the file /etc/nsswitch.conf, has to be adopted. The directive hosts: has to be set to: cluster files [SUCCESS=return] <external services> and the directive rpc: has to be set to: files <external services>. c Stefan Peinkofer 119 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER The statement [SUCCESS=return] defines that no external services should be queried if the corresponding entry is found in the local files. This statement is only needed for the hosts: directive, since it is already the default setting for the rpc: directive. The next step is to create a directory named SUNW.nfs within the directory which was specified as Pathprefix during resource group creation. Within the SUNW.nfs directory a file named dfstab.<resource name> has to be created, whereby <resource name> is the name which will be assigned to the NFS resource. On our cluster, the file is named dfstab.ha-user-nfs-home. The dfstab file contains the configuration of which directories are to be shared with which hosts. For the share configuration, the following special restrictions apply: • The hostnames of the cluster interconnect interfaces must not have access to the NFS service. • All hostnames which are assigned to public network interfaces of the cluster must have read/write access to the NFS service. Also, it turned out that these hostnames must be specified twice, once with the full qualified domain name and once with the bare hostname. After this, the LogicalHostname and the HAStoragePlus resource can be created within the resource group. The last step is to create the NFS resource, whereby only the dependencies to the HAStoragePlus and the QFS resource have to be specified during creation. It is worth mentioning that the NFS resource uses the SUNW.nfs directory not only for the dfstab configuration file, but also for state information, which enables the NFS program suite to perform NFS lock recovery in case of a resource fail over. The core NFS program suite consists of three daemons, nfsd, lockd and statd, whereby nfsd is responsible for file serving, lockd is responsible for translating NFS locks, acquired by clients into local file system locks on the server and statd keeps track of which clients have currently locked files. If a client locks a file, statd creates a file under SUNW.nfs/statmon/sm which is named like the client hostname which acquired the lock. If the NFS service is restarted, statd looks c Stefan Peinkofer 120 [email protected] 6.4. INITIAL CLUSTER DESIGN AND CONFIGURATION in the SUNW.nfs/statmon/sm directory and notifies each hostname for which a file was created in the directory to re-establish all locks the client held prior to the server restart. 6.4.5.4 Samba Like Radius, Samba, the windows file serving application for UNIX, has to be compiled from source and therefore the application binaries are placed on the cluster proxy file system. Since the Samba cluster agent provides the ability to run multiple Samba instances on the cluster, each instance requires a dedicated directory on a shared file system to store configuration files, application state information and log files. The names of these directories can be chosen freely. For our cluster, it was chosen to use the NetBIOS name of the Samba instance (SMB-CL1-RG) as instance directory name. The directory was created under /usr/slocal/samba-stuff/SMB-CL1-RG. Within the instance directory, the following subdirectory structure has to be created. 1 lib 2 logs 3 netlogon 4 private 5 shares 6 var 7 var / locks 8 var / log After this, the Samba configuration file smb.conf has to be created in the lib directory of the instance directory. The general configuration of Samba is not further discussed here, but again some cluster specific configuration settings have to be applied which are listed as follows: • interface - Must be set to the IP address or hostname of the dedicated IP address for the Samba resource group. • bind interfaces only - Must be set to true so that smbd and nmbd, the core daemons of the samba package, only bind to the IP address specified by the interface directive. c Stefan Peinkofer 121 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER • netbios name - Must be set to the NetBIOS name of the dedicated IP address, specified by the interface directive. • log file - Specifies the absolute path to the samba.log file, which should be located under <instance-directory>/var/log/samba.log • lock directory - Specifies the absolute path to the lock directory, which should be located under <instance-directory>/var/locks • pid directory - Specifies the absolute path to the pid directory, which should be located under <instance-directory>/var/locks • private dir - Specifies the absolute path to the Samba private directory, which should be located under <instance-directory>/private After this, a local user has to be created, which will be used by the monitor function of the cluster agent to test Samba. This user has to be created as a UNIX account and as a Samba account. Also a subdirectory has to be created within one of the directories, which will be shared by Samba. Ownership of this subdirectory must be set to the newly created monitor user. In the next step, the Samba resource group, the LogicalHostname and HAStoragePlus resources have to be created. After that, a special configuration file, used by the Samba resource agent, has to be created. In this configuration file, the following information has to be provided: • RS - The name of the Samba application resource which should be created. • RG - The name of the resource group in which the Samba application resource should be created. • SMB_BIN - The absolute path to the Samba bin directory. • SMB_SBIN - The absolute path to the Samba sbin directory. • SMB_INST - The absolute path to the Samba instance directory. • SMB_LOG - The absolute path to the Samba instance log directory. c Stefan Peinkofer 122 [email protected] 6.5. DEVELOPMENT OF A CLUSTER AGENT FOR FREERADIUS • SMB_LIB_PATH - A list of directories which contain shared libraries, used by Samba. • FMUSER - The username of the local user which was created for the monitor function. • FMPASS - The password of the monitor user. • RUN_NMBD - Specification of whether the Samba resource uses the NetBIOS daemon nmbd or not. • LH - Specification of the IP address or hostname which was configured by the interface directive in the smb.conf file. • HAS_RS - Specification of the resources on which the Samba resource depends. The last step is to call a special program, provided by the Samba cluster agent, which will register the Samba resource on the cluster, based on the information in the cluster agent configuration file. 6.5 Development of a Cluster Agent for Freeradius In the following sections we will look at the development of a cluster agent for the Freeradius application. The Sun Cluster software provides various ways and extensive APIs to implement a cluster agent. To discuss all of them would go beyond the scope of this thesis and therefore we will look only at the particular topics which were necessary to build the Freeradius cluster agent. Before we can discuss the concrete implementation of the agent, we must first look at how a cluster agent interacts with the cluster software. 6.5.1 Sun Cluster Resource Agent Callback Model The Sun Cluster software defines a fixed set of callback methods, which will be executed by the cluster software under well defined circumstances. The cluster software also defines which tasks the individual callback methods require the cluster agent to do, which arguments are provided to the cluster agent and which return values are expected from the cluster agent. To implement a c Stefan Peinkofer 123 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER cluster agent, a dedicated callback function program has to be written for each callback method, whereby a cluster agent is not required to implement all defined callback methods. A cluster agent for Sun Cluster does not consist of a single executable but of various executables which implement a specific callback function. The callback functions can either be implemented in C programs or in executable shell scripts. To define which callback function the cluster software should call for carrying out a particular callback method, a so-called Resource Type Registration (RTR) file has to be created, which must contain among other things a mapping between callback method and callback function. In the following section, we will look briefly at the defined callback methods. • Prenet_start - This method is called before the LogicalHostname resources in the same resource group are started. This can be used to implement special start-up tasks which have to be carried out before the IP addresses are configured. • Start - This method is called when the cluster software wants to start the resource. This function must implement the appropriate procedure to start the application and it must only return successfully if the application was successfully started. • Stop - This method is called when the cluster software wants to stop a resource. This function must implement the appropriate procedure to stop the application and must only return successfully if the application was successfully stopped. • Postnet_stop - This method is called after the LogicalHostname resource in the same resource group is stopped. This can be used to implement special stop tasks which have to be carried out after the IP addresses are unconfigured. • Monitor_start - This method is called when the cluster software wants to start the resource monitoring. This function must start the monitor program for the particular application and must only return successfully if it succeeds in starting the resource monitoring program. c Stefan Peinkofer 124 [email protected] 6.5. DEVELOPMENT OF A CLUSTER AGENT FOR FREERADIUS • Monitor_stop - This method is called when the cluster software wants to stop the resource monitoring. This function must stop the monitor program and must only return successfully if the monitoring program is stopped. • Monitor_check - This method is called when the cluster software wants to determine whether the resource is runnable on a particular hosts. This function must perform the needed steps to predict whether the resource will be runnable on the node or not. • Validate - This method is called on any hosts which is configured to be able to run the resource, when: – a resource of the corresponding type is created – resource properties of a resource of the corresponding type are changed – resource group properties of a group which contains a resource of the corresponding type are updated. Since the function is called before the particular action is carried out, this function is not used to test the new configuration but to do a basic sanity check of the environment on the nodes. • Update - This method is called by the cluster software to notify a resource agent when resource, resource group or resource type properties are changed. This function should implement the appropriate steps to reinitialize the resource with the new properties. • Init - The cluster software will call this function on all nodes which are potentially able to run the resource, when the resource is set to the managed state by the administrator. The managed state defines that the resource is controlled by the cluster software, which means for example that the resource can be brought online by an administrative command. It also means that the cluster software will automatically bring the resource online on the next node which joins the cluster. This function can be used to perform initialization tasks which have to be carried out when the resource becomes managed. c Stefan Peinkofer 125 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER • Fini - The cluster software will call this function on all nodes which are configured to be able to run the resource, when the resource is set to the unmanaged state by the administrator. This function can be used to perform clean up tasks which have to be carried out before a resource becomes unmanaged. • Boot - If the resource is in managed state, the cluster software will call this function on a node which is configured to be able to run the resource, when the node joins the cluster. This function can be used to perform initialization task which have to be carried out when a node joins the cluster. The Sun Cluster software requires that the callback functions for Stop, Monitor_stop, Init, Fini, Boot and Update are idempotent. Except for the Start and Stop methods, which must be implemented by the cluster agent, all other methods are optional. 6.5.2 Sun Cluster Resource Monitoring As we saw in the previous chapter, Sun Cluster defines no direct callback method for resource monitoring, i.e. it does not call the monitoring function directly and evaluates the return value of the function to determine whether the resource is healthy or not. Instead it defines two callback methods to start and stop the monitoring. This means that a cluster agent, which should perform resource monitoring, must implement a Probe function, which is started and stopped by the two callback methods and which continuously monitors the application in the configured interval. In addition to that, the Probe function must be able to initiate the appropriate actions when the probe failed, i.e. it must first decide whether the application should be restarted or failed over and second it must trigger the appropriate action by itself. 6.5.3 Sun Cluster Resource Agent Properties The Sun Cluster software defines a set of resource type properties and resource properties which are used to specify the configuration of a cluster agent. The values or default values respectively for the properties are specified in the Resource Type Registration file of the cluster c Stefan Peinkofer 126 [email protected] 6.5. DEVELOPMENT OF A CLUSTER AGENT FOR FREERADIUS agent. Resource type properties specify general attributes which are common for all resources of the specific type. Resource properties specify attributes which can be different for each resource of the specific type. In addition to the predefined set of resource properties, the cluster agent developer can define additional resource properties which contain special configuration attributes for the agent. In the following two sections we will look at some important resource type properties and resource properties. 6.5.3.1 Resource Type Properties The most important resource type properties are the “callback method to callback function“ mapping properties, which were already discussed in chapter 6.5.1 on page 123. Besides that, the following important resource type properties exist: • Failover - Defines whether the resource type is a fail over or a scalable resource. A fail over resource cannot be simultaneously online on multiple nodes, whereas a scalable resource can. Scalable resources are typically deployed when Sun Cluster is used as a Load Balancing or High Performance Computing cluster. • Resource_type - Defines the name of the resource type. • RT_basedir - Defines the absolute path to the directory, to which the resource agent is installed. • RT_version - Defines the program version of the cluster agent. • Single_instance - If this property is set to TRUE, only one resource of this type can be created on the cluster. • Vendor_ID - Defines the name of the organization which created the cluster agent. The syntax for defining resource type properties in the RTR is as follows: <property-name> = <value>; c Stefan Peinkofer 127 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER 6.5.3.2 Resource Properties • <Callback Method>_timeout - Defines the time in seconds the <Callback Method> is allowed run until the cluster considers the execution of the corresponding callback function failed. • Resource_dependencies - Takes a comma separated list of resources in the same or in another resource group, on which the resource depends. • Resource_name - The name of the resource. This value is specified when a new resource is created. • Retry_count - The number of times the Probe function should try to restart the resource before it triggers a fail over. • Retry_interval - This defines the time span, beginning with the first restart attempt, after which the restart retry counter will be reset. • Thorough_probe_interval - Defines the time interval in seconds which should elapse between two resource monitor sequence invocations. In contrast to the resource type properties, a resource property is defined by one or more resource property attributes. The most important resource property attributes are: • Default - The default value for the resource property • Min - The minimum allowed value for a resource property of the data type Integer. • Max - The maximum allowed value for a resource property of the data type Integer. • Minlength - The minimum allowed length of a resource property of the data type String or Stringarray. • Maxlength - The maximum allowed length of a resource property of the data type String or Stringarray. c Stefan Peinkofer 128 [email protected] 6.5. DEVELOPMENT OF A CLUSTER AGENT FOR FREERADIUS • Tuneable - This attribute specifies under which circumstances the administrator is allowed to change the value of the resource property. Legal values are: – NONE - The value can never be changed. – ANYTIME - The value can be changed at any time. – AT_CREATION - The value can only be changed when a resource is created. – WHEN_DISABLED - The value can only be changed when the resource is in disabled state. To define custom resource properties, the special resource property attribute Extension and one of the following resource property attributes which define the data type of the custom resource property have to be specified: • Boolean • Integer • Enum • String • Stringarray The syntax for defining resource properties in the RTR is as follows: { PROPERTY = <property name>; <resource property attribute>; | <resource property attribute> = <attribute value>; ... <resource property attribute>; | <resource property attribute> = <attribute value> } c Stefan Peinkofer 129 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER 6.5.4 The Sun Cluster Process Management Facility All processes which will be started by the callback functions should run under the control of the Process Management Facility (PMF). The PMF continuously checks to determine whether the application process or at least one of its child processes is alive. If not, it restarts the application. To start an application instance under the control of PMF, a special command has to be called to which the command to start the application has to be assigned as an argument. To identify an application instance which was “created“ under the control of PMF, a unique identifier tag has to be specified as argument, when calling the PMF to start an application. Since it is not desirable for PMF to restart an application indefinitely, the resource property values of Retry_count and Retry_interval are also specified as arguments. Besides the process control, PMF provides some other functions to the callback functions. For example a callback function can send a signal to all processes of an application instance by calling the PMF, specifying the identification tag of the application instance and the signal to send to the processes. 6.5.5 Creating the Cluster Agent Framework Creating a comprehensive cluster agent from scratch is very complex and time consuming because various callback functions have to be implemented and a comprehensive understanding of how Sun Cluster requires a cluster agent to be written is needed. Fortunately, the Cluster Software provides a Graphical User Interface, with which a cluster agent can be created. This wizard allows even a person with virtually no experience in programming to create a cluster agent in two steps. In the first step, the values for the resource type properties Vendor_ID, Resource_type, RT_version and Failover have to be specified and the user has to choose whether the agent programs will be “written“ as a C or a Kornshell program. In the second step, the commands to start and stop the applications and an optional command which will carry out the application health check have to be specified. The only requirement for these commands is that they return 0 if they are successful and a value other than 0 if they are not. c Stefan Peinkofer 130 [email protected] 6.5. DEVELOPMENT OF A CLUSTER AGENT FOR FREERADIUS In addition to that, for each of the three callback methods, a default timeout has to be specified, which is assigned as the default of the corresponding <Callback Method>_timeout resource property. After that, the wizard will create the needed source and configuration files, compile the sources if necessary and create a Solaris installation package. Although the creation of a cluster agent by using the wizard is very easy, it has one major drawback. The wizard provides no facility to pass any resource or resource type properties to the commands for starting, stopping and checking the applications. This means: • If the agent should be deployed on another cluster, it is required that the commands be installed to the same path to which they were installed on the original system. • Only one resource of this type can be created on the cluster because the location of the instance directory which contains the configuration, log and application state information files is “hard coded“. However, these restrictions do not render the wizard useless, since the created source files can be used as a framework which can be manually adapted to the actual requirements. 6.5.6 Modifying the Cluster Agent Framework One primary goal for the development of the Freeradius cluster agent was that it should be reusable on other cluster systems and that it should provide the ability to deploy more than one Freeradius resource on one cluster. Therefore, the cluster agent creation wizard was used to create the needed source files, which were manually extended by the needed functionality to make it freely configurable. To do so, the following callback functions had to be adopted: • Start • Validate In addition to that, the Probe function had to be adopted, which is responsible for calling the health check program in regular intervals to determine whether the resource is healthy and if c Stefan Peinkofer 131 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER this is not the case, to react in the appropriate manner. For the Start callback function, the following resource extension properties were defined in the RTR file: • Radiusd_bin_dir - This value defines the absolute path to the directory which contains the Radius application binary. • Resource_base_dir - This value defines the absolute path to the directory which will contain the instance directory of the Radius resource. • Radiusd_ld_lib_path - This value defines a list of directories which contain shared libraries, used by the Radius application. The functional extension of the Start callback function is that it determines the values of the three resource extension properties and uses them to assemble the start command. Instead of calling: /usr/slocal/radius-stuff/bin/radiusd -d \ /usr/slocal/radius-stuff/etc/raddb which would start the Freeradius application and tell it that the configuration files are found in the directory which was specified after the -d parameter, it will now call: <Radiusd_bin_dir>/radiusd -d \ <Resource_base_dir>/<Resource_name>/etc/raddb. The path specified by <Resource_base_dir> was not directly used as instance directory, since this property is assigned a default value and there is no way to force a value assigned to an extension property to be unique throughout all resources of the same type. So when more than one Freeradius resource is created on the cluster, the creator could forget to specify a different value for the property and, therefore, both resources would use the same instance directory which could lead to random side effects. Therefore it was chosen to force the resource creator to create a unique instance directory by using the resource name, which has to c Stefan Peinkofer 132 [email protected] 6.5. DEVELOPMENT OF A CLUSTER AGENT FOR FREERADIUS be unique throughout the cluster as the name for the instance directory. The directories specified by <Radiusd_ld_lib_path> are passed as command line argument to the PMF call which executes the start command. This causes the PMF to assign the directories to the environment variable LD_LIBRARY_PATH in the environment in which PMF will call the start command so the dynamic linker will also include these directories to search for shared libraries. The Validate function, created by the cluster agent creation wizards, checks to determine whether the application start command exists and is an executable file. The function was extended like the Start callback function, and instead of checking whether the file: /usr/slocal/radius-stuff/bin/radius exists and is executable, it checks the file which is specified by: <Radiusd_bin_dir>/radiusd The other checks of the Validate function need not be adapted. For the Probe function, the following resource extension properties were defined: • Probe_timeout - Defines the time in seconds the health check program is allowed to run until the Probe function considers the execution of the health check program failed. This extension property is actually defined by the cluster agent creation wizard. • Radius_port - The TCP port on which the Radius daemon listens for incoming requests. • Radius_dictionary - This allows the user to specify a path to an alternate Radius Dictionary file, which the health check program should use to communicate with the Radius daemon. • Login_attempts - This defines how many times the health check program tries to authenticate against Radius before it considers the Radius instance unhealthy. c Stefan Peinkofer 133 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER • Local_username - This defines the username the health check function will use to authenticate against Radius. • Radius_secrets_file - This defines the absolute path to a file which contains the password of the Local_username and the Radius secret which will be used to encrypt the password before it is sent to the Radius daemon. It was chosen to place this information in a external file, to which only privileged users have access, rather than put it in the cluster configuration since the resource properties can also be read by unprivileged users. • RFC_user_password - This defines whether the health check program should use User-Password, which is suggested by the Radius RFC, or Password, which is currently expected by the Freeradius application, as “password command“ in the Radius network protocol. • Probe_debug - This defines whether the health check program should do extensive logging or not. • SCDS_syslog - This defines the absolute path to the Sun Cluster Data Service syslog program the health check application will use to submit log messages. Except for the Probe_timeout property, the values of these properties are not used by the Probe function directly but are passed to the program which carries out the actual application health check, which is discussed in the next section. In addition to these values, the hostname of the cluster node the resource is currently running on is passed to the health check program, too. The complete source of the Radius cluster agent can be found on the CD-ROM which is delivered along with this document. 6.5.7 Radius Health Checking For the health check of the Freeradius application, it was chosen to perform a Radius authentication of a local user. Although the Freeradius program suite provides a Radius client application, c Stefan Peinkofer 134 [email protected] 6.6. USING SUN QFS AS HIGHLY AVAILABLE SAN FILE SYSTEM this application cannot be used as a health check application because it reports failures only by printing an error message to stderr but not by setting the exit code to a value other than 0. Another health check program was needed because of this fact. The health check program used for the Radius resource agent is an adopted version of a monitoring script provided by the Open Source service monitoring tool mon. The check program is written in PERL by James Fitz Gibbon and it is based upon Brian Moore’s Radius monitor script, posted to the mon mailing list. The program was adopted to meet the special requirements of the Freeradius daemon and the requirements of the Sun Cluster environment. 6.6 Using SUN QFS as Highly Available SAN File System Although Sun supports the deployment of a shared SUN QFS file system inside a cluster as a cluster file system, Sun does not support the deployment of it as a highly available SAN file system, which would allow computers from outside the cluster to access the file system as meta data clients. For the ZaK there are two main reasons why the use of a shared SUN QFS as highly available SAN file system, in addition to the use as cluster file system, is desirable: 1. Ability to do LAN-less backup. Doing a full backup of one TB data over the local area network cannot be finished within a adequate time span. Since the backup system of the ZaK is also connected to the storage area network, the obvious solution is to back up the data directly over the SAN. Therefore, the backup system, which cannot be part of the file serving cluster because it is a dedicated cluster, managed by an external company, must be able to mount the home directory file system. 2. Increased I/O performance. Some services, which run on servers outside the cluster, have currently mounted the home directories over NFS. If these servers could mount the home directory file system natively as shared file system meta data clients, they would benefit from the increased I/O performance. c Stefan Peinkofer 135 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER Unfortunately, using a shared SUN QFS as cluster file system and highly available SAN file system is not only unsupported but also is not possible without applying special workarounds. For this, basically three challenges have to be accomplished. In the following chapters we will discuss these challenges and the possible ways to accomplish them. 6.6.1 Challenge 1: SCSI Reservations 6.6.1.1 Problem Description The Sun Cluster software uses SCSI reservations to fence failed nodes from the shared storage devices. Which reservation method, SCSI-2 or SCSI-3, the cluster software uses is automatically determined by the cluster software in the following way: For each shared disk, which is connected to exactly two cluster nodes, SCSI-2 reservations are used. For shared disks which are connected to more than two cluster nodes, SCSI-3 persistent group reservations are used. This behavior is “hard wired“ and cannot be overwritten by a configuration parameter. A shared QFS meta data client needs at least read/write access to the shared disk(s) which contain(s) the file system data and read access to the shared disk(s) which contain(s) the file system meta data. The read access to the meta data disks is needed to read the file system super block that contains the information regarding which host currently acts as meta data server. Our cluster system consists of two nodes. This implies that the cluster software uses SCSI-2 reservations for fencing. As long as both cluster nodes are up, servers outside of the cluster can access the shared disk, since the disks are not reserved. If one cluster node goes down, the remaining node will reserve all shared disks, so the servers outside of the cluster cannot access the file system anymore. 6.6.1.2 Possible Solutions For the SCSI reservation problem, three possible solutions were found. Solution one is relatively straightforward. The Sun Cluster software allows the administrator to set a special flag called LocalOnly on a shared disk. This flag causes the cluster software to exclude the c Stefan Peinkofer 136 [email protected] 6.6. USING SUN QFS AS HIGHLY AVAILABLE SAN FILE SYSTEM disk from the fencing operation. If all disks which are used by the shared QFS are marked as LocalOnly, the servers outside of the cluster will be able to access the shared file system even if only one cluster node is up. However, this course is potentially dangerous and may lead to a corruption of the file system. A shared QFS does not require that the data and meta data disks be fenced in case a meta data client which has mounted the file system fails. However, it requires that if the server, which acted as the file system meta data server, fails, it be fenced off the meta data disk before another server can take over the meta data server task. If the shared QFS file system is deployed outside of a cluster, this is done by human intervention and if it is deployed inside a cluster it is done by the fencing mechanism. So the discussed solution cannot eliminate the possibility that a failed cluster node, which acted as meta data server, accesses the meta data disks, after the task was taken over by another cluster member. The second and third solutions are a little bit more complex than the first one. In addition they require at least a three-node cluster, since they rely on SCSI-3 persistent group reservations. To understand them, we have to discuss how Sun Cluster uses SCSI-3 persistent group reservations for shared disk fencing. The principles of SCSI-3 persistent group reservations were already discussed in chapter 3.2.7 on page 28. The Sun Cluster software uses the described WRITE EXCLUSIVE / REGISTRANTS ONLY reservation type. This allows any server which is attached to the disk to access the shared disk on a read-only basis and it allows write access only to those servers which are registered on the shared disk. Registering means that a node puts a unique 8-byte key on a special area on the disk, by issuing a special SCSI command. The key is created by the Sun Cluster software as follows: The first 4 bytes contain the cluster ID, which was created by the cluster software during the first time configuration process. The next 3 bytes are zero and the last byte contains the node ID of the corresponding cluster node. The node ID is a number between 1 and 64 and indicates the sequence in which the cluster nodes were installed. To fence a failed cluster node from the disks, the cluster software on the remaining cluster nodes computes the registration of the failed node and removes it10 from the shared disks by a special SCSI-3 command. If the failed node joins the cluster again, the cluster 10 And with it the reservation, if held by the node. c Stefan Peinkofer 137 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER software on the node places its registration key on the shared disks again. Solution two is basically the same idea of solution one, applied to a SCSI-3 persistent group reservation environment. As already said the WRITE EXCLUSIVE / REGISTRANTS ONLY prevents only write access to a shared disk from unregistered servers, and servers from outside the cluster must only have read-only access to the meta data disk. So the LocalOnly flag has only been applied to the shared disk(s) which contain(s) the file system data. In this configuration, it is assured that a failed meta data server node is fenced off the meta data disks and since the shared QFS does not require that a server is fenced off the file system data disks, file system consistency is ensured. With this solution, a virtually unlimited number of cluster external servers can access the file system and in addition to that, the servers can run any operating system which is supported by QFS11 . Although solution two seems to be sufficient, there is one imponderability. Although a QFS meta data client should not need write access to the file system meta data disks in theory, it is nowhere explicitly documented that it does not need write access in practice, too. So it cannot be ruled out that a QFS meta data client will try to write to the shared meta data disks for some special reason. The third solution goes a completely different way; instead of excluding disks from the SCSI-3 persistent group reservation, it includes the servers outside of the cluster in the SCSI-3 persistent group reservation. Therefore, a small application is needed, which registers the external server to the shared disks, used by SUN QFS. This application must be executed on any external host which should access the file system, since SCSI-3 registrations can only be removed by another node but not added for another node. Since SCSI-3 reservations are persistent, which means they survive power cycles of servers and storage and the Sun Cluster software will only remove the keys of failed cluster members from the shared disks to fence a cluster node off the shared disks, this step has to carried out only once when a new server is added to the shared QFS file system. To eliminate the imponderability that a reservation is lost for some reasons, the registration application could also be called every time before the 11 Which is currently only Solaris and a few Linux distributions. c Stefan Peinkofer 138 [email protected] 6.6. USING SUN QFS AS HIGHLY AVAILABLE SAN FILE SYSTEM shared QFS file system is mounted on the node since multiple reservations of a server are not possible. Unfortunately no freely available application which is able to place SCSI-3 persistent group reservations on shared disks exists. Although such an application is delivered with the Sun Cluster software, it cannot be used since it is tightly integrated with the cluster software and works only on a cluster node which is a member of a quorum cluster partition. Fortunately, Solaris provides a well documented programming interface named multihost disk control interface. By using this interface, such an application can be created easily. With this solution, the servers outside of the cluster have full read/write access to all shared disks used by QFS. This simulates exactly the conditions which exist on a shared SUN QFS which is deployed outside of a cluster. In addition to that, the fencing mechanism of the cluster is not impacted since all shared disks are included in the fencing operation. However, the overall count of servers which can access the file system is limited to 64 because SCSI-3 persistent group reservations can handle only 64 registrants. In addition to that, the application which registers a server can only be used on Solaris, since the used programming interface is not available on other operating systems. If operating systems other than Solaris should be able to access the file system, a new SCSI reservation application has to be found or written.12 6.6.2 Challenge 2: Meta Data Communications 6.6.2.1 Problem Description As described in section 6.4.5.1 on page 115, the meta data communication between cluster nodes has to travel over the cluster interconnect network. This restriction is in effect because of the following reason: The QFS resource agent makes only the QFS meta data server service highly available and therefore only monitors the function of the meta data server. What is left completely unaddressed by the QFS resource is the surveillance of whether a meta data client 12 Since QFS currently only supports Solaris and Linux, I did an Internet search for a Linux version of such an application. What I found was the sg3_utils, a set of applications which use functions provided by the Linux SCSI Generic Driver. Unfortunately I cannot say whether these tools work since during my tests, I had no success in placing a SCSI-3 reservation on a shared disk. But this may be because I used the tool in the wrong way. c Stefan Peinkofer 139 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER is able to communicate with the meta data server or whether the meta data server is able to communicate with all meta data clients in the cluster. This functionality is implicitly achieved by using the cluster interconnect as meta data interconnect, since all members of the quorum cluster partition are always able to communicate with each other over the cluster interconnect. If the meta data communication travelled over network interfaces other than the cluster interconnect interfaces, the failure of all these interfaces on a cluster node would either prevent the node from accessing the file system or prevent all other nodes from accessing the file system, depending upon whether the node was a meta data client or meta data server. Of course, if a node were not able to access the file system anymore, the resources which depend on the file system would be failed over to another node because the monitor function of these resources would fail. If the interfaces failed on the meta data server, all services which depend on the file system would be failed over to the meta data server, since it is the only node which is able to access the file system anymore. In a two-node cluster, this behavior is in principle no problem, but we should keep in mind that the resources are failed over to the node which actually caused the problem, so this behavior should not be desirable. In a 2+N node cluster or if external servers should be able to access the file system, this behavior is not desirable because the meta data server service should be failed over to another node so that all nodes, except the one which caused the problem, are able to access the file system again. Since dedicated physical network interfaces cannot be used for meta data communication between the cluster nodes, the obvious solution would be to connect the external servers to the cluster interconnect network. However, this is not possible either, since the Sun Cluster software requires that only cluster nodes are connected to the cluster interconnect network. 6.6.2.2 Possible Solutions As we have seen, the fundamental problem with meta data communication is that cluster nodes must use the cluster interconnect for sending and receiving meta data messages and cluster exc Stefan Peinkofer 140 [email protected] 6.6. USING SUN QFS AS HIGHLY AVAILABLE SAN FILE SYSTEM ternal hosts must not use the meta data interconnect for this. So to get around this restriction, the host which currently acts as meta data server should be able to send and receive meta data communications over more than one IP address. This would provide the ability to use the cluster interconnect network for sending and receiving meta data messages to/from cluster nodes and to use a public network for sending and receiving meta data messages to/from cluster external nodes. Fortunately, SUN QFS provides such a feature by simply mapping a comma separated list of IP addresses to the corresponding physical hostname of the node, in the hosts.<file system name> configuration file. SUN recommends for cluster external shared QFS file systems that meta data messages be sent over a dedicated network. To provide highly available meta data communication between the potential meta data servers within the cluster and the cluster external meta data clients, at least two redundant cascaded switches are needed and each cluster node and external node must be connected by two network interfaces to the switches so that interface A is connected to switch A and interface B is connected to switch B. To provide local interface IP address fail over, an IPMP group consisting of the two meta data network interfaces has to be defined on each cluster and external node. Additionally, each of the newly created IPMP groups is assigned an IP address which will be failed over between the corresponding local interfaces of the group. This IP addresses have to be added to the hosts.<file system name> to tell QFS that it should also use these addresses for meta data communication. As discussed in section 6.6.2.1 on 139, if the meta data communication does not travel over the cluster interconnect but over a normal network connection, the failure of this network connection on the current meta data server would prevent the meta data client hosts from being able to access the file system. Since the meta data network, which connects the cluster nodes with the cluster external hosts together, is a normal network connection, special precautions have to be taken so that the cluster system can appropriately respond in case all interfaces in the meta c Stefan Peinkofer 141 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER data IPMP group of the current meta data server fail. What is meant by “appropriately respond“ is fail over the meta data server task to a cluster node which still has connectivity to the external meta data network. To achieve this behavior, a LogicalHostname resource has to be created within the resource group which contains the QFS resource. The IP address which is assigned to the LogicalHostname resource must be from the same subnet as the meta data network IP addresses are from, so the resource will assign the IP address to the meta data IPMP group. Now, when all interfaces of the meta data IPMP group fail on the current meta data server, the LogicalHostname resource will fail and therefore the resource group, which contains the LogicalHostname resource and the QFS resource, will be failed over to a cluster node whose meta data IPMP group is healthy. As discussed in section 6.4.4.2 on page 103 IPMP will set the special flag fail on IP addresses, assigned to a failed interface. When all interfaces in the meta data IPMP group have failed, whereby failed can mean that just the elected ping nodes are not reachable, this can become a problem. As long as another cluster node exists whose meta data IPMP group is not considered failed this is no problem. But maybe all cluster nodes have automatically selected the same set of ping nodes. In this case, the meta data server service would become unavailable to the external hosts. In order to prevent this scenario, the ping nodes have to be specified manually on the cluster nodes. Basically, there are three options which will result in the desired cluster node behavior and will keep the IPMP group from being considered failed because of a “false alarm“ caused by the IPMP probe based failure detection mechanism: • All external meta data client hosts are configured as ping targets of the cluster nodes. In doing so, the IPMP probe based failure detection will only consider an IPMP group failed if all external meta data clients hosts are not reachable. The advantage of this method is that it actually monitors the logical connectivity to the external meta data client hosts. The drawback of this option is that the IPMP ping host configuration has to be adopted each time a new external meta data client is configured. • Each cluster node is configured to use only its own IPMP test addresses as ping nodes. c Stefan Peinkofer 142 [email protected] 6.6. USING SUN QFS AS HIGHLY AVAILABLE SAN FILE SYSTEM This method simply bypasses the IPMP probe based failure detection mechanism since the IPMP test addresses of the local interfaces are always available. The drawback of this method is that only physical connection failures can be detected. • If the network switches, deployed in the meta data network, are reachable through an IP address13 , the addresses of the switches to which the cluster nodes are directly connected can be used as ping targets. This is based on the thought that a switch which no longer responds to a ping request has a problem and, therefore, the IP address should be failed over to another interface, which is connected to another switch. The advantage of this option is that the IPMP ping host configuration does not have to be adopted if a new external meta data client is configured. The external meta data client hosts are configured to use only the IP address, provided by the LogicalHostname resource in the QFS resource group as ping node. Since this IP is always hosted by the current meta data server, this configuration is the best case solution since a path is only considered failed when the meta data server host cannot be reached over that path. 6.6.3 Challenge 3: QFS Cluster Agent 6.6.3.1 Problem Description The QFS cluster agent implements the two optional callback methods Validate and Monitor_check which both will validate the QFS configuration file hosts.<file system name>. The methods will fail if not all “physical hostname to meta data IP address“ mappings in the file are according the following syntax: <public network hostname> <cluster interconnect IP of the node> Since the syntax of the hosts.<file system name> does not meet this criteria anymore, because an additional IP address was specified after the cluster interconnect IP to solve challenge 2, the two functions will fail. To understand the effects of this failure, we will look a little closer at these callback functions. 13 For example to provide a configuration interface over the network. c Stefan Peinkofer 143 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER • Validate - The Validate function is called on any host which potentially can run this resource, when a resource is created or when resource or resource type attributes are changed by the administrator. The failure of the Validate method will prevent the resource from being created or the property from being updated. • Monitor_check - The Monitor_check function is called on a host to which the cluster system wants to fail over the resource, to determine whether the resource will be runnable on that particular host. Unfortunately it is not documented in which cases this function is called exactly. By observing the cluster system, the following could be determined: When a node failure, a failure of the QFS meta data server resource or a manual relocation of the resource with a special administrator command caused the resource relocation, the Monitor_check command was not executed. The only failure scenario in which an execution of the Monitor_check function could be observed was the failure of a resource on which the QFS meta data server resource depends. Since an IP address resource was added to the QFS resource group, on which the QFS resource implicitly depends, a failure of the meta data IPMP group would keep the QFS resource from failing over because the Monitor_check function was called and failed. 6.6.3.2 Possible Solutions Since both the Validate and Monitor_check functions of the QFS cluster agent are binary executable files, the only possible solution to solve this problem is to replace the two functions. This can be done in either of two ways. The first and easiest way is to replace just the two executable files for the Validate and the Monitor_check functions. The disadvantage of this solution is that applying a patch for the SUN QFS file system will overwrite the replaced files and so the files will have to be replaced every time the QFS file system is patched. The second way is to tell the cluster system to use other callback functions for the Validate c Stefan Peinkofer 144 [email protected] 6.6. USING SUN QFS AS HIGHLY AVAILABLE SAN FILE SYSTEM and Monitor_check methods. Unfortunately the two values cannot be changed by simply calling an administrative command. To change the values, the QFS cluster agent resource type registration file has to be changed. To let the changes come into effect, the QFS resource type has to be newly registered within the cluster system. If it was not registered before, this is no problem. If it was, the resource type must be first unregistered which means that every resource of this type has to be removed as well. The advantage of this method is that the changes remain in effect even when the QFS file system is patched. Since it is hard to determine which tasks the two callback functions would carry out, the replaced files do nothing but pass a return value of 0 (OK) back. For the Validate method, this means that newly created QFS resources and resource type property or resource property changes should be deliberate and require extensive testing. For the Monitor_check method, this means that in the special case that the meta data server task is failed over to a node, which is really not capable of running the resource, the fail over process takes a little bit longer. This is because the cluster system will not notice that the resource is not able to run on that host until the Start method or the Monitor method fails. But since the likelihood of such a scenario is reasonably small, this risk can be tolerated. 6.6.4 Cluster Redesign The following sections describe the redesign of the sample cluster implementation in order to use the SUN QFS as highly available SAN file system. 6.6.4.1 Installation of a Third Cluster Node To solve the SCSI reservation challenge, it was chosen to implement the solution which uses SCSI-3 persistent group reservations in conjunction with a special program which registers an external meta data client host on the SUN QFS disks to gain read/write access to them. As already said, this solution requires a three-node cluster since Sun Cluster will use SCSI-2 c Stefan Peinkofer 145 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER reservations on a two-node cluster. Since the cluster interconnect of our cluster is implemented as two direct cross-over network connections, the cluster interconnect had to be reconfigured to use network switches so an additional cluster node can be connected to the cluster interconnect network. This task is relatively easy and can be done in a rolling upgrade manner. First, one cluster interconnect path is removed from the cluster configuration; after this, the corresponding network interfaces are connected to the first switch; and then the cluster configuration is updated to use the new path. After this, the same procedure is carried out for the remaining cluster interconnect paths. To save hardware costs, it was decided to use the two switches for the cluster interconnect network as well as for the meta data network for the external meta data clients. Figure 6.7 shows the reconfigured cluster interconnect and meta data network. tribble Physical Network Interface Virtual Network Interface Virtual Network Interface gagh Ethernet Switches Cluster Interconnect VLAN Meta-Data VLAN Trunked Inter-Switch Link Node 3 Figure 6.7: Cluster Interconnect and Meta Data Network Connection Scheme c Stefan Peinkofer 146 [email protected] 6.6. USING SUN QFS AS HIGHLY AVAILABLE SAN FILE SYSTEM Since Sun Cluster requires that the cluster interconnect is a dedicated network, to which only cluster nodes have access, two tagged, port based VLANs are configured on each switch, one for the cluster interconnect network and one for the meta data network. As figure 6.7 shows, switch ports which connect cluster nodes are assigned to both tagged VLANs and the other ports are assigned only to the tagged VLAN for the meta data network. Tagged VLANs provide the ability to partition a physical network into several logical networks which are identified by a VLAN ID. To designate a switch port as a member of a VLAN, the corresponding VLAN ID is assigned to that port, whereby it is possible to assign more than one VLAN ID to a single port. If a port is a member of more than one VLAN, which are assigned as untagged VLANs to the port, it seems for the attached host that all traffic comes and goes to a single network. If the VLANs are assigned as tagged VLANs, all Ethernet packets are assigned a MAC header extension which contains the corresponding VLAN ID. Since the attached host is not aware of this header extension by default, these packets are dropped until a special virtual VLAN network interface which is aware of the VLAN ID extension field, is defined. To configure the VLAN interface, the VLAN ID of the VLAN, which the interface should use to send and receive data, has to be specified. So on our cluster nodes, two virtual VLAN interfaces, one for the cluster interconnect and one for the meta data network are configured upon each physical interface which is connected to one of the two switches. So although a common physical connection is used, for the cluster software and the other applications it looks like two dedicated networks exist. After the cluster interconnect network was reconfigured, the third cluster node was installed. Figure 6.8 shows the adopted connection scheme of the cluster. c Stefan Peinkofer 147 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER 3510 Storage Enclosure Fibre Channel Switches Ethernet Switches Node 3 gagh tribble Ethernet Switches Cluster Interconnect VLAN Meta Data Network VLAN Public Network Connection Redundant Inter-Switch Link Fibre Channel Connection Ethernet Copper (Twisted Pair) Ethernet Fibre Ethernet Fibre Ethernet Fibre Fibre Channel Figure 6.8: Adopted Cluster Connection Scheme Unfortunately, a third cluster node, which has the same performance as the other two nodes, was not affordable for the ZaK, and so we were forced to use a server, which is currently not used yet, but already assigned to another project. So the basic idea was to install the server temporarily as a third cluster node to force the Sun Cluster software to “think“ it is running on a three-node cluster and therefore has to use SCSI-3 reservations and, after that, give back the server to the project to which it is assigned. It is worth mentioning that this is not an ideal solution, since it may be that some special configuration task can only be done when all cluster nodes are up and running. To get around this problem, it was planned to obtain a small and cheap server as a third cluster node but it was not possible to obtain the server in a timely manner. Therefore, this solution should be understood as proof of concept implementation. c Stefan Peinkofer 148 [email protected] 6.6. USING SUN QFS AS HIGHLY AVAILABLE SAN FILE SYSTEM Since the third cluster node couldn’t be used as a “real“ cluster node, only a small subset of the discussed configuration tasks which are necessary for the cluster node to join the cluster was performed. This subset consisted of the following tasks: • Connect the third cluster node to the SAN. • Install the operating system without mirroring the boot disk. • Install the cluster software. • Configure for the first time the cluster software on the third node. At the point when the third node joins the cluster for the first time, the cluster updates the global device database and uses SCSI-3 persistent group reservations for every shared disk which can be accessed by all three cluster nodes, which in our case are all shared disks since the third cluster node is connected to the same SAN zone as the others. After the third node was installed and joined the cluster, the vote count of the quorum disk had to be adjusted, since now, four possible votes were available in the cluster: three from the cluster nodes and one from the quorum disk. To achieve that even a single node can constitute a quorum if it owns the quorum device, three votes are needed and so the quorum disk must be assigned a vote count of two. This is done by removing and then reassigning the quorum device from or respectively to the cluster configuration. Now the cluster assigns the quorum device automatically a vote count of two. After this, the third node was brought offline and given back to the project to which it was assigned. 6.6.4.2 Meta Data Network Design As already said, it was chosen to use the two switches for the cluster interconnect network also for the meta data network. The basic difference between the cluster interconnect network and the meta data network is that the meta data network, which uses IPMP for making the network connections highly available, requires that the two switches are connected together since the meta data server could listen on an interface which is connected to switch A and an external c Stefan Peinkofer 149 [email protected] CHAPTER 6. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING SUN CLUSTER meta data client could listen on an interface which is connected to switch B. In order to provide a redundant inter-switch link, the switches are connected together by two paths, which are used in a trunking configuration, i.e. the switches will utilize both connections simultaneously. Since these inter-switch connections are only required for the meta data network, the inter-switch links are configured to forward only the traffic of the meta data network VLAN. After this, the LogicalHostname resource was created within the QFS resource group. As IPMP ping targets of the cluster nodes, it was chosen to use the IP addresses of the two switches. The decision is based on the thought that maintaining a list of all external meta data clients is too error prone, since adding a new external meta data client to the list can be easily overseen. In the last step, the QFS configuration file hosts.<file system name> was adopted, so that each cluster node bound the meta data server to its cluster interconnect IP and its meta data network IP. 6.6.4.3 QFS Cluster Agent Reconfiguration To get around the restrictions of the QFS cluster agent, it was chosen to change the resource type configuration not to use the original callback functions but to use the replacements. Since the QFS cluster agent was already configured on the cluster and, therefore, the QFS resource type was already configured, the resource type had to be unregistered. Therefore, all resources had to be brought offline and set to the unmanaged state. Then, the resource dependencies between QFS and the NFS and the Samba resources had to be deleted. After this, the QFS resource had to be deleted. Finally, the QFS resource type could be unregistered. In the next step, the QFS resource type configuration was adopted so that the Validate and Monitor_Check callback methods pointed to the void replacement callback functions. After this, the resource type was registered again, the QFS resource was created and the dependencies between the QFS resource and the NFS and Samba resources were re-established. c Stefan Peinkofer 150 [email protected] Chapter 7 Implementing a High Availability Cluster System Using Heartbeat 7.1 Initial Situation The databases for the telephone directory and the Identity Management System are currently hosted on two x86 based servers. The server which hosts the Identity Management System database runs Red Hat Linux 9, which is no longer supported by Red Hat. The server which hosts the telephone directory database runs Fedora Core 2 Linux. The databases are currently located on local SCSI disks. The Identity Management System database is placed on a hardware RAID 5 of four disks and the telephone directory database is placed on a software RAID 1 of two disks. 7.2 Customer Requirements The requirements of the new system are to provide a reference implementation of a high availability cluster solution, using two identical x86 based servers, Red Hat Enterprise Linux 4 as operating system and Heartbeat 2.x as cluster software. On this cluster system, the two PostgreSQL databases for the Identity Management System and the telephone directory should be made highly available in an active/active configuration. 151 CHAPTER 7. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING HEARTBEAT Since Heartbeat 2.0.0 was released only a few weeks before the cluster system was created, the main purpose of this cluster system is to evaluate whether the latest Heartbeat version1 at that time is already reliable enough to be deployed on a production system. 7.3 General Information on Heartbeat Version 2 Heartbeat is a typical Fail Over cluster. Although Heartbeat supports that more than one instance of a particular resource is online simultaneously through so-called resource clones, it provides no functions for load balancing or high performance computing. So the use of these resource clones is very limited. Heartbeat supports two types of cluster interconnects: • Ethernet • Serial Interfaces Since Heartbeat exchanges heartbeats over the TCP/IP protocol, it is highly recommended to use a serial connection in addition to the Ethernet based cluster interconnects so that a split brain scenario, caused by a failure of the TCP/IP stack, is avoided. Heartbeat uses no quorum tie breaker, like a quorum disk. This is mainly caused by the fact that the Linux kernel provides poor and unreliable support for SCSI-2 and SCSI-3 reservations. The Heartbeat developers are currently deliberating about using ping nodes as quorum tie breakers but this solution is still under design. Because of the poor SCSI reservation support, Heartbeat also cannot use SCSI reservation for fencing and so it has to use STONITH. Since no quorum tie breaker is available, Heartbeat ignores quorum in a two-node configuration. To prevent the two nodes from “STONITHing“ each other simultaneously, one of the two nodes is given a head start. Which node is given the head start is negotiated between the two cluster nodes each time a node joins the cluster. 1 Which was 2.0.2 during this thesis. c Stefan Peinkofer 152 [email protected] 7.4. CLUSTER DESIGN AND CONFIGURATION 7.3.1 Heartbeat 1.x vs. Heartbeat 2.x To understand the desire to use Heartbeat 2.x on the cluster system, we must briefly look at the differences between Heartbeat version 1 and 2. • The maximum number of cluster nodes is limited to two in version 1, whereas it is virtually unlimited in version 2. At the time of this writing version 2 has been successfully tested with 16 nodes. • Heartbeat version 1 monitors only the health of the other cluster node, but not the resources which run on the cluster. Therefore version 1 provides only node level fail over. Heartbeat version 2 deploys a resource manager which can call monitoring functions to determine whether a resource is healthy or not and can react to a resource failure in the appropriate way. So version 2 provides also resource level fail over. • With Heartbeat version 1 it is only possible to define a single resource group for each cluster node. Heartbeat version 2 provides the ability to define a virtually infinite number of resource groups. So the feature set of Heartbeat version 2 meets the requirements on a modern high availability cluster system whereas version 1 lacks some fundamental features. 7.4 Cluster Design and Configuration In the following sections we will discuss the design of the Heartbeat cluster system. 7.4.1 Hardware Layout To build the cluster, two identical dual CPU servers were available. The required external connections the server had to provide are as follows: • 2 network connections for the public network. • 1 network connection for the cluster interconnect. c Stefan Peinkofer 153 [email protected] CHAPTER 7. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING HEARTBEAT • 1 serial connection for the cluster interconnect. • 1 network connection to the STONITH device. • 2 fibre channel connections to the SAN. For all network connections, copper based Gigabit Ethernet is deployed since fibre optic Ethernet cards for x86 based servers are disproportionately more expensive than copper based cards. Figure 7.1 shows how the interface cards are installed in the server. PCI Slots ttyS0 ttyS1 eth0 eth1 eth2 qla0 eth3 qla1 Gigabit Ethernet Copper (Twisted Pair) Fibre Channel RS-232 (Serial) Figure 7.1: PCI Card Installation RX 300 The servers already provide two Gigabit Ethernet copper interfaces and two serial ports on board. The additional two network and fibre channel connections are provided by dual port cards. For the network connections, this is no problem since the public network can be connected by one onboard port and one PCI network card port, the cluster interconnect connection is redundant through the use of the additional serial connection and the STONITH device provides only one network port. However, the use of the single fibre channel interface card constitutes a single point of failure, which should be removed before the system goes into production use. From the available server documentation it is not determinable whether the system board c Stefan Peinkofer 154 [email protected] 7.4. CLUSTER DESIGN AND CONFIGURATION provides more than one PCI bus and, if it does, which PCI slots are assigned to which PCI bus. Therefore the distribution of the PCI cards among the available PCI slots is randomly chosen, but consistent among both nodes. Figure 7.2 shows the various connections of the cluster nodes. 3510 Storage Enclosure Fibre Channel Switches Power Switches spock sarek Cluster Interconnect Cluster Interconnect Public Network Connection Redundant Inter-Switch Link Fibre Channel Connection STONITH Connection Power Connection Ethernet Switches Ethernet Copper (Twisted Pair) RS-232 (Null-Modem) Ethernet Fibre Fibre Channel Ethernet Fibre Ethernet Copper (Twisted Pair) Figure 7.2: Cluster Connection Scheme As already said in the Sun Cluster chapter, the cables for the various connections are not laid in different lanes. The cluster interconnection interfaces are connected directly with cross-over Ethernet cables and respectively null-modem cables. The public network connections of the two nodes are connected to two different switches and all paths are connected to different switch modules. Each server is connected to both SAN fabrics to tolerate the failure of one fabric. c Stefan Peinkofer 155 [email protected] CHAPTER 7. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING HEARTBEAT Each server contains 6 hot pluggable 147 GB SCSI disks, which are all connected to a single SCSI controller. This single point of failure cannot be removed, since the SCSI back plane provides only a single I/O controller connection. Although the servers were purchased with a hardware RAID controller option, this RAID controller cannot be used. This is because the RAID controller option is realized by a relatively new technology called Zero Channel RAID. A traditional RAID controller combines the SCSI controller and RAID controller task in one logical unit, i.e. the disks are directly connected to the RAID controller. In a Zero Channel RAID configuration, the disks are connected to a typical SCSI controller which is placed on the motherboard and the RAID controller is installed to a special PCI slot. This provides the advantage that the RAID functionality can be upgraded without recabling the disk connections. However, at the time the cluster was set up, no driver for the purchased Zero Channel RAID controller was available for Red Hat Enterprise Linux 4. In addition to that, the results of a performance test, using a Linux distribution which provides drivers for this controller showed that the performance of the Zero Channel RAID controller is inferior to the performance of software RAID. So it was chosen to abandon and uninstall the Zero Channel RAID controller. The servers provide two redundant power supplies whereby each power supply is connected to a different main power circuit. As on the Sun Cluster, no uninterruptible power supplies are deployed because of the maintenance costs. 7.4.2 Operating System For the installation of the operating system, no special requirements exist. Every node is assigned a physical hostname and a public network IP address as usual. On our Heartbeat cluster, the nodes are named spock and sarek. 7.4.2.1 Boot Disk Partition Layout Since no special requirements for the boot disk partition layout exist, the created layout is very simple. Although both servers have 4 GB main memory, it was chosen to put the root file c Stefan Peinkofer 156 [email protected] 7.4. CLUSTER DESIGN AND CONFIGURATION system and the swap area on different disks, since each server has enough local disks and it will provide a slight performance advantage in case the swap area is really needed sometime. So one disk, which will be used for the root file system, contains a single partition which consumes the whole space of the disk and one disk, which will be used for the swap area, contains two partitions, a 8 GB large swap partition and a partition which consumes the left space of the disk, but will not be used. 7.4.2.2 Boot Disk Mirroring Since each server contains 6 disks, it was chosen to use three disks, to mirror the disk which contains the root file system and three disks to mirror the disk which contains the swap file system, whereby in each case two disks are mirrored and the third disk is assigned as hot spare drive which will stand in when one of the two disks fails. Since the setup of the software mirroring can be done through a graphical user interface2 during the operating system installation it is recommended to do so. The Linux software RAID driver mirrors not the whole disk but only partitions. Therefore, first of all, the four remaining disks have to be partitioned equally to the corresponding disks which should be mirrored. After that, it has to be defined which partition from which disks should be used as mirror and which partition should be assigned as hot spare to the mirror. After that, the virtual devices which represent the mirrored partition will be created and it has to be specified which partition should be used for which file system. Finally the operating system will directly install itself to the mirrored disks. 7.4.2.3 Fibre Channel I/O Multipathing Like Sun Cluster, Heartbeat does not provide I/O path fail over for the storage devices and, therefore, this task has to be done on the operating system level. For fibre channel I/O multipathing, which provides the ability to fail over the I/O traffic to the second fibre channel 2 Which works. c Stefan Peinkofer 157 [email protected] CHAPTER 7. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING HEARTBEAT connection in case the first one fails, two different methods can be deployed on Red Hat Enterprise Linux 4. The first method is to use the Multi Disk (MD) driver which is contained in the official Linux kernel and is also used for the software RAID functions. The driver will utilize one fibre channel connection at a time and fail over the I/O traffic to the alternate path when the first path fails. The drawback of this method is that the MD driver works only with a simple, non-meshed SAN. Non-meshed SAN means that only two different paths to a single disk exist. Although the currently deployed SAN is non-meshed, the ZaK does not want to abandon the option to upgrade to a meshed SAN topology later. The second method is to use a proprietary driver software for the deployed fibre channel Host Bus Adapter (HBA), provided by the manufacturer of the HBAs, Qlogic. This driver supports I/O multipathing natively. It recognizes that the same shared disks can be accessed over two or more paths and directly discloses only one logical shared disk to the operating system, instead of representing each path to the disk as a dedicated disk. As the MD driver, the HBA driver utilizes only one path to a shared disk at a time and fails over to another path in case the active path fails. The advantage of this driver is that it also supports meshed SAN topologies. The disadvantage is that this driver does not work with the active/active RAID controller configuration of the deployed 3510 storage array. To understand why this restriction is in effect, we have to look at how the multipathing part of the HBA driver works, but before that we have to look a little bit closer at the addressing scheme of the fibre channel protocol. Each participant in a fibre channel environment has a unique ID which is referred to as World Wide Name (WWN). The following list gives an overview of some fibre channel environment participants. • Fibre channel host bus adapter cards. • Fibre channel ports on a host bus adapter card. c Stefan Peinkofer 158 [email protected] 7.4. CLUSTER DESIGN AND CONFIGURATION • Fibre channel storage enclosures. • RAID controllers within a fibre channel storage enclosure. • Fibre channel ports on RAID controllers. • Fibre channel switches. • Fibre channel ports on a switch. Figure 7.3 gives an overview of the important WWNs which are assigned to the 3510 storage enclosure. WWN of Enclosure RAID-Set A Partition 3 WWN of LUN 2 (A) Partition 2 WWN of LUN 1 (A) Partition 1 WWN of LUN 0 (A) WWN of Controller 1 WWN of Controller 2 RAID-Set B Partition 3 WWN of LUN 2 (B) Partition 2 WWN of LUN 1 (B) Partition 1 WWN of LUN 0 (B) Figure 7.3: Important World Wide Names (WWNs) of a 3510 Fibre Channel Array As shown in the figure, each LUN, which is a partition of a 3510 internal RAID 5 set and which we refer to as shared disk, is assigned a dedicated WWN and the enclosure itself is assigned a WWN, too. In addition to that, a LUN is assigned not only a WWN but also a LUN number. In contrast to the WWN which has to be unique throughout the fibre channel environment, a LUN number is only unique in the scope of the RAID controller, which exports this LUN to the “outside“. Therefore on each RAID controller which is connected to the fibre channel environment, a LUN 0 for instance can exist. c Stefan Peinkofer 159 [email protected] CHAPTER 7. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING HEARTBEAT It follows that I/O multipathing software should use the LUN WWNs to identify the various paths available to a particular LUN. Unfortunately, the multipathing function of the HBA driver does not use the LUN WWNs for this but uses another approach. It uses the WWN of the storage enclosure in conjunction with the LUN number to determine which paths to a particular LUN are available. This works perfectly for storage enclosures with only one RAID controller or with two RAID controllers in an active/passive configuration but introduces a big problem in active/active dual RAID controller configurations. Let’s consider that on an enclosure two LUNs are exported, one on each RAID controller. Both LUNs are assigned the LUN number 0 which is allowed since the LUNs are exported by different controllers. The HBA driver will now “think“ that it has four dedicated paths to a single LUN 0 which is wrong since in effect there are two dedicated LUN 0 which each can be reached over two paths in each case. The HBA driver makes this mistake because it assumes that from a single storage enclosure, only one LUN 0 can be exported and therefore each LUN 0 from the same storage enclosure represents a single physical disk space. To work around this problem, there are basically two solutions. The first solution would be to reconfigure the 3510 to use an active/passive RAID controller configuration. Since this configuration would degrade the performance of the 3510 which would affect not only the Linux servers, but also the Solaris servers, which constitute the majority of SAN attached hosts, this solution is not acceptable for the ZaK. The second solution is to configure the SAN in such a manner that the Linux servers can only access one of the RAID controllers of the 3510 enclosure. Therefore, the zone configuration on the fibre channel switches has to be changed. In addition to the already deployed test environment zone, an additional zone has to be created which contains the switch ports that connect the Linux servers and the switch ports that connect the first or respectively second RAID controller of the 3510. Since fibre channel zones allow a specific port to be a member of more than one zone, this configuration is acceptable since the original test environment zone, to which the c Stefan Peinkofer 160 [email protected] 7.4. CLUSTER DESIGN AND CONFIGURATION Solaris servers are connected, can still contain the ports that connect to the first and the second RAID controller. Figure 7.4 shows the reconfigured fibre channel zone configuration. 3510 Controller 1 3510 Controller 2 Inter-Switch Link Fibre Channel Switches Production Zone SUN Test Zone LINUX Test Zone Figure 7.4: New Fibre Channel Zone Configuration The restriction that the Linux servers can only access one RAID controller does not constitute a single point of failure in this special case because of the fibre channel connection scheme of the 3510 storage enclosure. To understand this, we must take a look at how the 3510 is connected to the SAN. As shown in figure 7.5, each of the two RAID controllers provide four ports which can be used to connect the controller to the SAN. c Stefan Peinkofer 161 [email protected] CHAPTER 7. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING HEARTBEAT Fibre Channel Switch FC Connection Ports RAID-Set A 3510 Storage Enclosure 0 1 4 5 0 1 4 5 RAID-Set B Fibre Channel Hubs Fibre Channel Switch Signal of Controller 1 Signal of Controller 2 Figure 7.5: 3510 Fibre Channel Array Connection Scheme Thereby, it does not necessarily mean that a port on the first controller is logically connected to the first controller. Instead, the ports which are on top of each other can be viewed as a functional entity, which can either be assigned to the first or the second controller. This means that if port entity 0 is assigned to controller 0, for instance, the signal which is transmitted over the two corresponding ports is the same. In fact the two ports 0 of both controller form a fibre channel hub and therefore it is irrelevant whether a cable is connected to the upper or lower port 0; the signal is always routed to controller 0. In our concrete configuration, the port entities 0 and 4 are assigned to the first controller and the entities 1 and 5 are assigned to the second controller. As figure 7.5 shows, every controller is connected once over a port, provided by itself, and once over a port provided by the other controller. c Stefan Peinkofer 162 [email protected] 7.4. CLUSTER DESIGN AND CONFIGURATION What will happen in case of a controller failure is shown in figure 7.6. The ports of the failed controller will become unusable and the work will be failed over to the second controller. So even if a zone contains only the first controller, the two switch ports are connected to both controllers and, therefore, this special zone configuration can survive a controller failure. Fibre Channel Switch FC Connection Ports RAID-Set A RAID-Set A 3510 Storage Enclosure 0 1 4 5 0 1 4 5 RAID-Set B RAID-Set B Fibre Channel Hubs Fibre Channel Switch Signal of Controller 1 Signal of Controller 2 No Signal Figure 7.6: 3510 Fibre Channel Array Failure 7.4.2.4 IP Multipathing Like on a Sun Cluster, a node in a Heartbeat cluster is typically connected to two different types of networks, the cluster interconnect network and one or more public networks. To provide a c Stefan Peinkofer 163 [email protected] CHAPTER 7. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING HEARTBEAT local network interface IP fail over functionality either for the public network or the cluster interconnect network interfaces, a special virtual network interface driver called bonding driver has to be used. This driver is part of the official Linux kernel. Using this special interface driver for the cluster interconnect network interfaces is only required if applications running on the cluster should be able to communicate with each other over the cluster interconnect interfaces. Heartbeat itself does not require that this driver be used for local interface IP address fail over on the cluster interconnect interfaces because is can also utilize the various IP addresses, assigned to the cluster interconnect interfaces, in parallel to sending an receiving heartbeat messages. To configure and activate the bonding driver, first of all the appropriate kernel module has to be loaded whereby some driver parameters have to be set. The interesting parameters for a fail over configuration of the bonding driver are the following: • mode - This specifies the operation mode of the bonding module. Besides the desired active/passive fail over mode, several other modes which distribute the load among the interfaces are available.3 • miimon - This specifies the time interval in milliseconds in which the bonding driver will evaluate the link status of the physical network interfaces to determine whether the interface has a link to the network or not. Usually a value of 100 milliseconds is sufficient. • downdelay - This defines the time delay in milliseconds after which the IP address will be failed over when the bonding driver encounters a link failure on the active interface. The value should be set to at least twice the miimon interval to prevent false alarms. • updelay - This defines the time delay in milliseconds after which the IP address will be failed back, when the bonding driver encounters that the link on the primary interface was restored. By loading the bonding driver, one virtual network device will be created. If multiple bonding devices are needed, for example for the public network and the cluster interconnect, either a 3 The bonding driver was originally developed for the Beowulf high performance computing cluster. c Stefan Peinkofer 164 [email protected] 7.4. CLUSTER DESIGN AND CONFIGURATION special parameter has to be specified when loading the driver which defines how much bonding interfaces should be created or the driver has to be loaded multiple times. The second method provides the advantage that the additional bonding interface can be assigned a different configuration, whereas the first method will create all bonding interfaces with the same configuration. The bonding driver also provides a probe based failure detection. In contrast to IPMP on Solaris, this method does not send and receive ICMP echo requests and replies but sends and receives ARP requests and replies to/from specified IP addresses. Unfortunately either the link based or the probe based failure detection can be used by the bonding driver. For our cluster system, it was chosen to use the link based failure detection, because a probe based failure detection could be easily refitted by implementing a user space program which pings a set of IP addresses and initiates a manual interface fail over when no ICMP echo reply is received anymore. After the bonding driver is loaded, the newly created virtual network interface appears as a normal network device on the system and therefore IP addresses are able to be assigned to it in the usual ways. Before the IP addresses, configured on the virtual interface, can be used, the two physical network interfaces, between which the IP addresses will be failed over, have to be assigned to the virtual network interface. This is done by calling a special command which takes the name of the desired virtual network interface and the names of the active and passive physical network interfaces as arguments. To make this configuration persistent across reboots, the system documentation of the deployed Linux distribution has to be consulted, because the method for doing so differs from distribution to distribution. 7.4.2.5 Dependencies to External Provided Services The operating system depends on the DNS service which is provided by an external host. Since the DNS service is not highly available yet, this service constitutes a single point of failure. Fortunately, access to the databases on the cluster nodes is restricted to four hosts. So in order c Stefan Peinkofer 165 [email protected] CHAPTER 7. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING HEARTBEAT to work around this single point of failure, the hostname to IP address mappings of these hosts are stored in the local /etc/hosts file of the cluster nodes, which will be used in addition to DNS to perform hostname to IP and IP to hostname resolutions. 7.4.2.6 Time Synchronization To synchronize the time between the cluster nodes, the NTP configuration from the Sun Cluster was copied and adapted to the Heartbeat cluster environment. Since our Heartbeat cluster possesses only one Ethernet cluster interconnect, synchronizing the time between the cluster nodes over this single path constitutes a single point of failure. However it is doubtful that the redundant path over the public network is more reliable than the cluster interconnect path, since the path over the network involves more components which could fail. The optimal solution would be to use the cluster interconnect path as well as the public network path for sending and receiving NTP messages. This means that the NTP daemon has to send and receive NTP messages to/from a single node over two dedicated IP addresses, whereby the NTP daemons will treat every IP address as a dedicated node. From the available documentation it could not be determined if such a configuration is supported, so another solution was deployed. On our cluster system, it was chosen to synchronize the time between the nodes only over the single cluster interconnect path. In addition to that, each cluster node synchronizes to three different NTP servers over the public network connection. So in case the Ethernet cluster interconnect path fails, the time on the cluster nodes stays synchronized because all nodes are still synchronized to the time servers. So this configuration tolerates a single path failure and therefore is suitable to be deployed on the cluster. 7.4.3 Shared Disks For the example Heartbeat cluster, two shared disks to store the database files are needed. Because of the current size of the databases, which are 7.3 GB for the Identity Management System database and 1.1 GB for the telephone directory database, a 50 GB shared disk and a 10 GB c Stefan Peinkofer 166 [email protected] 7.4. CLUSTER DESIGN AND CONFIGURATION shared disk were chosen, respectively. This space should be sufficient for the planned utilization time of the databases. To prevent the cluster nodes of the Sun Cluster system from putting SCSI reservations on these two disks, access to the two disk is restricted to the two Linux nodes by the LUN masking feature of the 3510 storage array. Although it would be desirable to mirror the shared disks across two 3510 enclosures in the production environment it was chosen to set this aside because although it is possible to create a software mirror configuration between two shared disks by using the MD driver, the Heartbeat developers highly recommended not using the MD driver for mirroring shared disks. This is because the MD driver was not built with shared disks in mind. The work around to fail over a MD mirrored shared disk is to remove the RAID set on the server which currently maintains the RAID set and then to create the RAID set on the second server again. According to some postings on the Heartbeat mailing list, this procedure is error prone and every time the RAID set is failed over the mirror has to be resynced. So in order to mirror shared disk by software on a Linux system, commercial products have to be used, which is not intended by the ZaK. 7.4.4 Cluster Software In the following sections we will look at the initial setup of the Heartbeat environment. 7.4.4.1 Installation of Heartbeat The Heartbeat program suite is available as precompiled installation packages for various Linux distributions as well as plain program sources. Since no installation package is available for Red Hat Enterprise Linux 4, it was chosen to manually compile the Heartbeat program suite. Before the Heartbeat program can be compiled, it is mandatory to create a user named haclient, which is a member of the group hacluster on all cluster nodes. If this is not done before Heartbeat is compiled, the program suite will not work because of erroneous file permissions. c Stefan Peinkofer 167 [email protected] CHAPTER 7. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING HEARTBEAT To compile Heartbeat, the usual configure; make; make install procedure has to be carried out, as with any other Linux program which is compiled from source. After the Heartbeat program suite is installed, the kernel watchdog has to be enabled, which is done by loading the appropriate kernel module. The watchdog module will automatically reboot a node when it is not continuously queried by an application. This can be understood as local heartbeat; when Heartbeat does not contact the watchdog for a specific time interval, the watchdog will consider the system failed and reboots it. It is important to set a special module option when loading the watchdog module, which defines that the watchdog timer, once enabled, can be disabled again by the software, since otherwise a manual shutdown of the Heartbeat program would cause the system to reboot. 7.4.4.2 Configuration of Heartbeat After Heartbeat is installed and the watchdog module is loaded, the initial Heartbeat configuration can be created. This is done by creating the two files ha.cf and authkeys. The ha.cf contains the main configuration of Heartbeat. In the following section, we will look at the most important configuration options of the ha.cf: • node - Defines the name of a cluster node. The name specified here must exactly match the output of the hostname command. For our configuration, one name entry for sarek and one for spock has to be specified. • bcast - This defines the name of a network interface, which Heartbeat will use to broadcast heartbeat packages to the other nodes. In our case, one entry for the dedicated cluster interconnect interface eth3 and one for the public network interface bond0 is used. Although Heartbeat can use unicasts and multicasts for exchanging heartbeat messages over Ethernet, it is highly recommended to use the broadcast mode, since it is the least error prone way to exchange messages over an Ethernet network. • udpport - This defines the UDP port to which heartbeat packages are sent. This parameter must only be specified if more than one Heartbeat cluster share a common network c Stefan Peinkofer 168 [email protected] 7.4. CLUSTER DESIGN AND CONFIGURATION for exchanging heartbeat packages, since the packages are not sent directly to the appropriate cluster nodes, but broadcast to the whole network. Therefore, each cluster must use a unique UDP port so that the packages are received only by the appropriate cluster nodes. • serial - This defines the name of a serial interface, which Heartbeat will use to exchange heartbeat messages to another node. In our case, one entry for the serial device /dev/ttyS1 is specified. • baud - This defines the data rate which will be used on the serial interface(s) to exchange heartbeat messages. • keepalive - This defines the time interval in seconds in which a node will send heartbeat messages. • warntime - This defines the time span in seconds after which a warning message will be logged, when Heartbeat detects that a node is not sending heartbeat messages anymore. • deadtime - This defines the time span in seconds after which Heartbeat will declare a node dead, when Heartbeat detects that the node is not sending heartbeat messages anymore. • initdead - When the Heartbeat program is started, it will wait this time span until it declares the cluster nodes, from which no heartbeat messages were received yet, dead. • auto_failback - This defines whether resource groups should be automatically failed back or not. • watchdog - This defines the path to the device file of the watchdog. • use_logd - This defines whether Heartbeat will use the system’s syslog daemon or a custom log daemon to write log messages. The advantage of the custom log daemon is that the log messages are written asynchronously, which means that the Heartbeat processes do not have to wait until the log message is written to the file, but can continue right c Stefan Peinkofer 169 [email protected] CHAPTER 7. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING HEARTBEAT after the log messages are delivered to the log daemon. This increases the performance of Heartbeat. • crm - This defines whether Heartbeat should run in Heartbeat v2 mode which uses the new Cluster Resource Manager (CRM) to manage the resources or if it should run in the Heartbeat v1 compatibility mode. The authkeys configuration file defines a password and a hash algorithm, with which the heartbeat messages are signed. The following hash algorithms can be specified: • CRC - Use the Cyclic Redundant Code hash algorithm • MD5 - Use the MD5 hash algorithm • SHA1 - Use the SHA1 hash algorithm The CRC method should only be used if all paths used as cluster interconnect are physically secure networks, since it provides no security but only prevents against packet corruption. After Heartbeat is configured, it has to be tested whether the specified cluster interconnect paths work. Therefore, Heartbeat provides a special command which will test whether the specified paths can be used as cluster interconnect paths. The most common failure scenarios, which disable Heartbeat from sending heartbeat messages are bad firewall rules on Ethernet interfaces and faulty cabling between serial interfaces. 7.4.4.3 File System for the Database Files Since PostgreSQL cannot benefit from a shared file system and Heartbeat itself does not provide a shared file system, the deployed file system on the shared disk is a usual Linux ext3 file system. Although Linux in general supports many file systems and some of them provide a better performance than the ext3 file system, ext3 has to be used, since it is the only file system which is supported by Red Hat Enterprise Linux 4. c Stefan Peinkofer 170 [email protected] 7.4. CLUSTER DESIGN AND CONFIGURATION After the two file systems have been created on the shared disks, an appropriate mount point has to be created for each shared disk. In contrast to Solaris, the “disk partition to mount point“ mapping has to be specified in the /etc/fstab file but the Heartbeat developers recommend that it should not be specified in this file, to avoid the file system being accidentally mounted manually. 7.4.5 Applications The only application which will be made highly available on the cluster system is the PostgreSQL database software. Although PostgreSQL was already installed together with the operating system, it was chosen to use a self compiled version of PostgreSQL because the version delivered along with the operating system is 7.x and the up-to-date version is 8.x. The decision for PostgreSQL version 8.x is mainly founded on the fact that version 8.x provides a point-in-time recovery mechanism. With point-in-time recovery it is possible to restore the database state the database had at a specific point in time. This is useful for example when the database is logically corrupted by a database command, like an accidental deletion of data records. Without point-in-time recovery, the backup of the database has to be restored, which may be taken hours before the actual database corruption. So all database changes made since the last database backup are lost. With point-in-time recovery, only the changes which were made after the database corruption are lost since the database can be rolled back to the point in time right before the execution of the hazardous command. It was chosen to store the PostgreSQL application binaries on the local disks of the cluster nodes, since no shared file system is used and therefore two instances of the application binaries have to be maintained anyway. Therefore storing the application binaries on the shared disks would provide no benefit. Before PostgreSQL can be compiled, a user called postgres, which is member of the group postgres has to be created. After that, the compilation and installation of PostgreSQL is done like with any other software which has to be compiled from source. c Stefan Peinkofer 171 [email protected] CHAPTER 7. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING HEARTBEAT After PostgreSQL is installed, the databases instance files on the shared disks have to be created. In the first step, the shared disks have to be mounted on the appropriate mount points. After that, a directory called data has to be created on both shared disks. It must be ensured that the mount points and the data directories are owned by the postgres user and the postgres group and that user and group have full access to them. After that, the database instance files have to be created within both data directories, by calling a special PostgreSQL command. After the database instance files are created, the database instances have to be configured. This is done by adopting the postgresql.conf file which has been automatically created along with the database instance files within the data directory. To use PostgreSQL on the cluster, the following configuration parameters have to be changed: • listen_addresses - The value of this parameter has to be set to the IP address which is assigned to the specific database instance. If the database should not listen to any IP address, this value must be empty, since PostgreSQL will bind to the localhost IP address by default. Therefore when both PostgreSQL would run on the same node, one instance would fail to bind to the localhost IP address, since the other instance is already bound to the IP on the same TCP port. • unix_socket_directory - This is the directory in which the PostgreSQL instance will create its UNIX domain socket file, which will be used by local clients to contact the database. Since the default value for this parameter is /tmp and the socket directory cannot be shared by PostgreSQL instances, it has to be set to a different directory for each PostgreSQL instance. On our cluster it was chosen to use the directory to which the shared disk of the specific PostgreSQL instance is mounted. Finally, both PostgreSQL instances must be configured to accept connections of the postgres user from all IP addresses which are bound to the public network interfaces of the cluster nodes. This is needed because the health check function of the PostgreSQL resource agent does not specify a password when connecting to the database instance. c Stefan Peinkofer 172 [email protected] 7.4. CLUSTER DESIGN AND CONFIGURATION 7.4.6 Configuring the STONITH Devices In this section we will discuss how the STONITH devices have to be configured so that Heartbeat is able to determine which STONITH devices can be used to STONITH a particular node. Heartbeat treats a STONITH device like a normal cluster resource. Depending on whether only one or multiple nodes can access the STONITH device simultaneously, a STONITH resource can be active only on one or on multiple nodes at a time. Depending on the deployed STONITH device, the hostnames of the cluster nodes which can be “STONITHed“ by a particular STONITH resource are either configured as a custom resource property of the STONITH resource or directly on the STONITH device. The STONITH devices for which Heartbeat requires that the hostnames are configured on the STONITH device usually provide a way to let the administrator assign names to the various outlet plugs. To define that a cluster node can be “STONITHed“ by such a device, all outlet plugs which are connected to the particular host must be assigned the hostname of the particular cluster node. When Heartbeat starts the resource of such a STONITH device, it will query the particular hostnames directly from the STONITH device. It is worth mentioning how Heartbeat carries out a STONITH operation. In every cluster partition, a so-called Designated Coordinator (DC) exists, which is among other things responsible for initiating STONITH operations. If the DC decides to STONITH a node it broadcasts a request containing the name of the node to STONITH, to all cluster members, including itself. Every node which receives the request will look if it currently runs a STONITH resource which is able to STONITH the particular node and if so will carry out the STONITH operation and announce whether the STONITH operation failed or succeeded to the other cluster members. 7.4.7 Creating the Heartbeat Resource Configuration Since the configuration of resources and resource groups for Heartbeat is not as easy and well documented as for Sun Cluster, we will look in the following sections with a little more detail on how exactly the resources and resource groups are defined. c Stefan Peinkofer 173 [email protected] CHAPTER 7. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING HEARTBEAT 7.4.7.1 Resources and Resource Dependencies Based on the requirements, the cluster should provide two highly available PostgreSQL database instances, whereby each node should run one instance by default. Therefore each database instance requires a dedicated IP address and that the shared disk which contains the database files of the instance is mounted. It was chosen to deploy additionally two other resource types which are used to inform the administrators in case of a failure. Since Heartbeat requires that a node must be able to fence itself but in our case every node is only connected to the STONITH device which allows to fence the other node, in addition to the two STONITH resources for the physical STONITH devices, two STONITH resources for software STONITH devices have to be deployed. To STONITH a node, the software STONITH devices will initiate a quick and ungraceful reboot of the node. Figure 7.7 shows the needed resources and resource dependencies whereby two specialities exists: 1. The STONITH resources are not configured within a resource group. This is done because Heartbeat does not necessarily require that a resource is contained in a resource group and since all STONITH resources are independent of each other the overhead of defining four additional resource groups can be saved. 2. The IP address, shared disk and application resources do not really depend on the two resources which are used to notify the administrators but since the failure of the resource group which contains the database instance is from interest for the administrators, the two resources have to be contained in the same resource group as the database application instance. c Stefan Peinkofer 174 [email protected] 7.4. CLUSTER DESIGN AND CONFIGURATION infobase_postmaster telebase_postmaster Type: Postgres Type: Postgres infobase_data infobase_ip telebase_data telebase_ip Type: Filesystem Type: IPAddr Type: Filesystem Type: IPAddr infobase_audioalarm infobase_mailalarm Type: AudibleAlarm telebase_audioalarm Type: MailTo telebase_mailalarm Type: AudibleAlarm infobase_rg Type: MailTo telebase_rg suicide_spock kill_sarek suicide_sarek kill_spock Type: suicide Type: wti_nps Type: suicide Type: wti_nps Resource Resource Group STONITH Resource Resource X depends on that Resource Y runs on the same host Resource X has a pseudo dependency to Resource Y Figure 7.7: Resources and Resource Dependencies on the Heartbeat Cluster A special constraint on the STONITH resources is that the resources for the physical STONITH devices are only allowed to run on the node which is connected by the Ethernet connection to the corresponding STONITH device whereas the resources of the software STONITH devices are only allowed to run on the node which should be fenced by the resource. Figure 7.8 shows the valid location configuration of the STONITH resources and figure 7.9 shows the invalid one. c Stefan Peinkofer 175 [email protected] CHAPTER 7. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING HEARTBEAT spock sarek suicide_sarek kill_sarek kill_spock STONITH Connection Power Connection suicide_spock Resource can fence Node Figure 7.8: Valid STONITH Resource Location Configuration spock sarek suicide_spock kill_spock kill_sarek STONITH Connection Power Connection suicide_sarek NOT ALLOWED Figure 7.9: Invalid STONITH Resource Location Configuration 7.4.7.2 Creating the Cluster Information Base The configuration of the resources and resource groups is done by creating an XML file, which is called the Cluster Information Base (CIB). Unfortunately, there is little documentation about how this file should look. The only information available is a commented Document Type Definition (DTD) of the XML file and a few example CIB files. What is left completely unaddressed is the definition of STONITH resources. An example for the definition of STONITH resources had to be retrieved from the source code of Heartbeat’s Cluster Test System (CTS), which contains a test CIB in which STONITH resources are defined. c Stefan Peinkofer 176 [email protected] 7.4. CLUSTER DESIGN AND CONFIGURATION In contrast to Sun Cluster, the definition of resources and resource groups is very complex since Heartbeat requires in addition to the usual resource group and resource information also information about what the Cluster Resource Manager should do with the resource group or resource when certain events occur. Since the discussion of all possible configuration options the CIB provides, would go beyond the scope of this thesis we will limit the discussion to a logical description of the CIB, which was created for our cluster system. However, the example CIB file is contained on the CD-ROM, delivered along with this document. The Cluster Information Base is divided into three sections. Section one contains the basic configuration of the Cluster Resource Manager, which is responsible for all resource related tasks, like starting, stopping and monitoring the resources. Section two contains the actual configuration of the resource groups and resources. Section three contains constraints which could define for example on which node a resource should run by default or could define resource dependencies between resources, contained in different resource groups. In the Cluster Resource Manager configuration segment, the following information was provided: • A cluster transition, like a fail over, has to be completed within 120 seconds. If the transition takes longer, it is considered failed and a new transition has to be initiated. • By default every resource can run on every cluster node. • The Cluster Resource Manager should enable fencing of failed nodes. For the resource groups, the following information was specified: • The corresponding name of the resource group. • When the Resource Group should be failed over because of a node failure, the Resource Manager must not start the resources of the resource group, until the failed node can be successfully fenced. c Stefan Peinkofer 177 [email protected] CHAPTER 7. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING HEARTBEAT Heartbeat implicitly assumes that the order in which the resources within a resource group are defined reflects the resource dependencies of the resources within this group. This means that Heartbeat will start the resource group by starting the resources within the group in a top-down order and it will stop them in the reverse order. So the resources within the resource groups are specified in the following order: • AudibleAlarm • MailTo • IPaddr • Filesystem • Postgres To configure a resource, the following information has to be provided: • The name of the resource. • The class of the cluster agent which should be used for the resource. For Heartbeat v2 resource agents, which provide a monitoring callback function, the class has to be set to ocf (Open Cluster Framework). • The name of the resource agent to use. • The name of the vendor who implemented the resource agent. Heartbeat v2 defines that OCF resource agents, which are implemented by different vendors, can have the same name. This option is used to distinguish between them. • The timeout of the Start callback function, after which a not yet completed Start operation is considered failed. • The timeout of the Stop callback function, after which a not yet completed Stop operation is considered failed. c Stefan Peinkofer 178 [email protected] 7.4. CLUSTER DESIGN AND CONFIGURATION • The timeout of the Monitor callback function, after which a not yet completed Monitor operation is considered failed. • The time interval in which the Monitor function should be executed. • The custom resource properties of the specific resource type. The concrete custom resource properties, specified for the deployed resource types are: – AudibleAlarm ∗ The hostname of the cluster node on which the resource group should run by default. – MailTo ∗ The E-mail addresses of the administrators who should be notified. – IPaddr ∗ The IP address of the logical hostname, which should be maintained by this resource. Like on Sun Cluster, the network interface to which the IP address should be assigned need not be specified because the resource agent will automatically assign it to the appropriate network interface. – Filesystem ∗ The device name of the disk partition, which should be mounted. ∗ The directory to which the disk partition should be mounted. ∗ The file system type which is used on the disk partition. – Postgres ∗ The directory to which the PostgreSQL application was installed. ∗ The directory which contains the database and configuration file of the PostgreSQL instance. ∗ The absolute path to a file, to which all messages which are written by the PostgreSQL process to stdout and stderr, are redirected. c Stefan Peinkofer 179 [email protected] CHAPTER 7. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING HEARTBEAT ∗ The hostname of the IP address to which the PostgreSQL instance is bound. The order in which the resources are defined within the resource group causes one negative side effect. The failure of a Start of Stop callback function of the AudibleAlarm or MailTo resource would cause the Cluster Resource Manager cancel to start or stop the resource group and fail it over to another node. Since the two resources are only used to notify the administrator about a fail over, the failure of such a resource does not justify that the start or stop of a resource group is cancelled on a node and therefore may remain inactive until a human intervention occurs. To get around this problem the Cluster Resource Manager was configured to ignore the failure of a callback function, provided by these resources so in case a function of these resources fails, the Cluster Resource Manager will pretend that it didn’t fail. Additionally the Cluster Resource Manager was configured not to perform any monitoring operation on the two resources. For the other resources within the group, the following behavior was configured: If the Start or Monitor callback functions fail or time out, the resource group should be failed over to another node. If the Stop callback function fails or times out, the node on which the resource group is currently hosted should be fenced. This is needed, in order to fail over the resource group in this case, since the failure of a stop operation indicates that the resource is still running. The fencing of the node, on which the failure of the stop operation occurred, will implicitly stop the resource and therefore another node can take over the resources after the node is fenced successfully. As already said, the STONITH resources were defined without being assigned to a specific resource group. To configure a STONITH resource, the following information has to be provided: • The STONITH resource can be started without any prerequisites, like a successful fencing of a failed node. • When any callback function of the STONITH resources fails, the corresponding STONITH resource should be restarted. Since the STONITH resources cannot be failed over in our configuration, this is the only sensible option. c Stefan Peinkofer 180 [email protected] 7.4. CLUSTER DESIGN AND CONFIGURATION • The name of the STONITH resource. • The class of the STONITH resource agent, which is stonith. • The name of the deployed STONITH device type. • The timeout of the Start callback function. • The timeout of the Stop callback function. • The timeout of the Monitor callback function. • The time interval in which the Monitor function should be performed. • The custom resource properties of the specific STONITH resource type. The concrete custom resource properties, specified for the deployed STONITH resource types are: – wti_nps (Physical STONITH device) ∗ The IP address of the STONITH device. ∗ The password which has to be specified in order to log in to the STONITH device. – suicide (Software STONITH device) ∗ No custom resource property is needed, since the suicide resource will query the name of the node which can be “STONITHed“ by calling the hostname command. In the third section, the following constraints were defined: • The resource group of the Identity Management System database, infobase_rg should run on spock by default. • The resource group of the telephone directory database telebase_rg should run on sarek by default. c Stefan Peinkofer 181 [email protected] CHAPTER 7. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING HEARTBEAT • The STONITH resource kill_sarek, which can be used by spock to fence sarek, can only run on spock. • The STONITH resource suicide_spock, which can be used by spock to fence itself, can only run on spock. • The STONITH resource kill_spock, which can be used by sarek to fence spock, can only run on sarek. • The STONITH resource suicide_sarek, which can be used by sarek to fence itself, can only run on sarek. 7.5 Development of a Cluster Agent for PostgreSQL In the following section we will look at the development of a cluster agent for the PostgreSQL application. For Heartbeat v2 resource agents, Heartbeat provides a small library, implemented as a shell script, which currently provides some functions for logging and debugging and defines some return values and file system paths to various important Heartbeat directories. Before we can discuss the implementation of the PostgreSQL agent, we must first look at the interaction model between Heartbeat and the cluster agent. 7.5.1 Heartbeat Resource Agent Callback Model Like Sun Cluster, Heartbeat provides a fixed set of callback functions, which will be called by the cluster software under well defined circumstances. In contrast to Sun Cluster, Heartbeat provides the ability to define further callback functions. Since the only way to call these functions is to define that Heartbeat should call them in regular time intervals, the use, these additional callback functions provide, is limited. One possible use case would be to implement an additional monitor function that performs a more comprehensive health checking procedure which uses more computing resources and therefore should not be called so often as the basic monitoring function. For the predefined callback methods Heartbeat also defines the task of c Stefan Peinkofer 182 [email protected] 7.5. DEVELOPMENT OF A CLUSTER AGENT FOR POSTGRESQL this callback method and the expected return values. To implement a Heartbeat cluster agent, one executable function which contains all callback functions has to be developed. To call a specific callback function, Heartbeat will pass the method name as command line argument to the cluster agent. In fact, Heartbeat does not require that a cluster agent is written in a specific programming language, but typically the cluster agents are implemented as shell scripts. In the following we will look briefly at the predefined callback methods: • Start - This method is called when Heartbeat wants to start the resource. The function must implement the necessary steps to start the application and it must only return successfully if the application was started. • Stop - This method is called when Heartbeat wants to stop a resource. The function must implement the necessary steps to stop the application and it must only return successfully if the application was stopped. • Status - The Heartbeat documentation does not describe under which circumstances this callback method is called; it just states that it is called in many places. The purpose of the status callback method is to determine whether the application processes of the specific resource instance are running or not. • Monitor - This method is called by Heartbeat in a regular, definable time interval to verify the health of the resource. It only must return successfully if the specific resource instance is considered healthy, based on the performed health check procedure. • Meta-data - The Heartbeat documentation does not describe under which circumstances this callback method is called. It must return a description of the cluster agent in XML format. The description contains the definition of the resource agent properties and the definition of the implemented callback functions. The description this function returns is comparable to the resource type properties and custom resource properties which are contained in the resource type registration file of a Sun Cluster agent. c Stefan Peinkofer 183 [email protected] CHAPTER 7. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING HEARTBEAT Heartbeat requires that a cluster agent implements at least the Start, Stop, Status and Meta-data callback methods. 7.5.2 Heartbeat Resource Monitoring As we saw in the previous chapter, Heartbeat defines a direct callback function for the resource monitoring task. In contrast to Sun Cluster, Heartbeat requires the resource agent just to return the health status of the resource instance and the appropriate actions for a failed resource are determined and carried out by Heartbeat itself. 7.5.3 Heartbeat Resource Agent Properties Like Sun Cluster, Heartbeat defines a set of resource type properties and resource properties which are used to define the configuration of a cluster agent. As already discussed, additional custom resource properties can be specified, too. In contrast to Sun Cluster, which provides a common file for all properties, the resource type properties and custom resource properties are specified within the cluster agent and passed to Heartbeat by the Meta-data function and the resource properties are specified directly in the cluster information base. 7.5.3.1 Resource Type Properties In the following section, we will look at the resource type properties of a Heartbeat cluster agent and their corresponding attributes: • resource-agent - This property specifies general information about the cluster agent. It takes the following attributes: – name - Defines the name of the resource agent type. – version - Defines the program version of the agent. • action - This property defines a callback function which the cluster agent provides. It takes the following attributes: c Stefan Peinkofer 184 [email protected] 7.5. DEVELOPMENT OF A CLUSTER AGENT FOR POSTGRESQL – name - Defines the name of the callback function. – timeout - Defines the default timeout, after which the cluster will consider the function failed, if it has not yet returned. – interval - Defines the default interval in which the function should be called. This attribute is only necessary for monitoring functions. – start-delay - Defines the time delay Heartbeat will wait after the execution of a Start function before it calls the status function. 7.5.3.2 Custom Resource Properties To define a custom resource property, the special property parameter in conjunction with the property content has to be specified in the XML description which is printed to stdout by the Meta-data callback function. The property parameter takes the following attributes: • name - The name of the custom resource property. • unique - Defines whether the value assigned to the custom resource property must be unique across all configured instances of this cluster agent type or not. The content property takes the following attributes: • type - This attribute defines the data type of the custom resource property value. Valid types are: boolean, integer and string. • default - This attribute defines the default value which is assigned to the custom resource property. The values of the custom resource properties can be individually overwritten in the cluster information base, for each resource of the specific type. The values of these properties as well as the values of the normal resource properties are passed to the resource agent as environment variables which are named according the following naming scheme: $OCF_RESKEY_<property name>. c Stefan Peinkofer 185 [email protected] CHAPTER 7. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING HEARTBEAT 7.5.4 Creating the PostgreSQL Resource Agent A Heartbeat resource agent has to be created from scratch since Heartbeat provides neither an agent builder similar to Sun Cluster, nor a resource agent template. Since there is sparse documentation about how to create a resource agent it is a good idea to look at the cluster agents, which are delivered along with Heartbeat, to determine how a Heartbeat cluster agent should be programmed. We will look at the development of a Heartbeat cluster agent in a bit more detail than we did in the Sun Cluster section because there is still so little documentation about it. The source of the cluster agent can be found on the CD-ROM delivered along with this document. A special requirement of Heartbeat is that each function a cluster agent provides must be idempotent. 7.5.4.1 Possible Return Values The Open Cluster Framework defines a fixed set of return values a cluster agent is allowed to return. The defined return values are: • OCF_SUCCESS - Must be returned if a callback function finished successfully. • OCF_ERR_GENERIC - Must be returned if an error occurred which does not match any other defined error return code. • OCF_ERR_ARGS - Must be returned if a custom resource property value is not reasonable. • OCF_ERR_UNIMPLEMENTED - Must be returned if the callback function name, specified by Heartbeat as command line argument, is not implemented by the resource agent. • OCF_ERR_PERM - Must be returned if a task cannot be carried out because of wrong user permissions. • OCF_ERR_INSTALLED - Must be returned if the application or a tool which is used by the cluster agent is not installed. c Stefan Peinkofer 186 [email protected] 7.5. DEVELOPMENT OF A CLUSTER AGENT FOR POSTGRESQL • OCF_ERR_CONFIGURED - Must be returned if the configuration of the application instance is invalid for some reason. • OCF_NOT_RUNNING - Must be returned if the application instance is not running. It is worth mentioning that except for the Status callback function, the cluster agent must not print messages to stdout or stderr since doing so can cause segmentation faults in Heartbeat under special circumstances. To print messages, the special function ocf_log has to be used, which is provided by Heartbeats cluster agent library and writes the messages directly to the appropriate log file. 7.5.4.2 Main Function The main function of the cluster agent must perform initialization tasks like retrieving the custom resource property values from the shell environment. In addition to that, it should validate whether all external commands used by the resource agent functions are available and whether the custom resource property values are set reasonably. In the last step, the main function must call the appropriate callback function, which was specified by Heartbeat as command line argument. 7.5.4.3 Meta-data Function The PostgreSQL resource agent defines the following custom resource properties: • basedir - Defines the absolute path of the base directory, to which PostgreSQL was installed. This value does not have to be unique, since many resource instances can use the same application binaries. • datadir - Defines the absolute path of the directory in which the database and configuration files of the application instance are stored. This value has to be unique since every instance must use different database and configuration files. • dbhost - Defines the hostname or IP address on which the specific PostgreSQL instance is listening. c Stefan Peinkofer 187 [email protected] CHAPTER 7. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING HEARTBEAT • logfile - Defines the absolute path of a file, to which the stdout and stderr output of the PostgreSQL instance is redirected. 7.5.4.4 Start Function The Start function should validate the configuration of the application instance before it tries to start the application and return OCF_ERR_CONFIGURED if the configuration is invalid. The Start function of the PostgreSQL resource agent performs the following steps: • Determine if the directory specified by the custom resource property datadir contains PostgreSQL database and configuration files. If not, return OCF_ERR_ARGS. • Determine if the version of the database files matches the deployed PostgreSQL version4 . If not, return OCF_ERR_CONFIGURED. • Determine if the specified instance of PostgreSQL is already running. If so, return OCF_SUCCESS immediately for idempotency reasons. • Remove the application state file postmaster.pid, if it exists. This step is needed because PostgreSQL will store the key of its shared memory area in this file. In an active/active configuration it is very likely that both PostgreSQL instances will use the same key, since they are running on different nodes. However, if one node dies and the instance is failed over, PostgreSQL will refuse to start the instance on the other node as a precaution, because a shared memory segment with the same key it used before already exists. PostgreSQL suggests the following two options to deal with such a situation: – Remove the shared memory segment manually, which cannot be done in this special case because the shared memory segment belongs to another PostgreSQL instance. – Remove the postmaster.pid file, which will cause PostgreSQL to create a new shared memory segment, which then is implicitly assigned a different key. 4 The format of the database files can change between major releases. c Stefan Peinkofer 188 [email protected] 7.5. DEVELOPMENT OF A CLUSTER AGENT FOR POSTGRESQL • Call the appropriate command, which starts the PostgreSQL instance. • Wait five seconds and then determine if the specified instance of PostgreSQL is running. If so, return OCF_SUCCESS, if not return OCF_ERR_GENERIC. 7.5.4.5 Stop Function The Stop function of the PostgreSQL resource agent performs the following steps: • Call the appropriate command which stops the PostgreSQL instance. Do not check to determine whether the call returned successfully or not because of idempotency reasons. • Determine if the specified application instance is still running. If so return OCF_ERR_GENERIC, if not return OCF_SUCCESS. 7.5.4.6 Status Function The Status function of the PostgreSQL resource agent performs the following step: • Determine if a PostgreSQL process exists in the process list, which uses the directory specified by the custom resource property datadir as instance directory. If so print running to stdout and return OCF_SUCCESS. If not print stopped to stdout and return OCF_NOT_RUNNING. 7.5.4.7 Monitor Function The Monitor function of the PostgreSQL resource agent performs the following steps: • Determine if the specified instance of PostgreSQL is already running. If not, return OCF_NOT_RUNNING. This is important since it is not guaranteed that the monitoring function is not called until the start function is called. Returning OCF_ERR_GENERIC in this case would indicate to Heartbeat that the resource has failed and Heartbeat would trigger the appropriate action for a failed resource, like failing over the resource for example. c Stefan Peinkofer 189 [email protected] CHAPTER 7. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING HEARTBEAT • Connect to the PostgreSQL server, which listens on the hostname or IP address as specified by the custom resource property dbhost. Then call the following SQL commands: – Remove the test database called hb_rg_testdb. – Create the test database hb_rg_testdb again. – Create a database table within the test database. – Insert a data record to the test table. – Select the inserted data record from the test table. – Delete the test table. – Remove the test database. • If any of the performed SQL commands, except for the first database remove call, failed, return OCF_ERR_GENERIC, otherwise return OCF_SUCCESS. 7.6 Evaluation of Heartbeat 2.0.x As already discussed in chapter 4.3 on page 73, a “brand new“ software version should not be deployed in a production environment without performing a comprehensive set of test cases, in the first place. The following sections will discuss the testing process which was used to evaluate the maturity of Heartbeat 2.0.2. 7.6.1 Test Procedure Used Usually, software is tested by comparing the actual behavior of the software with the expected behavior described by the software specification. The Heartbeat 2.0.x implementation orients itself on the Open Cluster Framework specification. Unfortunately, not the whole OCF specification is implemented in Heartbeat 2.0.x yet and the OCF specification does not cover all things which are implemented in Heartbeat 2.0.x. For Heartbeat 2.0.x itself, no real specifications exist. The only available information about the desired behavior is the Heartbeat documentation. Unfortunately, the sparse documentation which is available is not sufficient to derive a complete c Stefan Peinkofer 190 [email protected] 7.6. EVALUATION OF HEARTBEAT 2.0.X specification. In addition to that, the behavior of Heartbeat is mainly swayed by the deployed configuration. To test Heartbeat 2.0.x under these conditions, it was chosen to create a test procedure which initiates common cluster events and failure scenarios. The reaction of Heartbeat to these failure scenarios was then compared to the expected behavior which was derived partly from the available documentation, partly by comments in the source code and partly by implicit knowledge of cluster theory. Although Heartbeat provides an automated test tool called Cluster Test System (CTS) it was chosen to use a manual test procedure. This decision is mainly founded on the following thoughts: • The Heartbeat developers could not guarantee that the CTS would work with the configuration, which was created for our cluster system since the CTS cannot deal with all possible CIB constructs. • Setting up the CTS test environment would take a lot of time, which would be wasted in case the CTS really could not deal with the concrete CIB configuration. 5 The several steps of the developed test procedure, as well as the expected behavior, are shown in table 7.1. (Note: With the terms of starting and stopping a resource group, it is meant that the resources belonging to the resource group are started or stopped and it is implicitly assumed that they are started or stopped in the right order.) Since at the time the test steps were developed, Heartbeat provided no function to manually fail over resource groups yet, the auto_failback option was enabled for the test procedure so that a resource group is automatically failed back to the default node by the time the node joins the cluster again. In addition to that, Heartbeat was only started manually and not automatically at system start. 5 The initial timeline for the practice part was already violated because of two unexpected software bugs, found in the Solaris operating system and the Sun Cluster software. c Stefan Peinkofer 191 [email protected] CHAPTER 7. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING HEARTBEAT Step Test Case Expected Behavior 1 Start Heartbeat simultane- Both nodes are able to communicate over the ously on both nodes cluster interconnect. Sarek will start the kill_spock and suicide_sarek resources and the telebase_rg resource group. Spock will start the kill_sarek and suicide_spock resource and the infobase_rg resource group. 2 Stop Heartbeat on sarek Sarek will stop its resources in the right order and inform spock that it is going to shut down. Spock will start the telebase_rg resource group after all resources are stopped on sarek, without fencing spock. Spock will not start the kill_spock and suicide_sarek resources. 3 Start Heartbeat on sarek Sarek will rejoin the cluster and start the again kill_spock and suicide_sarek resources. Spock will stop the telebase_rg resource group. After it is stopped, sarek will start the telebase_rg resource group. 4 Initiate a split brain failure Both nodes will discover that the other node is dead. by disconnecting all cluster One of the nodes will STONITH the other node, be- interconnects fore the other node is able to issue a STONITH operation as well. The remaining node will take over the resource group of the other node but not until the STONITH operation is completed successfully. The remaining node will not take over the kill_* and suicide_* resources of the dead node. c Stefan Peinkofer 192 [email protected] 7.6. EVALUATION OF HEARTBEAT 2.0.X Step Test Case Expected Behavior 5 After the killed node has re- The node will rejoin the cluster and take over its re- booted, reconnect the clus- source group and the kill_* and suicide_* re- ter interconnect paths and sources, as described in step 3. start Heartbeat on that node again 6 Bring down spock without Sarek will discover that spock is dead. Sarek shutting down Heartbeat will STONITH spock. Sarek will start the infobase_rg resource group but not until the STONITH operation is completed successfully. Sarek will not start the kill_sarek and suicide_spock resources. 7 Start Heartbeat on spock Same result as in step 3 just with interchanged roles. again, after it is rebooted 8 Stop Heartbeat on sarek Same result as in step 2. 9 Stop Heartbeat on spock Spock will stop the infobase_rg and the telebase_rg as well as the kill_sarek and suicide_spock resource. 10 Start Heartbeat simultane- Same as in step 1. ously on both nodes 11 Shut down Heartbeat on Sarek and spock will encounter that the whole both nodes simultaneously. cluster should be shut down. Each node will stop its kill_* and suicide_* resources and its resource group. The nodes will not try to take over the resources of each other. 12 Start Heartbeat simultane- Same as in step 1. ously on both nodes c Stefan Peinkofer 193 [email protected] CHAPTER 7. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING HEARTBEAT Step Test Case Expected Behavior 13 Let Heartbeat run for least a No special events, triggered by a failure of Heartbeat week will occur. Table 7.1: Heartbeat Test Procedure It has to be mentioned that this test procedure covers only the basic functionality of Heartbeat v2. What is left completely unaddressed for example are test cases which check whether Heartbeat properly reacts to failures of resource agent callback methods. The actual plan was to verify the basic functionality of Heartbeat with the described test procedure and to develop further test cases after that. Unfortunately, it took too much time to fix the problems which were discovered by the basic test procedure so that no time was left to develop further test cases. Starting with version 2.0.2 of Heartbeat, the test procedure was run through until the observed behavior of a step departured from the expected behavior. In such a case, the failure was reported to the Heartbeat developers. Depending on the fault, it was decided whether it made sense to continue with the testing of the specific version or not. After the problem was fixed by the developers, the test procedure was run through from the beginning on the new version. This loop should last until either the observed behavior of each step matches exactly the expected behavior or until the time plan of this thesis prevents us from continuing with testing. 7.6.2 Problems Encountered During Testing In the following section we will look at the various software bugs of Heartbeat which were encountered during the test process. As mentioned, the test process started with version 2.0.2 of Heartbeat. Unfortunately, the Heartbeat developers provide no patches to the customers which would fix the bugs in the version in which it was encountered but they fix the bugs only in the actual development version, which can be retrieved from the CVS repository. That is why all found bugs, except for the first one, refer to the development version 2.0.3. c Stefan Peinkofer 194 [email protected] 7.6. EVALUATION OF HEARTBEAT 2.0.X 7.6.2.1 Heartbeat Started Resources Before STONITH Completed While performing test step 4 it was encountered that when a node triggered the STONITH of the other node, it did not wait until the STONITH operation completed before it started to take over the resource group of the other node. Therefore for a small period of time, the resources were active simultaneously on both nodes, which led to data corruption. After reporting the bug to the Heartbeat mail list, a developer responded that the problem is already known and fixed in the current CVS version. Unfortunately, it turned out that the problem was only fixed for resources which are not contained in a resource group. After reporting this to the mail list, the problem was fixed for the resources contained in a resource group as well. 7.6.2.2 The Filesystem Resource Agent Returned False Error Codes With the new CVS version, which fixed problem 1, a new bug was introduced which was encountered by test step 1. The callback functions Status and Monitor of the Heartbeat cluster agent Filesystem, which is responsible for mounting and unmounting shared disks, returned OCF_ERR_GENERIC when the resource was not started yet. As discussed in chapter 7.5.4.1 on page 186, the right error code for this scenario would be OCF_NOT_RUNNING. Since Heartbeat calls the Monitor method once before it calls the Start method, this caused Heartbeat not to start the resource groups, since it assumes that a return code of OCF_ERR_GENERIC indicates that a resource is unhealthy, even if it is not started yet. What happened is that each node tried to start its resource group, which failed because of the wrong return code, so the resource groups were failed over to the respective other node, on which the start of the resource group failed as well, of course. After that, Heartbeat left the resource groups alone, to avoid further ping-pong fail overs. Fortunately, the problem was easy to isolate and so a detailed description to fix the bug was provided to the Heartbeat developers. 7.6.2.3 Heartbeat Could not STONITH a Node Again, the new CVS version, which fixed problem 2, introduced a new bug, which was encountered by test step 4. Apparently, Heartbeat was not able to carry out STONITH operations. The c Stefan Peinkofer 195 [email protected] CHAPTER 7. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING HEARTBEAT cause of this problem was an unreasonable “if-condition“ in the STONITH code, which caused the STONITH operation to return with an error before the actual STONITH command was called. So what happened was that both nodes continuously tried to issue STONITH operations to fence the other node, which failed. The problem was reported to the Heartbeat developers, who fixed the unreasonable “if-condition“. 7.6.2.4 Heartbeat Stopped Resources in the Wrong Order After problem 3 was fixed, it was encountered that the specific stop procedure of Heartbeat, carried out in test step 9, stopped the resources in the resource group not in the right order, but in a random one. After a basic fault analysis, it turned out that this behavior could only be observed when Heartbeat was shut down on the last cluster member. The effects of this bug were random; sometimes everything worked fine, sometimes the application data disk remained mounted after Heartbeat was stopped. The problem was reported to the Heartbeat developers, who included the problem in their bug database. A final solution is still pending, since it seems that a fix will require a significant amount of work to be done. 7.6.2.5 STONITH Resources Failed After Some Time The last problem which was encountered was very hard to analyze because the actual problem cause was not Heartbeat itself but the deployed STONITH devices. During test step 13, the STONITH resources of the physical STONITH devices became inactive after some time. The available Heartbeat log files showed that the monitoring function of the STONITH resource timed out. After that Heartbeat stopped the STONITH resource and then tried to restart the STONITH resource again, which failed. After 2 seconds, a log file message which said that the STONITH daemon, which is responsible for carrying out all STONITH operations on the corresponding node, was killed because of a segmentation fault. Since the segmentation fault log message was not generated by the STONITH daemon itself but by Heartbeat, which monitors its child processes and respawns them if they exit, we assumed that the segmentation fault of the STONITH daemon actually happened already before the monitoring method timed out but was logged only after the timeout. Therefore we assumed that the segmentation fault was the c Stefan Peinkofer 196 [email protected] 7.6. EVALUATION OF HEARTBEAT 2.0.X initial cause of the problem. The Heartbeat developers said that they already knew this problem, but were not able to reproduce it reliably and asked us if they could get access to our systems to track the bug. So we gave them access to our machines and they fixed the segmentation fault problem. Unfortunately, it turned out that the segmentation fault did not cause the problem, but was only a consequence of it, since even with the fixed version of Heartbeat, the STONITH resources still became inactive after some time; the only difference was that the STONITH daemon caused no segmentation fault anymore. After a short period of perplexity, we decided to concentrate on the timed out monitoring method of the STONITH resource, which connects to the STONITH device over the network, calls the help command and disconnects from the STONITH device. Since Heartbeat v1 provides a command line tool, which performs the same procedure as the monitoring method, it was chosen to exercise the STONITH device by continuously calling this tool in a loop. The intention of this test was to figure out if the problem is caused by Heartbeat itself, by the STONITH code or by the STONITH device6 . At the beginning of this test, the monitoring command completed within a second. After a short time period, the command took about 3 to 5 seconds to complete and after another short period of time, the monitoring command completed unsuccessfully after 30 seconds. During this period, the monitoring function repeatedly printed out error messages, which said that it was not able to log in to the STONITH device. After stopping the test after this event, it was tried to ping the STONITH device and it didn’t respond to the ping request anymore. Funnily enough, the STONITH device began to respond to the ping messages after 2 minutes again, which is the configured network connection timeout on the STONITH device and the monitoring function could be called successfully again, too. So the fault could be isolated to a firmware bug of the STONITH device. Unfortunately, the manufacturer of the deployed STONITH devices does not provide firmware upgrades at all, so the source problem could not be eliminated. Since the STONITH device recovered automatically after 2 minutes, the last idea to work around the problem was to call the monitoring 6 In fact no one expected that the problem could be caused by the STONITH device at this point. c Stefan Peinkofer 197 [email protected] CHAPTER 7. IMPLEMENTING A HIGH AVAILABILITY CLUSTER SYSTEM USING HEARTBEAT function in time intervals higher than 2 minutes. Unfortunately, it turned out that even with a time interval of 4 minutes, the problem still occurred. The only difference was that it occurred not after some minutes but after some days. So it was decided to replace the STONITH devices with other ones. After an evaluation of which other STONITH devices are supported by Heartbeat and which of them provide the ability to connect to at least two different power sources so the power source of a node does not become a single point of failure, it turned out that only one other STONITH device can be used7 . Unfortunately, it turned out that this STONITH device is hard to get in Germany and so it didn’t arrive in time for this thesis. Since this version of Heartbeat passed the test procedure in a tolerable manner, because except for the STONITH resource problem which was not caused by Heartbeat and the resource stop problem, which occurs only under rare circumstances, the whole test procedure could be completed successfully, it was decided to stop the tests at this point. It has to be mentioned that the CVS version will not be used if this system goes into production use. Therefore the test process cannot be considered finished since because of the ongoing development it is very likely that new bugs are introduced. At least after the announcement of the feature freeze for the next stable Heartbeat version, the test process has to be run through again. Unfortunately, this was not possible because the feature freeze was announced too late for this thesis. 7 In fact there was a second one but it is not available anymore because production was discontinued. c Stefan Peinkofer 198 [email protected] Chapter 8 Comparing Sun Cluster with Heartbeat In the following sections we will look at the differences and similarities between Sun Cluster and Heartbeat. Thereby we will limit our discussion to a high-level comparison, since comparing the two products on the implementation level would be like comparing apples and oranges. The first section will be limited to the comparison of the pure Cluster Software. Since a high availability cluster solution has to be seen as a complete system, consisting of hardware, operating system and cluster software, we will look at further pros and cons that result from the concrete system composition: • Sun Cluster - Solaris - SPARC hardware • Heartbeat v2 - Linux - x86 hardware 8.1 Comparing the Heartbeat and Sun Cluster Software The following section will discuss the benefits and drawbacks of Heartbeat and Sun Cluster. 8.1.1 Cluster Software Features • Maximum number of cluster nodes - Sun currently officially limits the number of supported nodes to 16. However, parts of the cluster software, like the global device file system are obviously already prepared for 64 nodes. So it seems to be very likely that Sun 199 CHAPTER 8. COMPARING SUN CLUSTER WITH HEARTBEAT Cluster will support 64 nodes with one of the next software releases. Heartbeat v2 has no limitation of the number of cluster nodes. At the time of this writing, Heartbeat has been verified to run on a 16-node cluster. • Supported Operating Systems - Sun Cluster can only be deployed on Solaris for SPARC and Solaris for x86 whereby the x86 version provides only a subset of the features from the SPARC version. The limitation to Solaris results mainly from the tight integration of Sun Cluster and the Solaris kernel. Although Heartbeat is also called the Linux-HA project, it is not limited to the use of Linux as operating system. Heartbeat is also known to run on FreeBSD, Solaris and MacOS X. In fact, when Heartbeat can be compiled cleanly on an operating system there are good chances that it will also run on the corresponding OS. One person even tried to cluster Windows servers by using Heartbeat in the cygwin environment. Unfortunately it is not known whether the experiment was successful or not. • Supported Shared File Systems - As we already know, Sun Cluster supports the Sun Cluster proxy file system and the Sun QFS SAN file system as shared cluster file systems. At the time of this writing, Heartbeat currently does not support any shared file system. This does not necessarily mean that Heartbeat won’t work with a shared file system; it only means that Heartbeat requires the user to find out whether Heartbeat works in conjunction with a specific shared file system or not. However, the Heartbeat developers plan to support the Oracle Cluster File System (OCFS) 2, which is distributed under the terms of the GNU Public License, with one of the next Heartbeat releases. • Out-of-the-box Cluster Agents - Heartbeat v2 currently only ships with OCF resource agents for Apache, DB2, IBM Websphere and Xinetd. Sun Cluster currently provides 32 cluster agents, which support, amongst others, Oracle RAC (Real Application Cluster, SAP, IBM Websphere and Siebel. c Stefan Peinkofer 200 [email protected] 8.1. COMPARING THE HEARTBEAT AND SUN CLUSTER SOFTWARE 8.1.2 Documentation Sun provides a comprehensive set of documentation, which is comprised of a few thousand pages. Despite the great size of the documentation, it is well structured in several documents, so the right documentation for a particular topic can be retrieved relatively quickly and the documentation is always kept up-to-date. The documentation itself is mainly written as step-by-step instructions which lead the user straight to the desired goal. The only drawback, which is in fact only experienced by expert users, is that the documentation does not always describe how particular tasks are carried out by Sun Cluster in detail. But since this knowledge is usually not needed to build a “supported“ Sun Cluster system it may be legitimate from a normal user’s point of view. In addition to the general Sun Cluster documentation, a comprehensive “on-line“ help guide for the various cluster commands is available in the form of UNIX man pages. The Heartbeat documentation provides great room for improvement. First of all, the available documentation is very unstructured, which makes it very time consuming to retrieve the information for a particular topic. Second, only one documentation set for all available Heartbeat versions exists, which makes it in some cases very hard to determine whether particular information is valid for the concrete Heartbeat version deployed. Third, the documentation leaves some important topics either completely unaddressed or contains only a subset of the needed information. Fourth, Heartbeat provides virtually no “on-line“ help for the various cluster commands. The only advantage of the Heartbeat documentation is that it provides some information about how certain things are implemented, which are of course not very useful for users who just want to build a Heartbeat cluster, but are very interesting to people who want to learn something about how a high availability system could be implemented. 8.1.3 Usability Sun Cluster provides a comprehensive set of command line tools which can be used to configure and maintain the cluster system. The common command line options of the tools are named consistently throughout all commands, which eases the use of the tools and after a short c Stefan Peinkofer 201 [email protected] CHAPTER 8. COMPARING SUN CLUSTER WITH HEARTBEAT adaptation phase, the commands can be used nearly intuitive1 . In addition to that, the command line tools prevent the user from accidentally misusing the tools, by verifying whether the effects caused by the execution of the specific command are sensible or not. In addition to the command line tools, the Sun Cluster software also provides a graphical user interface for configuring and maintaining the cluster. However, not all possible tasks can be carried out by the graphical user interface, but for configuring a cluster and for performing normal “day-to-day“ tasks, the graphical user interface should be sufficient. In addition to that, the SUN Cluster software provides an easy to use graphical user interface which allows even users with virtually no programming experience to create custom cluster agents. The command line tools provided by Heartbeat are still evolving. In Heartbeat version 2.0.2 even some important command line tools, like a tool which allows the user to switch individual resource groups on- and offline, were missing. However, version 2.0.3 will introduce the missing commands. Compared to the Sun Cluster command line tools, the tools provided by Heartbeat are more complex to use. Another drawback in the usability of Heartbeat is that a user needs programming experience to create a cluster agent. However, the greatest drawback in the usability of Heartbeat is the configuration of the Cluster Information Base, since the structure of the XML file is very complex and it provides an overwhelming number of set screws which probably ask too much of a not so experienced user. Fortunately, this drawback was already recognized by the Heartbeat developers and so they are currently developing a graphical user interface for configuring Heartbeat and the Cluster Information Base. 8.1.4 Cluster Monitoring Sun Cluster provides a cluster monitoring functionality by using a Sun Cluster module for the general purpose monitoring and management platform Sun Management Center. Heartbeat itself provides only a simple, text oriented monitoring tool. However, through the use of additional software components like the Spumoni program, which enables virtually any program 1 Yes, command line tools can be intuitive, at least to UNIX gurus. c Stefan Peinkofer 202 [email protected] 8.1. COMPARING THE HEARTBEAT AND SUN CLUSTER SOFTWARE which can be queried via local commands to be health-checked via SNMP, Heartbeat can be integrated in enterprise level monitoring programs like HP OpenView or OpenNMS. 8.1.5 Support In the following we will compare the support which is available for the discussed cluster products. Therefore, we will look first at the support which is available at no charge and then at the additional support through commercial support contracts. 8.1.5.1 Free Support Sun provides two sources for free support for the Sun Cluster software. Source one is a knowledge database called SunSolve. This database provides current software patches as well as information of already known bugs and troubleshooting guides for common problems. However, the knowledge base does not contain all the information which is contained in Sun’s internal knowledge base and therefore for some problems, it is necessary to consult the Sun support to get information about how to fix the problem. The second source is a Web forum for Sun Cluster users. Registered users can post their questions to the forum but use of this forum seems to be very limited, since most of the questions to which users have replied could easily have been answered by the SunSolve knowledge database and to most of the questions which could not be answered by SunSolve, users have not replied. The Heartbeat community provides free support over their Heartbeat user mailing list, which is also available as a searchable archive. In addition to the mailing list, a Heartbeat IRC channel exists, over which a user can get in touch with the Heartbeat developers in real time. Questions to the mailing list are usually answered within 24 hours, whereby most of the questions are directly answered by the Heartbeat developers themselves, who are very friendly and patient2 . If response time is no issue, the quality of support, provided through the mailing list, can be 2 A factor which is not self-evident in Open Source and commercial software forums. c Stefan Peinkofer 203 [email protected] CHAPTER 8. COMPARING SUN CLUSTER WITH HEARTBEAT compared to the quality of the commercial telephone support for Sun Cluster. 8.1.5.2 Commercial Support Sun provides two commercial support levels for the Sun Cluster software, standard level and premium level support, whereby the support is already included in the license costs for the software. The standard support level allows the user to submit support calls during extended business hours which are 12 hours from Monday to Friday. The reaction time3 depends on the priority of the support call and is 4 hours for high priority, 8 hours for medium priority and 24 hours for low priority support calls. The premium support level allows the user to submit support calls 24 hours a day, 7 days a week. The reaction time is 2 hours for medium priority and 4 hours for low priority support calls. High priority support calls will be immediately transferred to a support engineer. In addition to this support, Sun offers the opportunity to place a contract with Suns Remote Enterprise Operation Services Center (REOS) which will undertake the task of doing the installation of the system as well as remote monitoring and administration tasks. The Heartbeat community itself does not provide commercial support. However, third parties like IBM Global Services or SUSE/Novell provide the ability to place support contracts for Heartbeat. SUSE for example provides various support levels for their SUSE Linux Enterprise distribution, which includes support for Heartbeat. Unfortunately, currently only Heartbeat v1 is supported by SUSE, since the SUSE Linux Enterprise distribution does not yet contain Heartbeat v2. The support levels vary from support during normal business hours and 8 hours response time, to 24/7 support with 30 minutes response time. The costs for this support varies from 8,100 EUR to 343,000 EUR per year whereby the support seems to enfold all SUSE Enterprise Linux installations of the organization. In addition to this support, SUSE provides also a remote management option, which is very similar to Sun’s REOS. 3 This is the time interval which is allowed to elapse between the point in time the support call is submitted and the point in time a support engineer responds to the call. c Stefan Peinkofer 204 [email protected] 8.1. COMPARING THE HEARTBEAT AND SUN CLUSTER SOFTWARE 8.1.6 Costs Currently, the license costs for Sun Cluster constitute 50 EUR per employee per year, which includes standard support for the Sun Cluster software. For premium support, the license costs are 60 EUR. In addition to that, Sun charges further license costs for some cluster agents. Since Heartbeat is distributed under the terms of the GNU Public License, it is available at no cost. 8.1.7 Cluster Software Bug Fixes and Updates Bugs encountered in a specific Sun Cluster version are fixed by applying patches, which are provided by Sun over the SunSolve knowledge base. Therefore bug fixes can be applied without upgrading the software to a new version, whereby nearly all patches can be applied by a rolling upgrade process. For Sun Cluster version updates, a distinction must be made between minor and major version updates. Minor version updates, which are denoted by extending the version number by the release date of the update, can be performed by a rolling upgrade process. Major version updates, for example from version 3.0 to 3.1, require the shutdown of the whole cluster and therefore cannot be applied by a rolling upgrade process. The same is true for updates of the Solaris operating system. This is caused by the tight integration of Sun Cluster and the Solaris kernel. Bugs encountered in a specific Heartbeat version cannot be fixed by applying patches, since the Heartbeat developers do not provide patches. The only chance to fix the bug is to deploy a successor version which does not contain the bug. This can mean that if no stable successor version exists yet, either the unstable CVS version has to be used or the user must wait until the next stable version is released. The only way to get around this problem would be to use a Linux distribution which provides back ports of recent bug fixes, for the Heartbeat version which was shipped with the Linux distribution. All Heartbeat version updates, except the update from v1 to v2, can be performed by a rolling upgrade process. In addition to that, all types of operating system updates can be performed by a rolling update process, too, since Heartbeat is decoupled from the deployed operating system kernel. c Stefan Peinkofer 205 [email protected] CHAPTER 8. COMPARING SUN CLUSTER WITH HEARTBEAT 8.2 Comparing the Heartbeat and Sun Cluster Solutions In the following sections we will look at further benefits and drawbacks which result from the concrete combination of Heartbeat together with Linux on x86 hardware and Sun Cluster together with Solaris on SPARC hardware. 8.2.1 Documentation Although the documentation of Linux and x86 hardware is not as bad as the Heartbeat documentation, the documentation of Solaris and SPARC hardware is still better. This is mainly founded on the fact that all documentation of Sun Cluster, Solaris and SPARC servers can be accessed over a common Web site, which is well structured and covers all important issues in step-by-step guides. In addition to that, the Solaris “on-line“ help, provided by UNIX man pages, provides far more information than the man pages of Linux and in contrast to Linux, Solaris provides a man page for every command line and graphical application. 8.2.2 Commercial Support Since virtual identically support contracts can be placed for Linux and x86 servers as well as for Solaris and SPARC servers, no differences in the available support levels exist. However, the SPARC solution provides one advantage: The support of the overall system is provided by a single company whereas for the x86 solution, at least two companies are involved, namely the company which provided the hardware and the company which provided the Linux distribution. Theoretically this should constitute no drawback, but in the real world it occurs from time to time that a support division of a company shifts the responsibility on to the support division of another company and the other way round. For example, consider a failure scenario in which a server reboots from time to time without giving a hint of what caused the problem. The company which provides the support for Linux will begin with saying that this is not caused by Linux but by the server hardware, and the company which provides the support for the hardware will say it is caused by Linux. So to get support at all, the customer must first prove that the c Stefan Peinkofer 206 [email protected] 8.2. COMPARING THE HEARTBEAT AND SUN CLUSTER SOLUTIONS problem is caused by either Linux or the server hardware. With a SPARC solution, the task of determining which component caused the failure is totally relinquished to the Sun support. 8.2.3 Software and Firmware Bug Fixes The main advantage of the SPARC solution for software and firmware patches is that all required patches can be downloaded from a common Web page. In addition to that, Sun usually keeps track of patch revision dependencies between the cluster software, the operating system and firmware patches and notifies users about this dependencies in the respective patch documentation. Since with a Linux solution, at least two companies are always involved, the operating system and firmware patches have to be downloaded from two different places and it is not guaranteed that the companies keep track of dependencies between their patches and the patches of other companies. 8.2.4 Costs The overall costs for servers and operating system for a x86 solution should be about 10 to 20 percent lower than the costs for a comparable SPARC solution. This is because of the slightly higher hardware costs for SPARC systems, since license costs are demanded neither for Linux nor for Solaris. 8.2.5 Additional Availability Features The use of midrange and enterprise level SPARC servers in conjunction with Solaris provides further availability features. This features are discussed in the following section. • Hot plug PCI bus - PCI devices can be removed and added without rebooting the system. Indeed, some x86 servers provide this feature too, but not all PCI cards, available for x86 severs support this feature and using the hot plug functionality with Linux is more complex than with Solaris. c Stefan Peinkofer 207 [email protected] CHAPTER 8. COMPARING SUN CLUSTER WITH HEARTBEAT • Hot plug memory and CPUs - Memory and CPUs can be removed and added without rebooting. Although it seems that some x86 systems support this feature4 , the Linux support for hot plug memory and CPUs is still in the alpha state. • Automatic recovery from CPU, memory and PCI device failures - If one of the mentioned components fail, the system will be rebooted by a kernel panic. During the reboot, the failed component will be unconfigured and the system will restore system operation without using the failed component. Unfortunately, no information about whether x86 systems provide such a functionality could be found. 8.2.6 “Time to Market“ Assuming that Heartbeat v2 works as expected, the overall time which is currently needed to design and configure a Sun Cluster system is less than the time which is needed to build a Heartbeat v2 system. This is mainly caused by the lack of documentation of the Heartbeat software. However, assuming that the documentation of Heartbeat is as good as the documentation for Sun Cluster, the time to market for a simple cluster configuration which does not use a shared file system and software mirrored shared disks and which does not require the development of a cluster agent would be approximately the same for both system types. For more complex cluster configurations, the time to market should be less for Sun Cluster since these configurations can be implemented in a very straightforward way, whereas a Linux - Heartbeat combination usually requires the user to perform complex configuration tasks to implement a complex configuration. 8.3 Conclusion As we have seen, the harnessed team Sun Cluster, Solaris and SPARC provides a comprehensive high availability cluster solution, which is mature and reliable enough to be deployed in a production environment. However, if commercial support from Sun is required it is mandatory 4 For example the ES7000 servers from Unisys. c Stefan Peinkofer 208 [email protected] 8.3. CONCLUSION that the concrete cluster configuration matches all special configuration constraints of the various applications. For the harnessed team Heartbeat v2, Linux and x86, things look different yet. As we have seen, Heartbeat v2 still contains too many bugs and the documentation is not good enough. However, with the basic design of Heartbeat v2 and the already planned improvements, Heartbeat v2 has the potential to become the best freely available cluster solution which does not need to hide from commercial cluster systems. If the documentation of Heartbeat is improved, there will be no reason not to deploy a Heartbeat v2 cluster in a production environment. So it is worth it to keep an eye on the evolution of Heartbeat v2. Linux and the x86 hardware lack still some high availability features which are desirable in midrange and enterprise scale configurations and the features already available are complicated to configure and to use. However, since many big companies like IBM, Red Hat and SUSE/Novell enforce Linux and x86 as an enterprise suitable operating system - hardware combination, it can be expected that these things will improve over time. c Stefan Peinkofer 209 [email protected] Chapter 9 Future Prospects of High Availability Solutions Finally we will briefly look at the emerging evolution of high availability solutions. 9.1 High Availability Cluster Software Unfortunately, most of the cluster software development is done behind closed doors and so not as much information about emerging new cluster software features is disclosed. One of the emerging features of cluster systems are so-called continental clusters which allow the cluster nodes to be separated by an unlimited distance. This feature will enable the customers to deploy high availability clusters even for services for which comprehensive disaster tolerance is required. Another emerging feature is that, by the use of the emerging technology of server virtualization and a cluster system which is aware of the virtualization technique, it will be possible to reduce the number of cluster installations. As figure 9.1 shows, server virtualization allows customers to run more than one operating system instance on a single server. 210 9.1. HIGH AVAILABILITY CLUSTER SOFTWARE Physical Hosts Cluster Application R1 R2 R3 R4 Cluster Application Virtual Hosts Figure 9.1: High Availability Cluster and Server Virtualization If the cluster system is aware of the underlying virtualization technique, it will be possible to deploy a single cluster instance which maintains all services, contained in the various operating system instances. As figure 9.2 shows, to make these services highly available, the cluster system will no longer fail over the application instance, but fail over the whole virtual operating system instance. c Stefan Peinkofer 211 [email protected] CHAPTER 9. FUTURE PROSPECTS OF HIGH AVAILABILITY SOLUTIONS Physical Hosts Cluster Application R1 R2 R3 R4 Cluster Application Fail Over R2 Virtual Hosts Figure 9.2: Virtual Host Fail Over 9.2 Operating System On the commercial operating system level, the current emerging technology concerning availability is self healing. The self healing functionality is divided into two parts: • Proactive self healing - Tries to predict failures of components before their occurrence and automatically reconfigures around the suspect component without affecting the availability of the system. • Reactive self healing - Tries to react automatically to failures that have already occurred by reconfiguring around the failed component by affecting the system availability as little as possible. In addition to that, the self healing vision includes the thought that the system will explain to the users what actually caused the problem and that it will also give recommendations to the c Stefan Peinkofer 212 [email protected] 9.3. HARDWARE user regarding what should be done to fix the problem. If these recommendations are reliable enough, the mean time to repair can be reduced, since the task to find a solution for the problem is already done by the system itself. On the non-commercial operating system level, we can expect that more and more of the availability features which are currently available on the commercial operating systems will be implemented and that the configuration and administration of these features will be as easy as they are on the commercial operating systems. 9.3 Hardware On the hardware level, we can expect that the reliability of hardware components will improve over time. In addition to that, more and more availability features, which are currently only available for midrange and enterprise scale hardware, will also be available for entry level hardware. Also, the complexity of the configuration and administration of hardware components like storage sub-systems or Ethernet switches and routers will be reduced. c Stefan Peinkofer 213 [email protected] Appendix A High Availability Cluster Product Overview Table A.1 gives an overview of the most important high availability cluster products. 214 c Stefan Peinkofer 215 [email protected] http://www.linux-ha.org Lifekeeper AIX PowerPC 32 http://www.ibm.com IRISa FailSave Operating System(s) Hardware Number of Nodes Web 16 http://www.redhat.com Number of Nodes Web b Note that this is no typo. The development of this product is discontinued. c Linux only. d For fail over configurations. a x86 Hardware http://www.sun.com 16 SPARC, x86 Solaris Sun Microsystems 32 Table A.1: High Availability Cluster Products Red Hat Linux Operating System(s) Red Hat Cluster Suite Red Hat Sun Cluster http://www.sgi.com Web Vendor http://www.steeleye.com/ 8 Number of Nodes http://www.microsoft.com 8d x86 Windows (NT, 2000, 2003) Microsoft Windows Cluster http://oss.sgi.com 16 x86, (maybe others too) x86, PowerPCc MIPS Hardware Linux Open Source (SGI originally) Linux FailSaveb http://www.hp.com 16 Itanium, PA-RISC, x86 HP-UX, Linux Hewlett-Packard HP Serviceguard Linux, Windows (NT, 2000, 2003) IRIX Operating System(s) SteelEye Technology SGI Vendor not limited x86, SPARC, PowerPC, others Linux, Solaris, FreeBSD, MacOS X, others IBM Open Source Heartbeat Vendor HACMP Nomenclature AP I Application Programming Interface ARP Address Resoluation Protocol AT A Advanced Technology Attachments BIOS Basic Input/Output System CIB Cluster Information Base CIF S Common Internet File System CP U Central Processing Unit CRC Cyclic Redundancy Check CRM Cluster Resource Manager CT S Cluster Test System CV S Concurrent Versions System DC Designated Coordinator DN S Domain Name System DT D Document Type Definition ECC Error Correction Code FC Fibre Channel GN U GNU’s Not Unix HA High Availability HBA Host Bus Adapter HP Hewlett-Packard HT T P Hypertext Transfer Protocol 216 IBM International Business Machines ICM P Internet Control Message Protocol IEEE Institute of Electrical and Electronics Engineers IP Internet Protocol IP M P IP Multipathing IRC Internet Relay Chat ISO International Organization for Standardization LAN Local Area Network LDAP Lightweight Directory Access Protocol LU N Logical Unit Number M AC Media Access Control MB Megabyte M D Multi Disk M P XIO Multiplex Input/Output M T BF Mean Time Between Failure M T T R Mean Time To Repair N F S Network File System N IS Network Information System N T P Network Time Protocol OCF Open Cluster Framework OCF S Oracle Cluster File System OS Operating System OSI Open Systems Interconnection P CI Peripheral Component Interconnect P ERL Practical Extraction and Report Language P M F Process Management Facility c Stefan Peinkofer 217 [email protected] APPENDIX A. HIGH AVAILABILITY CLUSTER PRODUCT OVERVIEW P XF S Proxy File System QF S Quick File System RAC Real Application Cluster RAID Redundant Array of Independent Disks REOS Remote Enterprise Operation Services Center RF C Requests for Comments ROM Read Only Memory RP C Remote Procedure Call RT R Resource Type Registration SAM Storage and Archive Manager SAN Storage Area Network SCI Scalable Coherent Interconnect SCSI Small Computer System Interface SM ART Self-Monitoring, Analysis and Reporting Technology SN M P Simple Network Management Protocol SP ARC Scalable Processor Architecture SP OF Single Point Of Failure SQL Structured Query Language ST OM IT H Shoot The Other Machine In The Head ST ON IT H Shoot The Other Node In The Head SV M Solaris Volume Manager T CP Transmission Control Protocol U DP User Datagram Protocol U F S Unix File System V HCI Virtual Host Controller Interconnect V LAN Virtual Local Area Network c Stefan Peinkofer 218 [email protected] V M S Virtual Memory System W AN Wide Area Network W LAN Wireless Local Area Network W W N World Wide Name XM L Extensible Markup Language ZaK Zentrum für angewandte Kommunikationstechnologien c Stefan Peinkofer 219 [email protected] Bibliography [ANON1] Anonymous, Reliability and Failure [ANON2] Anonymous, Live system upgrades [ANON3] Anonymous, High Availability Whitepaper [ANON4] Anonymous, Fault Isolation [ANON5] Anonymous, Disaster Recovery Plan of the University of Arkansas [ANON6] Anonymous, Disaster Recovery Planning [ANON7] Anonymous, Sun Cluster Data Services Planning and Administration Guide for Solaris OS [ANON8] Anonymous, Solaris VolumeManager Administration Guide [ANON9] Anonymous, Solaris man pages (man mediator) [ARPACI] Arpaci, Remzi H., Communication Behavior of a Distributed Operating System [BENDER] Bender William J., Joshi Abhinav, High Availability Technical Primer [BENEDI] Benediktsson Oddur, Fault, failure and error [BIANCO] Bianco Joseph, Lees Peter, Rabito Keven, SUN CLUSTER 3 Programming, Prentice Hall PTR, 2004, ISBN: 0130479756 [ELLING] Elling Richard, Read Tim, Designing Enterprise Solutions with Sun Cluster 3.0, Prentice Hall PTR, 2002, ISBN: 0130084581 [HELD1] Held Andrea, Grundlagen der Hochverfügbarkeit [HELD2] Held Andrea, Hochverfügbarkeit: Kennzahlen und Metriken [KAKADIA] Kakadia Deepak, Halabi Sam, Cormier Bill, Enterprise Network Design Patterns: High Availability, Sun Blueprints [KOPPER] Kopper Karl, The Linux Enterprise Cluster, No Starch Press, 2005, ISBN: 1593270364 220 BIBLIOGRAPHY [KRAMER] Kramer, Shoshani, Agarwal, Draney, Et al, Deep scientific computing requires deep data [KRONEN] Kronenberg Nancy P., Levy Henry M., Strecker Wiliam D.,VAXclusters: A Closely-Coupled Distributed System [MARCUS] Marcus Evan, Stern Hal, Blueprints for High Availability, John Wiley & Sons Inc., 2004, ISBN: 0471356018 [MELLOR] Mellor Chris, Hitting the buffers/Fibre Channel buffers inhibit long-distance [MENEZES] Menezes Telmo, Costa Diamantino, Tavares Miguel, On the Extension of Xception to Support Software Fault Models [MOREAU] Moreau Ken, A Survey of Cluster Technologies [PARABEL] Parabel Matthias, Disk-Backup kann ein Sicherheitsrisiko sein [PFISTER] Pfister Gregory F., In Search of Clusters, Prentice Hall PTR, 1998, ISBN: 0138997098 [RAHNAMAI] Rahnamai K., Arabshahi P., Yan T.-Y., Pham T., Finley S. G., An Intelligent Fault Detection and Isolation Architecture For Antenna Arrays [SMITH] Smith Jerry, What is two-phase commit? [SNOOPY] Snoopy, Igor,der Schalter,Igor,der Schalter!!!, UpTimes, September 2005 [SOLTAU] Soltau Michael, Unix/Linux Hochverfügbarkeit, MITP, 2002, ISBN: 3826607759 [STALKER] Stalker Software Inc, Cluster Technology and File Systems [STOCK] Stockebrand Benedikt, Zuverlässigkeit vor, hinter, unter und über dem Cluster, UpTimes, September 2005 [WIKI1] Anonymous, Computer cluster [YOSHITAKE] Yoshitake Shinkai, Yoshihiro Tsuchiya, Takeo Murakami, Alternatives of Implementing a Cluster File Systems c Stefan Peinkofer 221 [email protected]