how to Leverage Disk to improve recovery plans white paper
Transcription
how to Leverage Disk to improve recovery plans white paper
white paper How to Leverage Disk to Improve Recovery Plans WHITE PAPER Introduction How to Leverage Disk to Improve Recovery Plans | 2 If you are still using tape to back up your production servers, you are likely experiencing some problems around backup windows, data loss on recovery, recovery time, or recovery reliability. For enterprises that need to retain backup data for long periods of time to meet business and/or regulatory mandates, tape still has a place in the backup infrastructure. But disk offers significant advantages when it is integrated into the backup infrastructure in the right locations, and provides ready access to a number of other next generation recovery technologies that are critical in meeting business recovery requirements. As we recover from the industry downturn, cost-cutting and cost-containment are still key issues for most enterprises. Being able to provide the right recovery capabilities for your business extends to more than just data – it must cover applications as well. If you’re like most information technology (IT) shops, this means multiple products and multiple vendors. Interestingly, the strategic deployment of disk in your backup infrastructure can help you resolve this issue, simplifying your environment and actually lowering costs relative to what you’re doing now. Assuming disk as a well-integrated part of the backup infrastructure, this white paper will discuss technologies that are critical in designing a comprehensive and scalable recovery capability for your business that covers data and applications, both to meet everyday needs as well as more infrequent disaster recovery (DR) requirements. Understanding Why Disk Is A Strategic Data Protection Technology DISK ADVANTAGES – Ability to operate at varying speeds shortens backups – On-line searchability lets admins find data faster already multiple terabytes in size puts a huge load on servers, networks, storage devices, and administrators. As the data grows, these loads increase. With its performance and serial access characteristics, tape is TAPE ADVANTAGES a poor choice as a backup target for large data sets. It just takes too long to backup and/or – Tape has a lower $/GB cost than restore data – activities that your shop is doing disk (10x – 100x) on a daily basis. – Tape can dump large amounts of data faster than disk By designating disk as the initial backup target, its performance and random access characteristics can be leveraged to address backup window – Each tape device operates at a and recovery time issues. Conventional backups single rated speed – Disk is a more reliable media than can be completed much faster against disktape for backup tasks based targets than they can against tape-based – Disk provides access to key next targets, and this can in many cases cut backup generation recovery technologies times considerably. Given that most restores are done from the most recent backups, keeping the last couple of backups on disk Figure 1. Comparing the relevant “data protection” characteristics of disk and tape. ensures that data can be found and restored faster. And for the types of random access activities that typify initial backups and Data is growing at 50% - 60% a year for most enterprises. most object-level restores, disk is much better suited to it and Using legacy tape backup-based approaches, providing adequate therefore performs much more reliably. protection for mushrooming data stores that are in many cases – Data does not need to be converted to disk format prior to restore – If network infrastructure can support it WHITE PAPER Protected Servers How to Leverage Disk to Improve Recovery Plans Backup Tier 1 SATA Disk • Handles initial “backups” • Feeds all near term restores Backup Tier 2 Tape • Low cost, long term retention • Meets regulatory, compliance requirements Figure 2. To improve data protection and recovery, leverage the advantages of disk and tape in the appropriate backup “tiers”. The use of disk as a backup target opens up the use of diskbased snapshots. For most recovery scenarios, it is best to use application-consistent recovery points. Relative to crash-consistent recovery points (the only other kind), the use of applicationconsistent recovery points allows data and/or an entire server to be recovered faster and more reliably. When the objective is to restore the data fast or to bring an application service (such as Exchange) back up as quickly as possible, you want applicationconsistent recovery options. Many key enterprise applications offer APIs that allow third party products to interact with them to create application-consistent recovery points which can be kept on disk (called snapshots). Windows Volume Shadowcopy Services (VSS) and Oracle Recovery Manager (RMAN) are probably the two most widely known of these interfaces, but most enterprise data service platforms (databases, file systems, etc.) offer them. There are two very viable options for integrating disk into your current backup infrastructure that require minimal changes. Many enterprise backup software products will optionally support disk as a backup target, allowing you to stay with your current backup schedules while you leverage some of the advantages of disk. The other option is to buy a virtual tape library (VTL). VTLs use disk as the storage media, but fool the backup software into thinking they are tapes. While this is all interesting, these are only incremental improvements that are based almost entirely on maintaining the same processes you have today (scheduled backups that occur once a day) and merely using disk as the backup target. Because data will continue to grow at high rates, many of the same problems you have with tape today (e.g. backup window, data protection overhead) will re-surface again at some point in the future with these two approaches. | 3 What really makes disk a strategic play in data protection is the access it provides to a number of newer technologies that can transform how data protection occurs. Given current data growth rates, within 5-7 years almost all enterprises will have had to make the transition to a different recovery model. Many enterprises are already at that point today, and need to move away from point-intime oriented backup approaches so that they can reign in the growth of server overhead, network utilization, and storage capacity increases. This coming transformation will be a move from scheduled, point-in-time based data protection schemes to more continuous approaches that spread the capture of “backup” data out across the entire day (a 24 hour period) rather than attempting to capture it once a day in a scheduled, “off hours” period. With business operating 24x7 in most industries, there are no more “off hours” periods. And there just isn’t enough bandwidth in many cases to complete backups within some pre-defined window. Massively scalable, disk-based platforms built around scale out architectures designed to handle very large data sets have already been recommending “continuous” approaches to provide data protection for their platforms for years. It’s just a matter of time before enterprises run into the same “scale” problems with their backup data, if they are not already there. Dealing With the Cost of Disk The one huge negative of disk relative to tape has been its much higher cost. Two key developments provide options for end users to manage disk-based backup costs down. The widespread availability of large capacity, SATA disks with enterprise class reliability have brought the cost of a viable disk target down to somewhere in the $5/GB range (raw storage capacity cost) for many products. Tape, however, is still much cheaper coming in at $.40 - $.50/GB (raw storage capacity cost) for many enterprise class tape libraries. Compression technologies, included on most enterprise class tape libraries, generally can provide a 2:1 reduction in storage capacity requirements for most types of data, resulting in a $/GB cost for tape in the $.20 - $.25 range. The increasing use of storage capacity optimization technologies is also helping to reduce the cost of disk significantly. Storage capacity optimization includes technologies like compression, file-level single instancing, delta differencing, and data deduplication which can provide a 20:1 or more reduction in storage capacity requirements for backup data. When storage capacity optimization is applied to a SATA-based disk array being used for backup purposes, the raw storage capacity cost of that array may be reduced to the $.25 - $.50 range, making it only slightly How to Leverage Disk to Improve Recovery Plans more expensive than tape. Storage capacity optimization ratios vary significantly based on algorithms used and data set types on which it operates, but when used together with SATA disk it can narrow the price difference between disk and tape for backup purposes. Keep in mind, however, that storage capacity optimization is not inherently a “backup” solution. It is a technology which increases the amount of information that can be stored within a given storage array, based on the achievable storage capacity optimization ratio. The use of it may have some implications for backup and DR in certain environments, however, so look for it to be included as a feature in many mid range and larger disk arrays within the next 1-2 years. Decision Points: Disk vs Tape While tape is poor at handling initial backup and object-level restore requirements, “poor” is a relative term. If you can complete your backups within the allotted time, the amount of data you lose on recovery is acceptable assuming one or two backups per day, you are meeting recovery time requirements, and recovery reliability is not an issue affecting your restores, then by all means stay with tape. If you are looking for a relatively easy fix to handling in particular backup window, recovery time and recovery reliability issues that you are having with tape, then you may consider just inserting disk as a backup target into your existing backup infrastructure, and manage your backup data such that most of your restores are taken from disk-based data. Realize, however, that if you take this approach, it is a tactical rather than a strategic use of disk, postponing a problem that you (or your successor) will ultimately have to deal with anyway within a year or two (depending on your data growth rates). Other pressing problems, such as data protection overhead, available network bandwidth, data loss on recovery, better root cause analysis for data corruption problems, and complex environments with multiple, application-specific recovery solutions like log shipping may drive you to consider the strategic use of disk. Applying Disk Strategically If you have decided that you want to address backup problems rather than symptoms, the key transformation that needs to occur is a change from scheduled, point-in-time backups to a “continuous” approach. With scheduled backups, it became clear long ago that you can’t perform a full backup each time – there’s not enough time, network bandwidth, or storage capacity to do so. Incremental and differential backups which focus on just backing up the changes since the last backup have been in mainstream use for at least a decade. But given the size of today’s data sets and the change rates, even just backing up the changes has now run into the same time, network bandwidth, and storage capacity limitations. | 4 The way to address this is through “continuous” backup. These technologies capture only change data, they capture it in real time as it is created, and they spread the transmission of that data across 24 hours each day. The difference is between bundling up 50GB of daily changes from your database and trying to send them all at once versus spreading the transmission of that 50GB out across a 24 hour period. This immediately addresses two of the three key problems mentioned above. It completely eliminates point in time backups, replacing it with a transparent approach that puts negligible overhead on the system being backed up at any given point in time. Backup is effectively occurring all the time, but the impact on servers being backed up is so low that it’s not noticeable. Interestingly, using continuous approaches data becomes recoverable the instant it is created, not just once it’s backed up. And it significantly reduces peak bandwidth requirements. By spreading the transmission of that 50GB out across the day, the instantaneous bandwidth usage is so low that it is barely noticeable. Backup as a point in time operation generates spikes in resource utilization that impact production performance Resource Usage WHITE PAPER 0 Backup as a discrete operation “Backup” as a continuous operation Figure 3. Point in time backups generate spikes in server overhead and network bandwidth utilization that continuous backup does not. How continuous backup addresses the storage problem is not necessarily intuitive. After an initial synchronization (i.e. creating a copy of the original state of the production data), continuous backup just collects and stores the changes. So the amount of storage required will be heavily dependent on the change rate and how long the data is kept on disk. Think about how just capturing and storing changes compares to what data deduplication (discussed earlier in the white paper) does. Data deduplication takes a full backup and somehow processes it to remove redundancies during each scheduled backup, then stores it in its compacted form. Continuous backup never operates with a full backup (after the initial sync), it only ever captures and stores change data. Continuous backup approaches can come very close to achieving the same capacity optimization ratios that deduplication does without having to repeatedly process full backups. This points up another important distinction between continuous backup and data deduplication: you are working WHITE PAPER How to Leverage Disk to Improve Recovery Plans with “capacity optimized” data sets and enjoying the benefits it provides in terms of minimizing network bandwidth utilization across every network hop (LAN or WAN). With deduplication, you must consider the impacts of source-based vs target based deduplication on server overhead and network bandwidth requirements. There are two forms of continuous backup available today: replication and continuous data protection (CDP). Replication basically keeps designated source and target volumes which are connected across a network in sync. But replication by itself is not a data protection solution, because it only tracks the latest data state. If the source volume somehow becomes corrupted, that corruption will be transferred to the target volume, leaving you without a way to recover. So replication is often combined with snapshot technology so that if the latest data state is not available from the recovery volume, other good recovery points are. CDP, on the other hand, captures and stores all changes in a disk-based log, allowing administrators to choose any data point within the log to generate a disk-based recovery point. If corruption occurs, it is still transferred to the CDP log, but all points prior to that within the log are still good and can be used as recovery points. CDP can also be integrated with application snapshot APIs so that application-consistent recovery points can be marked (using “bookmarks”) in the CDP data stream. CDP by itself is a data protection solution, and it is often combined with replication so that the CDP log can be replicated to remote sites to support DR operations in exactly the same way as it supports local recovery operations. | 5 structure to do this: to migrate data to tape from a CDP system you generate a recovery point, mount that volume on a backup server, and back up the server represented by that volume (or volumes) just like you would have in the past. Note also that disk-based images of application-consistent data states can be generated for purposes other than recovery. Because recovery images can effectively be mounted on any other server, they can be used for test, development, reporting and other analysis purposes, enhancing the value of CDP beyond just data protection and recovery. Think about how many other “copy creation” tools and products it may be able to replace within your existing infrastructure. Virtual Servers Argue for the Strategic Use of Disk As enterprises deploy virtual server technology, it is generally used in server consolidation efforts to decrease energy and floorspace costs. While physical servers are often configured at 25% - 35% utilization rates to provide headroom for data and application growth, most virtual servers are configured at utilization rates of 85% or higher to maximize the benefits of server consolidation. The lack of headroom available on most virtual servers has key implications for data protection strategy that many enterprises do not realize until after they’ve tried to stay with the old method of deploying a backup agent in each server. Once you’ve come to the realization that you need a low overhead approach that supports application-consistent recovery options that can work across not just physical servers but also any virtual server platform (VMware, Microsoft, Citrix), you can really start to appreciate what CDP technology has to offer. Virtual servers offer significant opportunities for improved recovery and lower costs in DR scenarios. Restarting applications on virtual servers at a remote Figure 4. In addition to being the lowest overhead way to perform data protection, CDP can minimize data loss on recovery, offer site removes all the “bare metal reliable application-consistent recovery options, and can re-create one or more recovery points anywhere along the timeline restore” problems that exist with represented by the CDP log for root cause analysis. physical servers, and enables server consolidation on the DR CDP offers the highest value for initial backups and recovery side that significantly cuts infrastructure requirements, lowering operations, not cost-effective long term retention. Most cusenergy and floorspace costs. The use of disk as a backup metomers will size the CDP log to create a “retention window” dium provides access not only to CDP and application snapshot of two to four weeks. The retention window size is generally API integration (for application-consistent recovery options), driven by the frequency of accessing the data. CDP makes it but also to asynchronous replication. Asynchronous replication easy to access the data, since an administrator merely has to enables long distance DR configurations that will not impact pro“point and click” to select and generate a disk-based recovery duction application performance, and has already been integratpoint. Once the data has aged to a point where it is not likely ed with many CDP offerings available in the market today. When to be accessed, it can be moved to tape for long term retention replication technologies are not tied to a particular disk array venpurposes. CDP integrates well with existing tape-based infrador, they can provide a lot of flexibility to use a single product to WHITE PAPER How to Leverage Disk to Improve Recovery Plans replicate data stored on any kind of storage architecture (SAN, NAS, or DAS). Figure 5 indicates that most IT shops with virtual servers have very heterogeneous storage environments where such a feature may be valuable. Figure 5. An IDC survey of 168 respondents done in October 2009 indicates significant heterogeneity in storage architectures in virtual server environments. DR Implications of Disk-Based Backup Data sitting on tapes can’t be replicated. To move that data to remote locations for DR purposes, the standard approach has been to clone the tapes (create a copy of each) and ship them via ground transportation to the alternate site. Typically this has not been done for daily tape backups, just weekly full backups. And it takes several days to ship tapes to the remote location. So if recovery from the remote site is required, data is quite old. Recovering from old data means lots of data loss. If data is sitting on disk, however, it can be replicated to remote locations automatically, either continuously or through scheduled replication. This gets the data to the remote locations much faster, on the order of hours or even minutes (if continuous replication is used) so the achievable RPOs from data at the remote site are much better. Once at the remote site, data can always be migrated to tape for more cost-effective long term retention so many enterprises using replication may keep the last couple of days of “replicated” data on disk, migrating it to tape thereafter to minimize costs. Any time replication is considered, network bandwidth costs must also be considered. While LAN bandwidths are often 100 Mbit/sec or greater, WAN bandwidth is much more expensive and is generally much lower (as much as 100 times lower). Trying to send “backup” data across the WAN can take up much of that very limited bandwidth, imposing performance impacts on other production applications that are using the WAN. If data is on disk, however, storage capacity optimization technologies can be applied to it. WAN optimization, another form of storage | 6 capacity optimization, can be effectively deployed to minimize the amount of data that has to be sent across WANs to make the information it represents (your database tables, files, etc.) recoverable at the remote site. When considering replication, consider also how much bandwidth you will need to meet your RPO requirements at the remote site without unduly impacting other applications using that network. Bandwidth throttling, often available as part of replication products, can let you limit the amount of bandwidth that replication takes up throughout the day, thus guaranteeing a certain percentage of your network bandwidth for other critical applications. The use of bandwidth throttling may, however, impact your RPO at the remote site. When deploying replicated configurations, you will also want to quantify your resource utilization and capacity requirements as much as possible up front. This lets you accurately predict recovery performance and costs associated with your selected configuration. Look for tools from vendors that will collect quantitative data about your environment before full product installation so you know these answers – just using backup logs to gauge these requirements can under-report resource requirements by 20% - 40%. Application Recovery While data must be recovered every day, applications must be recovered as well, just generally less frequently. But when applications are down, the impacts to the business can be much larger than the impact of a few lost files. For most IT shops, the pressure is on to recover failed applications quickly and reliably. When applications must be recovered manually, the process can be time consuming, very dependent upon the skill set of the administrator, and inherently risky. Application recovery generally incorporates a set of well-known steps, however, which lend themselves very well to automation for most applications (as long as those applications are considered to be “crash tolerant”). At a high level, those steps are as follows: •Identify that a failure has occurred and an application service must be re-started •Determine whether to re-start the application service on the same or a different physical server •Mount the data volumes representing the desired recovery point on the “target” recovery server •Re-start the application •Re-direct any network clients that were using that application service on the “old” server to the “new” server WHITE PAPER How to Leverage Disk to Improve Recovery Plans Failover addresses the issue of recovering from the initial failure, but administrators will generally want to “fail back” to the original server at some point. In a disaster recovery scenario, depending on the scope of the initial “disaster”, it may take several weeks or more to get the primary site ready again to support the IT infrastructure needed for business operations. During that time, business operations may be running out of the “disaster recovery site”. When it comes time to fail back, administrators will want to fail back with all of the latest data generated from business operations since the original disaster. The same issues that apply to failover apply to failback as well. If failback is not automated, it can be a lengthy, risky undertaking. When evaluating application recovery options, selecting a solution that can automate both failover and failback will minimize both the risk and the downtime associated with failure scenarios. Shared Disk Cluster Architecture Shared Nothing Cluster Architecture | 7 Solutions that can integrate data and application recovery into a single, centrally managed product offer some distinct advantages in terms of deployment and ease of use. If they incorporate CDP, application snapshot API integration, asynchronous replication, and application failover/failback, then they can offer a comprehensive set of benefits: •CDP provides a low overhead way to capture data transparently from physical servers that applies extremely well to virtual servers, it can meet stringent RPO and RTO as well as recovery reliability requirements, and it offers the industry’s best approach to recovering from logical corruption •Application snapshot API integration allows the CDP product to “bookmark” application-consistent recovery points within the CDP data stream, giving administrators a variety of application-consistent and crash-consistent recovery options to use for recovery, root cause analysis, or other administrative purposes •Asynchronous replication extends all the benefits that CDP provides for local data recovery over long distances to remote locations for disaster recovery purposes, and does so without impacting the performance of production applications like synchronous replication would Figure 6. In shared disk architectures, the same set of physical disks is connected to all cluster nodes, with access to the data on the disks controlled by software. In shared nothing architectures, each cluster node owns its own disk. For the last 20 years, application failover products built around shared “disk” architectures comprised the lion’s share of the server “high availability” market. But in the last several years, HA configurations built around “shared nothing” architectures have started to come to the fore. Shared nothing architectures can support automated failover and failback processes between servers, but are much simpler to configure and manage than shared disk architectures. Source and target servers are connected over a network (LAN or WAN), and some form of replication technology is used to keep the source and target disks in sync. If data corruption is a concern, end users can look for replication products that are integrated with CDP technology. •Application failover and failback extend rapid, reliable recovery capabilities to applications, as well as offering easy application migration options to address maintenance and change management issues while minimizing production downtime Data and application recovery are really just different points along the recovery continuum. If you are going to make a decision to limit your recovery capabilities to just data, make that decision consciously with a full understanding of the implications. Approaches that cover data and applications provide more comprehensive solutions that ultimately support faster recovery and higher overall availability across the range of failure scenarios likely to be encountered. When recovery capabilities are automated, they become more predictable, easier to incrementally improve, and ultimately make it easier to manage to service level agreements that may be in place due to either business or regulatory mandates. How to Leverage Disk to Improve Recovery Plans | 8 Conclusion Due to industry trends like high data growth rates, tougher RPO and RTO requirements, and the increasing penetration of server virtualization technology, disk has become a strategic data protection technology that all enterprises should evaluate. Whether it is deployed tactically or strategically, it offers clear advantages over backup infrastructures based solely on tape. Tactical deployments will improve data protection operations and capabilities over the short term, but all enterprises should be considering how and when they will deploy it strategically to make the move to more continuous data protection operations. Disk is a required foundation for the next generation recovery technologies like CDP, application snapshot API integration, asynchronous replication, and storage capacity optimization that will ultimately become data protection prerequisites for most enterprises, and for some already have. When performing the strategic planning for the recovery architecture that will be required to meet your needs in the future, don’t forget to extend your definition of “recovery” to include both data and applications. Rapid, reliable recovery, both locally and remotely, for both data and applications, is really the baseline requirement going forward to keep your business running optimally, regardless of what challenges may lay ahead. 100 Century Center Court, #705, San Jose, CA 95112 | p: 1.800.646.3617 | p: 408.200.3840 | f: 408.588.1590 Web: www.inmage.com Copyright 2012, InMage Systems, Inc. All Rights Reserved. This document is provided for information purposes only, and the contents hereof are subject to change without notice. The information contained herein is the proprietary and confdential information of InMage and may not be reproduced or transmitted in any form for any purpose without InMage’s prior written permission. InMage is a registered trademark of InMage Systems, Inc. 2012V1 Email: [email protected] |