VMware Primary Snapshot Recovery Use Cases

Transcription

VMware Primary Snapshot Recovery Use Cases
ECX 2.0
ECX Best Practices for Use Data
1
© Catalogic Software, Inc TM, 2015. All rights reserved.
This publication contains proprietary and confidential material, and is only for use by licensees of
Catalogic DPXTM, Catalogic BEXTM, or Catalogic ECXTM proprietary software systems. This publication may
not be reproduced in whole or in part, in any form, except with written permission from Catalogic
Software.
Catalogic, Catalogic Software, DPX, BEX, ECX, and NSB are trademarks of Catalogic Software, Inc. Backup
Express is a registered trademark of Catalogic Software, Inc. All other company and product names used
herein may be the trademarks of their respective owners.
2
Table of Contents
Terminology and Concepts ........................................................................................................................... 4
Copy Data Management Principles and Practices ........................................................................................ 5
Best Practices for Sites .............................................................................................................................. 5
Best Practices for Usage............................................................................................................................ 6
Best Practices for Compartmental Boundaries ......................................................................................... 6
Best Practices for Resources ..................................................................................................................... 6
Best Practices for IV Modes ...................................................................................................................... 6
Best Practices for Sizing ............................................................................................................................ 7
Use Data Workflow Use Cases ...................................................................................................................... 7
VMware Use Data Workflows ....................................................................................................................... 7
Instant Access (IA) - VM Disks and Datastores ......................................................................................... 7
Instant Virtualization (IV) - VMs, vApps, and VM folders ......................................................................... 9
Distinguishing the VMware Use Data IV modes ................................................................................. 10
NetApp Use Data Workflows ...................................................................................................................... 11
Important Use Cases Not Currently Supported by ECX 2.0 ........................................................................ 13
VMware Use Data ............................................................................................................................... 13
NetApp Use Data................................................................................................................................. 13
3
Terminology and Concepts
The terms Copy Data and Use Data are used throughout ECX 2.0. These two terms come from “Copy
Data Management” concepts and they refer to the processes of making data copies (Copy Data) and of
using those data copies (Use Data) to perform useful operations such as analytics, testing and recovery.
Instant Access (IA) provides instant access to Copy Data for use by the more traditional restore
scenarios. IA is used to gain instant access to specific data and then restore files, volumes, or application
data, as needed, from the instantly accessed data.
Instant Virtualization (IV) goes beyond IA to bring up the workloads (operating systems and
applications) by connecting the instantly accessed Copy Data (which includes the OS/application disks)
to test, clone, or production restored virtual machines. Copy Data Management makes heavy use of IV
to address many use cases that traditional backup and restore products avoid.
A robust catalog is implemented within ECX’s dynamic platform. The catalog manages SAN/NAS RAID
storage (such as NetApp) against virtualized workloads (such as VMware). Creating copies and using
them becomes automatable with ECX, and the product can be used to protect and then bring up large
numbers of virtual machines (VMs) from data copies. With this ability, a user needs to ensure that those
mass amounts of VMs do not conflict with the original VMs on the application level. Fenced networking
ensures that these VMs do not conflict.
Fenced networks are simply private or contained networks that are separate from and have no direct
access to production or other networks. Fenced networks are used by various IV use cases to fence off
test or clone virtual machines so that they do not interact with their production counterparts.
Hypervisor environments refer to the virtualization environments such as VMware vCenter and
Microsoft Hyper-V. ECX 2.0 supports only VMware so the document may digress to using the direct
VMware terms.
Site allows the user to identify and group the storage copies and vCenter resources by some criteria
such as geographical location. Site becomes important when you want to perform mass
development/recovery testing on a regular basis. It is best to perform the testing against mirror disks at
a remote site because you do not want to impact the performance of the source production machines
by doubling up the load on your source RAID and source vCenter environment (CPU/memory/network).
Site provides you with the proper insight to where the resources (storage copies and vCenter) are, as
well as an easy way to configure automation jobs that are site based.
4
Copy Data Management Principles and Practices
One of the basic premises of “Copy Data Management” is that data copies (snapshots/backups) are
made for specific reasons to specific storage devices at specific locations (sites) to support a wide range
of specific use cases. The location and storage attributes of each copy become strategic based on how
they are used. For example, data copies that reside at a remote site would be used for site-based
disaster recovery (DR) and Dev/Test. Data copies that reside locally would be used for quick local
recovery. Further, local copies that reside on vaults would be used over copies that reside on primary
storage, as they typically have better retention policies, and using them avoids impacting the
performance of the primary storage. Cheaper or more expensive storage copies could also be employed
to adjust and control the cost vs. quality of the copies. Redundant copies can be removed. As such, it is
very important to categorize, track, monitor, and control the various different data copies in an
enterprise to ensure that the enterprise has the appropriate copies to address the various required use
cases that would be defined within the enterprise’s SLAs.
As a best practice, it is recommended that the user have multiple sites that each contain storage (such
as NetApp) and hypervisor (such as VMware) resources. The sites should be independent of each other
at least with respect to power failure so that that they provide a level of protection for the user. It is still
better to have sites at different geographical locations to provide protection against regional disasters
like power grid failure.
Once the sites are determined, the user should place mirror copies of the source primary data at
alternate sites.
The user should then configure vaults locally to contain snapshots with higher retentions for local
recovery. If the budget allows, the user should also consider configuring vaults remotely off of the
remote mirrors so that the remote sites also have higher fidelity snapshots. The local vaults can be used
for local recovery and the remote vaults for offloaded archiving (i.e. run tape backups from remote
vault).
Treated as a protection workflow, ECX 2.0 provides the way to do the above tasks of copy creation,
placement and control via the ECX 2.0 Copy Data policies.
Use Data can only be performed after the Copy Data protection is put in place. Use Data is a slave to
how and where the data is placed by Copy Data. So, it is critically important that the user get the Copy
Data protection right. Nonetheless, the process of creating automated Use Data policies that can be
quickly tested helps the user discover issues with the Copy Data workflows.
It is not uncommon for a user to go through several iterations of Copy Data and Use Data policy
tweaking on their way to optimizing their full solution to meet the SLA’s they put in place.
The following list highlights some best practices for Copy Data Management:
Best Practices for Sites


Configure multiple independent sites that have independent power, network, storage, and
hypervisor resources.
Place mirror copies at alternate sites.
5


Place vault copies locally.
If the budget allows, place a vault copy at a remote site off of the remote mirror for extra protection
and archiving.
Best Practices for Usage




Vault copies should typically have higher retentions and thus higher RPO fidelity.
Use the local vault for local recoveries.
Use the remote mirrors for disaster recovery (DR) testing, for Snapshot validation, for Dev/Test and
clone mode operations, where the Dev/Test and clone mode operations create cloned
environments for data mining or Dev/Test.
Place VMs that must be recovered together within the same Copy Data jobs. This ensures that
common snapshots are used for these co-dependent VMs during Use Data operations.
Best Practices for Compartmental Boundaries


It is recommended that the Copy Data and Use Data policies align on application or company
compartmental boundaries. This way, each application group or compartment can manage and test
their own Copy Data and Use Data policies. Common infrastructure components/applications such
as DNS or AD should be in a separate set of Copy Data and Use Data jobs so that they can be
leveraged by all the other application or compartmental jobs that depend upon the common
applications. The common component Copy Data jobs should have snapshot frequencies and
retention periods equivalent to or better than the Copy Data jobs for the dependent applications
they support.
It is recommended that that storage and datastores also align on application or company
compartmental boundaries. This is related to the recommendation above. If there are application
group or compartmental boundaries for the VMs, then those VMs should reside on datastores that
align the same way. There should not be cross usage of datastores for applications across the
boundaries otherwise there will be inter-compartmental dependencies on common storage.
Best Practices for Resources

To avoid consumption of local resources, use alternate sites to run large test jobs against. For
example, a job that recovers all VMs in an environment might be something you do not want to run
in test mode locally (to the source site) as that would double up demands against almost every
resource (network, CPU, and storage) within the local site. It is better to run these large jobs in test
mode to an alternate site that is not busy. These jobs would, of course, be run locally in production
mode in the case of a real disaster.
Best Practices for IV Modes

Use test mode to perform testing, snapshot validation, and any operation within a fenced network
that does not require the test mode VMs to run for any extended period of time. The test mode
VMs have independent UUIDs from their source, but they run off of snapshot volume clones which
are not intended for prolonged use. Clone mode is generally better suited for this. The user can
always move test mode VMs back to production to replace the source machines or they can clone
them and leave them running in the test network. The existing Copy Data protection policies do not
apply to the test VMs as they are essentially different VMs (with different UUIDs) so that they can
run concurrently to the real source VMs within the fenced test network.
6


Use clone mode when extending use of the clone VMs is required and there is storage available to
create the independent storage copies. As with test mode, the clone mode VMs have independent
UUIDs and run within the fenced test network so as to not collide with the concurrently running
production machines. The difference is that the clone VMs were moved via vMotion to permanent
storage. The user can then modify the clone VMs by changing the host name, revising the Product ID
(PID) and Security ID (SID) to gain a new OS identity, reconfiguring the identities of the contained
applications, and applying any required OS/application licenses if they wish to expose the clones via
the production networks. The existing Copy Data protection policies do not apply to the clone VMs
as the clone VMs are essentially different VMs.
Use Recovery mode (also referred to as Production mode or Restore mode) to perform restoration
of the protected machines themselves due to a disaster. Recovery mode performs a restore which is
essentially a replace/overwrite of the original VM with the restored images. Therefore, recovery
mode checks to make sure that the production VMs are not running so that it can restore the
images for those VMs. The existing Copy Data protection policies will continue to operate after the
restore is complete.
Best Practices for Sizing

When trying to size the requirements for the VMware IV Use Data jobs, the user should understand
what the jobs are configured to do from a conceptual level. This is so they can use the
sizing/performance requirements of the existing production VMs to determine what a set of copies
for test or clone mode would require at the target site.
Use Data Workflow Use Cases
The following sections provide an understanding of the Use Data workflow use cases that were
implemented for ECX 2.0. ECX 2.0 supports both top down and bottom up workflows to address the
needs of various users such as application, hypervisor (VMware), and storage (NetApp) administrators.
At a high level, ECX 2.0 policies support two major Use Data workflows:

VMware Use Data workflows

NetApp Use Data workflows
The VMware Use Data workflows are top down dealing with the VMware hypervisor objects such as
VMs, vApps, VM folders, VM disks and Datastores. The NetApp Use Data workflows are bottom up,
storage driven dealing with the storage objects such as NetApp volumes and files.
VMware Use Data Workflows
Instant Access (IA) - VM Disks and Datastores
The Instant Access workflows are “data focused” use cases that allow instant access to backed up data
at a given time (snapshot) to be used for data recovery, such as item, file, volume, etc. Depending on
7
your role, you might want to look at this from a bottom up (hypervisor administrator) or top down
(application administrator) perspective.
ECX VMware IA Use Data Policy - VM disk source selection
Two IA workflows are described here:

IA Mount Datastores leveraging primary, vault, or mirror snapshot clone(s) of one or more
datastores to an ESX Server or Cluster
Datastore(s) are mounted from a volume copy (of primary, vault or mirror) to a target ESX host
or cluster. The hypervisor administrator is responsible for any recovery action to production
sources. Ending the job cleans up the resources. There is also an option to make the IA’d
datastores permanent using split-clone if the user used the IA datastores and wants to make
them permanent.

IA Mount VM Disk(s) leveraging primary, vault, or mirror snapshot clone(s) of one or more VM
disk(s) to a chosen VM
The application administrator specifies the disks (VMDKs) to recover from a VM disk list and the
software mounts the appropriate datastores on mounted snapshot copies. The VMDK(s)
contained with the datastores are then assigned to the source VM or to any other VM to be
mounted at specified mount points. The mounted disks can be used from within the VM(s) for
application item level recovery (such as Exchange mailboxes or SharePoint objects). Ending the
job cleans up the resources. There is also an option to make the IA VM disks permanent using
split-clone if the IA’d VM disks were used and the user wants to make them permanent.
Note that, when using OS disks for file level recovery, many operating system files may be locked
preventing file copy. IA may not sufficiently handle data copy of locked OS disks.
8
Instant Virtualization (IV) - VMs, vApps, and VM folders
Instant virtualization (IV) is different from Instant Access (IA). With IV, the user actually instantiates
running VMs and vApps as they looked at the time and state of the snapshot. This means they are
actually starting up operating systems with applications on them (workloads).

Test, Clone and Recovery Modes – There are many different use cases which call for supporting
different modes of operation. ECX supports three modes: test, clone, and recovery. The user selects
the mode when they run the job policy.

Network Fencing – Private networks are leveraged to isolate test and clone mode copies from the
production environment. This assures the OS’s and applications on the test/clone VMs do not
interfere with their production counterparts. The user maps the source to test networks via the job
policy.

Storage vMotion is leveraged to move the instantly virtualized VMs from snapshot clones to more
permanent storage during Clone or Recovery (RRP). The VMs are operational during Storage
vMotion so there is no downtime. Users select the target vMotion storage location (RRP datastore)
via the job policy.

Recovery Order - The order in which workloads are recovered needs to be addressed for multiworkload (multi-machine) applications. For instance, Active Directory (AD) dependent application
server workloads need to come up after the AD comes up. The AD requires DNS so the DNS services
need to come up prior to AD. SharePoint Web servers should come up after the configuration and
content SQL Servers come up. So, recovery order may be important for VM recoveries.
ECX 2.0 supports recovery order within the vApp and also when the user selects groups of VMs
individually and sorts them in recovery order via the UI. ECX 2.0 does not support recovery order for
VMs contained with VM folders chosen for recovery, as folder membership is variable in nature. We
recommend that users leverage VMware vApps to manage the recovery order for groups of VMs
instead of using VM folders if the recovery order is important for the VMs contained with a VM
folder.
9
Distinguishing the VMware Use Data IV modes
ECX VMware Use Data Policy - IV Mode selection
Test mode is used to test the recovery of a select set of VMs, vApps, etc.
•
•
•
•
•
•
•
The test VMs are brought up within a fenced test network so that they do not collide with the
concurrently running production VMs. The test VMs selected contain the same applications and
OS’s as the running sources.
The test VMs are run off of snapshot clones. When the user is finished with the test, they end
the job and the job cleans up all the test resources.
Test mode can be used to test DR, run Dev/Test, and validate snapshots.
Test mode can be scheduled. This allows for continual or scheduled testing.
Users can decide to run the test VMs locally (with the same site) or at a remote site. Users must
be aware that the test VMs will consume the same CPU/memory/storage resources as the
production VMs, so most users elect to perform their large scale testing at remote sites.
Test mode supports three actions while active: “End IV (Cleanup)”, “RRP (vMotion)”, and “Clone
(vMotion)”. RRP stands for Rapid Return to Production.
Existing VMware Copy Data protection policies do not apply to test mode VMs.
10
Clone mode is used to create separate running copies of the original VMs within a fenced network
for an extended period of time.
•
•
•
•
•
•
The clone mode VMs are brought up within the fenced test/clone network so that they do not
collide with the concurrently running production VMs.
Clone mode first gets the clone VMs up and running off of snapshot clones and then moves
those VMs off of temporary snapshot clones (via storage vMotion) to more permanent
storage. Running off of snapshot clones is not recommended for long term use as they are
differential images from the original source disk. Limits on snapshots may cause a new snapshot
to cycle back on the one being used.
Clone mode can be used for data mining, migration, creating new VMs using a set of source VMs
as “templates”, etc.
The user first runs a job in test mode and then decides to clone the test VMs to more permanent
storage via the “Clone (vMotion)” action from test mode.
Clone VMs intended for permanent use should be modified before they are exposed to the
production network. Revise the Product ID (PID) and Security ID (SID), and rename and relicense
the OS and applications running on the clone VMs.
Existing VMware Copy Data protection policies do not apply to clone mode VMs. Users would
need to create new policies for cloned VMs that they make permanent.
Recovery mode is used to restore the production VMs to the state contained within the selected
snapshots. Recovery mode is also referred to as Production mode or Restore mode.
•
•
•
•
•
•
Recovery mode performs a restore which is essentially a replace/overwrite of the original VMs
with the restored images.
Recovery mode first gets the production VMs up and running off of snapshot clones (for speed)
before moving them (while operational) via Storage vMotion to permanent storage.
The restored VMs are brought up on the production network.
Recovery mode does not proceed if it detects running instances of the production VMs it will
replace. It is the user’s responsibility to shut down all production VMs that are being restored or
replaced.
The user can first run a job in test mode and then decide to use the test VMs to replace the
production VMs via the “RRP (vMotion)” action from test mode. RRP stands for Rapid Return to
Production.
Existing VMware Copy Data protection policies still apply to the production restored VMs. There
is no need to edit or change them.
NetApp Use Data Workflows
The following are examples of uses cases that are bottom up storage driven. These operations are fast
and simple but they do require that the user manage the application states and application specific
recoveries for those applications that have data on the volumes or files being restored.
11
ECX NetApp Use Data Policy – NFS Destination Specification

Instant Access (IA) volumes from any primary, vault, or mirror snapshot copies and expose
them via NFS and CIFS
The NetApp volume clone, NFS export, and CIFS share features are leveraged to provide this
bottom up storage feature. The IA job simply mounts a snapshot copy and then exposes it via
NFS and CIFS. The user can define the machines and the users that will have access to the
exposed volumes. In addition, the user may convert the IA disk running off the temporary
volume clone into a permanent disk by executing the “RRP” action, which is an option available
in the ECX user interface. The “RRP” action uses the NetApp split-clone feature to convert the
volume clone into a permanent disk while the disk is in use (no downtime).

Restore volumes from any primary, vault, or mirror snapshot copy
This feature performs the NetApp SnapRestore, SnapVault Restore and SnapMirror Restore
under the hood depending upon which copy snapshots are chosen for the restore. Restore from
primary snapshot (SnapRestore) was implemented by NetApp as a volume revert which comes
with two restrictions:
o
o
Snapshots newer than the primary snapshot selected for restore are removed after the
restore
Restore to the selected primary snapshot is blocked if a newer snapshots is in use for a
SnapMirror or SnapVault
It is for this reason we recommend that volume restores use vault or mirror snapshots instead of
primary snapshots. If the Use Data job has a choice (after site selection), it will choose vault
snapshots over primary snapshots.
12

Restore files (from primary snapshot only)
Restore one or more files from primary snapshot copy. The original source file is restored from
the selected primary snapshot. Note: File restore from vault or mirror is not supported in ECX
2.0.

Restore volumes from secondary vault or mirror copy (SnapVault or SnapMirror)
The NetApp snapshot mirror volume restore feature is leveraged to restore one or more files to
the state contained with the selected mirror or vault snapshot. Restore from vault or mirror is
recommended over restore from primary snapshot because it does not result in the loss of
snapshots nor is it blocked by current mirror/vault operations based off of snapshots newer
than the one selected. See restore volume from primary snapshot above.
Important Use Cases Not Currently Supported by ECX 2.0
VMware Use Data


Mirror reverse use cases. Mirror reverse will require the introduction of protection groups to
track which set of VMs must be operated on as a group (protection/recovery) because they
reside on common or intersecting storage.
RDM and direct assigned iSCSI disks are not addressed in ECX 2.0 by both the Copy Data and Use
Data workflows.
NetApp Use Data


File restore from vault or mirror snapshots. Data ONTAP 8.3 introduced APIs to address these
too late in the ECX 2.0 development cycle.
Volume restore with added enhancements to also re-create the NFS/CIFS configuration that the
original volume had in case the source volume is lost. ECX 2.0 does preserve the original
NFS/CIFS configurations if the source volumes exist, so this restriction is limited to the case
where the source volumes are no longer present.
13