Anton Gostev: VP, Product Management, Veeam
Transcription
Anton Gostev: VP, Product Management, Veeam
Anton Gostev: VP, Product Management, Veeam Platinum sponsors: What is data corruption? Silent data corruption Why does data corruption happen? Where does it happen most commonly? In theory In practice Common corruption patterns Tips and tricks recovering from corrupted Veeam® backup files Is an unintended data change Write Read Unintended data changes are facts of life! Data management systems are designed with corruption in mind Detection algorithms Via parity bit, check summing (CRC32), hashing (MD5) Retransmit on detection Correction algorithms (includes detection) Via error correcting code (ECC), erasure codes, etc. In-place data recovery on detection Both increase data footprint noticeably (performance, not so much) No detection in hardware and software causes SILENT data corruption Most software simply does not implement detection However, most backup software does have detection algorithms built in THIS is why you start seeing lots of corruption dealing with backups It’s NOT because all backup software is bad! RAM and CPU electronics problems (sticky bit) URE (Unrecoverable Read Error) and storage wear Data transfer noise (wireless, Ethernet, FC, SATA) Firmware bugs and “optimizations” (NAS software, NIC, RAID controller) Software bugs OS kernel File system Virtualization Data management applications Not just computers (but also routers, NICs, RAID controllers) Rated BER (Bit Error Rate) from 10-12 (1 error per 100 GB). Still want to save buying non-ECC memory for your servers? Breaks with time due to transistors wear! Normally takes at least 5-7 years, but high operational temperatures speed up the issue Results in: Unbootable system (recent CISCO routers issue) Sticky bit silently corrupting all of your data URE (Unrecoverable Read Error) Type Consumer SATA Enterprise SATA Enterprise SAS Tape (LTO) Tape (Enterprise) BER 10-14 10-15 10-16 10-17 10-19 One error per 10 TB 100 TB 1 PB 10 PB* 1 EB* *Assuming proper handling Consider mechanical wear for tape and classic hard drives, and electronic wear for SSDs (SLC/MLC/TLC) Bit Error Rate (BER) Type 1 Gb Ethernet 10 Gb Ethernet 4 Gb Fibre Channel BER 10-10 10-12 10-12 One error per 1 GB transferred 100 GB transferred 250 GB transferred One error every 10 seconds 1 minute 40 seconds ~ 5 minutes Checksummed and retransmitted* if necessary Wireless uses Symbol Error Rate (SER); is affected by technology, hardware, modulation, distance, collisions, noise, etc. NAS firmware Underlying base OS is usually solid Questionable “optimizations” and tweaks RAID controller firmware Most overlooked component in software patching NIC firmware Poor quality from certain vendors Also rarely patched, aside from performance troubleshooting case OS kernel (Server 2012 “magic 10 bytes” or Linux SLAB bugs) File system (two evils: bad stability vs. bad architecture) Virtualization Too much code in between apps and hardware Too many moving pieces (hard to test) Data management applications Algorithmic bugs (e.g., incremental without full) Non-transactional I/O handling (common for immature software) Data corruption bugs are extremely rare (hard to make, easy to catch) Corruption statistics in our support Over six years and 120,000 customers Through my prism of perception! Disclaimer There are three types of people: 1. People who trust statistics 2. People who do not trust statistics 3. People who make up their own statistics Network shares Windows SMB client issues Low-end NAS and various appliances Corrupted network traffic Bad NIC firmware Vendor ignoring TCP/IP reference Storage-level corruption RAID controllers writing rubbish Corruptions from storage-side data processing (e.g., dedupe) Network shares? Avoid SMB backup targets Use internal, DAS or block storage Corrupt network traffic? Network traffic verification (6.5 and later) Requires locally attached storage! Storage-level corruption? SureBackup® with full scan option (or script backup verifier) Copy your backups, remember 3-2-1 rule! Single/Double Bit Errors Bad memory (RAM, cache) Magic 10 bytes Windows Server 2012 kernel bug 2n sized chunks Linux SLAB bug 64KB chunks RAID controller misbehaving See “Silent Corruptions” CERN research by Peter Kelemen for more info Simple bit flip in a byte Usually persistent issue (caused by hardware) Typically caused by bad memory 1>0 transitions are more frequent than 0>1* 00000000 * 35285650 35285660 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 |3333333333333333| 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 22 33 33 33 33 33 33 33 33 33 |33333333333333"3| |3333333333333333| 0x33 = 00110011b 0x22 = 00100010b Random 10 bytes get random content Specific to Windows Server 2012 and above Usually persistent for hours/days (until environment change) Typically surfaces on heavy non-sequential I/O (copying large files) Timing is critical (possible race condition): reduced system load or debug tools installation makes it go away Recent occurrence points at system cache corrupting the data before or during flush 128-512 bytes chunks of data Usually transient Sometimes contain identifiable user data Typically observed in vicinity of Out of Memory Specific to Linux backup repository Possible corruption in SLAB (so may impact other Unix-based OS) Multiple large chunks of 64 KB containing “old” data from previous cycles (can be a few cycles old) or data from another location on disk: Usually persistent; comes in bursts Typically associated with I/O command timeouts Size often matches RAID stripe size (64 KB is a common default) RAID controller is the primary suspect Accept and be prepared for corruptions! Avoid detecting them at restore time: Use file systems with built-in end-to-end checksumming, data scrubbing and integrity checking (ZFS, ReFS) Scrub your RAID arrays regularly Re-read your tapes Infamous LZ4/ZLIB/RLE decompression error Restore job failed Error: Client error: Zlib decompression error: [-3]. Windows Event Log: The device … has a bad block (Disk Event 7) This is actually good news! Indicates point corruption of backup file (usually, a single block) File-level restore is still possible (unless MFT blocks are hit) Full VM restore will fail, but still possible with support tools This is bad news: All instances of storage metadata are corrupted. Failed to load metastore: Failed to load metadata partition. This backup file is FUBAR :( Metadata store is redundant (two copies), but still gets corrupted. We have no idea how blocks map to backed up files (or their order) Copy your backups! Extracting data from corrupted backups: Manual process in the early days VeeamRAR support tool (v7) Storage Explorer (v8) Storage Explorer NEW Analyzes impacted VM files Enables image-level restores from corrupt backups Can fix invalid summary.xml Includes support for encrypted backups and cloud repositories SureBackup Verifies BOTH backup file integrity AND recoverability Fully automated (set it and forget it) Available in Enterprise edition (and later) Backup Validator Verifies backup file integrity ONLY Can be scripted to automate integrity checks Backup files must be imported (will not work on standalone file) Available in all product editions Data corruptions are facts of life. Get them before they get you! Test your backup integrity regularly to find corruptions …or find corruptions at restore time, whichever you prefer And last, but not least: Copy your backups! And copy them to a different storage! Storage-based replication between identical storage devices keeps your data in the same fault domain. Gostev @ veeam.com (put “corruption” into the subject) Thank you!