Improvements in Lustre Data Integrity
Transcription
Improvements in Lustre Data Integrity
Improvements in Lustre Data Integrity Nathan Rutman Friday, April 20, 12 Topics •Lustre Wire Checksum Improvements –Cleanup –Portability –Algorithms –Performance •End-to-End Data Integrity –T10DIF/DIX –Version Mirroring 2 Friday, April 20, 12 Goals •End user assurance that their data was written to disk accurately •Protection against all RAID (single) failure modes •Offload the heavy calculations from the servers •Support a wider range of client/server hardware 3 Friday, April 20, 12 Over-the-Wire Bulk Checksums 4 Friday, April 20, 12 Bulk Checksum History •Initially only software CRC-32 (IEEE, ANSI) 2007 •Adler-32 added in 1.6.5 –easy to calculate –weak for small message sizes •Shuichi Ihara added initial support for hardware CRC-32C (Castagnoli) –Intel Nehalem –bz 23549, landed in Lustre 2.2 •WC added multi-threaded ptlrpcd, and bulk RPC checksums moved into ptlrpcd context: parallelized checksums –LU-884, LU-1019, in Lustre 2.2 •sptlrpc implementation used a different set of functions –CRC-32, Adler, MD5, SHA1-512 5 Friday, April 20, 12 Bulk Checksum Changes •Cleanup of sptlrpc and bulk checksum algorithms to use the kernel crypto hash library –simplifies future additions –LU-1201 •Addition of Software CRC-32C support –eg. if server has HW support and clients don’t –LU-1201 •Implementation of Hardware CRC-32 using PCLMULQDQ –Intel Westmere –MRP-314, still testing 6 Friday, April 20, 12 Bulk Checksum Speeds, MB/s SW Zlib rsync HW IEEE 802.3 ANSI X3.66 gzip bzip2 mpeg2 png iSCSI Btrfs 3000 2250 1500 750 Adler 7 Friday, April 20, 12 CRC-32 CRC-32C 0 End-to-End Data Integrity with T10 and Version Mirroring 8 Friday, April 20, 12 T10 DIF the data portion of the sector. The application tag is simply ard tag protects e. And finally, the reference tag is being used to protect against out-of-ord ected write scenarios. 0 512 512 bytes of data GRD 514 APP 516 519 REF 16-bit guard tag (CRC of 512-byte data portion) 16-bit application tag 32-bit reference tag Figure 1 DIF tuple contents rdizing the contents of the protection information enables all nodes in the I/ •8 bytes of PI appended to 512 byte sectors ng the disk itself, to verify the integrity of the data block. •HBA and disk drives must support T10-DIF in hardware parison of I/O Paths cal I/O9 submission scenario in an enterprise configuration is illustrated in fi ly entity Friday, April 20, 12 capable of using the 8 bytes of protection information is the array fir that are multiples of 512-byte sectors. HBA generates protection information and sends out 520-byte sectors. T10 DIF SAN switch can optionally check protection information. Array firmware verifies protection information, optionally remaps reference tags and writes to disk. Disk verifies protection information before storing request. Application Application Bytestream stream Byte OS OS 512 bytesector sector 512 byte 8 byte PI HBA calculates all PI 8 byte PI I/O Controller SAN Disk drive T10-DIF HBA byte sector 512 byte520 sector 512 GRD8 byte APPPI ✔ byte sector byte520 sector GRD8 byte APPPI 520 byte sector 8 byte PI 520 byte sector 8 byte PI REF ✔ Xport CRC REF Disk Array Disk Drive Sector CRC Figure 4 DIX I/O write: Application writes byte stream to OS, optionally including protection information. Filesystem writes in logical blocks that are multiples of 512-byte sectors. If no pro10 tection information has been generated, OS automatically does so and attaches it to the I/O. HBA Friday, April 20, 12 that are multiples of 512-byte sectors. HBA generates protection information and sends out 520-byte sectors. T10 DIX SAN switch can optionally check protection information. Array firmware verifies protection information, optionally remaps reference tags and writes to disk. Disk verifies protection information before storing request. User app or OS generates GRD Application Application OS I/O Controller HBA T10-DIF + T10-DIX OS Bytestream stream Byte 512 bytesector sector 512 byte byte sector 512 byte520 sector ✔ ✔ ✔ ✔ 8 byte PI GRD APP GRD8 byte APPPI REF REF Xport CRC SAN Disk drive 8 byte PI GRD byte sector 512 byte520 sector GRD8 byte APPPI REF Disk Array Disk Drive 520 byte sectordata HBA merges and PI scatterlists 520 byte sector 8 byte PI 8 byte PI Sector CRC Figure 4 DIX I/O write: Application writes byte stream to OS, optionally including protection information. Filesystem writes in logical blocks that are multiples of 512-byte sectors. If no pro11 tection information has been generated, OS automatically does so and attaches it to the I/O. HBA Friday, April 20, 12 that are multiples of 512-byte sectors. HBA generates protection information and sends out T10 DIX withSAN Lustre 520-byte sectors. switch can optionally check protection information. Array firmware verifies protection information, optionally remaps reference tags and writes to disk. Disk verifies protection information before storing request. Application Application Byte stream Byte stream OST SAN MDRAID Disk Array 512 byte sector Lustre End-to-End with T10 I/OOSS Controller BulkChecksum OS OSC 512 byte sector 512 byte sector 520 byte sector 512 byte sector 520 byte sector 512 byte sector 520 byte sector HBA Disk Drive 512 byte sector 520 byte sector Disk drive 512 byte sector GRD ✔ ✔ 8 byte PI GRD 8 byte PI GRD 8 byte PI GRD Xport REFCRC 8 byte PI GRD APP ✔ ✔ REF 8 byte PI GRD APP 8 byte PI GRD APP REF ✔ Sector CRC REF Figure 4 DIX I/O write: Application writes byte stream to OS, optionally including protection information. Filesystem writes in logical blocks that are multiples of 512-byte sectors. If no pro12 Friday, April 20, 12 Changes to Lustre •Additional checksum data described or carried in brw RPC •Add PI and checking to data path •For mmap’ed pages, early GRD failure implies data has changed, recompute from OSC •Disable bulk checksums •Optional GRD checking on OSS can push all checksum load to HBA/disk hardware 13 Friday, April 20, 12 RAID failure modes Parity Lost and Parity Regained - Andrew Krioukov •Latent Sector (reliable read) errors •Data Corruption •Misdirected Writes •Lost Writes •Torn Writes •Parity Pollution ๏ •Outcomes –Data recovery –Data loss (detected) –Data corruption (silent) 14 Friday, April 20, 12 RAID with Version Mirroring A B C P(ABC) GRD cksum(A) cksum(B) cksum(C) cksum(P) APP Ver0 Ver0 Ver0 0,0,0 •RAID stripe across disks •Block (chunk) on one disk •Multiple sectors per block •Store block version in T10 APP field –sectors within a chunk store the same version –parity block contains version vector 15 Friday, April 20, 12 Rebuild with Version Mirroring Update C to C’, write C’ and P(ABC’) A B C cksum(A) cksum(B) cksum(C) Ver0 Ver0 Ver0 P(ABC’) 0,0,1 Lost Write Later, attempt to update A to A’ First, read B & C to prepare for constructing new P But first read P to verify versions before writing P(A’BC’) Version mismatch, latest is in P, so reconstruct C’ from P C’ Write C’ Calculate P(A’BC’) Write A’ and P(A’BC’) 16 Friday, April 20, 12 P(ABC’) A B RAID failure modes •Latent Sector (reliable read) errors - drive detects •Data Corruption - GRD, lightweight, partial reads •Misdirected Writes - REF •Lost Writes - parity block version vector •Torn Writes - sector versions •Parity Pollution - GRD + versions allow safe reconstruction 17 Friday, April 20, 12 Fin 18 Friday, April 20, 12