Improvements in Lustre Data Integrity

Transcription

Improvements in Lustre Data Integrity
Improvements in Lustre Data Integrity
Nathan Rutman
Friday, April 20, 12
Topics
•Lustre Wire Checksum Improvements
–Cleanup
–Portability
–Algorithms
–Performance
•End-to-End Data Integrity
–T10DIF/DIX
–Version Mirroring
2
Friday, April 20, 12
Goals
•End user assurance that their data was written to disk
accurately
•Protection against all RAID (single) failure modes
•Offload the heavy calculations from the servers
•Support a wider range of client/server hardware
3
Friday, April 20, 12
Over-the-Wire Bulk Checksums
4
Friday, April 20, 12
Bulk Checksum History
•Initially only software CRC-32 (IEEE, ANSI) 2007
•Adler-32 added in 1.6.5
–easy to calculate
–weak for small message sizes
•Shuichi Ihara added initial support for hardware CRC-32C
(Castagnoli)
–Intel Nehalem
–bz 23549, landed in Lustre 2.2
•WC added multi-threaded ptlrpcd, and bulk RPC checksums
moved into ptlrpcd context: parallelized checksums
–LU-884, LU-1019, in Lustre 2.2
•sptlrpc implementation used a different set of functions
–CRC-32, Adler, MD5, SHA1-512
5
Friday, April 20, 12
Bulk Checksum Changes
•Cleanup of sptlrpc and bulk checksum algorithms to use the
kernel crypto hash library
–simplifies future additions
–LU-1201
•Addition of Software CRC-32C support
–eg. if server has HW support and clients don’t
–LU-1201
•Implementation of Hardware CRC-32 using PCLMULQDQ
–Intel Westmere
–MRP-314, still testing
6
Friday, April 20, 12
Bulk Checksum Speeds, MB/s
SW
Zlib
rsync
HW
IEEE 802.3
ANSI X3.66
gzip bzip2
mpeg2 png
iSCSI
Btrfs
3000
2250
1500
750
Adler
7
Friday, April 20, 12
CRC-32
CRC-32C
0
End-to-End Data Integrity with T10 and Version
Mirroring
8
Friday, April 20, 12
T10
DIF the data portion of the sector. The application tag is simply
ard
tag protects
e. And finally, the reference tag is being used to protect against out-of-ord
ected write scenarios.
0
512
512 bytes of data
GRD
514
APP
516
519
REF
16-bit guard tag (CRC of 512-byte data portion)
16-bit application tag
32-bit reference tag
Figure 1
DIF tuple contents
rdizing
the contents
of the protection
information
enables all nodes in the I/
•8 bytes
of PI appended
to 512
byte sectors
ng the disk itself, to verify the integrity of the data block.
•HBA and disk drives must support T10-DIF in hardware
parison of I/O Paths
cal I/O9 submission scenario in an enterprise configuration is illustrated in fi
ly
entity
Friday,
April 20, 12 capable of using the 8 bytes of protection information is the array fir
that are multiples of 512-byte sectors. HBA generates protection information and sends out
520-byte
sectors.
T10
DIF
SAN switch can optionally check protection information.
Array firmware
verifies protection information, optionally remaps reference tags and writes to disk. Disk verifies
protection information before storing request.
Application
Application
Bytestream
stream
Byte
OS
OS
512
bytesector
sector
512
byte
8 byte PI
HBA
calculates
all PI
8 byte PI
I/O Controller
SAN
Disk drive
T10-DIF
HBA
byte sector
512 byte520
sector
512
GRD8 byte
APPPI
✔
byte sector
byte520
sector
GRD8 byte
APPPI
520 byte sector
8 byte PI
520 byte sector
8 byte PI
REF
✔
Xport CRC
REF
Disk Array
Disk Drive
Sector CRC
Figure 4 DIX I/O write: Application writes byte stream to OS, optionally including protection
information. Filesystem writes in logical blocks that are multiples of 512-byte sectors. If no pro10
tection information has been generated, OS automatically does so and attaches it to the I/O. HBA
Friday, April 20, 12
that are multiples of 512-byte sectors. HBA generates protection information and sends out
520-byte
sectors.
T10
DIX
SAN switch can optionally check protection information.
Array firmware
verifies protection information, optionally remaps reference tags and writes to disk. Disk verifies
protection information before storing request.
User app or OS
generates GRD
Application
Application
OS
I/O Controller
HBA
T10-DIF + T10-DIX
OS
Bytestream
stream
Byte
512
bytesector
sector
512
byte
byte sector
512 byte520
sector
✔
✔
✔ ✔
8 byte
PI
GRD
APP
GRD8 byte
APPPI
REF
REF
Xport CRC
SAN
Disk drive
8 byte PI
GRD
byte sector
512 byte520
sector
GRD8 byte
APPPI
REF
Disk Array
Disk Drive
520 byte
sectordata
HBA
merges
and PI scatterlists
520 byte sector
8 byte PI
8 byte PI
Sector CRC
Figure 4 DIX I/O write: Application writes byte stream to OS, optionally including protection
information. Filesystem writes in logical blocks that are multiples of 512-byte sectors. If no pro11
tection information has been generated, OS automatically does so and attaches it to the I/O. HBA
Friday, April 20, 12
that are multiples of 512-byte sectors. HBA generates protection information and sends out
T10
DIX
withSAN
Lustre
520-byte
sectors.
switch can optionally check protection information.
Array firmware
verifies protection information, optionally remaps reference tags and writes to disk. Disk verifies
protection information before storing request.
Application
Application
Byte stream
Byte stream
OST
SAN
MDRAID
Disk Array
512 byte sector
Lustre End-to-End with T10
I/OOSS
Controller
BulkChecksum
OS
OSC
512 byte sector
512 byte sector
520 byte sector
512 byte sector
520 byte sector
512 byte sector
520 byte sector
HBA
Disk Drive
512 byte sector
520 byte sector
Disk drive
512 byte sector
GRD
✔
✔
8 byte PI
GRD
8 byte PI
GRD
8 byte PI
GRD
Xport
REFCRC
8 byte PI
GRD APP
✔
✔
REF
8 byte PI
GRD APP
8 byte PI
GRD APP
REF
✔
Sector CRC
REF
Figure 4 DIX I/O write: Application writes byte stream to OS, optionally including protection
information.
Filesystem writes in logical blocks that are multiples of 512-byte sectors. If no pro12
Friday, April 20, 12
Changes to Lustre
•Additional checksum data described or carried in brw
RPC
•Add PI and checking to data path
•For mmap’ed pages, early GRD failure implies data has
changed, recompute from OSC
•Disable bulk checksums
•Optional GRD checking on OSS can push all checksum
load to HBA/disk hardware
13
Friday, April 20, 12
RAID failure modes
Parity Lost and Parity Regained - Andrew Krioukov
•Latent Sector (reliable read) errors
•Data Corruption
•Misdirected Writes
•Lost Writes
•Torn Writes
•Parity Pollution
๏
•Outcomes
–Data recovery
–Data loss (detected)
–Data corruption (silent)
14
Friday, April 20, 12
RAID with Version Mirroring
A
B
C
P(ABC)
GRD
cksum(A)
cksum(B)
cksum(C)
cksum(P)
APP
Ver0
Ver0
Ver0
0,0,0
•RAID stripe across disks
•Block (chunk) on one disk
•Multiple sectors per block
•Store block version in T10 APP field
–sectors within a chunk store the same version
–parity block contains version vector
15
Friday, April 20, 12
Rebuild with Version Mirroring
Update C to C’, write C’ and P(ABC’)
A
B
C
cksum(A)
cksum(B)
cksum(C)
Ver0
Ver0
Ver0
P(ABC’)
0,0,1
Lost Write
Later, attempt to update A to A’
First, read B & C to prepare for constructing new P
But first read P to verify versions before writing P(A’BC’)
Version mismatch, latest is in P, so reconstruct C’ from P
C’
Write C’
Calculate P(A’BC’)
Write A’ and P(A’BC’)
16
Friday, April 20, 12
P(ABC’)
A
B
RAID failure modes
•Latent Sector (reliable read) errors - drive detects
•Data Corruption - GRD, lightweight, partial reads
•Misdirected Writes - REF
•Lost Writes - parity block version vector
•Torn Writes - sector versions
•Parity Pollution - GRD + versions allow safe
reconstruction
17
Friday, April 20, 12
Fin
18
Friday, April 20, 12