Stephan Peinkofer
Transcription
Stephan Peinkofer
TSM Performance Tuning Exploiting the full power of modern industry standard Linux-Systems with TSM Stephan Peinkofer [email protected] Agenda Network Performance Disk-Cache Performance Tape Performance Server Performance Lessons Learned Additional Resources Network The Problem with High-Speed High Speed Networks Current Ethernet technology can transfer up to 1.25 GB/s With default settings we cannot saturate a single Gigabit Tuning Network Settings for Gigabit and Beyond Utilizing (Multi-)Gigabit links requires tuning of: z TCP Window size • How much can be sent/received before waiting for ACK z Maximum Transfer Unit • How much can be sent/received per Ethernet frame TCP Window Size $> cat /etc/sysctl.conf … net.ipv4.tcp_rmem = 4096 87389 4194304 net ipv4 tcp wmem = 4096 87389 4194304 net.ipv4.tcp_wmem net.core.rmem_max = 4194304 net.core.wmem et.co e. e _max a = 4194304 9 30 Sets a limit of 4MB for the receive and send window TSM option p TCP Window size has to be set to 2MB on server and client Maximum Transfer Unit $> ifconfig ethX mtu XXXX $ cat / $> /etc/sysctl.conf / l f … net ipv4 ip no pmtu disc = 0 net.ipv4.ip_no_pmtu_disc Set MTU to max supported size Enable MTU path discovery for communication with non-JumboFramed hosts Only O l useful f l if every intermediate i t di t system t supports t Jumbo J b Frames F Measuring the Success IPERF was used to benchmark the network performance http://dast.nlanr.net/Projects/Iperf Measuring the Success Server $ i $>iperf f –s –w 1M –f f M Client $> iperf -c <server> -t 20 -w 1M -f M ------------------------------------Cli t connecting Client ti to t <server>, TCP port t 5001 TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte) -----------------------------------------------------------[ 3] local <IP> p port 36484 connected with <IP> port p 5001 [ 3] 0.0-20.0 sec 10665 MBytes 533 MBytes/sec Measuring the Success Influence of TCP Window size on a 10 Gbit Ethernet link Some Thoughts on Bonding/Trunking Great for high availability Mostly not suitable for increasing performance z Single client can utilize a single link only z Multiple clients balance across available links only if: • Clients and server are in the same subnet or • Balancing algorithm uses IP addresses (unlikely) We have to keep in mind that: z Switch is responsible for balancing incoming traffic z Server is responsible for balancing outgoing traffic Alternatives to Bonding Use next Ethernet generation Balance manually by using multiple IP‘s Disk Storage Disk-Storage Photo from Helmut Payer, gsiCom Main Factors for good Disk-Cache Performance Stripe-Size Locality of disk accesses IO-Subsystem of OS Number of FC-Links utilized in parallel p Stripe Size Stripe-Size Rule of thumb: z Random IO => Small Stripe-Size z Sequential IO => Large Stripe-Size TSM Disk-Cache is rather a sequential IO workload z Use Stripe-Size of 512 KB or larger TSM Database is rather a random IO workload z IBM recommends Stripe-Size of 256 KB Locality of Disk Accesses How TSM uses Disk-Cache volumes cannot be influenced How the OS lays out the volumes can be influenced Locality of Disk Accesses TSM can allocate multiple disk volumes in parallel tsm> DEFINE VOLUME /stg/vol1.dsm FORMATSIZE=16G ANR0984I PROCESS XX for DEFINE VOLUME started ... ... tsm> DEFINE VOLUME /stg/vol4.dsm FORMATSIZE=16G ANR0984I PROCESS XY for DEFINE VOLUME started ... How the volumes are placed on disk depends on the file s stem system XFS Allocates disk-blocks when file system buffer is flushed Write(1) Write(2) Write(3) Write(4) Write(1) Write(2) Write(3) Write(4) Filesystem y Cache Flush Buffers Disk EXT3 Allocates disk-blocks when data hits the file system buffer Write(1) Write(2) Write(3) Write(4) Write(1) Write(2) Write(3) Write(4) Filesystem y Cache Flush Buffers Disk Comparing EXT3 and XFS XFS has no problems with parallel allocation of Disk-Volumes XFS has a slight weakness with re-write workloads On EXT3 EXT3, volumes have to be defined one after another Linux IO-Subsystem IO Subsystem Linux’s IO-Subsystem is rapidly evolving More and more screws to turn More and more complex to tune Linux IO-Subsystem IO Subsystem Current observation: z Write performance OK with default settings z Read performance must be tuned by setting read-ahead of block device $ blockdev $> bl kd –setra <bytes> b <device> d i IO Multipathing Typically more than one FC-Link is used for connecting servers to storage for HA reasons Available FC-Links can be used in parallel to gain optimal performance IO-Balancing algorithm depends on IO-Failover driver Configuration for exploiting performance benefit depends on algorithm IOMP with Qlogic Drivers Qlogic driver supports assignment of individual LUNs to a specific FC-Link z Performance P f per LUN iis nott iincreased d Resulting configuration: z Use at least 2 LUNs per TSM TSM-Instance Instance and stripe them with Software-RAID 0 z Use multiple TSM-Instances per server and use dedicated LUNs per instance Measuring the Success IOZONE was used to benchmark disk performance http://www.iozone.org Measuring the Success Write file sequential $ i $>iozone -s 10g -r 512k k -t 1 -i0 i -w Read file sequential $>iozone -s 10g -r 512k -t 1 –i1 -w -s -r -t -i -w 10g : 512k: 1 : 0|1 | : : Amount to Write/Read is 10 GB Size of Record to Write/Read is 512 KB Write 1 File in parallel Perform Write | Perform Read Don’t delete Files after benchmark Comparison of Stripe-Size Stripe Size IBM FastT900 with 6 SATA-Disks in a RAID5 volume Workload: Single file sequential read/write EXT3 Block Allocation IBM FastT900 with 6 SATA-Disks in a RAID5 volume Workload: 12 parallel sequential reads Comparison of Read-Ahead Read Ahead STK FlexLine 380 with 7 FC-Disks in a RAID5 volume Tape Storage Tape-Storage TSM Tape Performance No real influence on tape performance Barely seen 125 MB/s for more than a few seconds with Titanium drives TSM v5.3 on Linux seems not to be ready for current high-end tapes yet Assumption: Some buffers are too small Photo from Sun Microsystems Server Photo from Helmut Payer, gsiCom Main Factors of Server Performance PCI Bus throughput Memory Bandwidth Number of CPU-Cores Performance of a CPU-Core PCI Bus Throughput Data travels 4 times over PCI Bus z => PCI Bus is main bottleneck PCI-X barely achieves half of the theoretical throughput in typical TSM workloads PCI-Express performs much better because of its switched topology General Rule: Don‘t Don t try to save money on the peripheral interconnect Memory Bandwidth As long as DIRECT-IO is not used, data travels 4 times through memory Database operations rely on memory performance too Number of CPU-Cores CPU Cores TSM is a multi-threaded application z The more CPU-cores available the more work can be done i parallel in ll l Lessons Learned Tuning Network z TCP Window-size: always z MTU: if applicable Disk z Read-ahead z Define Cache-/DB-/Log-volumes g sequentially q y Criteria for next Servers Have fastest peripheral interconnect available Have 10 Gbit-Ethernet Have at least 4-Gbit FC-HBAs Have at least 4 CPU-Cores Have upper class CPU-Core performance Additional Resources IBM Tivoli Storage Manager Performance Tuning Guide v5.3 IBM DS4000 Best Practices and Performance Tuning Guide Thank you for your Attention Any questions? Contact: [email protected]