Low Latency Tunings
Transcription
Low Latency Tunings
LOW LATENCY SETTINGS: 1. TCP/UDP buffers ( we may already have standard settings for these ) net.ipv4.tcp_rmem net.ipv4.tcp_wmem net.ipv4.tcp_mem net.core.wmem_max net.core.rmem_max net.core.wmem_default net.core.rmem_default eg. net.ipv4.tcp_rmem =” 4096 16777216 33554432” net.ipv4.tcp_wmem =” 4096 16777216 33554432” net.ipv4.tcp_mem = “4096 16777216 33554432” net.core.rmem_default = 16777216 net.core.wmem_default = 16777216 net.core.rmem_max=16777216 net.core.wmem_max=16777216 net.core.netdev_max_backlog = 30000 net.core.netdev_max_backlog = 30000 net.ipv4.tcp_low_latency=1 (optimize tcp stack for low latency over high throughput. Faster connection times, works better for smaller pkts) net.ipv4.tcp_timestamps=0 net.ipv4.tcp_ecn=0 (disable router explicit congestion notification, will help when we see large file transfer stalls in a small env with 1/2 routers not a good) net.ipv4.tcp_app_win=0 (do not reserve space or handle differently for slower apps. Larger the value smaller the buffer for specific window) net.ipv4.tcp_no_metrics_save=1 (tcp saving various connection metrics in route cache when connections are closed. These caches are consulted for future similar connections. Will increase performance on the market access side or when we do have similar connections) rhash_entries=1024 2. Initial congesstion window for the routes (default is 3. Can send initially 3 pkts without acks. Maybe required on the market access side .. Order sending might be delayed or bundled) Change it with: ip route change default via n.n.n.n dev eth0 initcwnd #" (here I did for the default route but can be done for any routes ) To check the value: "ip route show" 3. Disable C-states, C1E State ( if exists ), HT, power mgmt from BIOS. 4. Disable SMIs ( when supported/agreed by HW vendors ) thru BIOS. Not all vendors support this by default … Will minimize jitters greatly. 5. Disable/remove irqbalance, cpuspeed and other not needed services 6. SFC ( Solarflare ) lro=0 ( large receive offload, will not bundle recv pkts into larger chunks) lso=0 ( large send offload, will not bundle tx pkts into larger chunks ) rx_alloc_method=1 ( how memory is allocated for received pkts by the sfc driver. Value of 1 is better for smaller pkts. this is what we have ) rss_cores=#cores sfc_affinity (create per core queues, helpful if dropping pkts consistently and also observing delays in receiving pkts in the application ) (to affinitize or steer pkt streams to cores/connections, Cannot use with onload simultaneously) There are kernel parameters at the OS level to steer pkts as well. Which I have not tested. 7. Netdev buffers net.core.netdev_max_backlog= # (higher than the default 2500 , for 10g should be much higher ) net.core.netdev_budget=300 (Depends. Default of 300 is usually ok. Max number of pkts taken from interfaces in each polling cycle) (I have written a simple script to monitor the netdev buffers will send it soon ) 8. Ring buffers, interrupt coalescing, moderation adjust the ring buffers accordingly ( SFC might have limitation of 4096 for both RX and TX .. Other cards have different values ) no interrupt coalescing or moderation for any interfaces 9. Bind device ( nic, raid , disk, usb .. etc ) interrupts to certain cores This depends on the application as per binding what to which cores .. We must achieve data locality for the application See which cores are in which sockets.. Depending on the vendors with the same Intel/AMD processors numbers will vary "numactl --hardware" will show for sure 10. ONLOAD (If and when used) EF_POLL_USEC=1000 (onload spins at user space for 1ms. If not used, it'll block. Better to stay in the user space rather in kernel space) Can check with: onload_stackdump | grep poll (will greatly increase the performance as it'll avoid interrupts and kernel CS. By default onload uses epoll() when it goes into polling mode ) EF_RXQ_SIZE (need to match the ring buffer RX setting ) EF_TCP_INITIAL_CWND (may need to adjust this value depending on the initial congestion window set mentioned above in this email ) There are other settings which all depends on how the application is performing using onload. “onload_stackdump” is a good starting point Onload can only be used with SFC cards, other cards have TOE which needs different settings 11. Using cpuset, tuna , isolcpus, taskset It's a good practice to use isolcpus to isolate cores from general scheduling. I believe we are already doing this here cpuset, tuna both almost do similar things. So whatever the appdev/unix teams are comfortable with. I personally like tuna better as once I know what thread is doing what I can bind them to certain cores properly in order to receive data locality. To do this we need to understand the apps better. Is the app running on a busy loop or not? Personally I prefer to bind the thread to a certain core and let it be running there all the time. Using cpusets will have the thread migrate to cores in the same set which sometimes may not be desirable. Setting thread priorities/class might also be a good idea. But all of these depend on the application behavior. 12. Events kernel thread When we see the application is observing delays in receiving pkts/messages or there's a longer than usual time spent from the time the pkt came to the NIC and the message delivered to the application, we may need to bump up the priority of the kernel event thread. These threads process pending events in the kernel generated by device drivers …This would increase the epoll(), poll(), select() calls which I believe most of the NIC vendors use. When using ONLY onload this would not matter much 13. MRG 1.3 A whole new ball game and MUST be carefully planned when used. Maybe soon in a wholesale level :-) There are different things that need to be done for this. UDP PACKET LOSS: 1. Poorly sized UDP receive (rx) buffer sizes / UDP buffer socket overflows 1. Description: Oracle RAC Global cache block processing is ‘bursty’ in nature and, consequently, the OS may need to buffer receive(rx) packets while waiting for CPU. Unavailable buffer space may lead to silent packet loss and global cache block loss. `netstat –s` or `netstat –su` on most UNIX will help determine UDPInOverflows, packet receive errors, dropped frames, or packets dropped due to buffer full errors. Action: Packet loss is often attributed to inadequate( rx) UDP buffer sizing on the recipient server, resulting in buffer overflows and global cache block loss. The UDP receive (rx) buffer size for a socket is set to 128k when Oracle opens the socket when the OS setting is less than 128k. If the OS setting is larger than 128k Oracle respects the value and leaves it unchanged. The UDP receive buffer size will automatically increase according to database block sizes greater than 8k, but will not increase beyond the OS dependent limit. UDP buffer overflows, packet loss and lost blocks may be observed in environments where there are excessive timeouts on ‘global cache cr requests’ due to inadequate buffer setting when DB_FILE_MULTIBLOCK_READ_COUNT is greater than 4. To alleviate this problem, increase the UDP buffer size and decrease the DB_FILE_MULTIBLOCK_READ_COUNT for the system or active session. 2. 3. 2. To determine if you are experiencing UDP socket buffer overflow and packet loss, on most UNIX platforms, execute 3. `netstat –s` or `netstat –su` and look for either “udpInOverflows”, “packet receive errors” or “fragments dropped” depending on the platform. 4. NOTE: UDP packet loss usually results in increased latencies, decreased bandwidth, increased cpu utilization (kernel and user), and memory consumption to deal with packet retransmission. Poor interconnect performance and high cpu utilization. `netstat –s` reports packet reassembly failures 1. Description: Large UDP datagrams may be fragmented and sent in multiple frames based on Medium Transmission Unit (MTU) size. These fragmented packets need to be reassembled on the receiving node. High CPU utilization (sustained or frequent spikes), inadequate reassembly buffers and UDP buffer space can cause packet reassembly failures. `netstat -s` reports a large number of Internet Protocol (IP) ‘reassembles failed’ and ‘fragments dropped after timeout’ in the “IP Statistics” section of the output on the receiving node. Fragmented packets have a time-to-live for reassembly. Packets that are not reassembled are dropped and requested again. Fragments that arrive and there is no space for reassembly are silently dropped 2. `netstat –a` IP stat counters: 3104582 fragments dropped after timeout 34550600 reassemblies required 8961342 packets reassembled ok 3104582 packet reassembles failed. 3. Action: Increase fragment reassembly buffers, allocating more space for reassembly. Increase the time to reassemble packet fragment., increase udp receive buffers to accommodate network processing latencies that aggravate reassembly failures and identify CPU utilization that negatively impacts network stack processing. 4. On LINUX: To modify reassembly buffer space, change the following thresholds: 5. /proc/sys/net/ipv4/ipfrag_low_thresh (default = 196608) /proc/sys/net/ipv4/ipfrag_high_thresh (default = 262144) 6. To modify packet fragment reassembly times, modify: /proc/sys/net/ipv4/ipfrag_time (default = 30) 7. see your OS for the equivalent command syntax. Network packet corruption resulting from UDP checksum errors and/or send (tx) / receive (rx) transmission errors 1. 4. Mismatched MTU sizes in the communication path 1. 5. Description: UDP includes a checksum field in the packet header which is read on receipt. Any corruption of the checksum results in silent dropped packets. Checksum corruptions result in packet retransmissions, additional cpu overhead for the additional request and latencies in packet processing. Action: Use tcpdump/snoop/network utilities sniffer utility to capture packet dumps to identify checksum errors and confirm checksum corruption. Engage sysadmins and network engineers for root cause. Checksum offloading on NICs have been known to create checksum errors. Consider disabling the NIC checksum offloading, if configured, and test. On LINUX ethtool –K <IF> rx off tx off disables the checksum offloading. Description: Mismatched MTU sizes cause ‘packet too big’ failures and silent packet loss resulting in global cache block loss and excessive packet retransmission requests. Action: The MTU is the “Maximum Transmission Unit” or frame size configured for the interconnect interfaces. The default standard for most UNIX is 1500 bytes for Ethernet. MTU definitions should be identical for all devices in the interconnect communication path. Identify and monitor all devices in the interconnect communication path. Use large, non-default sized, ICMP probe packets for` ping`, `tracepath` or `traceroute` to detect mismatched MTUs in the path. Use `ifconfig` or vendor recommended utilities to determine and set MTU sizes for the server NICS. See Jumbo Frames #12 below. Note: Mismatched MTU sizes for the interconnect will inhibit nodes joining the cluster in 10g and 11g. Faulty or poorly seated cables/cards 1. Description: Faulty network cable connections, the wrong cable, poorly constructed cables, excessive length and wrong port assignments can result in inferior bit rates, corrupt frames, dropped packets and poor performance. Action: CAT 5 grade cables or better should be deployed for interconnect links. All cables should be securely seated and labeled according to LAN/port and aggregation, if applicable. Cable lengths should conform to vendor Ethernet specifics. 6. Interconnect LAN non-dedicated 1. 7. Lack of Server/Switch Adjacency 1. 8. Description: Network devices are said to be ‘adjacent’ if they can reach each other with a single hop across a link layer. Multiple hops add latency and introduce unnecessary complexity and risk when other network devices are in the communication path. Action: All GbE server interconnect links should be (OSI) layer 2 direct attach to the switch or switches (if redundant switches are configured). There should be no intermediary network device, such as a router, in the interconnect communication path. The unix command `traceroute` will help identify adjacency issues. IPFILTER configured 1. 9. Description: Shared public IP traffic and/or shared NAS IP traffic, configured on the interconnect LAN will result in degraded application performance, network congestion and, in extreme cases, global cache block loss. Action: The interconnect/clusterware traffic should be on a dedicated LAN defined by a non-routed subnet. Interconnect traffic should be isolated to the adjacent switch(es), e.g. interconnect traffic should not extend beyond the access layer switch(es) to which the links are attached. The interconnect traffic should not be shared with public or NAS traffic. If Virtual LANs (VLANS) are used, the interconnect should be on a single, dedicated VLAN mapped to a dedicated, non-routed subnet, which is isolated from public or NAS traffic. Description:IPFILTER (IPF) is a host based firewall or Network Address Translation (NAT) software package that has been identified to create problems for interconnect traffic. IPF may contribute to severe application performance degradation, packet loss and global cache block loss. Action: disable IPFILTER Outdated Network driver or NIC firmware 1. Description: Outdated NIC drivers or firmware have been known to cause problems in packet processing across the interconnect. Incompatible NIC drivers in inter-node communication may introduce packet processing latencies, skewed latencies and packet loss. Action: Server NICs should be the same make/model and have identical performance characteristics on all nodes and should be symmetrical in slot id. Firmware and NIC drivers should be at the same (latest) rev. for all server interconnect NICs in the cluster. 10. Proprietary interconnect link transport and network protocol 1. Description: Non-standard, proprietary protocols, such as LLT, HMP,etc, have proven to be unreliable and difficult to debug. Miss-configured proprietary protocols have caused application performance degradation, dropped packets and node outages. Action: Oracle has standardized on 1GbE UDP as the transport and protocol. This has proven stable, reliable and performant. Proprietary protocols and substandard transports should be avoided. IP and RDS on Inifiniband are available and supported for interconnect network deployment and 10GbE has been certified for some platforms (see OTN for details) – certification in this area is ongoing. 11. Misconfigured bonding/link aggregation 1. Description: Failure to correctly configure NIC Link Aggregation or Bonding on the servers or failure to configure aggregation on the adjacent switch for interconnect communication can result in degraded performance and block loss due to ‘port flapping’, interconnect ports on the switch forming an aggregated link frequently change ‘UP’/’DOWN’ state. Action: If using link aggregation on the clustered servers, the ports on the switch should also support and be configured for link aggregation for the interconnect links. Failure to correctly configure aggregation for interconnect ports on the switch would result in ‘port flapping’, switch ports randomly dropping, resulting in packet loss. Bonding/Aggregation should be correctly configured per driver documentation and tested under load. There are a number of public domain utilities that help to test and measure link bandwidth and latency performance (see iperf). OS, network and network driver statistics should be evaluated to determine efficiency of bonding. 12. Misconfigured Jumbo Frames 1. Description: Misconfigured Jumbo Frames may create mismatched MTU sizes described above. Action: Jumbo Frames are IEEE non-standard and as a consequence, care should be taken when configuring. A Jumbo Frame is a frame size around 9000bytes. Frame size may vary depending on network device vendor and may not be consistent between communicating devices. An identical maximum transport unit (MTU) size should be configured for all devices in the communication path if the default is not 9000 bytes. All the network devices , switches/NICS/line cards, in operation must be configured to support the same frame size (MTU size). Mismatched MTU sizes, where the switch may be configured to be MTU:1500 but the server interconnect interfaces are configured to be MTU:9000 will lead to packet loss, packet fragmentation and reassembly errors which cause severe performance degradation and cluster node outages. The IP stats in `netstat –s` on most platforms will identify frame fragmentation and reassembly errors. The command `ifconfig -a`, on most platforms, will identify the frame size in use (MTU:1500). See the switch vendor’s documentation to identify Jumbo Frames support. 13. NIC force full duplex and duplex mode mismatch 1. Description: Duplex mode mismatch is when two nodes in a communication channel are operating at half-duplex on one end and full duplex on the other. This may be manually misconfigured duplex modes or, one end configured manually to be half-duplex while the communication partner is auto negotiate. Duplex mode mismatch results in severely degraded interconnect communication. Action: Duplex mode should be set to auto negotiate for all Server NICs in the cluster *and* line cards on the switch(es) servicing the interconnect links. Gigabit Ethernet standards require auto negotiation set to “on” in order to operate. Duplex mismatches can cause severe network degradation, collisions and dropped packets. Auto negotiate duplex modes should be confirmed after every hardware/software upgrade affecting the network interfaces. Auto negotiate on all interfaces will operate at 1000 full duplex. 14. Flow-control mismatch in the interconnect communication path 1. Description: Flow control is the situation when a server may be transmitting data faster than a network peer (or network device in the path) can accept it. The receiving device may send a PAUSE frame requesting the sender to temporarily stop transmitting. Action: Flow-control mismatches can result in lost packets and severe interconnect network performance degradation. 2. tx flow control should be turned off rx flow control should be turned on tx/rx flow control should be turned on for the switch(es) 3. NOTE: flow control definitions may change after firmware/network driver upgrades. NIC settings should be verified after any upgrade 15. Packet loss at the OS, NIC or switch layer 1. Description: Any packet loss as reported by OS, NIC or switch should be thoroughly investigated and resolved. Packet loss can result in degraded interconnect performance, cpu overhead and network/node outages. Action: Specific tools will help identify which layer you are experiencing the packet/frame loss (process/OS/Network/NIC/switch). netstat, ifconfig, ethtool, kstat (depending on the OS) and switch port stats would be the first diagnostics to evaluate. You may need to use a network sniffer to trace end-to-end packet communication to help isolate the problem (see public domain tools such as snoop/wireshare/ethereal). Note, understanding packet loss at the lower layers may be essential to determining root cause. Under sized ring buffers or receive queues on a network interface are known to cause silent packet loss, e.g. packet loss that is not reported at any layer. See NIC Driver Issues and Kernel queue lengths below. Engage your systems administrator and network engineers to determine root cause. 16. NIC Driver/Firmware Configuration 1. Description: Misconfigured or inadequate default settings for tunable NIC public properties may result in silent packet loss and increased retransmission requests. Action: Default factory settings should be satisfactory for the majority of network deployments. However, there have been issues with some vendor NICs and the nature of interconnect traffic that have required modifying interrupt coalescence settings and the number of descriptors in the ring buffers associated with the device. Interrupt coalescence is the CPU interrupt rate for send (tx) and receive (rx) packet processing. The ring buffers hold rx packets for processing between CPU interrupts. Misconfiguration at this layer often results in silent packet loss. Diagnostics at this layer require sysadmin and OS/Vendor intervention. 17. NIC send (tx) and receive (rx) queue lengths 1. Description: Inadequately sized NIC tx/rx queue lengths may silently drop packets when queues are full. This results in gc block loss, increased packet retransmission and degraded interconnect performance. Action: As packets move between the kernel network subsystem and the network interface device driver, send (tx) and receive (rx) queues are implemented to manage packet transport and processing. The size of these queues are configurable. If these queues are under configured or misconfigured for the amount of network traffic generated or MTU size configured, full queues will cause overflow and packet loss. Depending on the driver and quality of statistics gathered for the device, this packet loss may not be easy to detect. Diagnostics at this layer require sysadmin and OS/Vendor intervention. (c.f. iftxtqueue and netdev_max_backlog on linux) 18. Limited capacity and over-saturated bandwidth 1. Description: Oversubscribed network usage will result in interconnect performance degradation and packet loss. Action: An interconnect deployment best practice is to know your interconnect usage and bandwidth. This should be monitored regularly to identify usage trends, transient or constant. Increasing demands on the interconnect may be attributed to scaling the application or aberrant usage such a bad sql or unexpected traffic skew. Assess the cause of bandwidth saturation and address it. 19. Over subscribed CPU and scheduling latencies 1. Description: Sustained high load averages and network stack scheduling latencies can negatively affect interconnect packet processing and result in interconnect performance degradation, packet loss, gc block loss and potential node outages. Action: Scheduling delays when the system is under high CPU utilization can cause delays in network packet processing. Excessive, sustained latencies will cause severe performance degradation and may cause cluster node failure. It is critical that sustained elevated CPU utilization be investigated. The `uptime` command will display load average information on most platforms. Excessive CPU interrupts associated with network stack processing may be mitigated through NIC interrupt coalescence and/or binding network interrupts to a single CPU. Please work with NIC vendors for these types of optimizations. Scheduling latencies can result in reassembly errors. See #2 above. 20. Switch related packet processing problems 1. Description: Buffer overflows on the switch port, switch congestion and switch misconfiguration such as MTU size, aggregation and Virtual Land definitions (VLANs) can lead to inefficiencies in packet processing resulting in performance degradation or cluster node outage. Action: The Oracle interconnect requires a switched Ethernet network. The switch is a critical component in end-to-end packet communication of interconnect traffic. As a network device, the switch may be subject to many factors or conditions that can negatively impact interconnect performance and availability. It is critical that the switch be monitored for abnormal, packet processing events, temporary or sustained traffic congestion and efficient throughput. Switch statistics should be evaluated at regular intervals to assess trends in interconnect traffic and to identify anomalies. 21. QoS which negatively impacts the interconnect packet processing 1. Description: Quality of Service definitions on the switch which shares interconnect traffic may negatively impact interconnect network processing resulting in severe performance degradation Action: If the interconnect is deployed on a shared switch segmented by VLANs, any QoS definitions on the shared switch should be configured such that prioritization of service does not negatively impact interconnect packet processing. Any QoS definitions should be evaluated prior to deployment and impact assessed. 22. Spanning tree brownouts during reconvergence. 1. Description: Ethernet networks use a Spanning Tree Protocol (STP) to ensure a loop-free topology where there are redundant routes to hosts. An outage of any network device participating in an STP topology is subject to a reconvergence of the topology which recalculates routes to hosts. If STP is enabled in the LAN and misconfigured or unoptimized, a network reconvergence event can take up to 1 minute or more to recalculate (depending on size of network and participating devices). Such latencies can result in interconnect failure and cluster wide outage. Action: Many switch vendors provide optimized extensions to STP enabling faster network reconvergence times. Optimizations such as Rapid Spanning Tree (RSTP), Per-VLAN Spanning Tree (PVST), and Multi-Spanning Tree (MSTP) should be deployed to avoid a cluster wide outage. 23. sq_max_size inadequate for STREAMS queing 1. Description: AWR reports high waits for “gc cr block lost” and/or “gc current block lost”. netstat output does not reveal any packet processing errors. `kstat -p -s ‘*nocanput*` returns non-zero values. nocanput indicates that queues for streaming messages are full and packets are dropped. Customer is running STREAMS in a RAC in a Solaris env. Action: Increasing the udp max buffer space and defining unlimited STREAMS queuing should relieve the problem and eliminate ‘nocanput’ lost messages. The following are the Solaris commands to make these changes: 2. `ndd -set /dev/udp udp_max_buf <NUMERIC VALUE>` set sq_max_size to 0 (unlimited) in /etc/system. Default = 2 3. udp_max_buf controls how large send and receive buffers (in bytes) can be for a UDP socket. The default setting, 262,144 bytes, may be inadequate for STREAMS applications. sq_max_size is the depth of the message queue. Troubleshooting InfiniBand connection issues using OFED tools The Open Fabrics Enterprise Distribution (OFED) package has many debugging tools available as part of the standard release. This article describes the use of those tools to troubleshoot the hardware and firmware of an InfiniBand fabric deployment. First, the /sys/class sub-system should be checked to verify that the hardware is up and connected to the InfiniBand fabric. The following command will show the InfiniBand hardware modules recognized by the system: ls /sys/class/infiniband This example will use the module mlx4_0, which is typical for Mellanox ConnectX* series of adapters. If this, or a similar module, is not found, refer to the documentation that came with the OFED package on starting the OpenIB drivers. Next, check the state of the InfiniBand port: cat /sys/class/infiniband/mlx4_0/ports/1/state This command should return “ACTIVE” if the hardware is initialized, and the subnet manager has found the port and added the port to the InfiniBand fabric. If this command returns “INIT” the hardware is initialized, but the subnet manager has not added the port to the fabric yet. If necessary, start the subnet manager: /etc/init.d/opensmd start Once the port on the head node is in the “ACTIVE” state, check the state of the InfiniBand port on all the compute nodes to ensure that all of the Infiniband hardware on the compute nodes has been initialized, and the subnet manager has added all of the compute nodes ports on to the fabric. This article will use the pdsh tool to run the command on all nodes: pdsh –a cat /sys/class/infiniband/mlx4_0/ports/1/state All nodes should report “ACTIVE”. If a node reports it cannot find the file, ensure the OpenIB drivers is loaded on that node. Refer to the documentation that came with the OFED package on starting the OpenIB drivers. Once all of the compute nodes report that port 1 is “ACTIVE”, verify the speed on each port using the following commands: cat /sys/class/infiniband/mlx4_0/ports/1/rate pdsh –a cat /sys/class/infiniband/mlx4_0/ports/1/rate This is a good first check for a bad cable or connection. Each port should report the same speed. For example, the output for double data rate (DDR) InfiniBand cards will be similar to “20 Gb/sec (4X DDR)”. Once the above basic checks are complete, more in-depth troubleshooting can be performed. The main OFED tool for troubleshooting performance and connection problems is ibdiagnet. This tool runs multiple tests, as specified on the command line during the run, to detect errors related to the subnet, bad packets, and bad states. These errors are some of the more common seen during initial setup of Infiniband fabrics. Run ibdiagnet with the following command line options: ibdiagnet –pc –c 1000 The output will be similar to this: Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.2 -W- Topology file is not specified. Reports regarding cluster links will use direct routes. Loading IBDM from: /usr/lib64/ibdm1.2 -W- A few ports of local device are up. Since port-num was not specified (-p option), port 1 of device 1 will be used as the local port. -I- Discovering ... 17 nodes (1 Switches & 16 CA-s) discovered. -I---------------------------------------------------I- Bad Guids/LIDs Info -I---------------------------------------------------I- No bad Guids were found -I---------------------------------------------------I- Links With Logical State = INIT -I---------------------------------------------------I- No bad Links (with logical state = INIT) were found -I---------------------------------------------------I- PM Counters Info -I---------------------------------------------------I- No illegal PM counters values were found -I---------------------------------------------------I- Fabric Partitions Report (see ibdiagnet.pkey for a full hosts list) -I---------------------------------------------------I---------------------------------------------------I- IPoIB Subnets Check -I---------------------------------------------------I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00 -W- No members found for group -I---------------------------------------------------I- Bad Links Info -I- Errors have occurred on the following links (for errors details, look in log file /tmp/ibdiagnet.log): -I--------------------------------------------------Link at the end of direct route "1,5" ----------------------------------------------------------------I- Stages Status Report: STAGE Errors Warnings Bad GUIDs/LIDs Check 0 0 Link State Active Check 0 0 Performance Counters Report 0 0 Partitions Check 0 0 IPoIB Subnets Check 0 1 Link Errors Check 0 0 Please see /tmp/ibdiagnet.log for complete log ----------------------------------------------------------------I- Done. Run time was 9 seconds. The warning “No members found for group” can safely be ignored. In this example, a bad link was found: “Link at the end of direct route “1,5”.” "1,5" refers to the LID numbers associated with the individual ports. The following commands can be used to identify the LID numbers associated with each port: cat /sys/class/infiniband/mlx4_0/ports/1/lid pdsh –a /sys/class/infiniband/mlx4_0/ports/1/lid This command generates a list of LIDs associated with nodes. In the output of the above command, locate the entries for 0x1 and 0x5. 0x1 is likely the head node. For errors of this type, reseat or replace the InfiniBand cable connecting the node corresponding to LID 0x5. Finally, run ibdiagnet once more time to verify there are no errors, and then to check the error state of each port. Each test should pass. ibdiagnet –pc –c 1000 ibcheckerrors. KICKSTART FOR DISKS LARGER THAN 2TB: Some machines we build have large root disk arrays, bigger than 2TB. They exceed the capacity of msdos partition table and require GPT (GUID Partition Table) to be installed in order to use the whole disk. Only RHEL 6.1 supports it now and it required significant customization of the kickstart configuration file. Also, it only works on the machines equipped with UEFI Boot Manager. IBM x3650 M3 and x3755 M3 do have it but I only tested the new setup on x3650 M3 In order to install RHEL 6.1 to a large disk machine, do the following: o In Opsware, pick OS Sequence called RBCCM-NY-RHEL-6.1_GPT. It is equivalent to RBCCM-NY-RHEL-6.1 except of GPT support. o This will cause regular installation to be executed. Once it is over, machine will reboot. On that reboot, go to the System Configuration and Boot Management screen (a.k.a. BIOS config) and follow this set of menu options: Boot Manager | Add Boot Option | Load File (first) | EFI | redhat | grub.efi o Provide some short but meaningful name to the new boot option through Input Description option. o Once this is done, press Commit Changes. o Exit the System Configuration screen, the machine should boot into the newly installed OS. On a regular, non-Large Disk system, this OS Sequence should work as the regular one, no additional activity is required. # Kickstart file automatically generated by anaconda. #version=DEVEL install nfs --server=159.55.104.162 --dir=/vol/XRZ0/opsware/linux/redhat/x86_64/v6.1 lang en_US.UTF-8 keyboard us network --onboot no --device eth0 --noipv4 --noipv6 network --onboot no --device eth1 --noipv4 --noipv6 network --onboot yes --device eth2 --mtu=1500 --bootproto dhcp network --onboot no --device eth3 --noipv4 --noipv6 network --onboot no --device usb0 --noipv4 --noipv6 rootpw --iscrypted X4qmByxwKBQbY # Reboot after installation reboot firewall --disabled authconfig --useshadow --enablemd5 selinux --disabled timezone --utc America/New_York bootloader --location=mbr --driveorder=sda --append="crashkernel=auto rhgb quiet" # The following is the partition information you requested # Note that any partitions you deleted are not expressed # here so unless you clear all partitions first, this is # not guaranteed to work #clearpart --none #part /boot/efi --fstype=efi --grow --maxsize=200 --size=20 #part /boot --fstype=ext4 --size=500 #part pv.008003 --grow --asprimary --size=25600 #volgroup vg00 --pesize=32768 pv.008003 #logvol / --fstype=ext4 --name=root --vgname=vg00 --size=10240 #logvol swap --name=swap --vgname=vg00 --size=4096 #logvol /tmp --fstype=ext4 --name=tmp --vgname=vg00 --size=2048 #logvol /var --fstype=ext4 --name=var --vgname=vg00 --size=6144 repo --name="Red Hat Enterprise Linux" --baseurl=nfs:159.55.104.162:/vol/XRZ0/opsware/linux/redhat/x86_64/v6.1 --cost=100 %packages @Base @Core @base @core OpenIPMI OpenIPMI-libs audit audit-libs autofs busybox compat-openldap crash dos2unix e2fsprogs grub kexec-tools krb5-libs krb5-libs.i686 ksh libcollection libcollection.i686 libdhash libdhash.i686 libini_config libini_config.i686 lm_sensors lvm2 mailx ncompress net-snmp net-snmp-libs net-snmp-utils nfs-utils nss-pam-ldapd ntp openldap openldap-clients openssl098e pam.i686 parted perl perl-Compress-Zlib perl-Convert-ASN1 perl-HTML-Parser perl-HTML-Tagset perl-IO-Socket-SSL perl-LDAP perl-Net-SSLeay perl-URI perl-XML-NamespaceSupport perl-XML-SAX perl-libwww-perl psacct sssd sssd-client strace sudo sysstat tcpdump tcsh telnet zsh -NetworkManager -NetworkManager-glib -alsa-lib -alsa-utils -audiofile -authconfig-gtk -bluez-libs -cdparanoia-libs -cups -cups-libs -dbus-python -desktop-file-utils -elinks -esound -fetchmail -foomatic -ftp -gimp-print -gnome-keyring -gnome-python2 -gnome-python2-bonobo -gnome-python2-canvas -gnome-python2-gtkhtml2 -gnome-vfs2 -gnuplot -gphoto2 -gpm -gstreamer -gtkhtml2 -gtksourceview -indexhtml -isdn4k-utils -jpackage-utils -jwhois -kdemultimedia -kernel-devel -lftp -lha -libbonobo -libbonoboui -libglade2 -libgnome -libgnomecanvas -libgnomeui -libibverbs -libjpeg -libmthca -libpng -libtiff -libvorbis -libwnck -libwvstreams -libxml2 -libxslt -logwatch -lokkit -lrzsz -minicom -mkbootdisk -mt-st -mtr -mutt -mysql -nc -nmap -numactl -openib -openmotif -opensm-libs -pango -pinfo -ppp -procmail -pyOpenSSL -pygtk2 -pyorbit -pyxf86config -rdate -rdist -redhat-logos -redhat-menus -rhn-check -rhn-client-tools -rhn-setup -rhnsd -rp-pppoe -rsh -samba-client -samba-common -setuptool -sox -star -stunnel -system-config-network-tui -talk -tmpwatch -tog-pegasus -usermode-gtk -vnc-server -wireless-tools -wvdial -yp-tools -ypbind %end %pre set -x [ -d /opsw ] && exit mkdir /opsw mount /dev/sda1 /opsw cp /opsw/phase2/* /tmp umount /opsw /tmp/miniagent --server 159.55.104.44 8017 %end %pre --logfile /tmp/anaconda.pre_log #!/bin/bash echo "Running pre-install script..." # find out the size of the boot disk parted="/usr/sbin/parted" grep="/bin/grep" awk="/bin/awk" sed="/bin/sed" dd="/usr/bin/dd" sleep="/bin/sleep" # echo 'y' because parted asks for some confirmation when hits a raw disk without mbr sda_size=`echo 'y' | $parted -l /dev/sda | $grep ^Disk| $grep sda | $awk -F':' '{print $2}' | $sed -e 's/^ *//' | $sed -e 's/[^0-9]*$//'` echo "/dev/sda size == $sda_size GB" echo "" # clean up the MBR and partition table of the root disk /usr/bin/dd bs=512 count=10 if=/dev/zero of=/dev/sda echo "" if [ $sda_size -gt 2048 ] then echo "Root disk is bigger than 2TB, installing gpt partition table!" /usr/sbin/parted --script /dev/sda mklabel gpt else echo "Root disk is smaller than 2TB, installing msdos partition table." /usr/sbin/parted --script /dev/sda mklabel msdos fi echo "" /usr/sbin/parted -l /dev/sda /bin/sleep 30 %end %post --nochroot echo "Running post-install scritp" # prepare UEFI boot environment # (don't know how to make RHEL installer to do it automatically) /usr/bin/cp /mnt/sysimage/boot/grub/grub.conf /mnt/sysimage/boot/efi/EFI/redhat/grub.conf # necessary for grubby to update the right file during the kernel upgrades /usr/bin/rm /mnt/sysimage/etc/grub.conf /usr/bin/ln -s /boot/efi/EFI/redhat/grub.conf /mnt/sysimage/etc/grub.conf # make logs available after kickstart for debugging purpose /usr/bin/cp /tmp/*log /mnt/sysimage/root %end %post touch /var/tmp/agentInstallTrigger; while [ -f /var/tmp/agentInstallTrigger ]; if [ -f /var/tmp/doneAgentInstall ]; then break; fi; do sleep 5; done %end FUSION IO: RAID1-SETUP http://kb.fusionio.com/KB/a18/creating-a-raid1mirrored-set-using-the-iodrive-on-linux.aspx?KBSearchID=16919 Creating a RAID1 (Mirrored) Set using the ioDrive on Linux To create a RAID1/Mirror set on Linux, enter the command: $ mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/fioa /dev/fiob This creates a mirrored set using the two ioDrives fioa and fiob. (Use the fio-status utility to view your specific device names.) Start up the RAID1 mirror To start the RAID1 mirrored, set and mount it as md0, enter the command: mdadm --assemble /dev/md0 /dev/fioa /dev/fiob Stop the RAID1 mirror To stop the RAID1 set, enter the command: mdadm --stop /dev/md0 Make the mirror persistent (exists after a reboot) Use the following commands to have your RAID1 mirror will appear after a reboot. $ sudo sh -c 'echo "DEVICE /dev/fioa /dev/fiob" >/etc/mdadm.conf' $ sudo sh -c 'mdadm --detail --scan >>/etc/mdadm.conf' On some versions of Unix, the configuration file is in /etc/mdadm/mdadm.conf, not /etc/mdadm.conf. On most systems, the RAID1 mirror will be created automatically upon reboot. However, if you have problems accessing /dev/md0after a reboot, run this command: $ sudo mdadm --assemble --scan DRIVER-SETUP http://kb.fusionio.com/KB/a64/loading-the-driver-via-udev-or-init-script-for-md-and-lvm.aspx?KBSearchID=16922 -Loading the Driver via udev or Init Script for md and LVM There are two main methods for loading the Fusion-IO Driver: udev and the iodrive script. The default method of loading is udev. Using udev to Load the Driver Most modern linux distributions use udev to facilitate driver loading. Usually it just works, behind the scenes, loading drivers. udev automatically loads drivers for installed hardware on the system. It does this by walking through the devices on the PCI bus, and then loading any driver that has been properly configured to work with that device. This way, drivers can be loaded without having to use scripts or modify configuration files. The Fusion-io drivers are properly configured, and udev will find the drivers and attempt to load them. However, there are some cases where loading by udev is not appropriate, or can have issues. Udev will wait 180 seconds for the driver to load, then it will exit. In most cases, this is plenty of time, even with multiple ioDrives installed. But if the drives were shut down improperly, loading the driver and attaching the drives takes longer than the 180 seconds. In this case, udev will exit. The driver will not exit, but will continue working on attaching the drives. There is not always a problem when udev exits early. The drivers will eventually load, and then you will be able to use the attached block devices. But, if the drivers do take too long to load, and udev does exit, and file systems are set to be mounted in the fstab, then the system file system check (fsck) will fail, and the system will stop booting. In most distributions the user will drop into a single-user mode, or repair mode. Again, this is normal behavior; after the driver finishes rescanning and attaching the drive, a reboot will fix things. For most users, this will not happen often enough to be an issue. But for installations with many devices, or for server installations where dropping into single-user mode is unacceptable, there is an alternative method for driver loading that does not have these issues. Using Init Scripts to Load the Driver The ioDriver packages provide init scripts that are installed on all supported Linux distributions. These scripts typically reside in /etc/init.d/iodrive or /etc/rc.d/iodrive. The scripts are used to load and start the driver, and to mount filesystems, after the system is up. This method completely avoids the udev behavior described above. (It will wait as long as it takes for drives to be attached). NOTE: These steps assume that the logical volumes /dev/md0 (md) and /dev/vg0/fio_lv (LVM) have been setup beforehand. There are other knowledgebase articles that provide details for logical volume creation, including updating lvm.conf to work with ioDrives. Follow the instructions based your driver version. (jump to: Using Init Scripts to Load the 1.2.x Driver) Using Init Scripts to Load the 2.x.x Driver Step 1: Modifying /etc/modprobe.d/iomemory-vsl.conf Edit /etc/modprobe.d/iomemory-vsl.conf, and uncomment the blacklist line: Before: To keep ioDrive from auto loading at boot, uncomment below: # blacklist iomemory-vsl After: To keep ioDrive from auto loading at boot, uncomment below: blacklist iomemory-vsl. This keeps udev from automatically loading the driver Step 2: Modifying /etc/fstab Add noauto to the options in the appropriate line in the /etc/fstab file. This will keep the OS from trying to check the drive for errors on boot. Before: LABEL=/ / ext3 defaults 1 1 LABEL=/boot /boot ext3 defaults 1 2 tmpfs /dev/shm tmpfs defaults 0 0 devpts /dev/pts devpts gid=5,mode=620 0 0 sysfs /sys sysfs defaults 0 0 proc /proc proc defaults 0 0 LABEL=SWAP-sda5 swap swap defaults 0 0 /dev/md0 /iodrive_mountpoint ext3 defaults 1 2 /dev/vg0/fio_lv /iodrive_mountpoint2 ext3 defaults 1 2 After: LABEL=/ / ext3 defaults 1 1 LABEL=/boot /boot ext3 defaults 1 2 tmpfs /dev/shm tmpfs defaults 0 0 devpts /dev/pts devpts gid=5,mode=620 0 0 sysfs /sys sysfs defaults 0 0 proc /proc proc defaults 0 0 LABEL=SWAP-sda5 swap swap defaults 0 0 /dev/md0 /iodrive_mountpoint ext3 defaults,noauto 0 0 /dev/vg0/fio_lv /iodrive_mountpoint2 ext3 defaults,noauto 0 0 Step 3: Modifying /etc/sysconfig/iomemory-vsl. /etc/sysconfig/iomemory-vsl, and uncomment ENABLED=1 to enable the init script. Before: If ENABLED is not set (non-zero) then iomemory-vsl init script will not be used. #ENABLED=1 After: If ENABLED is not set (non-zero) then iomemory-vsl init script will not be used. ENABLED=1 While editing /etc/sysconfig/iomemory-vsl, add the mountpoint to the MOUNTS variable so it will be automatically attached and mounted: Before: An IFS separated list of md arrays to start once the driver is loaded. Arrays should be configured in the mdadm.conf file. Example: MD_ARRAYS="/dev/md0 /dev/md1"MD_ARRAYS="" An IFS separated list of LVM volume groups to start once the driver is loaded. Volumes should be configured in lvm.conf. Example: LVM_VGS="/dev/vg0 /dev/vg1"LVM_VGS="" An IFS separated list of mount points to mount once the driver is loaded. These mount points should be listed in /etc/fstab with "noauto" as one of the mount options. Example /etc/fstab: /dev/fioa /mnt/fioa ext3 defaults,noauto 0 0 /dev/fiob /mnt/firehose ext3 defaults,noauto 0 0 Example: MOUNTS="/mnt/fioa /mnt/firehose"MOUNTS="" After: 11. An IFS separated list of md arrays to start once the driver is loaded. Arrays should be configured in the mdadm.conf file. # Example: MD_ARRAYS="/dev/md0 /dev/md1"MD_ARRAYS="/dev/md0" An IFS separated list of LVM volume groups to start once the driver is loaded. Volumes should be configured in lvm.conf. # Example: LVM_VGS="/dev/vg0 /dev/vg1"LVM_VGS="/dev/vg0" An IFS separated list of mount points to mount once the driver is loaded. These mount points should be listed in /etc/fstab with "noauto" as one of the mount options. Example /etc/fstab: /dev/fioa /mnt/fioa ext3 defaults,noauto 0 0 /dev/fiob /mnt/firehose ext3 defaults,noauto 0 0 Example: MOUNTS="/mnt/fioa /mnt/firehose"MOUNTS="/iodrive_mountpoint /iodrive_mountpoint2" Step 4: Verifying the Status of the Init Script Make sure the iodrive script loads at run levels 1 through 5 (runlevel 0 is shutdown and runlevel 6 is reboot). Run the following commands: $ chkconfig iomemory-vsl on$ chkconfig --list iomemory-vsl iomemory-vsl 0:off 1:on 2:on 3:on 4:on 5:on 6:off Using Init Scripts to Load the 1.2.x Driver Step 1: Modifying /etc/modprobe.d/iodrive Edit /etc/modprobe.d/iodrive, and uncomment the blacklist line: Before: 1. To keep ioDrive from auto loading at boot, uncomment below# blacklist fio-driverAfter: 2. To keep ioDrive from auto loading at boot, uncomment belowblacklist fio-driverThis keeps udev from automatically loading the driver Step 2: Modifying /etc/fstab. Add noauto to the options in the appropriate line in the /etc/fstab file. This will keep the OS from trying to check the drive for errors on boot. Before: LABEL=/ / ext3 defaults 1 1 LABEL=/boot /boot ext3 defaults 1 2 tmpfs /dev/shm tmpfs defaults 0 0 devpts /dev/pts devpts gid=5,mode=620 0 0 sysfs /sys sysfs defaults 0 0 proc /proc proc defaults 0 0 LABEL=SWAP-sda5 swap swap defaults 0 0 /dev/md0 /iodrive_mountpoint ext3 defaults 1 2 /dev/vg0/fio_lv /iodrive_mountpoint2 ext3 defaults 1 2 After: LABEL=/ / ext3 defaults 1 1 LABEL=/boot /boot ext3 defaults 1 2 tmpfs /dev/shm tmpfs defaults 0 0 devpts /dev/pts devpts gid=5,mode=620 0 0 sysfs /sys sysfs defaults 0 0 proc /proc proc defaults 0 0 LABEL=SWAP-sda5 swap swap defaults 0 0 /dev/md0 /iodrive_mountpoint ext3 defaults,noauto 0 0 /dev/vg0/fio_lv /iodrive_mountpoint2 ext3 defaults,noauto 0 0 Step 3: Modifying /etc/sysconfig/iodrive Edit /etc/sysconfig/iodrive, and add the mountpoint to the MOUNTS variable, so it will be automatically attached and mounted: Before: 1. An IFS separated list of md arrays to start once the driver is# loaded. Arrays should be configured in the mdadm.conf file. 2. # Example: MD_ARRAYS="/dev/md0 /dev/md1"MD_ARRAYS="" 3. # An IFS separated list of LVM volume groups to start once the driver is# loaded. Volumes should be configured in lvm.conf.# Example: LVM_VGS="/dev/vg0 /dev/vg1"LVM_VGS=""# An IFS separated list of mount points to mount once the driver is# loaded. These mount points should be listed in /etc/fstab with# "noauto" as one of the mount options.# Example /etc/fstab:#/dev/fioa /mnt/fioa ext3 defaults,noauto 0 0#/dev/fiob /mnt/firehose ext3 defaults,noauto 0 0# Example: MOUNTS="/mnt/fioa /mnt/firehose"MOUNTS=""After: 2. An IFS separated list of md arrays to start once the driver is# loaded. Arrays should be configured in the mdadm.conf file.# Example: MD_ARRAYS="/dev/md0 /dev/md1"MD_ARRAYS="/dev/md0"# An IFS separated list of LVM volume groups to start once the driver is# loaded. Volumes should be configured in lvm.conf.# Example: LVM_VGS="/dev/vg0 /dev/vg1"LVM_VGS="/dev/vg0"# An IFS separated list of mount points to mount once the driver is# loaded. These mount points should be listed in /etc/fstab with# "noauto" as one of the mount options.# Example /etc/fstab:#/dev/fioa /mnt/fioa ext3 defaults,noauto 0 0#/dev/fiob /mnt/firehose ext3 defaults,noauto 0 0# Example: MOUNTS="/mnt/fioa /mnt/firehose"MOUNTS="/iodrive_mountpoint /iodrive_mountpoint2"Step 4: Verifying the Status of the Init Script Make sure the iodrive script loads at run levels 1 through 5 (runlevel 0 is shutdown and runlevel 6 is reboot). Run the following commands: $ chkconfig iodrive on$ chkconfig --list iodrive iodrive 0:off 1:on 2:on 3:on 4:on 5:on 6:off LVM-ENABLE http://kb.fusionio.com/KB/a36/enabling-the-iodrive-for-lvm-use.aspx Enabling the ioDrive for LVM Use The Logical Volume Manager (LVM) volume group management application handles mass storage devices like the ioDrive if you add the ioDrive as a supported type: 1. Locate and edit the /etc/lvm/lvm.conf configuration file. 2. Add an entry similar to the following to that file: types = [ "fio", 16 ] The parameter “16” represents the maximum number of partitions supported by the drive. For the ioDrive, this can be any number from 1 upwards, with 16 as the recommended setting. Do NOT set this parameter to 0.