Low Latency Tunings

Transcription

Low Latency Tunings
LOW LATENCY SETTINGS:
1. TCP/UDP buffers ( we may already have standard settings for these )
net.ipv4.tcp_rmem
net.ipv4.tcp_wmem
net.ipv4.tcp_mem
net.core.wmem_max
net.core.rmem_max
net.core.wmem_default
net.core.rmem_default
eg.
net.ipv4.tcp_rmem =” 4096 16777216 33554432”
net.ipv4.tcp_wmem =” 4096 16777216 33554432”
net.ipv4.tcp_mem = “4096 16777216 33554432”
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.core.netdev_max_backlog = 30000
net.core.netdev_max_backlog = 30000
net.ipv4.tcp_low_latency=1
(optimize tcp stack for low latency over high throughput. Faster connection times, works better for smaller pkts)
net.ipv4.tcp_timestamps=0
net.ipv4.tcp_ecn=0
(disable router explicit congestion notification, will help when we see large file transfer stalls in a small env with 1/2 routers not a good)
net.ipv4.tcp_app_win=0
(do not reserve space or handle differently for slower apps. Larger the value smaller the buffer for specific window)
net.ipv4.tcp_no_metrics_save=1
(tcp saving various connection metrics in route cache when connections are closed. These caches are consulted for future similar connections.
Will increase performance on the market access side or when we do have similar connections)
rhash_entries=1024
2. Initial congesstion window for the routes (default is 3. Can send initially 3 pkts without acks. Maybe required on the market access side ..
Order sending might be delayed or bundled)
Change it with:
ip route change default via n.n.n.n dev eth0 initcwnd #"
(here I did for the default route but can be done for any routes )
To check the value:
"ip route show"
3. Disable C-states, C1E State ( if exists ), HT, power mgmt from BIOS.
4. Disable SMIs ( when supported/agreed by HW vendors ) thru BIOS. Not all vendors support this by default … Will minimize jitters greatly.
5. Disable/remove irqbalance, cpuspeed and other not needed services
6. SFC ( Solarflare )
lro=0
( large receive offload, will not bundle recv pkts into larger chunks)
lso=0
( large send offload, will not bundle tx pkts into larger chunks )
rx_alloc_method=1 ( how memory is allocated for received pkts by the sfc driver. Value of 1 is better for smaller pkts. this is what we
have )
rss_cores=#cores
sfc_affinity
(create per core queues, helpful if dropping pkts consistently and also observing delays in receiving pkts in the application )
(to affinitize or steer pkt streams to cores/connections, Cannot use with onload simultaneously)
There are kernel parameters at the OS level to steer pkts as well. Which I have not tested.
7. Netdev buffers
net.core.netdev_max_backlog= # (higher than the default 2500 , for 10g should be much higher )
net.core.netdev_budget=300
(Depends. Default of 300 is usually ok. Max number of pkts taken from interfaces in each polling cycle)
(I have written a simple script to monitor the netdev buffers will send it soon )
8. Ring buffers, interrupt coalescing, moderation
adjust the ring buffers accordingly ( SFC might have limitation of 4096 for both RX and TX .. Other cards have different values )
no interrupt coalescing or moderation for any interfaces
9. Bind device ( nic, raid , disk, usb .. etc ) interrupts to certain cores
This depends on the application as per binding what to which cores .. We must achieve data locality for the application
See which cores are in which sockets.. Depending on the vendors with the same Intel/AMD processors numbers will vary
"numactl --hardware" will show for sure
10. ONLOAD (If and when used)
EF_POLL_USEC=1000
(onload spins at user space for 1ms. If not used, it'll block. Better to stay in the user space rather in kernel space)
Can check with:
onload_stackdump | grep poll
(will greatly increase the performance as it'll avoid interrupts and kernel CS. By default onload uses epoll() when it goes into polling mode )
EF_RXQ_SIZE
(need to match the ring buffer RX setting )
EF_TCP_INITIAL_CWND
(may need to adjust this value depending on the initial congestion window set mentioned above in this email )
There are other settings which all depends on how the application is performing using onload.
“onload_stackdump” is a good starting point
Onload can only be used with SFC cards, other cards have TOE which needs different settings
11. Using cpuset, tuna , isolcpus, taskset
It's a good practice to use isolcpus to isolate cores from general scheduling. I believe we are already doing this here cpuset, tuna
both almost do similar things. So whatever the appdev/unix teams are comfortable with.
I personally like tuna better as once I know what thread is doing what I can bind them to certain cores properly in order to receive
data locality. To do this we need to understand the apps better. Is the app running on a busy loop or not? Personally I prefer to bind
the thread to a certain core and let it be running there all the time. Using cpusets will have the thread migrate to cores in the same
set which sometimes may not be desirable. Setting thread priorities/class might also be a good idea. But all of these depend on the
application behavior.
12. Events kernel thread
When we see the application is observing delays in receiving pkts/messages or there's a longer than usual time spent from the time
the pkt came to the NIC and the message delivered to the application, we may need to bump up the priority of the kernel event
thread. These threads process pending events in the kernel generated by device drivers …This would increase the epoll(), poll(),
select() calls which I believe most of the NIC vendors use. When using ONLY onload this would not matter much
13. MRG 1.3
A whole new ball game and MUST be carefully planned when used. Maybe soon in a wholesale level :-) There are different things
that need to be done for this.
UDP PACKET LOSS:
1.
Poorly sized UDP receive (rx) buffer sizes / UDP buffer socket overflows
1.
Description: Oracle RAC Global cache block processing is ‘bursty’ in nature and, consequently, the OS may
need to buffer receive(rx) packets while waiting for CPU. Unavailable buffer space may lead to silent packet
loss and global cache block loss. `netstat –s` or `netstat –su` on most UNIX will help determine
UDPInOverflows, packet receive errors, dropped frames, or packets dropped due to buffer full errors.
Action: Packet loss is often attributed to inadequate( rx) UDP buffer sizing on the recipient server, resulting in
buffer overflows and global cache block loss. The UDP receive (rx) buffer size for a socket is set to 128k when
Oracle opens the socket when the OS setting is less than 128k. If the OS setting is larger than 128k Oracle
respects the value and leaves it unchanged. The UDP receive buffer size will automatically increase according
to database block sizes greater than 8k, but will not increase beyond the OS dependent limit. UDP buffer
overflows, packet loss and lost blocks may be observed in environments where there are excessive timeouts
on ‘global cache cr requests’ due to inadequate buffer setting when DB_FILE_MULTIBLOCK_READ_COUNT is
greater than 4. To alleviate this problem, increase the UDP buffer size and decrease the
DB_FILE_MULTIBLOCK_READ_COUNT for the system or active session.
2.
3.
2.
To determine if you are experiencing UDP socket buffer overflow and packet loss, on most UNIX platforms,
execute
3.
`netstat –s` or `netstat –su` and look for either “udpInOverflows”, “packet receive errors” or “fragments
dropped” depending on the platform.
4.
NOTE: UDP packet loss usually results in increased latencies, decreased bandwidth, increased cpu utilization
(kernel and user), and memory consumption to deal with packet retransmission.
Poor interconnect performance and high cpu utilization. `netstat –s` reports packet reassembly failures
1.
Description: Large UDP datagrams may be fragmented and sent in multiple frames based on Medium
Transmission Unit (MTU) size. These fragmented packets need to be reassembled on the receiving node. High
CPU utilization (sustained or frequent spikes), inadequate reassembly buffers and UDP buffer space can cause
packet reassembly failures. `netstat -s` reports a large number of Internet Protocol (IP) ‘reassembles failed’
and ‘fragments dropped after timeout’ in the “IP Statistics” section of the output on the receiving node.
Fragmented packets have a time-to-live for reassembly. Packets that are not reassembled are dropped and
requested again. Fragments that arrive and there is no space for reassembly are silently dropped
2.
`netstat –a` IP stat counters:
3104582 fragments dropped after timeout
34550600 reassemblies required
8961342 packets reassembled ok
3104582 packet reassembles failed.
3.
Action: Increase fragment reassembly buffers, allocating more space for reassembly. Increase the time to
reassemble packet fragment., increase udp receive buffers to accommodate network processing latencies that
aggravate reassembly failures and identify CPU utilization that negatively impacts network stack processing.
4.
On LINUX:
To modify reassembly buffer space, change the following thresholds:
5.
/proc/sys/net/ipv4/ipfrag_low_thresh (default = 196608)
/proc/sys/net/ipv4/ipfrag_high_thresh (default = 262144)
6.
To modify packet fragment reassembly times, modify:
/proc/sys/net/ipv4/ipfrag_time (default = 30)
7.
see your OS for the equivalent command syntax.
Network packet corruption resulting from UDP checksum errors and/or send (tx) / receive (rx) transmission errors
1.
4.
Mismatched MTU sizes in the communication path
1.
5.
Description: UDP includes a checksum field in the packet header which is read on receipt. Any corruption of
the checksum results in silent dropped packets. Checksum corruptions result in packet retransmissions,
additional cpu overhead for the additional request and latencies in packet processing.
Action: Use tcpdump/snoop/network utilities sniffer utility to capture packet dumps to identify checksum
errors and confirm checksum corruption. Engage sysadmins and network engineers for root cause. Checksum
offloading on NICs have been known to create checksum errors. Consider disabling the NIC checksum
offloading, if configured, and test. On LINUX ethtool –K <IF> rx off tx off disables the checksum offloading.
Description: Mismatched MTU sizes cause ‘packet too big’ failures and silent packet loss resulting in global
cache block loss and excessive packet retransmission requests.
Action: The MTU is the “Maximum Transmission Unit” or frame size configured for the interconnect
interfaces. The default standard for most UNIX is 1500 bytes for Ethernet. MTU definitions should be identical
for all devices in the interconnect communication path. Identify and monitor all devices in the interconnect
communication path. Use large, non-default sized, ICMP probe packets for` ping`, `tracepath` or `traceroute`
to detect mismatched MTUs in the path. Use `ifconfig` or vendor recommended utilities to determine and set
MTU sizes for the server NICS. See Jumbo Frames #12 below. Note: Mismatched MTU sizes for the
interconnect will inhibit nodes joining the cluster in 10g and 11g.
Faulty or poorly seated cables/cards
1.
Description: Faulty network cable connections, the wrong cable, poorly constructed cables, excessive length
and wrong port assignments can result in inferior bit rates, corrupt frames, dropped packets and poor
performance.
Action: CAT 5 grade cables or better should be deployed for interconnect links. All cables should be securely
seated and labeled according to LAN/port and aggregation, if applicable. Cable lengths should conform to
vendor Ethernet specifics.
6.
Interconnect LAN non-dedicated
1.
7.
Lack of Server/Switch Adjacency
1.
8.
Description: Network devices are said to be ‘adjacent’ if they can reach each other with a single hop across a
link layer. Multiple hops add latency and introduce unnecessary complexity and risk when other network
devices are in the communication path.
Action: All GbE server interconnect links should be (OSI) layer 2 direct attach to the switch or switches (if
redundant switches are configured). There should be no intermediary network device, such as a router, in the
interconnect communication path. The unix command `traceroute` will help identify adjacency issues.
IPFILTER configured
1.
9.
Description: Shared public IP traffic and/or shared NAS IP traffic, configured on the interconnect LAN will
result in degraded application performance, network congestion and, in extreme cases, global cache block
loss.
Action: The interconnect/clusterware traffic should be on a dedicated LAN defined by a non-routed subnet.
Interconnect traffic should be isolated to the adjacent switch(es), e.g. interconnect traffic should not extend
beyond the access layer switch(es) to which the links are attached. The interconnect traffic should not be
shared with public or NAS traffic. If Virtual LANs (VLANS) are used, the interconnect should be on a single,
dedicated VLAN mapped to a dedicated, non-routed subnet, which is isolated from public or NAS traffic.
Description:IPFILTER (IPF) is a host based firewall or Network Address Translation (NAT) software package that
has been identified to create problems for interconnect traffic. IPF may contribute to severe application
performance degradation, packet loss and global cache block loss.
Action: disable IPFILTER
Outdated Network driver or NIC firmware
1.
Description: Outdated NIC drivers or firmware have been known to cause problems in packet processing
across the interconnect. Incompatible NIC drivers in inter-node communication may introduce packet
processing latencies, skewed latencies and packet loss.
Action: Server NICs should be the same make/model and have identical performance characteristics on all
nodes and should be symmetrical in slot id. Firmware and NIC drivers should be at the same (latest) rev. for all
server interconnect NICs in the cluster.
10. Proprietary interconnect link transport and network protocol
1.
Description: Non-standard, proprietary protocols, such as LLT, HMP,etc, have proven to be unreliable and
difficult to debug. Miss-configured proprietary protocols have caused application performance degradation,
dropped packets and node outages.
Action: Oracle has standardized on 1GbE UDP as the transport and protocol. This has proven stable, reliable
and performant. Proprietary protocols and substandard transports should be avoided. IP and RDS on
Inifiniband are available and supported for interconnect network deployment and 10GbE has been certified
for some platforms (see OTN for details) – certification in this area is ongoing.
11. Misconfigured bonding/link aggregation
1.
Description: Failure to correctly configure NIC Link Aggregation or Bonding on the servers or failure to
configure aggregation on the adjacent switch for interconnect communication can result in degraded
performance and block loss due to ‘port flapping’, interconnect ports on the switch forming an aggregated link
frequently change ‘UP’/’DOWN’ state.
Action: If using link aggregation on the clustered servers, the ports on the switch should also support and be
configured for link aggregation for the interconnect links. Failure to correctly configure aggregation for
interconnect ports on the switch would result in ‘port flapping’, switch ports randomly dropping, resulting in
packet loss.
Bonding/Aggregation should be correctly configured per driver documentation and tested under load. There
are a number of public domain utilities that help to test and measure link bandwidth and latency performance
(see iperf). OS, network and network driver statistics should be evaluated to determine efficiency of bonding.
12. Misconfigured Jumbo Frames
1.
Description: Misconfigured Jumbo Frames may create mismatched MTU sizes described above.
Action: Jumbo Frames are IEEE non-standard and as a consequence, care should be taken when configuring. A
Jumbo Frame is a frame size around 9000bytes. Frame size may vary depending on network device vendor and
may not be consistent between communicating devices. An identical maximum transport unit (MTU) size
should be configured for all devices in the communication path if the default is not 9000 bytes. All the network
devices , switches/NICS/line cards, in operation must be configured to support the same frame size (MTU
size). Mismatched MTU sizes, where the switch may be configured to be MTU:1500 but the server
interconnect interfaces are configured to be MTU:9000 will lead to packet loss, packet fragmentation and
reassembly errors which cause severe performance degradation and cluster node outages. The IP stats in
`netstat –s` on most platforms will identify frame fragmentation and reassembly errors. The command
`ifconfig -a`, on most platforms, will identify the frame size in use (MTU:1500). See the switch vendor’s
documentation to identify Jumbo Frames support.
13. NIC force full duplex and duplex mode mismatch
1.
Description: Duplex mode mismatch is when two nodes in a communication channel are operating at
half-duplex on one end and full duplex on the other. This may be manually misconfigured duplex modes or,
one end configured manually to be half-duplex while the communication partner is auto negotiate. Duplex
mode mismatch results in severely degraded interconnect communication.
Action: Duplex mode should be set to auto negotiate for all Server NICs in the cluster *and* line cards on the
switch(es) servicing the interconnect links. Gigabit Ethernet standards require auto negotiation set to “on” in
order to operate. Duplex mismatches can cause severe network degradation, collisions and dropped packets.
Auto negotiate duplex modes should be confirmed after every hardware/software upgrade affecting the
network interfaces. Auto negotiate on all interfaces will operate at 1000 full duplex.
14. Flow-control mismatch in the interconnect communication path
1.
Description: Flow control is the situation when a server may be transmitting data faster than a network peer
(or network device in the path) can accept it. The receiving device may send a PAUSE frame requesting the
sender to temporarily stop transmitting.
Action: Flow-control mismatches can result in lost packets and severe interconnect network performance
degradation.
2.
tx flow control should be turned off
rx flow control should be turned on
tx/rx flow control should be turned on for the switch(es)
3.
NOTE: flow control definitions may change after firmware/network driver upgrades. NIC settings should be
verified after any upgrade
15. Packet loss at the OS, NIC or switch layer
1.
Description: Any packet loss as reported by OS, NIC or switch should be thoroughly investigated and resolved.
Packet loss can result in degraded interconnect performance, cpu overhead and network/node outages.
Action: Specific tools will help identify which layer you are experiencing the packet/frame loss
(process/OS/Network/NIC/switch). netstat, ifconfig, ethtool, kstat (depending on the OS) and switch port stats
would be the first diagnostics to evaluate. You may need to use a network sniffer to trace end-to-end packet
communication to help isolate the problem (see public domain tools such as snoop/wireshare/ethereal). Note,
understanding packet loss at the lower layers may be essential to determining root cause. Under sized ring
buffers or receive queues on a network interface are known to cause silent packet loss, e.g. packet loss that is
not reported at any layer. See NIC Driver Issues and Kernel queue lengths below. Engage your systems
administrator and network engineers to determine root cause.
16. NIC Driver/Firmware Configuration
1.
Description: Misconfigured or inadequate default settings for tunable NIC public properties may result in
silent packet loss and increased retransmission requests.
Action: Default factory settings should be satisfactory for the majority of network deployments. However,
there have been issues with some vendor NICs and the nature of interconnect traffic that have required
modifying interrupt coalescence settings and the number of descriptors in the ring buffers associated with the
device. Interrupt coalescence is the CPU interrupt rate for send (tx) and receive (rx) packet processing. The
ring buffers hold rx packets for processing between CPU interrupts. Misconfiguration at this layer often results
in silent packet loss. Diagnostics at this layer require sysadmin and OS/Vendor intervention.
17. NIC send (tx) and receive (rx) queue lengths
1.
Description: Inadequately sized NIC tx/rx queue lengths may silently drop packets when queues are full. This
results in gc block loss, increased packet retransmission and degraded interconnect performance.
Action: As packets move between the kernel network subsystem and the network interface device driver,
send (tx) and receive (rx) queues are implemented to manage packet transport and processing. The size of
these queues are configurable. If these queues are under configured or misconfigured for the amount of
network traffic generated or MTU size configured, full queues will cause overflow and packet loss. Depending
on the driver and quality of statistics gathered for the device, this packet loss may not be easy to detect.
Diagnostics at this layer require sysadmin and OS/Vendor intervention. (c.f. iftxtqueue and
netdev_max_backlog on linux)
18. Limited capacity and over-saturated bandwidth
1.
Description: Oversubscribed network usage will result in interconnect performance degradation and packet
loss.
Action: An interconnect deployment best practice is to know your interconnect usage and bandwidth. This
should be monitored regularly to identify usage trends, transient or constant. Increasing demands on the
interconnect may be attributed to scaling the application or aberrant usage such a bad sql or unexpected
traffic skew. Assess the cause of bandwidth saturation and address it.
19. Over subscribed CPU and scheduling latencies
1.
Description: Sustained high load averages and network stack scheduling latencies can negatively affect
interconnect packet processing and result in interconnect performance degradation, packet loss, gc block loss
and potential node outages.
Action: Scheduling delays when the system is under high CPU utilization can cause delays in network packet
processing. Excessive, sustained latencies will cause severe performance degradation and may cause cluster
node failure. It is critical that sustained elevated CPU utilization be investigated. The `uptime` command will
display load average information on most platforms. Excessive CPU interrupts associated with network stack
processing may be mitigated through NIC interrupt coalescence and/or binding network interrupts to a single
CPU. Please work with NIC vendors for these types of optimizations. Scheduling latencies can result in
reassembly errors. See #2 above.
20. Switch related packet processing problems
1.
Description: Buffer overflows on the switch port, switch congestion and switch misconfiguration such as MTU
size, aggregation and Virtual Land definitions (VLANs) can lead to inefficiencies in packet processing resulting
in performance degradation or cluster node outage.
Action: The Oracle interconnect requires a switched Ethernet network. The switch is a critical component in
end-to-end packet communication of interconnect traffic. As a network device, the switch may be subject to
many factors or conditions that can negatively impact interconnect performance and availability. It is critical
that the switch be monitored for abnormal, packet processing events, temporary or sustained traffic
congestion and efficient throughput. Switch statistics should be evaluated at regular intervals to assess trends
in interconnect traffic and to identify anomalies.
21. QoS which negatively impacts the interconnect packet processing
1.
Description: Quality of Service definitions on the switch which shares interconnect traffic may negatively
impact interconnect network processing resulting in severe performance degradation
Action: If the interconnect is deployed on a shared switch segmented by VLANs, any QoS definitions on the
shared switch should be configured such that prioritization of service does not negatively impact interconnect
packet processing. Any QoS definitions should be evaluated prior to deployment and impact assessed.
22. Spanning tree brownouts during reconvergence.
1.
Description: Ethernet networks use a Spanning Tree Protocol (STP) to ensure a loop-free topology where
there are redundant routes to hosts. An outage of any network device participating in an STP topology is
subject to a reconvergence of the topology which recalculates routes to hosts. If STP is enabled in the LAN and
misconfigured or unoptimized, a network reconvergence event can take up to 1 minute or more to recalculate
(depending on size of network and participating devices). Such latencies can result in interconnect failure and
cluster wide outage.
Action: Many switch vendors provide optimized extensions to STP enabling faster network reconvergence
times. Optimizations such as Rapid Spanning Tree (RSTP), Per-VLAN Spanning Tree (PVST), and Multi-Spanning
Tree (MSTP) should be deployed to avoid a cluster wide outage.
23. sq_max_size inadequate for STREAMS queing
1.
Description: AWR reports high waits for “gc cr block lost” and/or “gc current block lost”. netstat output does
not reveal any packet processing errors. `kstat -p -s ‘*nocanput*` returns non-zero values. nocanput indicates
that queues for streaming messages are full and packets are dropped. Customer is running STREAMS in a RAC
in a Solaris env.
Action: Increasing the udp max buffer space and defining unlimited STREAMS queuing should relieve the
problem and eliminate ‘nocanput’ lost messages. The following are the Solaris commands to make these
changes:
2.
`ndd -set /dev/udp udp_max_buf <NUMERIC VALUE>`
set sq_max_size to 0 (unlimited) in /etc/system. Default = 2
3.
udp_max_buf controls how large send and receive buffers (in bytes) can be for a UDP socket. The default
setting, 262,144 bytes, may be inadequate for STREAMS applications. sq_max_size is the depth of the message
queue.
Troubleshooting InfiniBand connection issues using OFED tools
The Open Fabrics Enterprise Distribution (OFED) package has many debugging tools available as part of the standard release. This article
describes the use of those tools to troubleshoot the hardware and firmware of an InfiniBand fabric deployment.
First, the /sys/class sub-system should be checked to verify that the hardware is up and connected to the InfiniBand fabric. The following
command will show the InfiniBand hardware modules recognized by the system:
ls /sys/class/infiniband
This example will use the module mlx4_0, which is typical for Mellanox ConnectX* series of adapters. If this, or a similar module, is not found,
refer to the documentation that came with the OFED package on starting the OpenIB drivers.
Next, check the state of the InfiniBand port:
cat /sys/class/infiniband/mlx4_0/ports/1/state
This command should return “ACTIVE” if the hardware is initialized, and the subnet manager has found the port and added the port to the
InfiniBand fabric. If this command returns “INIT” the hardware is initialized, but the subnet manager has not added the port to the fabric yet.
If necessary, start the subnet manager:
/etc/init.d/opensmd start
Once the port on the head node is in the “ACTIVE” state, check the state of the InfiniBand port on all the compute nodes to ensure that all of
the Infiniband hardware on the compute nodes has been initialized, and the subnet manager has added all of the compute nodes ports on to
the fabric. This article will use the pdsh tool to run the command on all nodes:
pdsh –a cat /sys/class/infiniband/mlx4_0/ports/1/state
All nodes should report “ACTIVE”. If a node reports it cannot find the file, ensure the OpenIB drivers is loaded on that node. Refer to the
documentation that came with the OFED package on starting the OpenIB drivers.
Once all of the compute nodes report that port 1 is “ACTIVE”, verify the speed on each port using the following commands:
cat /sys/class/infiniband/mlx4_0/ports/1/rate
pdsh –a cat /sys/class/infiniband/mlx4_0/ports/1/rate
This is a good first check for a bad cable or connection. Each port should report the same speed. For example, the output for double data rate
(DDR) InfiniBand cards will be similar to “20 Gb/sec (4X DDR)”.
Once the above basic checks are complete, more in-depth troubleshooting can be performed. The main OFED tool for troubleshooting
performance and connection problems is ibdiagnet. This tool runs multiple tests, as specified on the command line during the run, to detect
errors related to the subnet, bad packets, and bad states. These errors are some of the more common seen during initial setup of Infiniband
fabrics.
Run ibdiagnet with the following command line options:
ibdiagnet –pc –c 1000
The output will be similar to this:
Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.2
-W- Topology file is not specified.
Reports regarding cluster links will use direct routes.
Loading IBDM from: /usr/lib64/ibdm1.2
-W- A few ports of local device are up.
Since port-num was not specified (-p option), port 1 of device 1 will be
used as the local port.
-I- Discovering ... 17 nodes (1 Switches & 16 CA-s) discovered.
-I---------------------------------------------------I- Bad Guids/LIDs Info
-I---------------------------------------------------I- No bad Guids were found
-I---------------------------------------------------I- Links With Logical State = INIT
-I---------------------------------------------------I- No bad Links (with logical state = INIT) were found
-I---------------------------------------------------I- PM Counters Info
-I---------------------------------------------------I- No illegal PM counters values were found
-I---------------------------------------------------I- Fabric Partitions Report (see ibdiagnet.pkey for a full hosts list)
-I---------------------------------------------------I---------------------------------------------------I- IPoIB Subnets Check
-I---------------------------------------------------I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- No members found for group
-I---------------------------------------------------I- Bad Links Info
-I- Errors have occurred on the following links
(for errors details, look in log file /tmp/ibdiagnet.log):
-I--------------------------------------------------Link at the end of direct route "1,5"
----------------------------------------------------------------I- Stages Status Report:
STAGE Errors Warnings
Bad GUIDs/LIDs Check 0 0
Link State Active Check 0 0
Performance Counters Report 0 0
Partitions Check 0 0
IPoIB Subnets Check 0 1
Link Errors Check 0 0
Please see /tmp/ibdiagnet.log for complete log
----------------------------------------------------------------I- Done. Run time was 9 seconds.
The warning “No members found for group” can safely be ignored.
In this example, a bad link was found: “Link at the end of direct route “1,5”.” "1,5" refers to the LID numbers associated with the individual
ports. The following commands can be used to identify the LID numbers associated with each port:
cat /sys/class/infiniband/mlx4_0/ports/1/lid
pdsh –a /sys/class/infiniband/mlx4_0/ports/1/lid
This command generates a list of LIDs associated with nodes. In the output of the above command, locate the entries for 0x1 and 0x5. 0x1 is
likely the head node. For errors of this type, reseat or replace the InfiniBand cable connecting the node corresponding to LID 0x5.
Finally, run ibdiagnet once more time to verify there are no errors, and then to check the error state of each port. Each test should pass.
ibdiagnet –pc –c 1000
ibcheckerrors.
KICKSTART FOR DISKS LARGER THAN 2TB:
Some machines we build have large root disk arrays, bigger than 2TB. They exceed the capacity of msdos partition table and require GPT (GUID
Partition Table) to be installed in order to use the whole disk. Only RHEL 6.1 supports it now and it required significant customization of the
kickstart configuration file. Also, it only works on the machines equipped with UEFI Boot Manager. IBM x3650 M3 and x3755 M3 do have it but I
only tested the new setup on x3650 M3
In order to install RHEL 6.1 to a large disk machine, do the following:
o
In Opsware, pick OS Sequence called RBCCM-NY-RHEL-6.1_GPT. It is equivalent to RBCCM-NY-RHEL-6.1 except of GPT support.
o
This will cause regular installation to be executed. Once it is over, machine will reboot. On that reboot, go to the System Configuration and
Boot Management screen (a.k.a. BIOS config) and follow this set of menu options:
Boot Manager | Add Boot Option | Load File (first) | EFI | redhat | grub.efi
o
Provide some short but meaningful name to the new boot option through Input Description option.
o
Once this is done, press Commit Changes.
o
Exit the System Configuration screen, the machine should boot into the newly installed OS.
On a regular, non-Large Disk system, this OS Sequence should work as the regular one, no additional activity is required.
# Kickstart file automatically generated by anaconda.
#version=DEVEL
install
nfs --server=159.55.104.162 --dir=/vol/XRZ0/opsware/linux/redhat/x86_64/v6.1
lang en_US.UTF-8
keyboard us
network --onboot no --device eth0 --noipv4 --noipv6
network --onboot no --device eth1 --noipv4 --noipv6
network --onboot yes --device eth2 --mtu=1500 --bootproto dhcp
network --onboot no --device eth3 --noipv4 --noipv6
network --onboot no --device usb0 --noipv4 --noipv6
rootpw --iscrypted X4qmByxwKBQbY
# Reboot after installation
reboot
firewall --disabled
authconfig --useshadow --enablemd5
selinux --disabled
timezone --utc America/New_York
bootloader --location=mbr --driveorder=sda --append="crashkernel=auto rhgb quiet"
# The following is the partition information you requested
# Note that any partitions you deleted are not expressed
# here so unless you clear all partitions first, this is
# not guaranteed to work
#clearpart --none
#part /boot/efi --fstype=efi --grow --maxsize=200 --size=20
#part /boot --fstype=ext4 --size=500
#part pv.008003 --grow --asprimary --size=25600
#volgroup vg00 --pesize=32768 pv.008003
#logvol / --fstype=ext4 --name=root --vgname=vg00 --size=10240
#logvol swap --name=swap --vgname=vg00 --size=4096
#logvol /tmp --fstype=ext4 --name=tmp --vgname=vg00 --size=2048
#logvol /var --fstype=ext4 --name=var --vgname=vg00 --size=6144
repo --name="Red Hat Enterprise Linux" --baseurl=nfs:159.55.104.162:/vol/XRZ0/opsware/linux/redhat/x86_64/v6.1 --cost=100
%packages
@Base
@Core
@base
@core
OpenIPMI
OpenIPMI-libs
audit
audit-libs
autofs
busybox
compat-openldap
crash
dos2unix
e2fsprogs
grub
kexec-tools
krb5-libs
krb5-libs.i686
ksh
libcollection
libcollection.i686
libdhash
libdhash.i686
libini_config
libini_config.i686
lm_sensors
lvm2
mailx
ncompress
net-snmp
net-snmp-libs
net-snmp-utils
nfs-utils
nss-pam-ldapd
ntp
openldap
openldap-clients
openssl098e
pam.i686
parted
perl
perl-Compress-Zlib
perl-Convert-ASN1
perl-HTML-Parser
perl-HTML-Tagset
perl-IO-Socket-SSL
perl-LDAP
perl-Net-SSLeay
perl-URI
perl-XML-NamespaceSupport
perl-XML-SAX
perl-libwww-perl
psacct
sssd
sssd-client
strace
sudo
sysstat
tcpdump
tcsh
telnet
zsh
-NetworkManager
-NetworkManager-glib
-alsa-lib
-alsa-utils
-audiofile
-authconfig-gtk
-bluez-libs
-cdparanoia-libs
-cups
-cups-libs
-dbus-python
-desktop-file-utils
-elinks
-esound
-fetchmail
-foomatic
-ftp
-gimp-print
-gnome-keyring
-gnome-python2
-gnome-python2-bonobo
-gnome-python2-canvas
-gnome-python2-gtkhtml2
-gnome-vfs2
-gnuplot
-gphoto2
-gpm
-gstreamer
-gtkhtml2
-gtksourceview
-indexhtml
-isdn4k-utils
-jpackage-utils
-jwhois
-kdemultimedia
-kernel-devel
-lftp
-lha
-libbonobo
-libbonoboui
-libglade2
-libgnome
-libgnomecanvas
-libgnomeui
-libibverbs
-libjpeg
-libmthca
-libpng
-libtiff
-libvorbis
-libwnck
-libwvstreams
-libxml2
-libxslt
-logwatch
-lokkit
-lrzsz
-minicom
-mkbootdisk
-mt-st
-mtr
-mutt
-mysql
-nc
-nmap
-numactl
-openib
-openmotif
-opensm-libs
-pango
-pinfo
-ppp
-procmail
-pyOpenSSL
-pygtk2
-pyorbit
-pyxf86config
-rdate
-rdist
-redhat-logos
-redhat-menus
-rhn-check
-rhn-client-tools
-rhn-setup
-rhnsd
-rp-pppoe
-rsh
-samba-client
-samba-common
-setuptool
-sox
-star
-stunnel
-system-config-network-tui
-talk
-tmpwatch
-tog-pegasus
-usermode-gtk
-vnc-server
-wireless-tools
-wvdial
-yp-tools
-ypbind
%end
%pre
set -x
[ -d /opsw ] && exit
mkdir /opsw
mount /dev/sda1 /opsw
cp /opsw/phase2/* /tmp
umount /opsw
/tmp/miniagent --server 159.55.104.44 8017
%end
%pre --logfile /tmp/anaconda.pre_log
#!/bin/bash
echo "Running pre-install script..."
# find out the size of the boot disk
parted="/usr/sbin/parted"
grep="/bin/grep"
awk="/bin/awk"
sed="/bin/sed"
dd="/usr/bin/dd"
sleep="/bin/sleep"
# echo 'y' because parted asks for some confirmation when hits a raw disk without mbr
sda_size=`echo 'y' | $parted -l /dev/sda | $grep ^Disk| $grep sda | $awk -F':' '{print $2}' | $sed -e 's/^ *//' | $sed -e 's/[^0-9]*$//'`
echo "/dev/sda size == $sda_size GB"
echo ""
# clean up the MBR and partition table of the root disk
/usr/bin/dd bs=512 count=10 if=/dev/zero of=/dev/sda
echo ""
if [ $sda_size -gt 2048 ]
then
echo "Root disk is bigger than 2TB, installing gpt partition table!"
/usr/sbin/parted --script /dev/sda mklabel gpt
else
echo "Root disk is smaller than 2TB, installing msdos partition table."
/usr/sbin/parted --script /dev/sda mklabel msdos
fi
echo ""
/usr/sbin/parted -l /dev/sda
/bin/sleep 30
%end
%post --nochroot
echo "Running post-install scritp"
# prepare UEFI boot environment
# (don't know how to make RHEL installer to do it automatically)
/usr/bin/cp /mnt/sysimage/boot/grub/grub.conf /mnt/sysimage/boot/efi/EFI/redhat/grub.conf
# necessary for grubby to update the right file during the kernel upgrades
/usr/bin/rm /mnt/sysimage/etc/grub.conf
/usr/bin/ln -s /boot/efi/EFI/redhat/grub.conf /mnt/sysimage/etc/grub.conf
# make logs available after kickstart for debugging purpose
/usr/bin/cp /tmp/*log /mnt/sysimage/root
%end
%post
touch /var/tmp/agentInstallTrigger; while [ -f /var/tmp/agentInstallTrigger ]; if [ -f /var/tmp/doneAgentInstall ]; then break; fi; do sleep 5; done
%end
FUSION IO:
RAID1-SETUP
http://kb.fusionio.com/KB/a18/creating-a-raid1mirrored-set-using-the-iodrive-on-linux.aspx?KBSearchID=16919
Creating a RAID1 (Mirrored) Set using the ioDrive on Linux
To create a RAID1/Mirror set on Linux, enter the command:
$ mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/fioa /dev/fiob
This creates a mirrored set using the two ioDrives fioa and fiob. (Use the fio-status utility to view your specific device
names.)
Start up the RAID1 mirror
To start the RAID1 mirrored, set and mount it as md0, enter the command:
mdadm --assemble /dev/md0 /dev/fioa /dev/fiob
Stop the RAID1 mirror
To stop the RAID1 set, enter the command:
mdadm --stop /dev/md0
Make the mirror persistent (exists after a reboot)
Use the following commands to have your RAID1 mirror will appear after a reboot.
$ sudo sh -c 'echo "DEVICE /dev/fioa /dev/fiob" >/etc/mdadm.conf'
$ sudo sh -c 'mdadm --detail --scan >>/etc/mdadm.conf'
On some versions of Unix, the configuration file is in /etc/mdadm/mdadm.conf,
not /etc/mdadm.conf.
On most systems, the RAID1 mirror will be created automatically upon reboot. However, if you have problems
accessing /dev/md0after a reboot, run this command:
$ sudo mdadm --assemble --scan
DRIVER-SETUP
http://kb.fusionio.com/KB/a64/loading-the-driver-via-udev-or-init-script-for-md-and-lvm.aspx?KBSearchID=16922
-Loading the Driver via udev or Init Script for md and LVM There are two main methods for loading the Fusion-IO Driver: udev and the iodrive
script. The default method of loading is udev.
Using udev to Load the Driver Most modern linux distributions use udev to facilitate driver loading. Usually it just works, behind the scenes,
loading drivers. udev automatically loads drivers for installed hardware on the system. It does this by walking through the devices on the PCI
bus, and then loading any driver that has been properly configured to work with that device. This way, drivers can be loaded without having to
use scripts or modify configuration files. The Fusion-io drivers are properly configured, and udev will find the drivers and attempt to load them.
However, there are some cases where loading by udev is not appropriate, or can have issues.
Udev will wait 180 seconds for the driver to load, then it will exit. In most cases, this is plenty of time, even with multiple ioDrives installed. But
if the drives were shut down improperly, loading the driver and attaching the drives takes longer than the 180 seconds. In this case, udev will
exit. The driver will not exit, but will continue working on attaching the drives.
There is not always a problem when udev exits early. The drivers will eventually load, and then you will be able to use the attached block
devices. But, if the drivers do take too long to load, and udev does exit, and file systems are set to be mounted in the fstab, then the system file
system check (fsck) will fail, and the system will stop booting. In most distributions the user will drop into a single-user mode, or repair mode.
Again, this is normal behavior; after the driver finishes rescanning and attaching the drive, a reboot will fix things. For most users, this will not
happen often enough to be an issue.
But for installations with many devices, or for server installations where dropping into single-user mode is unacceptable, there is an alternative
method for driver loading that does not have these issues.
Using Init Scripts to Load the Driver The ioDriver packages provide init scripts that are installed on all supported Linux distributions. These
scripts typically reside in /etc/init.d/iodrive or /etc/rc.d/iodrive. The scripts are used to load and start the driver, and to mount filesystems,
after the system is up. This method completely avoids the udev behavior described above. (It will wait as long as it takes for drives to be
attached).
NOTE: These steps assume that the logical volumes /dev/md0 (md) and /dev/vg0/fio_lv (LVM) have been setup beforehand. There are other
knowledgebase articles that provide details for logical volume creation, including updating lvm.conf to work with ioDrives. Follow the
instructions based your driver version. (jump to: Using Init Scripts to Load the 1.2.x Driver)
Using Init Scripts to Load the 2.x.x Driver
Step 1: Modifying /etc/modprobe.d/iomemory-vsl.conf
Edit /etc/modprobe.d/iomemory-vsl.conf, and uncomment the blacklist line:
Before:
To keep ioDrive from auto loading at boot, uncomment below:
# blacklist iomemory-vsl
After:
To keep ioDrive from auto loading at boot, uncomment below:
blacklist iomemory-vsl.
This keeps udev from automatically loading the driver
Step 2: Modifying /etc/fstab Add noauto to the options in the appropriate line in the /etc/fstab file. This will keep the OS from trying to check
the drive for errors on boot.
Before:
LABEL=/ / ext3 defaults 1 1
LABEL=/boot /boot ext3 defaults 1 2
tmpfs /dev/shm tmpfs defaults 0 0
devpts /dev/pts devpts gid=5,mode=620 0 0
sysfs /sys sysfs defaults 0 0
proc /proc proc defaults 0 0
LABEL=SWAP-sda5 swap swap defaults 0 0
/dev/md0 /iodrive_mountpoint ext3 defaults 1 2
/dev/vg0/fio_lv /iodrive_mountpoint2 ext3 defaults 1 2
After:
LABEL=/ / ext3 defaults 1 1
LABEL=/boot /boot ext3 defaults 1 2 tmpfs
/dev/shm tmpfs defaults 0 0 devpts /dev/pts
devpts gid=5,mode=620 0 0
sysfs /sys sysfs defaults 0 0
proc /proc proc defaults 0 0
LABEL=SWAP-sda5 swap swap defaults 0 0
/dev/md0 /iodrive_mountpoint ext3 defaults,noauto 0 0
/dev/vg0/fio_lv /iodrive_mountpoint2 ext3 defaults,noauto 0 0
Step 3: Modifying /etc/sysconfig/iomemory-vsl.

/etc/sysconfig/iomemory-vsl, and uncomment ENABLED=1 to enable the init script.
Before:
If ENABLED is not set (non-zero) then iomemory-vsl init script will not be used.
#ENABLED=1
After:
If ENABLED is not set (non-zero) then iomemory-vsl init script will not be used.
ENABLED=1
While editing /etc/sysconfig/iomemory-vsl, add the mountpoint to the MOUNTS variable so it will be automatically attached and mounted:
Before:
An IFS separated list of md arrays to start once the driver is loaded. Arrays should be configured in the mdadm.conf file.
Example: MD_ARRAYS="/dev/md0 /dev/md1"MD_ARRAYS=""
An IFS separated list of LVM volume groups to start once the driver is loaded. Volumes should be configured in lvm.conf.
Example: LVM_VGS="/dev/vg0 /dev/vg1"LVM_VGS=""
An IFS separated list of mount points to mount once the driver is loaded.
These mount points should be listed in /etc/fstab with
"noauto" as one of the mount options.
Example /etc/fstab:
/dev/fioa /mnt/fioa ext3 defaults,noauto 0 0
/dev/fiob /mnt/firehose ext3 defaults,noauto 0 0
Example:
MOUNTS="/mnt/fioa /mnt/firehose"MOUNTS=""
After:
11. An IFS separated list of md arrays to start once the driver is loaded. Arrays should be configured in the mdadm.conf file.
# Example: MD_ARRAYS="/dev/md0 /dev/md1"MD_ARRAYS="/dev/md0"
An IFS separated list of LVM volume groups to start once the driver is loaded. Volumes should be configured in lvm.conf.
# Example:
LVM_VGS="/dev/vg0 /dev/vg1"LVM_VGS="/dev/vg0"
An IFS separated list of mount points to mount once the driver is loaded. These mount points should be listed in /etc/fstab with
"noauto" as one of the mount options.
Example /etc/fstab:
/dev/fioa /mnt/fioa ext3 defaults,noauto 0 0
/dev/fiob /mnt/firehose ext3 defaults,noauto 0 0
Example:
MOUNTS="/mnt/fioa /mnt/firehose"MOUNTS="/iodrive_mountpoint /iodrive_mountpoint2"
Step 4: Verifying the Status of the Init Script Make sure the iodrive script loads at run levels 1 through 5 (runlevel 0 is shutdown and runlevel 6 is
reboot). Run the following commands:
$ chkconfig iomemory-vsl on$ chkconfig --list iomemory-vsl iomemory-vsl 0:off 1:on 2:on 3:on 4:on 5:on 6:off
Using Init Scripts to Load the 1.2.x Driver Step 1: Modifying /etc/modprobe.d/iodrive Edit /etc/modprobe.d/iodrive, and uncomment the
blacklist line:
Before:
1.
To keep ioDrive from auto loading at boot, uncomment below# blacklist fio-driverAfter:
2.
To keep ioDrive from auto loading at boot, uncomment belowblacklist fio-driverThis keeps udev from automatically loading the
driver
Step 2: Modifying /etc/fstab. Add noauto to the options in the appropriate line in the /etc/fstab file. This will keep the OS from trying to check
the drive for errors on boot.
Before:
LABEL=/ / ext3 defaults 1 1
LABEL=/boot /boot ext3 defaults 1 2
tmpfs /dev/shm tmpfs defaults 0 0
devpts /dev/pts devpts gid=5,mode=620 0 0
sysfs /sys sysfs defaults 0 0
proc /proc proc defaults 0 0
LABEL=SWAP-sda5 swap swap defaults 0 0
/dev/md0 /iodrive_mountpoint ext3 defaults 1 2
/dev/vg0/fio_lv /iodrive_mountpoint2 ext3 defaults 1 2
After:
LABEL=/ / ext3 defaults 1 1
LABEL=/boot /boot ext3 defaults 1 2
tmpfs /dev/shm tmpfs defaults 0 0
devpts /dev/pts devpts gid=5,mode=620 0 0
sysfs /sys sysfs defaults 0 0
proc /proc proc defaults 0 0
LABEL=SWAP-sda5 swap swap defaults 0 0
/dev/md0 /iodrive_mountpoint ext3 defaults,noauto 0 0
/dev/vg0/fio_lv /iodrive_mountpoint2 ext3 defaults,noauto 0 0
Step 3: Modifying /etc/sysconfig/iodrive
Edit /etc/sysconfig/iodrive, and add the mountpoint to the MOUNTS variable, so it will be automatically attached and mounted:
Before:
1.
An IFS separated list of md arrays to start once the driver is# loaded. Arrays should be configured in the mdadm.conf file.
2.
# Example: MD_ARRAYS="/dev/md0 /dev/md1"MD_ARRAYS=""
3.
# An IFS separated list of LVM volume groups to start once the driver is# loaded. Volumes should be configured in lvm.conf.#
Example: LVM_VGS="/dev/vg0 /dev/vg1"LVM_VGS=""# An IFS separated list of mount points to mount once the driver is# loaded.
These mount points should be listed in /etc/fstab with# "noauto" as one of the mount options.# Example /etc/fstab:#/dev/fioa
/mnt/fioa ext3 defaults,noauto 0 0#/dev/fiob /mnt/firehose ext3 defaults,noauto 0 0# Example: MOUNTS="/mnt/fioa
/mnt/firehose"MOUNTS=""After:
2.
An IFS separated list of md arrays to start once the driver is# loaded. Arrays should be configured in the mdadm.conf file.# Example:
MD_ARRAYS="/dev/md0 /dev/md1"MD_ARRAYS="/dev/md0"# An IFS separated list of LVM volume groups to start once the driver
is# loaded. Volumes should be configured in lvm.conf.# Example: LVM_VGS="/dev/vg0 /dev/vg1"LVM_VGS="/dev/vg0"# An IFS
separated list of mount points to mount once the driver is# loaded. These mount points should be listed in /etc/fstab with# "noauto"
as one of the mount options.# Example /etc/fstab:#/dev/fioa /mnt/fioa ext3 defaults,noauto 0 0#/dev/fiob /mnt/firehose ext3
defaults,noauto 0 0# Example: MOUNTS="/mnt/fioa /mnt/firehose"MOUNTS="/iodrive_mountpoint /iodrive_mountpoint2"Step 4:
Verifying the Status of the Init Script
Make sure the iodrive script loads at run levels 1 through 5 (runlevel 0 is shutdown and runlevel 6 is reboot). Run the following commands:
$ chkconfig iodrive on$ chkconfig --list iodrive iodrive 0:off 1:on 2:on 3:on 4:on 5:on 6:off
LVM-ENABLE
http://kb.fusionio.com/KB/a36/enabling-the-iodrive-for-lvm-use.aspx
Enabling the ioDrive for LVM Use The Logical Volume Manager (LVM) volume group management application handles mass storage devices like
the ioDrive if you add the ioDrive as a supported type:
1. Locate and edit the /etc/lvm/lvm.conf configuration file.
2. Add an entry similar to the following to that file:
types = [ "fio", 16 ]
The parameter “16” represents the maximum number of partitions supported by the drive. For the ioDrive, this can be any number from 1
upwards, with 16 as the recommended setting.
Do NOT set this parameter to 0.