Sistemas Operativos - Moodle

Transcription

Sistemas Operativos - Moodle
4. Miscellaneous:
network virtualization
Protocols for Data Networks
(aka Advanced Computer Networks)
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Lecture plan
1. B. Pfaff et al., “Design and implementation of
Open vSwitch,” NSDI’15
and B. Pfaff et al., “Extending Networking into the
Virtualization Layer,” HotNets’09
2. T. Koponen et al., “Network Virtualization in
Multi-tenant Datacenters”, NSDI’14
2
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Lecture plan
1. B. Pfaff et al., “Design and implementation of
Open vSwitch,” NSDI’15
and B. Pfaff et al., “Extending Networking into the
Virtualization Layer,” HotNets’09
2. T. Koponen et al., “Network Virtualization in
Multi-tenant Datacenters”, NSDI’14
3
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Context
• Virtualization has changed the way we do
computing
– The goal of most data centers is to have all hosts
virtualized
– The number of virtual machines has already exceeded
the number of servers
• With the proliferation of virtualization, a new
network layer is emerging
– Within the hypervisor
4
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Motivation
• Virtualization imposes new requirements
– Network mobility, as VMs can migrate between hosts
– Scaling limits with datacenters with hundreds of
thousands of VMs
– Isolation is required for joint-tenant environments
• But also provides features that make networking
easier
– Virtualization layer can provide information about
host arrivals and movements
– Topology becomes more tractable
• Networking is composed entirely of leaf nodes
• However, this placement complicates scaling
– It is very flexible, as it is in software
5
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Contribution
• Typical model of internetworking in virtualized
environments is L2 switching
– Primary concern is providing basic network connectivity
– Hard to address the challenges that exist in virtualized
environments
• The authors present the design and implementation
of Open vSwitch (OvS), a capable virtual switch for
virtualized environments
– A software switch that resides within the hypervisor or
management domain
– Exports interface for fine grained control of the
forwarding (via OpenFlow) and of configuration (via
OVSDB: to configure queues, create/destroy switches,
add/remove ports, etc.)
– Open-source
– Multi-platform
6
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Where is Open vSwitch Used?
• Broad support:
– Linux, FreeBSD, NetBSD, Windows, ESX
– KVM, Xen, Docker, VirtualBox, Hyper-V, …
– OpenStack, CloudStack, OpenNebula, …
• Widely used:
–
–
–
–
Most popular OpenStack networking backend
Default network stack in XenServer
1,440 hits in Google Scholar
Thousands of subscribers to OVS mailing lists
source: http://openvswitch.org/support/slides/nsdi2015-slides.pdf
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
7
Challenge
• Performance...
– ...without the luxury of specialization
• Design goals
– flexibility and general-purpose use
and
– high-performance
• “the primary function of a hypervisor: running user
workloads”
• The paper basically shows how OvS obtains high
performance without sacrificing generality
– Most of it details their design optimizations through
flow caching and other types of caching to reduce CPU
usage and increase forwarding rates
8
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
OVS architecture
• Largest component is ovs-vswitchd, a userspace
daemon
– Essentially the same across operating systems
• The data path kernel module is usually written
specially for the host OS for performance
– Remains unaware of OpenFlow, simplifying the module
• This separation is invisible to the controller
9
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Use case: network virtualization
•
Implication for performance:
– 100+ hash lookups per packet for tuple space search!
– (Note: for packet classification tuple space search is used, with
one hash table per type of match; the tuple is the key, with the
set of fields for the type of match)
based on: http://openvswitch.org/support/slides/nsdi2015-slides.pdf
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
10
Non-solutions
• These helped:
– Multithreading
• Flow setup distributed to multiple threads/cores
– Optimistic concurrent techniques
• Such as userspace RCU (Read-Copy Update)
– Batching packet processing
• Increasing performance for flow setup
– Classifier optimisations
– Microoptimisations
• But none really enough
– “Classification is expensive on general-purpose
CPUs!”
based on: http://openvswitch.org/support/slides/nsdi2015-slides.pdf
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
11
OVS Cache v1: microflow cache
source: http://openvswitch.org/support/slides/nsdi2015-slides.pdf
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
12
Speed up with microflow cache
• From 100+ hash lookups per packet, to just 1!
• In practice:
– Tremendous speedup for most workloads
– Problematic for traffic patterns with short-lived
microflows
• Fundamental caching problem: low hit rate
source: http://openvswitch.org/support/slides/nsdi2015-slides.pdf
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
13
Solution: more expensive cache
If kc << k0+k1+...+k24: benefit!
source: http://openvswitch.org/support/slides/nsdi2015-slides.pdf
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
14
Naive approach to populating cache
• Combine all tables into 1!
• Result:
– up to n1 × n2 × ∙∙∙ × n24 flows
– “Crossproduct problem”
source: http://openvswitch.org/support/slides/nsdi2015-slides.pdf
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
15
Lazy approach to populating cache
• Solution: Build cache of combined “megaflows”
lazily as packets arrive
• Same (or better!) table lookups as naive
approach.
• Traffic locality yields practical cache size
source: http://openvswitch.org/support/slides/nsdi2015-slides.pdf
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
16
OVS Cache v2: “Megaflow” Cache
•
Megaflows are more effective when they match fewer fields
– Megaflows that match TCP ports are almost like microflows!
•
Contribution: megaflow generation improvements
– Tuple priority sorting: when a match occurs, no need to search
lower priorities
– Staged lookup: 4 hash tables instead of 1, with matching in
multiple stages (metadata only; metadata + L2; metadata + L2 + L3;
metadata + L2 + L3 + L4)
– Prefix tracking: optimization for IP prefixes using a trie data
structure that allows matching of only the higher order bits
source: http://openvswitch.org/support/slides/nsdi2015-slides.pdf
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
17
Megaflow vs. Microflow
• Microflow cache:
– k0 + k1 + ∙∙∙ + k24 lookups for first packet in
microflow
– 1 lookup for later packets in microflow
• Megaflow cache:
– kc lookups for (almost) every packet
– kc > 1 is normal, so megaflows perform worse in common
case!
• Best of both worlds would be:
– kc lookups for first packet in microflow
– 1 lookup for later packets in microflow
source: http://openvswitch.org/support/slides/nsdi2015-slides.pdf
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
18
OVS Cache v3: Dual Caches
source: http://openvswitch.org/support/slides/nsdi2015-slides.pdf
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
19
Evaluation: in production
• Cache size
– 99th percentile = 7k flows
– OvS limit = 200k entries
• So it’s fine
• Cache hit rate
– 97.7% overall
– Cache is effective
• CPU usage
– 80% hypervisors average 5% or
less
• Their traditional target
– They individually examined the
outliers and realised they had
a previously unknown bug, which
(they believe) was corrected in
OvS 2.3
20
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Evaluation: microbenchmarks
• For the tests they ran Netperf’s TCP_CRR test
– It repeatedly establishes a TCP connection, sends and
receives one byte of traffic, and disconnects
– Results from a simple flow table design to illustrate
benefits of optimizations, in transactions per second
(tps)
• Each optimization
– reduces the number of kernel flows needed to run the
test (representing trips kernel-userspace), so reduces
userspace CPU usage
– While it increases number of masks/tuples (which
increase packet classification cost)
– However, the tradeoff is overall positive
21
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Lecture plan
1. B. Pfaff et al., “Design and implementation of
Open vSwitch,” NSDI’15
and B. Pfaff et al., “Extending Networking into the
Virtualization Layer,” HotNets’09
2. T. Koponen et al., “Network Virtualization in
Multi-tenant Datacenters”, NSDI’14
22
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Context and motivation
• Server virtualization has become the dominant
approach for managing computational infrastructures
• What is lacking to achieve full virtualization?
– Virtualizing the network
• What network aspects are important to virtualize?
– Network topology
• Different workloads require different topologies
• How has this problem been solve traditionally?
– Simple, build multiple physical networks
– Address space
• Virtualized workloads operate in the same address space as
the physical network
• Problems?
– Cannot move VMs to arbitrary locations
– Cannot change addressing type (if physical is IPv4, VMs are
IPv4)
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
23
Alternatives
• Wait, but we’ve had network virtualization for
ages!
– VLANs
• Virtualize L2 (Ethernet) networks
24
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Opening parentheses: VLAN
25
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Motivation
• Problem 1: what if a CS user moves office to Chemistry,
but wants connect to the CS switch?
– Need to move all cabling…
• Problem 2: one LAN = a single broadcast domain
– all layer-2 broadcast traffic (ARP, DHCP, unknown location of
destination MAC address) must cross entire LAN; no isolation
– security/privacy issues, efficiency issues (hard to scale)
– One possibility to solve this problem would be to replace
center switch with router
• Problem 3: inefficient use of switches
– If you have many groups with a small number of users each, then
you will have many ports unused
Computer
Science
Physics
Chemistry
Fonte: [Kurose2009]
26
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
VLANs
• port-based VLAN: switch
ports grouped (by switch
management software) so
that single physical
switch…
• … operates as multiple
virtual switches
1
7
9
15
2
8
10
16
…
…
Chemistry
(VLAN ports 1-8)
CS
(VLAN ports 9-15)
1
7
9
15
2
8
10
16
…
Chemistry
(VLAN ports 1-8)
…
CS
(VLAN ports 9-16)
Fonte: [Kurose2009]
27
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Port-based VLAN
• traffic isolation:
frames to/from ports
1-8 can only reach
ports 1-8
– can also define VLAN
based on MAC addresses
of endpoints, rather
than switch port
router
1
7
9
15
2
8
10
16
• dynamic membership
– ports can be dynamically
assigned among VLANs
• How is forwarding done
between VLANS?
…
Chemistry
…
CS
– via routing (just as with
separate switches)
– in practice vendors sell
combined switches plus
routers
Fonte: [Kurose2009]
28
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
VLANS spanning multiple switches
• trunk port carries frames between VLANS defined
over multiple physical switches
• Frames forwarded within VLAN between switches
can’t be vanilla 802.1 frames (must carry VLAN
ID info)
• 802.1q protocol adds/removed additional header
fields for frames forwarded between trunk ports
1
7
9
15
1
3
5
7
2
8
10
16
2
4
6
8
…
Electrical Engineering
(VLAN ports 1-8)
…
Computer Science
(VLAN ports 9-15)
Ports 2,3,5 belong to EE VLAN
Ports 4,6,7,8 belong to CS VLAN
Fonte: [Kurose2009]
29
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
802.1Q VLAN frame format
Type
802.1 frame
802.1Q frame
2-byte Tag Protocol Identifier
(value: 81-00)
Recomputed
CRC
Tag Control Information (12 bit VLAN ID field,
3 bit priority field like IP TOS)
Fonte: [Kurose2009]
30
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Closing parentheses
31
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Alternatives
• Wait, but we’ve had network virtualization for
ages!
– VLANs
• Virtualize L2 (Ethernet) networks
– NAT
• Virtualize IP address space
– MPLS
• Virtualize physical paths
32
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Opening parentheses: MPLS
33
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Multiprotocol label switching (MPLS)
• Initial goal: high-speed IP forwarding using
fixed length label (instead of IP address)
– fast lookup using fixed length identifier (rather
than shortest prefix matching)
– borrowing ideas from Virtual Circuit (VC) approach
– but IP datagram still keeps IP address!
PPP or Ethernet
header
MPLS header
label
20
IP header
remainder of link-layer frame
Exp S TTL
3
1
5
Fonte: [Kurose2009]
34
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
MPLS capable routers
• a.k.a. label-switched router
• forward packets to outgoing interface based
only on label value (don’t inspect IP address)
– MPLS forwarding table distinct from IP forwarding
tables
• flexibility: MPLS forwarding decisions can
differ from those of IP
– e.g, use destination and source addresses to route
flows to same destination differently (traffic
engineering)
– re-route flows quickly if link fails: pre-computed
backup paths
Fonte: [Kurose2009]
35
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
MPLS versus IP paths
• IP routing: path to destination determined by
destination address alone
R6
D
R4
R3
R5
A
R2
IP router
Fonte: [Kurose2009]
36
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
MPLS versus IP paths
• MPLS routing: path to destination can be based
on source and destination address
entry router (R4) can use different MPLS
routes to A based, e.g., on source address
R6
D
R4
R3
R5
A
R2
IP-only
router
MPLS and
IP router
Fonte: [Kurose2009]
37
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
MPLS signaling
• Need to modify OSPF, IS-IS link-state flooding
protocols to carry info used by MPLS routing
– e.g., link bandwidth, amount of “reserved” link
bandwidth
• Entry MPLS router uses RSVP-TE signaling
protocol to set up MPLS forwarding at
downstream routers
RSVP-TE
R6
D
R4
R5
modified
link state
flooding
A
Fonte: [Kurose2009]
38
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
MPLS forwarding tables
in
label
out
label dest
10
12
8
out
interface
A
D
A
0
0
1
in
label
out
label dest
out
interface
10
6
A
1
12
9
D
0
R6
0
0
D
1
1
R3
R4
R5
0
0
R2
in
label
8
out
label dest
6
A
out
interface
in
label
6
outR1
label dest
-
A
A
out
interface
0
0
Fonte: [Kurose2009]
39
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Closing parentheses
40
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Alternatives
• Wait, but we’ve had network virtualization for
ages!
– VLANs
• Virtualize L2 (Ethernet) networks
– NAT
• Virtualize IP address space
– MPLS
• Virtualize physical paths
• What is the problem with these solutions?
– VLANs don’t scale
– Point solutions, requiring box-by-box configuration
– No global, unifying abstractions
41
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Contribution
• NVP, a Network Virtualization Platform
• A complete network virtualization solution
– Allows the creation of virtual networks, each with
independent…
• Service models
• Topologies
• Addressing architectures
– …over the same physical network
42
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Network hypervisor abstractions
• Control abstraction
– Tenants define logical datapaths that are configured
with their control planes
• Logical datapath = set of logical network elements
– How are logical datapaths defined?
• A packet forwarding pipeline (similar to forwarding ASICs)
that contains a sequence of lookup tables
• The pipeline results in a forwarding decision
– How are logical datapaths implemented?
• In the software virtual switches
– Forwarding decisions are done solely on the end hosts!
• Advantages over ASIC implementations?
– More flexibility
– Can match over arbitrary packet header fields
43
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Network hypervisor abstractions
• Packet abstraction
– Packets sent by endpoints are given the same
treatment (switching, routing, filtering) as in the
tenant’s home network
44
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Network hypervisor architecture
• What happens when the logical datapaths reaches a
forwarding decision?
– The packet is tunneled over the physical network to the
receiving host hypervisor
• Using several encapsulation mechanisms, such as GRE or STT
• Allowing the encapsulation of Ethernet frames inside IP
packets, for example
– Host hypervisor decapsulates the packet and sends it to
destination VM
– The physical network sees nothing but ordinary IP
traffic
45
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Opening parentheses: tunneling
46
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Generic Routing Encapsulation
•
Tunneling
–
–
Encapsulation with delivery header
The addresses in the delivery header are the
addresses of the head-end and the tail-end of the
tunnel
Delivery header
20.1.1.1/30.1.1.1
GRE
10.1.1.1/10.2.1.1
20.1.1.1
30.1.1.1
10.1.1.1/10.2.1.1
tunnel
Private
network site
10.1.0.0/16
10.1.1.1
Public Network
Private
network site
10.2.0.0/16
10.2.1.1
47
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Closing parentheses
48
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Discussion
• What network entity configures the software
switches?
– An SDN controller
• Tunnels work for point-to-point communication.
How about multicast and broadcast?
– A simple multicast overlay is used, adding physical
forwarding elements for that purpose (service nodes)
– Service nodes replicate the packets received
• How are logical networks interconnected with
physical networks?
– A gateway is used for this purpose
49
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Design challenges
• How to accelerate software switching?
• How to compute all that forwarding state and
disseminate it to the switches, avoiding
inconsistencies?
• How to scale the controller cluster?
50
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Logical datapath implementation
• NVP uses Open vSwitch (OVS) to forward packets
• The NVP controller cluster configures the OVS
remotely using two protocols
– OpenFlow to inspect and modify the flow tables
– OVSDB to create and manage overlay tunnels and to
discover which VMs are hosted at a hypervisor
• How is the logical pipeline created?
– NVP augments the logical flow table in OVS to include
a match over the packet’s metadata for the logical
table identifier
– NVP modifies each action of a flow entry to write the
ID of the next logical flow table and to resubmit the
packet back to the OVS flow table
• This creates the logical pipeline
51
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Forwarding performance
• Traditional physical switches classify packets using
TCAMs
– How can we classify packets quickly with software switches,
such as OVS?
– What techniques are explored in NVP?
• Flow caching
– Exploits traffic locality
• All packets belonging to same flow (say, one VM TCP connection)
traverse exactly the same set of flow entries
– The first packet of the flow is sent from the kernel module to
userspace
• But userspace program installs exact-match flows into the flow table
in the kernel, so future packets don’t leave the kernel
• Use of hardware offloading techniques
– TCP segment offloading (TSO) allows the OS to send TCP packets
larger than the physical MTU, and then the NIC takes care of
the rest
– Large Received Offload (LRO) does the opposite (again, work
offloaded to the NIC)
– Problem: current Ethernet NICs do not support offloading in the
presence of IP encapsulation
– Solution: use TSS as encapsulation method
• Add fake TCP header, and then the NIC is capable of performing the
standard offloading mechanisms
52
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Forwarding state computation
• Forwarding state is computed based on vNICs
location info and system configuration, and is
pushed to transport nodes via OpenFlow
• Computational model is entirely proactive
– Is this different from the “traditional” SDN model?
• Different, here controllers push all forwarding state
down and do not process any packets
– Is it good or bad?
• Simplifies scaling of the controller cluster
• Failure isolation – less problems if connectivity to the
controller cluster is lost
• Full computation after every change is
computationally inefficient, so incremental
computation necessary
– Problem: very hard to code and to test
– Solution: they implemented nlog, a domain-specific,
declarative language that allows the separation of
logic specification from its implementation
53
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Controller cluster
• What techniques are used to scale computation?
– Controllers are arranged in a two-layer hierarchy
– Separation of concerns eases computation and allows
more parallelization
• What techniques are used to guarantee high
availability?
– There are hot-standbys at both layers
54
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Evaluation: cold start
• Simulates bringing the entire system back
online after major datacenter disaster
– Takes around one hour…
– Comments?
55
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Evaluation: tunnel performance
• Why is GRE throughput so low?
– It is incapable of using hardware offloading
• STT, on the other hand, is capable of having a
throughput equivalent to having no
encapsulation
56
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Discussion
• What were, in your opinion, the seeds of NVP’s
success?
• Make logical networks look exactly like current
network configurations
– despite current networks’ many flaws, they represent
a large installed base, and can be used without
modification
• The purpose-built programming language (nlog)
– easing development while assuring correctness
• Leveraging the flexibility of software
switching
– Software enabling much faster innovation
• SDN control centralization
– Important to have a centralized global view
57
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015
Next lecture: fast networking
• Mandatory
– L. Rizzo, “netmap: A Novel Framework for Fast Packet
I/O,” USENIX ATC’12
• [Optional]
– S. Garzarella et al., “Virtual device passthrough for
high speed VM networking,” IEEE/ACM ANCS 2015
Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015