Sistemas Operativos - Moodle
Transcription
Sistemas Operativos - Moodle
4. Miscellaneous: network virtualization Protocols for Data Networks (aka Advanced Computer Networks) Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Lecture plan 1. B. Pfaff et al., “Design and implementation of Open vSwitch,” NSDI’15 and B. Pfaff et al., “Extending Networking into the Virtualization Layer,” HotNets’09 2. T. Koponen et al., “Network Virtualization in Multi-tenant Datacenters”, NSDI’14 2 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Lecture plan 1. B. Pfaff et al., “Design and implementation of Open vSwitch,” NSDI’15 and B. Pfaff et al., “Extending Networking into the Virtualization Layer,” HotNets’09 2. T. Koponen et al., “Network Virtualization in Multi-tenant Datacenters”, NSDI’14 3 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Context • Virtualization has changed the way we do computing – The goal of most data centers is to have all hosts virtualized – The number of virtual machines has already exceeded the number of servers • With the proliferation of virtualization, a new network layer is emerging – Within the hypervisor 4 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Motivation • Virtualization imposes new requirements – Network mobility, as VMs can migrate between hosts – Scaling limits with datacenters with hundreds of thousands of VMs – Isolation is required for joint-tenant environments • But also provides features that make networking easier – Virtualization layer can provide information about host arrivals and movements – Topology becomes more tractable • Networking is composed entirely of leaf nodes • However, this placement complicates scaling – It is very flexible, as it is in software 5 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Contribution • Typical model of internetworking in virtualized environments is L2 switching – Primary concern is providing basic network connectivity – Hard to address the challenges that exist in virtualized environments • The authors present the design and implementation of Open vSwitch (OvS), a capable virtual switch for virtualized environments – A software switch that resides within the hypervisor or management domain – Exports interface for fine grained control of the forwarding (via OpenFlow) and of configuration (via OVSDB: to configure queues, create/destroy switches, add/remove ports, etc.) – Open-source – Multi-platform 6 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Where is Open vSwitch Used? • Broad support: – Linux, FreeBSD, NetBSD, Windows, ESX – KVM, Xen, Docker, VirtualBox, Hyper-V, … – OpenStack, CloudStack, OpenNebula, … • Widely used: – – – – Most popular OpenStack networking backend Default network stack in XenServer 1,440 hits in Google Scholar Thousands of subscribers to OVS mailing lists source: http://openvswitch.org/support/slides/nsdi2015-slides.pdf Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 7 Challenge • Performance... – ...without the luxury of specialization • Design goals – flexibility and general-purpose use and – high-performance • “the primary function of a hypervisor: running user workloads” • The paper basically shows how OvS obtains high performance without sacrificing generality – Most of it details their design optimizations through flow caching and other types of caching to reduce CPU usage and increase forwarding rates 8 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 OVS architecture • Largest component is ovs-vswitchd, a userspace daemon – Essentially the same across operating systems • The data path kernel module is usually written specially for the host OS for performance – Remains unaware of OpenFlow, simplifying the module • This separation is invisible to the controller 9 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Use case: network virtualization • Implication for performance: – 100+ hash lookups per packet for tuple space search! – (Note: for packet classification tuple space search is used, with one hash table per type of match; the tuple is the key, with the set of fields for the type of match) based on: http://openvswitch.org/support/slides/nsdi2015-slides.pdf Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 10 Non-solutions • These helped: – Multithreading • Flow setup distributed to multiple threads/cores – Optimistic concurrent techniques • Such as userspace RCU (Read-Copy Update) – Batching packet processing • Increasing performance for flow setup – Classifier optimisations – Microoptimisations • But none really enough – “Classification is expensive on general-purpose CPUs!” based on: http://openvswitch.org/support/slides/nsdi2015-slides.pdf Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 11 OVS Cache v1: microflow cache source: http://openvswitch.org/support/slides/nsdi2015-slides.pdf Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 12 Speed up with microflow cache • From 100+ hash lookups per packet, to just 1! • In practice: – Tremendous speedup for most workloads – Problematic for traffic patterns with short-lived microflows • Fundamental caching problem: low hit rate source: http://openvswitch.org/support/slides/nsdi2015-slides.pdf Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 13 Solution: more expensive cache If kc << k0+k1+...+k24: benefit! source: http://openvswitch.org/support/slides/nsdi2015-slides.pdf Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 14 Naive approach to populating cache • Combine all tables into 1! • Result: – up to n1 × n2 × ∙∙∙ × n24 flows – “Crossproduct problem” source: http://openvswitch.org/support/slides/nsdi2015-slides.pdf Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 15 Lazy approach to populating cache • Solution: Build cache of combined “megaflows” lazily as packets arrive • Same (or better!) table lookups as naive approach. • Traffic locality yields practical cache size source: http://openvswitch.org/support/slides/nsdi2015-slides.pdf Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 16 OVS Cache v2: “Megaflow” Cache • Megaflows are more effective when they match fewer fields – Megaflows that match TCP ports are almost like microflows! • Contribution: megaflow generation improvements – Tuple priority sorting: when a match occurs, no need to search lower priorities – Staged lookup: 4 hash tables instead of 1, with matching in multiple stages (metadata only; metadata + L2; metadata + L2 + L3; metadata + L2 + L3 + L4) – Prefix tracking: optimization for IP prefixes using a trie data structure that allows matching of only the higher order bits source: http://openvswitch.org/support/slides/nsdi2015-slides.pdf Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 17 Megaflow vs. Microflow • Microflow cache: – k0 + k1 + ∙∙∙ + k24 lookups for first packet in microflow – 1 lookup for later packets in microflow • Megaflow cache: – kc lookups for (almost) every packet – kc > 1 is normal, so megaflows perform worse in common case! • Best of both worlds would be: – kc lookups for first packet in microflow – 1 lookup for later packets in microflow source: http://openvswitch.org/support/slides/nsdi2015-slides.pdf Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 18 OVS Cache v3: Dual Caches source: http://openvswitch.org/support/slides/nsdi2015-slides.pdf Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 19 Evaluation: in production • Cache size – 99th percentile = 7k flows – OvS limit = 200k entries • So it’s fine • Cache hit rate – 97.7% overall – Cache is effective • CPU usage – 80% hypervisors average 5% or less • Their traditional target – They individually examined the outliers and realised they had a previously unknown bug, which (they believe) was corrected in OvS 2.3 20 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Evaluation: microbenchmarks • For the tests they ran Netperf’s TCP_CRR test – It repeatedly establishes a TCP connection, sends and receives one byte of traffic, and disconnects – Results from a simple flow table design to illustrate benefits of optimizations, in transactions per second (tps) • Each optimization – reduces the number of kernel flows needed to run the test (representing trips kernel-userspace), so reduces userspace CPU usage – While it increases number of masks/tuples (which increase packet classification cost) – However, the tradeoff is overall positive 21 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Lecture plan 1. B. Pfaff et al., “Design and implementation of Open vSwitch,” NSDI’15 and B. Pfaff et al., “Extending Networking into the Virtualization Layer,” HotNets’09 2. T. Koponen et al., “Network Virtualization in Multi-tenant Datacenters”, NSDI’14 22 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Context and motivation • Server virtualization has become the dominant approach for managing computational infrastructures • What is lacking to achieve full virtualization? – Virtualizing the network • What network aspects are important to virtualize? – Network topology • Different workloads require different topologies • How has this problem been solve traditionally? – Simple, build multiple physical networks – Address space • Virtualized workloads operate in the same address space as the physical network • Problems? – Cannot move VMs to arbitrary locations – Cannot change addressing type (if physical is IPv4, VMs are IPv4) Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 23 Alternatives • Wait, but we’ve had network virtualization for ages! – VLANs • Virtualize L2 (Ethernet) networks 24 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Opening parentheses: VLAN 25 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Motivation • Problem 1: what if a CS user moves office to Chemistry, but wants connect to the CS switch? – Need to move all cabling… • Problem 2: one LAN = a single broadcast domain – all layer-2 broadcast traffic (ARP, DHCP, unknown location of destination MAC address) must cross entire LAN; no isolation – security/privacy issues, efficiency issues (hard to scale) – One possibility to solve this problem would be to replace center switch with router • Problem 3: inefficient use of switches – If you have many groups with a small number of users each, then you will have many ports unused Computer Science Physics Chemistry Fonte: [Kurose2009] 26 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 VLANs • port-based VLAN: switch ports grouped (by switch management software) so that single physical switch… • … operates as multiple virtual switches 1 7 9 15 2 8 10 16 … … Chemistry (VLAN ports 1-8) CS (VLAN ports 9-15) 1 7 9 15 2 8 10 16 … Chemistry (VLAN ports 1-8) … CS (VLAN ports 9-16) Fonte: [Kurose2009] 27 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Port-based VLAN • traffic isolation: frames to/from ports 1-8 can only reach ports 1-8 – can also define VLAN based on MAC addresses of endpoints, rather than switch port router 1 7 9 15 2 8 10 16 • dynamic membership – ports can be dynamically assigned among VLANs • How is forwarding done between VLANS? … Chemistry … CS – via routing (just as with separate switches) – in practice vendors sell combined switches plus routers Fonte: [Kurose2009] 28 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 VLANS spanning multiple switches • trunk port carries frames between VLANS defined over multiple physical switches • Frames forwarded within VLAN between switches can’t be vanilla 802.1 frames (must carry VLAN ID info) • 802.1q protocol adds/removed additional header fields for frames forwarded between trunk ports 1 7 9 15 1 3 5 7 2 8 10 16 2 4 6 8 … Electrical Engineering (VLAN ports 1-8) … Computer Science (VLAN ports 9-15) Ports 2,3,5 belong to EE VLAN Ports 4,6,7,8 belong to CS VLAN Fonte: [Kurose2009] 29 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 802.1Q VLAN frame format Type 802.1 frame 802.1Q frame 2-byte Tag Protocol Identifier (value: 81-00) Recomputed CRC Tag Control Information (12 bit VLAN ID field, 3 bit priority field like IP TOS) Fonte: [Kurose2009] 30 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Closing parentheses 31 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Alternatives • Wait, but we’ve had network virtualization for ages! – VLANs • Virtualize L2 (Ethernet) networks – NAT • Virtualize IP address space – MPLS • Virtualize physical paths 32 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Opening parentheses: MPLS 33 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Multiprotocol label switching (MPLS) • Initial goal: high-speed IP forwarding using fixed length label (instead of IP address) – fast lookup using fixed length identifier (rather than shortest prefix matching) – borrowing ideas from Virtual Circuit (VC) approach – but IP datagram still keeps IP address! PPP or Ethernet header MPLS header label 20 IP header remainder of link-layer frame Exp S TTL 3 1 5 Fonte: [Kurose2009] 34 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 MPLS capable routers • a.k.a. label-switched router • forward packets to outgoing interface based only on label value (don’t inspect IP address) – MPLS forwarding table distinct from IP forwarding tables • flexibility: MPLS forwarding decisions can differ from those of IP – e.g, use destination and source addresses to route flows to same destination differently (traffic engineering) – re-route flows quickly if link fails: pre-computed backup paths Fonte: [Kurose2009] 35 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 MPLS versus IP paths • IP routing: path to destination determined by destination address alone R6 D R4 R3 R5 A R2 IP router Fonte: [Kurose2009] 36 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 MPLS versus IP paths • MPLS routing: path to destination can be based on source and destination address entry router (R4) can use different MPLS routes to A based, e.g., on source address R6 D R4 R3 R5 A R2 IP-only router MPLS and IP router Fonte: [Kurose2009] 37 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 MPLS signaling • Need to modify OSPF, IS-IS link-state flooding protocols to carry info used by MPLS routing – e.g., link bandwidth, amount of “reserved” link bandwidth • Entry MPLS router uses RSVP-TE signaling protocol to set up MPLS forwarding at downstream routers RSVP-TE R6 D R4 R5 modified link state flooding A Fonte: [Kurose2009] 38 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 MPLS forwarding tables in label out label dest 10 12 8 out interface A D A 0 0 1 in label out label dest out interface 10 6 A 1 12 9 D 0 R6 0 0 D 1 1 R3 R4 R5 0 0 R2 in label 8 out label dest 6 A out interface in label 6 outR1 label dest - A A out interface 0 0 Fonte: [Kurose2009] 39 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Closing parentheses 40 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Alternatives • Wait, but we’ve had network virtualization for ages! – VLANs • Virtualize L2 (Ethernet) networks – NAT • Virtualize IP address space – MPLS • Virtualize physical paths • What is the problem with these solutions? – VLANs don’t scale – Point solutions, requiring box-by-box configuration – No global, unifying abstractions 41 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Contribution • NVP, a Network Virtualization Platform • A complete network virtualization solution – Allows the creation of virtual networks, each with independent… • Service models • Topologies • Addressing architectures – …over the same physical network 42 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Network hypervisor abstractions • Control abstraction – Tenants define logical datapaths that are configured with their control planes • Logical datapath = set of logical network elements – How are logical datapaths defined? • A packet forwarding pipeline (similar to forwarding ASICs) that contains a sequence of lookup tables • The pipeline results in a forwarding decision – How are logical datapaths implemented? • In the software virtual switches – Forwarding decisions are done solely on the end hosts! • Advantages over ASIC implementations? – More flexibility – Can match over arbitrary packet header fields 43 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Network hypervisor abstractions • Packet abstraction – Packets sent by endpoints are given the same treatment (switching, routing, filtering) as in the tenant’s home network 44 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Network hypervisor architecture • What happens when the logical datapaths reaches a forwarding decision? – The packet is tunneled over the physical network to the receiving host hypervisor • Using several encapsulation mechanisms, such as GRE or STT • Allowing the encapsulation of Ethernet frames inside IP packets, for example – Host hypervisor decapsulates the packet and sends it to destination VM – The physical network sees nothing but ordinary IP traffic 45 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Opening parentheses: tunneling 46 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Generic Routing Encapsulation • Tunneling – – Encapsulation with delivery header The addresses in the delivery header are the addresses of the head-end and the tail-end of the tunnel Delivery header 20.1.1.1/30.1.1.1 GRE 10.1.1.1/10.2.1.1 20.1.1.1 30.1.1.1 10.1.1.1/10.2.1.1 tunnel Private network site 10.1.0.0/16 10.1.1.1 Public Network Private network site 10.2.0.0/16 10.2.1.1 47 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Closing parentheses 48 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Discussion • What network entity configures the software switches? – An SDN controller • Tunnels work for point-to-point communication. How about multicast and broadcast? – A simple multicast overlay is used, adding physical forwarding elements for that purpose (service nodes) – Service nodes replicate the packets received • How are logical networks interconnected with physical networks? – A gateway is used for this purpose 49 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Design challenges • How to accelerate software switching? • How to compute all that forwarding state and disseminate it to the switches, avoiding inconsistencies? • How to scale the controller cluster? 50 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Logical datapath implementation • NVP uses Open vSwitch (OVS) to forward packets • The NVP controller cluster configures the OVS remotely using two protocols – OpenFlow to inspect and modify the flow tables – OVSDB to create and manage overlay tunnels and to discover which VMs are hosted at a hypervisor • How is the logical pipeline created? – NVP augments the logical flow table in OVS to include a match over the packet’s metadata for the logical table identifier – NVP modifies each action of a flow entry to write the ID of the next logical flow table and to resubmit the packet back to the OVS flow table • This creates the logical pipeline 51 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Forwarding performance • Traditional physical switches classify packets using TCAMs – How can we classify packets quickly with software switches, such as OVS? – What techniques are explored in NVP? • Flow caching – Exploits traffic locality • All packets belonging to same flow (say, one VM TCP connection) traverse exactly the same set of flow entries – The first packet of the flow is sent from the kernel module to userspace • But userspace program installs exact-match flows into the flow table in the kernel, so future packets don’t leave the kernel • Use of hardware offloading techniques – TCP segment offloading (TSO) allows the OS to send TCP packets larger than the physical MTU, and then the NIC takes care of the rest – Large Received Offload (LRO) does the opposite (again, work offloaded to the NIC) – Problem: current Ethernet NICs do not support offloading in the presence of IP encapsulation – Solution: use TSS as encapsulation method • Add fake TCP header, and then the NIC is capable of performing the standard offloading mechanisms 52 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Forwarding state computation • Forwarding state is computed based on vNICs location info and system configuration, and is pushed to transport nodes via OpenFlow • Computational model is entirely proactive – Is this different from the “traditional” SDN model? • Different, here controllers push all forwarding state down and do not process any packets – Is it good or bad? • Simplifies scaling of the controller cluster • Failure isolation – less problems if connectivity to the controller cluster is lost • Full computation after every change is computationally inefficient, so incremental computation necessary – Problem: very hard to code and to test – Solution: they implemented nlog, a domain-specific, declarative language that allows the separation of logic specification from its implementation 53 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Controller cluster • What techniques are used to scale computation? – Controllers are arranged in a two-layer hierarchy – Separation of concerns eases computation and allows more parallelization • What techniques are used to guarantee high availability? – There are hot-standbys at both layers 54 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Evaluation: cold start • Simulates bringing the entire system back online after major datacenter disaster – Takes around one hour… – Comments? 55 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Evaluation: tunnel performance • Why is GRE throughput so low? – It is incapable of using hardware offloading • STT, on the other hand, is capable of having a throughput equivalent to having no encapsulation 56 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Discussion • What were, in your opinion, the seeds of NVP’s success? • Make logical networks look exactly like current network configurations – despite current networks’ many flaws, they represent a large installed base, and can be used without modification • The purpose-built programming language (nlog) – easing development while assuring correctness • Leveraging the flexibility of software switching – Software enabling much faster innovation • SDN control centralization – Important to have a centralized global view 57 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015 Next lecture: fast networking • Mandatory – L. Rizzo, “netmap: A Novel Framework for Fast Packet I/O,” USENIX ATC’12 • [Optional] – S. Garzarella et al., “Virtual device passthrough for high speed VM networking,” IEEE/ACM ANCS 2015 Fernando M. V. Ramos, [email protected], PRD-RAC, 2014-2015