1 Abstract 2 Motivation
Transcription
1 Abstract 2 Motivation
1 Abstract Energy-efficient computing is a hot topic in times of rising energy costs and environmental awareness. As the demand for computational resources grows unabated, so does the total amount of energy spent to store, aggregate, transform, and enrich the world’s digital assets. Researchers and companies alike strive for novel ways to reduce the energy consumption of today’s warehousesized data centers. The talk starts with an overview of the topic before showcasing possible solutions. We will touch all aspects from rethinking the entire data center, tailored infrastructure, workload analysis and migration, to server design. In the second part of the talk I will cover my own contributions so far. The talk will conclude with an outlook what I would still like to accomplish as part of my PhD. 2 Motivation Who cares about IT’s energy consumption? • Energy Protection Agency1 • Greenpeace2 • Greenpeace cares a lot3 • Jevons paradox – resource consumption increases due to reduced cost • The New York Times4 : 6-12% average utilization • SFB 912: Highly Adaptive Energy-Efficient Computing (HAEC) The American Energy Protection Agency published a report titled “Server and Data Center Energy Efficiency” in 2007. In this report the EPA estimated that the energy consumption doubled between 2000 and 2006 and projected another two-fold increase between 2007 and 2011. To highlight just one other number: the report estimates the US federal government’s electricity cost with $740 million in 2011. Greenpeace also published two reports: one in 2011 “How dirty is your data?” and “How Clean is Your Cloud?” in 2012. The reports call for greater efficiency as well as transparency. According to Greenpeace, however, green computing does not stop at just attaining greater efficiency. It must be accompanied by a move away from “dirty” energy sources, e.g., fossil-based and nuclear, to renewables. A New York Times article from 2012 also called 1 Report to Congress on Server and Data Center Energy Efficiency. U.S. Environmental Protection Agency, ENERGY STAR Program, 2007. 2 Gary Cook and Jodie Van Horn. How Dirty is Your Data? url: http://www.greenpeace.org/international/en/publications/ reports/How- dirty- is- your- data/. 3 Greenpeace. How Clean is Your Cloud? url: http : / / www . greenpeace . org / international / en / publications / Campaign - reports / Climate- Reports/How- Clean- is- Your- Cloud/. 4 James Glanz. The Cloud Factories – Power, Pollution and the Internet. url: http://www.nytimes.com/2012/09/23/technology/ data- centers- waste- vast- amounts- of- energy- belying- industry- image.html/. 1 DIRTY DATA 03 attentionGreenpeace to the pollution caused released its own report, Make IT Green: Cloudby inefficient data centers, though reactions Computing and its Contribution to Climate Change in March of to the artice have been mixed. Some of the claimed inefficiencies are already 2010, highlighting the scale of IT’s estimated energy consumption, and providing new analysis on the projected growth in energy being addressed ininternet modern datafor centers. For example, Google built data centers consumption of the and cloud computing the coming decade, particularly as driven by data centres. which only use natural cooling reducing the energy required to maintain a save Key findings and outstanding questions from the Make IT Green include: operatingreporttemperature range. Also, many data centers are run by corporations • The electricity consumption of data centres may be as much as whose main goal is topredicted. maximize profit. As long as the inefficiency is bounded 70% higher than previously demand of the internet/cloud (data • The combined electricity and not hurting the corporation’s bottom line, businesses have no real incentive centres and telecommunications network) globally is 623bn kWh would rank 5th among countries). to address(and the inefficiency. Based on current projections, the demand for electricity will more 9 • than triple to 1,973bn kWh, an amount greater than the combined total demands of France, Germany, Canada and Brazil. Cloud computing 5th largest “country” in terms of energy consumption5 2007 electricity consumption. Billion kwH US 3923 China 3438 Russia 1023 Japan 925 Cloud computing 662 India 568 Germany 547 Canada 536 France 447 Brazil 404 UK 345 0 1000 2000 3000 4000 5000 Figure 1: IT’s energy consumption compared to national energy consumption. 9 http://www.greenpeace.org/international/en/publications/reports/make-it-green-cloudcomputing/ To illustrate the problem, Greenpeace included a graph in their report, which compares the energy spent on “cloud computing” to national energy11 budgets. If “cloud computing” were a country, it would have ranked 5th (in 2007) among the world’s nations in terms of kWh consumed. In their book “The Datacenter as a Computer”6 Barroso and Hölzle present a breakdown of data center operating costs. They present two case studies to demonstrate how operating costs can vary based on energy price and choice of server. Case A uses expensive, professional-grade server hardware consuming less peak power (300 W) and electricity is cheap at around 6 c/kWh. The server related costs, including server amortization, server interest, and server opex, accumulate to 70% of the monthly costs. Case study B assumes commoditygrade compute hardware which is cheaper to acquire, but draws more power (peak power is at 500 W). In this scenario, server related costs drop to 29% (23% amortization, 5% interest, 1% opex). On the other hand, due to higher prices and higher power servers, energy-related costs rise to 22% from previously 6%. The trend of rising energy-related data center costs will continue as energy costs keep rising. Greenpeace International 5 Gary Cook and Jodie Van Horn. How Dirty is Your Data? url: http://www.greenpeace.org/international/en/publications/ reports/How- dirty- is- your- data/. 6 L.A. Barroso and U. Hölzle. “The datacenter as a computer: An introduction to the design of warehouse-scale machines”. In: Synthesis Lectures on Computer Architecture 4.1 (2009). 2 Figure 2: Power consumption and energy-efficiency for a typical server across the utilization spectrum. Servers are not power-proportional7 Looking at the individual consumers inside the data center, about 59% of the energy is consumed by the actual servers8 . Two factors aggravate the energy inefficiency inside the data center. First, server have a high baseline consumption at low utilization levels. A typical server consumes about 30% of its peak power at 0% utilization. Second, the average server utilization level is low between 10-50%9 . Because energy-proportional hardware will not be available in the near future, researchers concentrate on how to increase average utilization levels. Overview • Tailored Infrastructure • Dynamic Scaling • Workload Analysis • Server Design • Data Center Design • Contribution 7 L.A. Barroso and U. Hölzle. “The datacenter as a computer: An introduction to the design of warehouse-scale machines”. In: Synthesis Lectures on Computer Architecture 4.1 (2009). 8 J. Hamilton. “Cooperative Expendable Micro-Slice Servers (CEMS): Low Cost, Low Power Servers for InternetScale Services”. In: Conference on Innovative Data Systems Research. Citeseer. 2009. 9 L.A. Barroso and U. Hölzle. “The datacenter as a computer: An introduction to the design of warehouse-scale machines”. In: Synthesis Lectures on Computer Architecture 4.1 (2009). 3 Figure 3: Turducken: the system (left) and the real thing (right). 3 Tailored Infrastructure Hardware heterogeneity bridges the gap between low and high utilization levels • combine hardware with different power/performance characteristics • examples: Turducken10 , eBond , FAWN11 , Somniloquy12 If a single piece of hardware does not offer enough flexibility with respect to energy/performance trade-offs, combining multiple components is one solution. One system which exploited the concept of heterogeneous hardware is Turducken [21]. Turducken combined a laptop, a Palm handheld, and a mote (sensor node) to extend the laptop’s overall battery runtime. Fast Array of Whimpy Nodes (FAWN) [2] co-designs hardware and software to optimize for ops/joule. Works around memory-wall by using low-power, low-performance (relative) CPUs. The utilized components, low-power Atom boards and flash memory, are a better fit for the I/O intensive workload. Keyvalue store is I/O-bound, so that it makes sense to use low-power processors. It is a reasonable assumption that by matching the hardware components to the workload more work per unit of energy can be done. The resulting system is, however, constrained to the targeted workload. In a modern data center multiple workloads with different characteristics, e.g., I/O intensive vs. CPUintensive, run in parallel. The infrastructure must be flexible enough to handle workload changes. As such, a general purpose infrastructure may still be the first choice. Developing software designed to run on a special purpose hardware costs money too. While a higher ops/joule metric is a worthy goal, it may make little sense buisness-wise. 10 J. Sorber et al. “Turducken: hierarchical power management for mobile devices”. In: Proceedings of the 3rd International Conference on Mobile Systems, Applications, and Services. ACM. 2005, pp. 261–274. 11 D.G. Andersen et al. “FAWN: A Fast Array of Wimpy Nodes”. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. ACM. 2009, pp. 1–14. 12 Y. Agarwal et al. “Somniloquy: Augmenting Network Interfaces to Reduce PC Energy Usage”. In: Proceedings of the 6th USENIX symposium on Networked systems design and implementation. USENIX Association. 2009, pp. 365–380. 4 ared to the amount of available blems. Thus, it is essential that tive as possible. approach to consolidation in a at takes into account both the available nodes and the probhese nodes. Our consolidation ses and is based on constraint d on constraints describing the mory requirements, computes mber of nodes and a tentative placement. The second phase, that take feasible migrations lan, to reduce the number of nts, using the NASGrid benchteron 2.0GHz CPU uniprocesonsolidation uses 24.31 nodes previously-used First Fit Deses 15.34 nodes per hour, and only 11.72 nodes per hour, a ed to the static solution. d as follows. Section 2 gives ion 3 describes how Entropy mine the minimum number of s, and Section 4 describes how g to minimize the reconfiguras Entropy using experimental xperimental testbed. Section 6 7 presents our conclusions and gle node dedicated to cluster of nodes that can host user uch as file servers. Entropy is yed on the first two. It consists ns on the node that provides et of sensors that run in Xen’s user tasks, i.e., VMs. ntly maintain the cluster in a to nodes, that is (i) viable, i.e. ient memory and every active , and (ii) optimal, i.e. that uses ure 1 shows the global design ne acts as a loop that 1) iterantropy sensors that a VM has or vice versa, 2) tries to comrom the current configuration grations and leaves the cluster d 3) if successful, initiates miuration uses fewer nodes than nfiguration is not viable. The ates new information about reonds for our prototype, before s, each Entropy sensor periodterface of the Xen hypervisor U usage of the local VMs, and ation. An Entropy sensor also figuration engine when a VM gration request to the Xen hy- g a viable, configuration have timal placement is chosen for y [3, 7, 11, 17]. However, load to a globally optimal solu- Figure 1. Reconfiguration loop Figure 4: Architecture of consolidation manager Entropy. tion, and may fail to produce any solution at all. Entropy instead uses Constraint Programming (CP), which is able to determine a globally optimal solution, if one exists, by using a more exhaustive search, based on a depth first search. The idea of CP is to define 13 a Vacate by redistributing themust load problem by machines stating constraints (logical relations) that be satis aAconsolidation mangerProblem for virtualized isfiedEntryopy by the solution. Constraint Satisfaction (CSP) is environments. A the defined as a setofofavariables, a set of domains representing the setcontinously. Periodically utilization set of virtual machines is monitored ofthe possible values for variable and a setmachines of constraints that assignment of each virtual to physical is updated based on changes in represent required relations between the values of the variables. A resource utilization. The problem of how to update the assignment is split into solution for a CSP is a variable assignment (a value for each varitwo that sub-problems: virtual machine packing problem (VMPP) and virtual able) simultaneouslythe satisfies the constraints. To solve CSPs, machine solution Entropy usesreconfiguration the Choco library problem [10], which(VMRP). can solve aA CSP where to the packing problem the goal is virtual to minimize or maximize value of number a single variable. assigns machines to athe minimal of physical machines subject to Choco and mostIt other solvers only solve problem. Transitioning a Because set of constraints. is constraint related to the can bin-packing optimization problems of a single variable, the reconfiguration alfrom the current assignment to the new assignment with a minimal number of gorithm proceeds in two phases (see Figure 1). The first phase finds virtual machine is necessary the reconfiguration problem. The study considers the minimum numbermigrations n of nodes are to host all VMs and cluster (NASGrid) where of overall a only sample viable benchmarks configuration that uses this number nodes.turn We around time is the most refer to the problem this phase as thesuch Virtual sensible metric. considered It is left inopen whether anMachine approach is viable for latencyPacking Problem (VMPP).web Theservices. second phase an equivsensitive interactive Thecomputes study uses virtual machines as the basic alent viable configuration that minimizes the reconfiguration time, unit for resource allocation and computation. Virtual machines are appealing given the chosen number of nodes n. We refer to the problem conbecause they canasbe between physicalProblem servers. sidered in this phase themigrated Virtual Machine Replacement (VMRP). Solving these problems may be time-consuming. While the reconfiguration runsby on the cluster resource management Power down engine servers load-based scaling of stateless compute comnode, and thus does not compete with VMs for CPU and memory, ponents it is important to produce a new configuration quickly to maximize the benefit of consolidation. Choco has the property that it can be 14 • example NapSAC aborted at any time, in which case it returns the best result computed so far. This makes it possible to impose a time limit on the scaling of stateless web-application tier solver,• toload-based ensure the reactivity of the approach. Thus, we limit the total computation time for both problems to 1 minute, of which the • heterogeneous (Nehalem Atom) first phase has at most 15hardware seconds, and the secondand phase has the remaining time. These durations are sufficient to give a nontrivial 13 F. Hermenier et al. “Entropy: a Consolidation Manager for Clusters”. In: International Conference on Virtual Execuimprovement inACM. the solution, as compared to the FFD heuristic, as tion Environments. 2009, pp. 41–50. 14 A. Krioukov et al. “NapSAC: Design and Implementation of a Power-Proportional Web Cluster”. In: ACM SIGshown in Section 5. In our initial experiments, we tried to model COMM Computer Communication Review 41.1 (2011), pp. 102–108. the reconfiguration algorithm with a single problem that proposed a trade-off between the number of used nodes and the number of migrations to perform. However, the computation time was much 5 higher for an at best equivalent packing and reconfiguration cost. 4 Virtual Machines 3. The Virtual Machine Packing Problem The objective of the VMPP is to determine the minimum number of nodes that can host the VMs, given their current processingunit and memory requirements. We first present several examples that illustrate the constraints on the assignment of VMs to nodes, then consider how to express the VMPP as a constraint satisfaction No impact on available memory Does not introduce data hot spots nor impact data locality Improvement preserved when using ECC instead of replication Addresses long running jobs with low parallelism Energy savings ✔ 9-50%1 ✔ ✔ ✔ ✔ 0-50%2 24%3 ✔ ✔ ✔ Partially 40-50% Table 2. Required properties for energy-saving techniques for Facebook’s MIA workload. Prior proposals are insufficient. Notes: 1 reported energy savings used an energy model based on linearly extrapolating CPU utilization while running the GridMix throughput be mark [22] on a 36-node cluster. 2 Reported only relative energy savings compared with the covering subset technique, and for only artificial jobs (Terasort and Grep) on a 24-node experimental cluster. We recomputed absolute energy savings using the graphs in the pa 3 Reported simulation based energy cost savings, assumed an electricity cost of $0.063/KWh and 80% capacity utilization. storage). Figures 3 and 4 indicate that choosing an appro ate value for interactive can allow most jobs to be cla fied as interactive and executed without any delay introdu 8"0:%"/()#/4/;&0) !"#$%&'&() by BEEMR. This interactive threshold should be per *"+,) !%/,,123) ically adjusted as workloads evolve. *"+)-.&.&) +/'!<) '/,:) The interactive zone acts like a data cache. When an 6"4&) -.&.&) teractive job accesses data that is not in the interactive z (i.e., a cache miss), BEEMR migrates the relevant data f $/0/#&'&0,) ,%/5&,) the batch zone to the interactive zone, either immediatel 14'&00.$'&() *"+,) upon the next batch. Since most jobs use small data sets #"41'"0) are reaccessed frequently, cache misses occur infrequen Figure 5. The BEEMR workload manager (i.e., job tracker) clasAlso, BEEMR requires storing the ECC parity or replica Figure 5: Berkley Energy-Efficient MapReduce (BEEMR) architecture sifies each job into one of three classes which determines which blocks within the respective zones, e.g., for data in the in cluster zone will service the job. Interactive jobs are serviced in the active zone, their parity or replication blocks would be sto interactive zone, while batchable and interruptible jobs are serviced in the interactive zone also. power need more of less jobs powerful hardware for the same in the •batch zone.density Energy problem: savings come from aggregating in Upon submission of batched and interruptible jobs, power the batchcomputational zone to achieve high utilization, executing them in regutasks associated with the job are put in a wait queue. At re lar batches, and then transitioning machines in the batch zone to a • basic infrastructure further eats into savings gained bylarless powerful intervals, the hardworkload manager initiates a batch, pow low-power state15when the batch completes. ware on all machines in the batch zone, and run all tasks on wait queue using the whole cluster. The machines in the tive zone is always fully powered. The batch zone makes A similar strategy of load-based adaptation wasupemployed by the NapSAC teractive zone are also available for batch and interrupt thesystem. rest of the cluster, and is into a very low power state The goal was toput approximate energy proportional computing by scaljobs, but interactive jobs retain priority there. After a ba between [25]. tier of a web service. NapSAC combined the load-based ing thebatches application begins, any batch and interruptible jobs that arrive wo As jobs arrive, BEEMR classifies them as one of three The hardware was a mix resource provisioning with heterogeneous hardware. wait for the next batch. Once all batch jobs complete, the jobof types. Classification is based onless empirical parameters powerful Nehalem server and powerful Intel Atom-based trackermachines. assigns noThe further tasks. Active tasks from interr derived fromservers the analysis in Section 2. load If thewhile job input powerful handled the base the data Atom machines used for and enqueued to be resumed in ible jobswere are suspended, size is less than Scaling some threshold interactive, is classified easy because it is stateload spikes. the application tier is itcomparatively next batch. Machines in the batch zone return to a low-po asless. an interactive job. BEEMR seeks to service jobs The application state and data is kept these in a separate always-on data does store.not complete by start of the next ba state. If a batch with latency. is If aif job tasks with duration longer Thelow question thehas concept can task be extended to statefulinterval, components where, the cluster would remain fully powered for con than some threshold interruptible, classified an potentially, gigabytes of state mustitbeisloaded intoasmemory beforebatch execution can utive periods. The high peak-to-average load in F interruptible job.DreamServer Latency is not a concern for these jobs, this question. resume. Our project attempts to answer ure 2 indicates that on average, the batch zone would sp because their long-running tasks can be check-pointed and considerable periods in a low-power state. resumed over multiple batches. All other jobs are classified BEEMR improves over prior batching and zoning sche as5 batchWorkload jobs. Latency is also not a concern for these jobs, but Analysis by combining both, and uses empirical observations to BEEMR makes best effort to run them by regular deadlines. Flexible be exploitedpolicies to save energy16the values of policy parameters, which we describe next. Such a setupdeadlines is equivalentcan to deadline-based where The virtual approach irrespective of the application. Howthe deadlines are themachine same length as the works batch intervals. Space ever, making applications “energy-aware” higherParameter savings, beThe interactive zone is always in a full-power potentially ready state. yields3.1.1 specificjobs knownledge exists. Theassopopular BEEMR Map/Reduce frameIt cause runs allapplication of the interactive and holds all of their involves several design parameters whose va workinput, Hadoop received muchdata interest. Chenand et HDFS al. exploit need the fact, manyThese parameters are: ciated shuffle, and output (both local to be that optimized. queries only operate on a very tiny sub-set of the entire data (100 MB or less). Because the cluster size is storage constrained, the small interactive zone can only hold a subset of the entire data. In fact, the interactive zone works as a 47 least-recently-used (LRU) cache: only the most recently accessed data is present. 14'&0/!'15&)6"4&) 7/%8/3,)"49) 15 Urs Hölzle. “Brawny cores still beat wimpy cores, most of the time”. In: IEEE Micro 30.4 (2010). 16 Y. Chen et al. “Energy Efficiency for Large-Scale MapReduce Workloads with Significant Interactive Analysis”. In: EuroSys. 2012. 6 Figure 6: Parasol installation. If a job accesses uncached data, it will be delayed until the batch zone is powered up again to fetch the missing data. The cluster is split into two zones: interactive and batch. The small interactive zone is always on whereas machines in the batch zone are activated on-demand. Batch jobs are delayed if there are insufficient resources. Combining flexible deadlines with solar energy is even greener17 With GreenHadoop Goiri et al. use a similar classification of “urgent” and “non-urgent” jobs. The basic classification allows more flexibility in scheduling the jobs. In addition to saving power, GreenHadoop uses flexible deadlines to build a schedule around the availability of solar energy. The system incorporates energy price and weather prediction to build the execution schedule. While this works well for a comparatively small Hadoop cluster, it remains to be seen whether the concept can be scaled up to the size of an entire data center. Also, as with the BEEMR system, only Map/Reduce-style workloads are considered. Whether a mix of batch and interactive workloads can be run successfully on such an infrastructure remains an open question. Which power states are worthwhile? 17 Í. Goiri et al. “GreenHadoop: Leveraging Green Energy in Data-Processing Frameworks”. In: EuroSys, April (2012). 7 active low-power active low-power idle Existing low power states have transition times that are too coarse grained for some data center workloads. A different direction of research was thus to identify characteristics of sleep states which would make them more broadly applicable. Chip and system designer would use this information to guide the development of the next generation of server hardware. Meisner, Gold, and Wenisch argue for only a single idle-power state. Combined with a fast transitioning mechanism between active and idle states the two-state solution simplifies the optimization problem. Transitioning between two states (active and idle) is easier than deciding between a set of power states each with varying performance/power trade-offs. The examined workloads include web servers, mail (pop, smtp, imap), DNS, DHCP, backup, and a scientific compute cluster. Because of the system’s increased power range an alternative power supply design is presented too. Power supply units only operate efficiently at high loads. Gandhi, Harchol-Balter, and Kozuch18 explore the usability and characteristics of sleep states in servers. In contrast to the Powernap system, they include low-power active states. They argue that transitions times are shorter for lowpower active states. Servers usually have very short idle periods. Exploiting them requires short transition periods. The design choices are active or idle low power modes. Active low-power modes allow the system to continue to operate, though with reduced performance. In idle low-power modes the system cannot do any processing, though the potential savings are usually higher than with active low-power modes. Use cases for current idle, low-power systems exist • approximating energy-proportionality at the aggregate level using nonenergy-proportional components19 18 Anshul Gandhi, Mor Harchol-Balter, and Michael A. Kozuch. “The case for sleep states in servers”. In: HotPower Workshop. 2011. 19 N. Tolia et al. “Delivering energy proportionality with non energy-proportional systems: optimizing the ensemble”. In: Proceedings of the 2008 Conference on Power Aware Computing and Systems. USENIX Association. 2008. 8 Figure 7: “Cloud computing is hot, literally.” • powering down machines serving long-lived TCP connections has its own set of challenges20 6 Data Center Design Mini/Micro Data Centers21 Recently, instead of warehouse-sized data centers housing containers packed with servers, researchers have proposed to build many small micro data centers [18]. The excess heat generated by the micro data centers can be reused to heat the structures, e.g., homes, they are set up in. Reusing the waste heat has been done before for large size installation, e.g., IBM’s hot water super computer in Zurich. The appealing point of many, small, decentralized micro data centers is the latency advantage. Moving the infrastructure closer to the end-user results in higher fidelity for interactive service. One example is video on-demand [23]. How much energy can be saved? system name savings Turducken 10x battery lifetime Somniloquy 38-85% FAWN ≥ 100x (queries/joule) NapSAC 70% BEEMR 50% 20 G. Chen et al. “Energy-aware Server Provisioning and Load Dispatching for Connection-Intensive Internet Services”. In: Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation. USENIX Association. 2008, pp. 337–350. 21 J. Liu et al. “The Data Furnace: Heating Up with Cloud Computing”. In: Proceedings of the 3rd USENIX Workshop on Hot Topics in Cloud Computing. USENIX Association. 2011. 9 7 Contribution Contributions • Timed instances • Export storage over USB 3.0 • DreamServer • dsync My contributions are four-fold and touch different parts of the data center infrastructure: First, with timed instances I explore the possibility to improve the scheduling within the data center. The cloud computing customer provides the lease time as part of the resource request. The cloud provider uses the lease time to optimize the virtual to physical machine mapping. Second, I investigated how to be able to power down servers without losing access to the locally attached disk. Consumer-grade motherboards are equipped with USB 3.0 ports which offer a bandwidth between 300-400 MB/s. Instead of attaching the hard disks directly to the server, we would introduce a low-power controller which exports the disks via USB 3.0. Third, I am developing DreamServer, an energyaware web cluster proxy. DreamServer is intended for small to medium-sized hosting data centers. The Apache acts as a proxy between the client and the origin server. If the (virtualized) origin server is suspended, DreamServer wakes it up before forwarding the request. Suspension policies are configurable, e.g., a simple strategy is to suspend backend servers after a fixed period of inactivity. Fourth, dsync was a by-product of the DreamServer system. dsync helps with repeated synchronization of large binary data blobs, e.g., virtual machine images. Instead of determining modifications after the fact, dsync tracks modifications online. Accompanying user space programs use the modification status to perform the synchronization efficiently. Timed instances • spot vs. on-demand vs. timed reservation • can we use knowledge of resource reservation length to improve scheduling? • assume diurnal load curve • use simulation • savings between 2% and 21% depending on setting22 • additional constraints negatively impact savings23 22 Thomas Knauth and Christof Fetzer. “Spot-on for timed instances: striking a balance between spot and ondemand instances”. In: International Conference on Cloud and Green Computing. IEEE. 2012. url: http://se.inf.tu- dresden. de/pubs/papers/knauth2012spot.pdf. 23 Thomas Knauth and Christof Fetzer. “Energy-aware Scheduling for Infrastructure Clouds”. In: International Conference on Cloud Computing and Science. IEEE. 2012. url: http://se.inf.tu- dresden.de/pubs/papers/knauth2012scheduling.pdf. 10 We proposed timed instances as an alternative to the established purchase options on-demand and spot instances. On-demand instances guarantee the availability of resources until the end of the usage period, but are usually more expensive than spot instances. Spot instances may be terminated any time if the current spot price exceeds the user’s maximum bid. We wanted to find out if and how well the provider can optimize the scheduling if the lease time is known apriori. Reservations which expire at the same time are co-located. This increases the probablity for physical machines to be vacant. Vacant machines can be powered down. In our simulations the cumulative machine uptime is reduced by 2-21% depending on the simulation parameters. Export storage over USB 3.0 • leverage new USB Super Speed to exchange data between two servers24 • goodspeed of 4 Gigabit per second • use for accessing stable storage, transfer virtual machine state during migration • cost effective alternative to 10 Gigabit LAN and other high-end data center interconnects Whenever a server is switched off all the data residing on the locally attached disks becomes unavailable. We are still investigating if and how USB 3.0 can be used as a cost effective way to share disks between multiple servers. The idea is to attached the disks to an always-on controller instead of the server. The controller exports the disks via USB3.0. Due to the asymmetric nature of the Universal Serial Bus a direct connection between two servers is impossible. One host governs the access to the bus and the host can only talk to devices. We have an initial working prototype, but more work is required before we can present actual measurements. 24 Thomas Knauth and Christof Fetzer. An Inexpensive Energy-proportional General-Purpose Computing Platform, SOSP, Work-in-progress. 2011. 11 140 suspend time [seconds] 120 100 80 60 40 20 0 0 200 400 600 800 1000 1200 1400 1600 1800 dump size [MB] Figure 8: Relationship between virtual machine suspend time and dump size. DreamServer • premise: small and medium-sized web sites/services have significant idle times • shut down virtualized web service if not utilized • challenge is fast suspend/resume to/from disk of stateful service • suspend is CPU-bound whereas resume is I/O bound DreamServer is our resource-conserving HTTP proxy. Its purpose is to suspend unused virtual machines and ultimately power off idle servers. The challenges we are facing with DreamServer at the moment are slow suspend and resume performance. Resuming or suspending a virtual machine with multiple gigabytes of states takes on the order of minutes. The slow transition times are one hurdle we would need to overcome for such a system to be practically relevant. DreamServer dsync • efficient, fast, periodic transfer of binary data, e.g., virtual machine disks • copy everything vs. identify differences post-hoc vs. track modification online • simple copy is wasteful (network bandwidth, disk bandwidth, cache pollution) 12 synchronization time [s] 103 dsync rsync copy 102 101 100 500 1000 1500 2000 2500 3000 # modified blocks 3500 4000 Figure 9: Synchronization time for copy, rsync, and dsync depending on changed block count. • identify differences trades CPU for network bandwidth • tracking only supported for certain applications, e.g., VMWare ESX ≥ 4 or specific file system, e.g., btrfs and zfs dsync was born out of necessity from the DreamServer project. We wanted to be able to repeatedly synchronize virtual machine disks between servers. Copying them in their entirety is wasteful and existing tools, e.g., rsync, are slow. A tool which combines the speed of a simple copy with the efficiency of rsync was needed. rsync uses many CPU cycles to compute checksums which are required to identify differences. If there existed a way to track modifications, no checksum would be needed anymore. We implemented block-level modification tracking in the Linux kernel by extending the device mapper module. The modification state is exposed to user space via a proc-fs interface. For one workload, shown in Figure 9, we saw over 100x improvement in synchronization speed compared to rsync. dsync Critical remarks • business incentives may be missing to push for greater efficiency • energy costs may only be a small fractional part of operational costs • CPU utilization values are misleading; server has other resources too 13 Summarizing the contributions • timed instances: purchase option with predefined lease time • sharing storage over USB 3.0 • DreamServer: resource-conserving HTTP proxy • dsync: modification tracking for block devices 14