1 Abstract 2 Motivation

Transcription

1 Abstract 2 Motivation
1
Abstract
Energy-efficient computing is a hot topic in times of rising energy costs and
environmental awareness. As the demand for computational resources grows
unabated, so does the total amount of energy spent to store, aggregate, transform, and enrich the world’s digital assets. Researchers and companies alike
strive for novel ways to reduce the energy consumption of today’s warehousesized data centers.
The talk starts with an overview of the topic before showcasing possible
solutions. We will touch all aspects from rethinking the entire data center,
tailored infrastructure, workload analysis and migration, to server design. In
the second part of the talk I will cover my own contributions so far. The talk
will conclude with an outlook what I would still like to accomplish as part of
my PhD.
2
Motivation
Who cares about IT’s energy consumption?
• Energy Protection Agency1
• Greenpeace2
• Greenpeace cares a lot3
• Jevons paradox – resource consumption increases due to reduced cost
• The New York Times4 : 6-12% average utilization
• SFB 912: Highly Adaptive Energy-Efficient Computing (HAEC)
The American Energy Protection Agency published a report titled “Server
and Data Center Energy Efficiency” in 2007. In this report the EPA estimated
that the energy consumption doubled between 2000 and 2006 and projected
another two-fold increase between 2007 and 2011. To highlight just one other
number: the report estimates the US federal government’s electricity cost with
$740 million in 2011. Greenpeace also published two reports: one in 2011 “How
dirty is your data?” and “How Clean is Your Cloud?” in 2012. The reports call
for greater efficiency as well as transparency. According to Greenpeace, however,
green computing does not stop at just attaining greater efficiency. It must be
accompanied by a move away from “dirty” energy sources, e.g., fossil-based
and nuclear, to renewables. A New York Times article from 2012 also called
1 Report to Congress on Server and Data Center Energy Efficiency. U.S. Environmental Protection Agency, ENERGY
STAR Program, 2007.
2 Gary Cook and Jodie Van Horn. How Dirty is Your Data? url: http://www.greenpeace.org/international/en/publications/
reports/How- dirty- is- your- data/.
3 Greenpeace. How Clean is Your Cloud? url: http : / / www . greenpeace . org / international / en / publications / Campaign - reports /
Climate- Reports/How- Clean- is- Your- Cloud/.
4 James Glanz. The Cloud Factories – Power, Pollution and the Internet. url: http://www.nytimes.com/2012/09/23/technology/
data- centers- waste- vast- amounts- of- energy- belying- industry- image.html/.
1
DIRTY DATA
03
attentionGreenpeace
to the
pollution
caused
released
its own report, Make
IT Green: Cloudby inefficient data centers, though reactions
Computing and its Contribution to Climate Change in March of
to the artice
have
been
mixed.
Some of the claimed inefficiencies are already
2010, highlighting
the scale
of IT’s estimated
energy consumption,
and providing new analysis on the projected growth in energy
being addressed
ininternet
modern
datafor centers.
For example, Google built data centers
consumption of the
and cloud computing
the coming
decade, particularly as driven by data centres.
which only
use
natural
cooling
reducing
the
energy required to maintain a save
Key findings and outstanding questions from the Make IT Green
include:
operatingreporttemperature
range. Also, many data centers are run by corporations
• The electricity consumption of data centres may be as much as
whose main
goal
is topredicted.
maximize profit. As long as the inefficiency is bounded
70% higher
than previously
demand of the internet/cloud (data
• The combined electricity
and not hurting
the
corporation’s
bottom line, businesses have no real incentive
centres and telecommunications network) globally is 623bn kWh
would rank 5th among countries).
to address(and
the
inefficiency.
Based on current projections, the demand for electricity will more
9
•
than triple to 1,973bn kWh, an amount greater than the
combined total demands of France, Germany, Canada and
Brazil.
Cloud computing 5th largest “country” in terms of energy consumption5
2007 electricity consumption. Billion kwH
US
3923
China
3438
Russia
1023
Japan
925
Cloud computing
662
India
568
Germany
547
Canada
536
France
447
Brazil
404
UK
345
0
1000
2000
3000
4000
5000
Figure 1: IT’s energy consumption compared to national energy consumption.
9 http://www.greenpeace.org/international/en/publications/reports/make-it-green-cloudcomputing/
To illustrate the problem, Greenpeace included a graph in their report, which
compares the energy spent on “cloud computing” to national energy11 budgets.
If “cloud computing” were a country, it would have ranked 5th (in 2007) among
the world’s nations in terms of kWh consumed.
In their book “The Datacenter as a Computer”6 Barroso and Hölzle present
a breakdown of data center operating costs. They present two case studies to
demonstrate how operating costs can vary based on energy price and choice of
server. Case A uses expensive, professional-grade server hardware consuming
less peak power (300 W) and electricity is cheap at around 6 c/kWh. The server
related costs, including server amortization, server interest, and server opex,
accumulate to 70% of the monthly costs. Case study B assumes commoditygrade compute hardware which is cheaper to acquire, but draws more power
(peak power is at 500 W). In this scenario, server related costs drop to 29%
(23% amortization, 5% interest, 1% opex). On the other hand, due to higher
prices and higher power servers, energy-related costs rise to 22% from previously
6%. The trend of rising energy-related data center costs will continue as energy
costs keep rising.
Greenpeace International
5 Gary Cook and Jodie Van Horn. How Dirty is Your Data? url: http://www.greenpeace.org/international/en/publications/
reports/How- dirty- is- your- data/.
6 L.A. Barroso and U. Hölzle. “The datacenter as a computer: An introduction to the design of warehouse-scale
machines”. In: Synthesis Lectures on Computer Architecture 4.1 (2009).
2
Figure 2: Power consumption and energy-efficiency for a typical server across
the utilization spectrum.
Servers are not power-proportional7
Looking at the individual consumers inside the data center, about 59% of
the energy is consumed by the actual servers8 . Two factors aggravate the energy
inefficiency inside the data center. First, server have a high baseline consumption at low utilization levels. A typical server consumes about 30% of its peak
power at 0% utilization. Second, the average server utilization level is low between 10-50%9 . Because energy-proportional hardware will not be available in
the near future, researchers concentrate on how to increase average utilization
levels.
Overview
• Tailored Infrastructure
• Dynamic Scaling
• Workload Analysis
• Server Design
• Data Center Design
• Contribution
7 L.A. Barroso and U. Hölzle. “The datacenter as a computer: An introduction to the design of warehouse-scale
machines”. In: Synthesis Lectures on Computer Architecture 4.1 (2009).
8 J. Hamilton. “Cooperative Expendable Micro-Slice Servers (CEMS): Low Cost, Low Power Servers for InternetScale Services”. In: Conference on Innovative Data Systems Research. Citeseer. 2009.
9 L.A. Barroso and U. Hölzle. “The datacenter as a computer: An introduction to the design of warehouse-scale
machines”. In: Synthesis Lectures on Computer Architecture 4.1 (2009).
3
Figure 3: Turducken: the system (left) and the real thing (right).
3
Tailored Infrastructure
Hardware heterogeneity bridges the gap between low and high utilization levels
• combine hardware with different power/performance characteristics
• examples: Turducken10 , eBond , FAWN11 , Somniloquy12
If a single piece of hardware does not offer enough flexibility with respect
to energy/performance trade-offs, combining multiple components is one solution. One system which exploited the concept of heterogeneous hardware is
Turducken [21]. Turducken combined a laptop, a Palm handheld, and a mote
(sensor node) to extend the laptop’s overall battery runtime.
Fast Array of Whimpy Nodes (FAWN) [2] co-designs hardware and software
to optimize for ops/joule. Works around memory-wall by using low-power,
low-performance (relative) CPUs. The utilized components, low-power Atom
boards and flash memory, are a better fit for the I/O intensive workload. Keyvalue store is I/O-bound, so that it makes sense to use low-power processors.
It is a reasonable assumption that by matching the hardware components to
the workload more work per unit of energy can be done. The resulting system
is, however, constrained to the targeted workload. In a modern data center
multiple workloads with different characteristics, e.g., I/O intensive vs. CPUintensive, run in parallel. The infrastructure must be flexible enough to handle
workload changes. As such, a general purpose infrastructure may still be the
first choice. Developing software designed to run on a special purpose hardware
costs money too. While a higher ops/joule metric is a worthy goal, it may make
little sense buisness-wise.
10 J. Sorber et al. “Turducken: hierarchical power management for mobile devices”. In: Proceedings of the 3rd International Conference on Mobile Systems, Applications, and Services. ACM. 2005, pp. 261–274.
11 D.G. Andersen et al. “FAWN: A Fast Array of Wimpy Nodes”. In: Proceedings of the ACM SIGOPS 22nd Symposium
on Operating Systems Principles. ACM. 2009, pp. 1–14.
12 Y. Agarwal et al. “Somniloquy: Augmenting Network Interfaces to Reduce PC Energy Usage”. In: Proceedings of
the 6th USENIX symposium on Networked systems design and implementation. USENIX Association. 2009, pp. 365–380.
4
ared to the amount of available
blems. Thus, it is essential that
tive as possible.
approach to consolidation in a
at takes into account both the
available nodes and the probhese nodes. Our consolidation
ses and is based on constraint
d on constraints describing the
mory requirements, computes
mber of nodes and a tentative
placement. The second phase,
that take feasible migrations
lan, to reduce the number of
nts, using the NASGrid benchteron 2.0GHz CPU uniprocesonsolidation uses 24.31 nodes
previously-used First Fit Deses 15.34 nodes per hour, and
only 11.72 nodes per hour, a
ed to the static solution.
d as follows. Section 2 gives
ion 3 describes how Entropy
mine the minimum number of
s, and Section 4 describes how
g to minimize the reconfiguras Entropy using experimental
xperimental testbed. Section 6
7 presents our conclusions and
gle node dedicated to cluster
of nodes that can host user
uch as file servers. Entropy is
yed on the first two. It consists
ns on the node that provides
et of sensors that run in Xen’s
user tasks, i.e., VMs.
ntly maintain the cluster in a
to nodes, that is (i) viable, i.e.
ient memory and every active
, and (ii) optimal, i.e. that uses
ure 1 shows the global design
ne acts as a loop that 1) iterantropy sensors that a VM has
or vice versa, 2) tries to comrom the current configuration
grations and leaves the cluster
d 3) if successful, initiates miuration uses fewer nodes than
nfiguration is not viable. The
ates new information about reonds for our prototype, before
s, each Entropy sensor periodterface of the Xen hypervisor
U usage of the local VMs, and
ation. An Entropy sensor also
figuration engine when a VM
gration request to the Xen hy-
g a viable, configuration have
timal placement is chosen for
y [3, 7, 11, 17]. However, load to a globally optimal solu-
Figure 1. Reconfiguration loop
Figure 4: Architecture of consolidation manager Entropy.
tion, and may fail to produce any solution at all. Entropy instead
uses Constraint Programming (CP), which is able to determine a
globally optimal solution, if one exists, by using a more exhaustive
search, based on a depth first search. The idea of CP is to define
13 a
Vacate
by redistributing
themust
load
problem
by machines
stating constraints
(logical relations) that
be satis aAconsolidation
mangerProblem
for virtualized
isfiedEntryopy
by the solution.
Constraint Satisfaction
(CSP) is environments. A the
defined
as a setofofavariables,
a set of domains
representing
the setcontinously. Periodically
utilization
set of virtual
machines
is monitored
ofthe
possible
values for
variable
and a setmachines
of constraints
that
assignment
of each
virtual
to physical
is updated
based on changes in
represent required relations between the values of the variables. A
resource utilization. The problem of how to update the assignment is split into
solution for a CSP is a variable assignment (a value for each varitwo that
sub-problems:
virtual
machine packing
problem (VMPP) and virtual
able)
simultaneouslythe
satisfies
the constraints.
To solve CSPs,
machine
solution
Entropy
usesreconfiguration
the Choco library problem
[10], which(VMRP).
can solve aA
CSP
where to the packing problem
the
goal is virtual
to minimize
or maximize
value of number
a single variable.
assigns
machines
to athe
minimal
of physical machines subject to
Choco and mostIt
other
solvers
only solve problem. Transitioning
a Because
set of constraints.
is constraint
related to
the can
bin-packing
optimization problems of a single variable, the reconfiguration alfrom the current assignment to the new assignment with a minimal number of
gorithm proceeds in two phases (see Figure 1). The first phase finds
virtual
machine
is necessary
the reconfiguration
problem.
The study considers
the
minimum
numbermigrations
n of nodes are
to host all VMs
and
cluster
(NASGrid)
where of
overall
a only
sample
viable benchmarks
configuration that
uses this number
nodes.turn
We around time is the most
refer
to the problem
this phase
as thesuch
Virtual
sensible
metric. considered
It is left inopen
whether
anMachine
approach is viable for latencyPacking
Problem
(VMPP).web
Theservices.
second phase
an equivsensitive
interactive
Thecomputes
study uses
virtual machines as the basic
alent viable configuration that minimizes the reconfiguration time,
unit for resource allocation and computation. Virtual machines are appealing
given the chosen number of nodes n. We refer to the problem conbecause
they
canasbe
between
physicalProblem
servers.
sidered
in this
phase
themigrated
Virtual Machine
Replacement
(VMRP). Solving these problems may be time-consuming. While
the
reconfiguration
runsby
on the
cluster resource
management
Power
down engine
servers
load-based
scaling
of stateless compute comnode, and thus does not compete with VMs for CPU and memory,
ponents
it is important to produce a new configuration quickly to maximize
the benefit of consolidation. Choco
has the property that it can be
14
• example
NapSAC
aborted
at any time,
in which case it returns the best result computed so far. This makes it possible to impose a time limit on the
scaling of
stateless
web-application
tier
solver,• toload-based
ensure the reactivity
of the
approach.
Thus, we limit the
total computation time for both problems to 1 minute, of which the
• heterogeneous
(Nehalem
Atom)
first phase
has at most 15hardware
seconds, and
the secondand
phase
has the
remaining
time.
These
durations
are
sufficient
to
give
a
nontrivial
13 F. Hermenier et al. “Entropy: a Consolidation Manager for Clusters”. In: International Conference on Virtual Execuimprovement
inACM.
the solution,
as compared to the FFD heuristic, as
tion Environments.
2009, pp. 41–50.
14 A. Krioukov et al. “NapSAC: Design and Implementation of a Power-Proportional Web Cluster”. In: ACM SIGshown
in Section 5. In our initial experiments, we tried to model
COMM Computer Communication Review 41.1 (2011), pp. 102–108.
the reconfiguration algorithm with a single problem that proposed
a trade-off between the number of used nodes and the number of
migrations to perform. However, the computation time was much
5
higher for an at best equivalent packing and reconfiguration
cost.
4
Virtual Machines
3. The Virtual Machine Packing Problem
The objective of the VMPP is to determine the minimum number
of nodes that can host the VMs, given their current processingunit and memory requirements. We first present several examples
that illustrate the constraints on the assignment of VMs to nodes,
then consider how to express the VMPP as a constraint satisfaction
No impact on available memory
Does not introduce data hot spots nor impact data locality
Improvement preserved when using ECC instead of replication
Addresses long running jobs with low parallelism
Energy savings
✔
9-50%1
✔
✔
✔
✔
0-50%2
24%3
✔
✔
✔
Partially
40-50%
Table 2. Required properties for energy-saving techniques for Facebook’s MIA workload. Prior proposals are insufficient. Notes: 1
reported energy savings used an energy model based on linearly extrapolating CPU utilization while running the GridMix throughput be
mark [22] on a 36-node cluster. 2 Reported only relative energy savings compared with the covering subset technique, and for only
artificial jobs (Terasort and Grep) on a 24-node experimental cluster. We recomputed absolute energy savings using the graphs in the pa
3
Reported simulation based energy cost savings, assumed an electricity cost of $0.063/KWh and 80% capacity utilization.
storage). Figures 3 and 4 indicate that choosing an appro
ate value for interactive can allow most jobs to be cla
fied as interactive and executed without any delay introdu
8"0:%"/()#/4/;&0)
!"#$%&'&()
by BEEMR. This interactive threshold should be per
*"+,)
!%/,,123)
ically adjusted as workloads evolve.
*"+)-.&.&)
+/'!<)
'/,:)
The interactive zone acts like a data cache. When an
6"4&)
-.&.&)
teractive job accesses data that is not in the interactive z
(i.e., a cache miss), BEEMR migrates the relevant data f
$/0/#&'&0,)
,%/5&,)
the batch zone to the interactive zone, either immediatel
14'&00.$'&()
*"+,)
upon the next batch. Since most jobs use small data sets
#"41'"0)
are reaccessed frequently, cache misses occur infrequen
Figure 5. The BEEMR workload manager (i.e., job tracker) clasAlso, BEEMR requires storing the ECC parity or replica
Figure
5: Berkley
Energy-Efficient
MapReduce
(BEEMR) architecture
sifies each
job into
one of three
classes which determines
which
blocks within the respective zones, e.g., for data in the in
cluster zone will service the job. Interactive jobs are serviced in the
active zone, their parity or replication blocks would be sto
interactive zone, while batchable and interruptible jobs are serviced
in the interactive
zone also.
power
need
more
of less jobs
powerful
hardware
for the same
in the •batch
zone.density
Energy problem:
savings come
from
aggregating
in
Upon
submission
of batched and interruptible jobs,
power
the batchcomputational
zone to achieve high
utilization, executing them in regutasks associated with the job are put in a wait queue. At re
lar batches, and then transitioning machines in the batch zone to a
• basic infrastructure further eats into savings gained bylarless
powerful
intervals,
the hardworkload manager initiates a batch, pow
low-power state15when the batch completes.
ware
on all machines in the batch zone, and run all tasks on
wait queue using the whole cluster. The machines in the
tive zone
is always
fully powered.
The batch
zone makes
A similar
strategy
of load-based
adaptation
wasupemployed
by the
NapSAC
teractive
zone
are also available for batch and interrupt
thesystem.
rest of the
cluster,
and is
into a very low
power
state
The
goal was
toput
approximate
energy
proportional
computing
by scaljobs, but interactive
jobs retain priority there. After a ba
between
[25]. tier of a web service. NapSAC combined the load-based
ing thebatches
application
begins, any batch and interruptible jobs that arrive wo
As jobs arrive,
BEEMR
classifies
them as one
of three The hardware was a mix
resource
provisioning
with
heterogeneous
hardware.
wait for the next batch. Once all batch jobs complete, the
jobof types.
Classification
is based
onless
empirical
parameters
powerful
Nehalem server
and
powerful
Intel Atom-based
trackermachines.
assigns noThe
further tasks. Active tasks from interr
derived
fromservers
the analysis
in Section
2. load
If thewhile
job input
powerful
handled
the base
the data
Atom machines
used for and enqueued to be resumed in
ible jobswere
are suspended,
size
is less
than Scaling
some threshold
interactive,
is classified easy because it is stateload
spikes.
the application
tier is itcomparatively
next batch. Machines in the batch zone return to a low-po
asless.
an interactive
job.
BEEMR
seeks
to
service
jobs
The application state and data is kept these
in a separate
always-on
data does
store.not complete by start of the next ba
state. If a batch
with
latency. is
If aif job
tasks with
duration longer
Thelow
question
thehas
concept
can task
be extended
to statefulinterval,
components
where,
the cluster would remain fully powered for con
than
some threshold
interruptible,
classified
an
potentially,
gigabytes
of state mustitbeisloaded
intoasmemory
beforebatch
execution
can
utive
periods.
The high peak-to-average load in F
interruptible
job.DreamServer
Latency is not
a concern
for these
jobs, this question.
resume. Our
project
attempts
to answer
ure 2 indicates that on average, the batch zone would sp
because their long-running tasks can be check-pointed and
considerable periods in a low-power state.
resumed over multiple batches. All other jobs are classified
BEEMR improves over prior batching and zoning sche
as5
batchWorkload
jobs. Latency is also
not a concern for these jobs, but
Analysis
by combining both, and uses empirical observations to
BEEMR makes best effort to run them by regular deadlines.
Flexible
be exploitedpolicies
to save
energy16the values of policy parameters, which we describe next.
Such
a setupdeadlines
is equivalentcan
to deadline-based
where
The virtual
approach
irrespective of the application. Howthe deadlines
are themachine
same length
as the works
batch intervals.
Space
ever,
making applications
“energy-aware”
higherParameter
savings, beThe interactive
zone is always
in a full-power potentially
ready state. yields3.1.1
specificjobs
knownledge
exists.
Theassopopular BEEMR
Map/Reduce
frameIt cause
runs allapplication
of the interactive
and holds all
of their
involves
several design parameters whose va
workinput,
Hadoop
received
muchdata
interest.
Chenand
et HDFS
al. exploit need
the fact,
manyThese parameters are:
ciated
shuffle,
and output
(both local
to be that
optimized.
queries only operate on a very tiny sub-set of the entire data (100 MB or less).
Because the cluster size is storage constrained, the small interactive zone can
only hold a subset of the entire data. In fact, the interactive zone works as a
47
least-recently-used (LRU) cache: only the most recently accessed data is present.
14'&0/!'15&)6"4&)
7/%8/3,)"49)
15 Urs Hölzle. “Brawny cores still beat wimpy cores, most of the time”. In: IEEE Micro 30.4 (2010).
16 Y. Chen et al. “Energy Efficiency for Large-Scale MapReduce Workloads with Significant Interactive Analysis”.
In: EuroSys. 2012.
6
Figure 6: Parasol installation.
If a job accesses uncached data, it will be delayed until the batch zone is powered
up again to fetch the missing data. The cluster is split into two zones: interactive and batch. The small interactive zone is always on whereas machines in
the batch zone are activated on-demand. Batch jobs are delayed if there are
insufficient resources.
Combining flexible deadlines with solar energy is even greener17
With GreenHadoop Goiri et al. use a similar classification of “urgent” and
“non-urgent” jobs. The basic classification allows more flexibility in scheduling
the jobs. In addition to saving power, GreenHadoop uses flexible deadlines to
build a schedule around the availability of solar energy. The system incorporates
energy price and weather prediction to build the execution schedule. While this
works well for a comparatively small Hadoop cluster, it remains to be seen
whether the concept can be scaled up to the size of an entire data center. Also,
as with the BEEMR system, only Map/Reduce-style workloads are considered.
Whether a mix of batch and interactive workloads can be run successfully on
such an infrastructure remains an open question.
Which power states are worthwhile?
17 Í. Goiri et al. “GreenHadoop: Leveraging Green Energy in Data-Processing Frameworks”. In: EuroSys, April
(2012).
7
active
low-power
active
low-power
idle
Existing low power states have transition times that are too coarse grained
for some data center workloads. A different direction of research was thus to
identify characteristics of sleep states which would make them more broadly
applicable. Chip and system designer would use this information to guide the
development of the next generation of server hardware. Meisner, Gold, and
Wenisch argue for only a single idle-power state. Combined with a fast transitioning mechanism between active and idle states the two-state solution simplifies the optimization problem. Transitioning between two states (active and
idle) is easier than deciding between a set of power states each with varying performance/power trade-offs. The examined workloads include web servers, mail
(pop, smtp, imap), DNS, DHCP, backup, and a scientific compute cluster. Because of the system’s increased power range an alternative power supply design
is presented too. Power supply units only operate efficiently at high loads.
Gandhi, Harchol-Balter, and Kozuch18 explore the usability and characteristics of sleep states in servers. In contrast to the Powernap system, they include
low-power active states. They argue that transitions times are shorter for lowpower active states. Servers usually have very short idle periods. Exploiting
them requires short transition periods. The design choices are active or idle low
power modes. Active low-power modes allow the system to continue to operate,
though with reduced performance. In idle low-power modes the system cannot
do any processing, though the potential savings are usually higher than with
active low-power modes.
Use cases for current idle, low-power systems exist
• approximating energy-proportionality at the aggregate level using nonenergy-proportional components19
18 Anshul Gandhi, Mor Harchol-Balter, and Michael A. Kozuch. “The case for sleep states in servers”. In: HotPower
Workshop. 2011.
19 N. Tolia et al. “Delivering energy proportionality with non energy-proportional systems: optimizing the ensemble”. In: Proceedings of the 2008 Conference on Power Aware Computing and Systems. USENIX Association. 2008.
8
Figure 7: “Cloud computing is hot, literally.”
• powering down machines serving long-lived TCP connections has its own
set of challenges20
6
Data Center Design
Mini/Micro Data Centers21
Recently, instead of warehouse-sized data centers housing containers packed
with servers, researchers have proposed to build many small micro data centers [18]. The excess heat generated by the micro data centers can be reused
to heat the structures, e.g., homes, they are set up in. Reusing the waste heat
has been done before for large size installation, e.g., IBM’s hot water super
computer in Zurich. The appealing point of many, small, decentralized micro
data centers is the latency advantage. Moving the infrastructure closer to the
end-user results in higher fidelity for interactive service. One example is video
on-demand [23].
How much energy can be saved?
system name savings
Turducken
10x battery lifetime
Somniloquy
38-85%
FAWN
≥ 100x (queries/joule)
NapSAC
70%
BEEMR
50%
20 G. Chen et al. “Energy-aware Server Provisioning and Load Dispatching for Connection-Intensive Internet Services”. In: Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation. USENIX Association.
2008, pp. 337–350.
21 J. Liu et al. “The Data Furnace: Heating Up with Cloud Computing”. In: Proceedings of the 3rd USENIX Workshop
on Hot Topics in Cloud Computing. USENIX Association. 2011.
9
7
Contribution
Contributions
• Timed instances
• Export storage over USB 3.0
• DreamServer
• dsync
My contributions are four-fold and touch different parts of the data center
infrastructure: First, with timed instances I explore the possibility to improve
the scheduling within the data center. The cloud computing customer provides
the lease time as part of the resource request. The cloud provider uses the lease
time to optimize the virtual to physical machine mapping. Second, I investigated how to be able to power down servers without losing access to the locally
attached disk. Consumer-grade motherboards are equipped with USB 3.0 ports
which offer a bandwidth between 300-400 MB/s. Instead of attaching the hard
disks directly to the server, we would introduce a low-power controller which
exports the disks via USB 3.0. Third, I am developing DreamServer, an energyaware web cluster proxy. DreamServer is intended for small to medium-sized
hosting data centers. The Apache acts as a proxy between the client and the
origin server. If the (virtualized) origin server is suspended, DreamServer wakes
it up before forwarding the request. Suspension policies are configurable, e.g.,
a simple strategy is to suspend backend servers after a fixed period of inactivity. Fourth, dsync was a by-product of the DreamServer system. dsync helps
with repeated synchronization of large binary data blobs, e.g., virtual machine
images. Instead of determining modifications after the fact, dsync tracks modifications online. Accompanying user space programs use the modification status
to perform the synchronization efficiently.
Timed instances
• spot vs. on-demand vs. timed reservation
• can we use knowledge of resource reservation length to improve scheduling?
• assume diurnal load curve
• use simulation
• savings between 2% and 21% depending on setting22
• additional constraints negatively impact savings23
22 Thomas Knauth and Christof Fetzer. “Spot-on for timed instances: striking a balance between spot and ondemand instances”. In: International Conference on Cloud and Green Computing. IEEE. 2012. url: http://se.inf.tu- dresden.
de/pubs/papers/knauth2012spot.pdf.
23 Thomas Knauth and Christof Fetzer. “Energy-aware Scheduling for Infrastructure Clouds”. In: International
Conference on Cloud Computing and Science. IEEE. 2012. url: http://se.inf.tu- dresden.de/pubs/papers/knauth2012scheduling.pdf.
10
We proposed timed instances as an alternative to the established purchase
options on-demand and spot instances. On-demand instances guarantee the
availability of resources until the end of the usage period, but are usually more
expensive than spot instances. Spot instances may be terminated any time if
the current spot price exceeds the user’s maximum bid. We wanted to find out
if and how well the provider can optimize the scheduling if the lease time is
known apriori. Reservations which expire at the same time are co-located. This
increases the probablity for physical machines to be vacant. Vacant machines
can be powered down. In our simulations the cumulative machine uptime is
reduced by 2-21% depending on the simulation parameters.
Export storage over USB 3.0
• leverage new USB Super Speed to exchange data between two servers24
• goodspeed of 4 Gigabit per second
• use for accessing stable storage, transfer virtual machine state during migration
• cost effective alternative to 10 Gigabit LAN and other high-end data center
interconnects
Whenever a server is switched off all the data residing on the locally attached
disks becomes unavailable. We are still investigating if and how USB 3.0 can
be used as a cost effective way to share disks between multiple servers. The
idea is to attached the disks to an always-on controller instead of the server.
The controller exports the disks via USB3.0. Due to the asymmetric nature of
the Universal Serial Bus a direct connection between two servers is impossible.
One host governs the access to the bus and the host can only talk to devices.
We have an initial working prototype, but more work is required before we can
present actual measurements.
24 Thomas Knauth and Christof Fetzer. An Inexpensive Energy-proportional General-Purpose Computing Platform, SOSP,
Work-in-progress. 2011.
11
140
suspend time [seconds]
120
100
80
60
40
20
0
0 200 400 600 800 1000 1200 1400 1600 1800
dump size [MB]
Figure 8: Relationship between virtual machine suspend time and dump size.
DreamServer
• premise: small and medium-sized web sites/services have significant idle
times
• shut down virtualized web service if not utilized
• challenge is fast suspend/resume to/from disk of stateful service
• suspend is CPU-bound whereas resume is I/O bound
DreamServer is our resource-conserving HTTP proxy. Its purpose is to suspend unused virtual machines and ultimately power off idle servers. The challenges we are facing with DreamServer at the moment are slow suspend and
resume performance. Resuming or suspending a virtual machine with multiple
gigabytes of states takes on the order of minutes. The slow transition times
are one hurdle we would need to overcome for such a system to be practically
relevant.
DreamServer
dsync
• efficient, fast, periodic transfer of binary data, e.g., virtual machine disks
• copy everything vs. identify differences post-hoc vs. track modification
online
• simple copy is wasteful (network bandwidth, disk bandwidth, cache pollution)
12
synchronization time [s]
103
dsync
rsync
copy
102
101
100
500
1000
1500
2000
2500
3000
# modified blocks
3500
4000
Figure 9: Synchronization time for copy, rsync, and dsync depending on changed
block count.
• identify differences trades CPU for network bandwidth
• tracking only supported for certain applications, e.g., VMWare ESX ≥ 4
or specific file system, e.g., btrfs and zfs
dsync was born out of necessity from the DreamServer project. We wanted
to be able to repeatedly synchronize virtual machine disks between servers.
Copying them in their entirety is wasteful and existing tools, e.g., rsync, are
slow. A tool which combines the speed of a simple copy with the efficiency of
rsync was needed. rsync uses many CPU cycles to compute checksums which are
required to identify differences. If there existed a way to track modifications, no
checksum would be needed anymore. We implemented block-level modification
tracking in the Linux kernel by extending the device mapper module. The
modification state is exposed to user space via a proc-fs interface. For one
workload, shown in Figure 9, we saw over 100x improvement in synchronization
speed compared to rsync.
dsync
Critical remarks
• business incentives may be missing to push for greater efficiency
• energy costs may only be a small fractional part of operational costs
• CPU utilization values are misleading; server has other resources too
13
Summarizing the contributions
• timed instances: purchase option with predefined lease time
• sharing storage over USB 3.0
• DreamServer: resource-conserving HTTP proxy
• dsync: modification tracking for block devices
14