How to Avoid Herd: A Novel Stochastic Algorithm in Grid...

Transcription

How to Avoid Herd: A Novel Stochastic Algorithm in Grid...
How to Avoid Herd: A Novel Stochastic Algorithm in Grid Scheduling
Qinghua Zheng1,2, Haijun Yang3 and Yuzhong Sun1
1
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100080
2
Graduate School of the Chinese Academy of Sciences, Beijing, 100049
3
School of Economics and Management, BEIHANG University
[email protected], [email protected], [email protected]
behavior and thus degrade performance[3]. Herd
behavior is caused by independent entities, with
imperfect and stale system information, allocating
multiple requests onto a resource simultaneously. Here,
“imperfect” means that one does not know others’
decisions when it performs resource allocation.
This herd behavior can be found in the
supercomputer scheduling[4], where many SAs (stands
for Supercomputer AppLeS) scheduled jobs on one
supercomputer. SA estimated the turn-around time of
each request based on the current state of the
supercomputer, and then forwarded the request with
the smallest expected turn-around time to the
supercomputer. The decision made by one SA affected
the state of the system, therefore impacting other
instances of SA, and the global behavior of the system
thus came from the aggregate behavior of all SAs. We
can abstract a model as Fig.1 for the above system in
which herd behavior might happen. In this model, jobs
are firstly submitted to one of SIs, and then forwarded
to one of sharing resources allocated independently by
the SI. When multiple SIs, with imperfect and stale
information, perform same resource allocations
simultaneously, herd behavior will happen, causing
some resources overloaded while others being idle.
Abstract
Grid technologies promise to bring the grid users
high performance. Consequently, scheduling is being
becoming a crucial problem. Herd behavior is a
common phenomenon, which causes the severe
performance decrease in grid environment with
respect to bad scheduling behaviors. In this paper, on
the basis of the theoretical results of the homogeneous
balls and bins model, we proposed a novel stochastic
algorithm to avoid herd behavior. Our experiments
address that the multi-choice strategy, combined with
the advantages of DHT, can decrease herd behavior in
large-scale sharing environment, at the same time,
providing better schedule performance while
burdening much less scheduling overhead than greedy
algorithms. In the case of 1000 resources, the
simulations show that, for the heavy load(i.e. system
utilization rate 0.5), the multi-choice algorithm
reduces the number of incurred herds by a factor of 36,
the average job waiting time by a factor of 8, and the
average job turn-around time by 12% compared to the
greedy algorithms.
1. Introduction
1-4244-0307-3/06/$20.00 ©2006 IEEE.
SI
Jobs
Scheduling is one of key issues in computing grid,
and various scheduling systems can be found in many
grid projects, such as GrADS[1], Ninf[2], etc. Among
all challenges in grid scheduling, “herd behavior”
(with the meaning that “all tasks arriving in an
information update interval go to a same subset of
servers”)[3], which causes imbalance and drastic
contention on resources, should be a critical one. Large
scale of grid resources makes network be partitioned
into many autonomy sites, each having its autonomous
decision. However, there has been theoretical evidence
that systems in which resource allocation is performed
by many independent entities can exhibit herd
R
……
Jobs
R
SI
R
Fig.1. Job Scheduling Model (SI: Scheduling
Instance. R: Resource).
This abstract model is still reasonable in grid
environment. Autonomy of each site and network
latency make the information for scheduling imperfect
and stale, and they even worsen these characteristics
compared to the supercomputer environment. With
267
imperfect and stale information, however, the existing
grid scheduling methods will possibly cause herd
behavior and thus degrade system performance. How
to prevent herd behavior in grid scheduling is an
important problem to promote system performance.
The main contributions of this paper are as follows:
1) On the basis of the balls and bins model, we
proposed a novel stochastic algorithm, named
multi-choice algorithm, to reduce herds incurred
in grid scheduling by at least one order of
magnitude.
2) With combination of the advantages of DHT and
data replication techniques, we bridged the gap
between the theoretical achievement(the
homogeneous balls and bins model) and a
complex and realistic world (the heterogeneous
distributed grid environment).
3) Simulations demonstrated that the multi-choice
algorithm provided better schedule performance
while burdening much less overhead compared
to conventional ones. In the case of 1000
resources, the simulations showed that, for the
heavy load(i.e. system utilization rate 0.5), the
multi-choice algorithm reduced the average job
waiting time by a factor of 8, and the average job
turn-around time by 12% compared to the greedy
algorithms.
The rest of this paper is structured as follows.
Section II compares our method to related work.
Section III presents our scheme to prevent herd.
Section IV addresses our implementation and
algorithms. Section V presents simulations supporting
our claims. Finally, we summarize our conclusions in
Section VI.
components, named metaserver, provides the scheduler
and resource monitor. The scheduler obtains the
computing servers’ load information from the resource
information database and decides the most appropriate
server depending on the characteristics of the
requested function (usually the server which would
yield the best response time, while maintaining
reasonable overall system throughput) .
Although details of above scheduling systems vary
dramatically, there are aspects which they share to
causing herd behavior: all of them are in nature greedy
algorithms, selecting the best resource set according to
some criteria. When jobs simultaneously arrive at
multiple scheduling instances, same schedules, taken
by these scheduling instances, might aggregate on the
system and thus herd happens.
Condor[5] is a distributed batch system for sharing
the workload of compute-intensive jobs in a pool of
UNIX workstations connected by a network. In a pool,
the Central Manager is responsible for job scheduling,
finding matches between various ClassAds[6,7], in
particular, resource requests and resource offers,
choosing the one with the highest Rank value (noninteger values are treated as zero) among compatible
matches, breaking ties according to the provider’s rank
value[6]. To allow multiple Condor pools to share
resources by sending jobs to each other, Condor
employs a distributed, layered flocking mechanism,
whose basis is formed by the Gateway machines – at
least one in every participating pool[8]. The purpose of
these Gateway machines is to act as resource brokers
between pools. Gateway chooses at random a machine
from the availability lists received from other pools to
which it is connected, and represents the machine to
the Central Manager for job allocation. When a job is
assigned to the Gateway, the Gateway scans the
availability lists in random order until it encounters a
machine satisfying the job requirements. It then sends
the job to the pool to which this machine belongs.
Although DHT can be used to automatically organize
Condor pools[9], the fashion of scheduling in a Condor
flock keeps unchanged.
Scheduling in Condor would not cause herd
behavior, because: (1) it defines a scope (Condor pool)
within which only one scheduling instance is running,
and consequently scheduling in each scope does not
affect others; (2) stochastic selection is used when
scheduling in Condor flocking. Some disadvantages
are obvious: a large scope will arouse the concerns of
performance bottleneck and single point of failure,
while a small scope will prevents certain optimizations
due to its layered Condor flocking mechanism[8]. Its
scope leads to the lack of efficiency and flexibility to
2. Related work
The goal of the Grid Application Development
Software (GrADS) Project is to provide programming
tools and an execution environment to ease program
development for the Grid. Its scheduling procedure
involves the following general steps: (i) identify a
large number of sets of resources that might be good
platforms for the application, (ii) use the applicationspecific mapper and performance model specified in
the COPs (configurable object programs) to generate a
data map and predict execution time for those resource
sets, and (iii) select the resource set that results in the
best rank value (e.g. the smallest predicted execution
time).
The goal of Ninf is to develop a platform for global
scientific computing with computational resources
distributed in a world-wide global network. One of its
268
Suppose that we sequentially place n balls into
n boxes by putting each ball into a randomly chosen
share grid resources. Nevertheless, its stochastic
thought is a key to prevent herd behavior.
Randomization has been widely mentioned to be
useful for scheduling. In [10], it was pointed out that
parameters of performance models might exhibit a
range of values because, in non-dedicated distributed
systems, execution performance may vary widely as
other users share resources. In [11], their experiences
proved that introducing some degree of randomness in
several of the heuristics for host selection can lead to
better scheduling decision in certain case. When
multiple task choices achieve or are very close to the
required minimum or maximum of some value (e.g.
minimum completion time, maximum sufferage, etc.),
their idea is to pick one of these tasks at random rather
than letting this choice be guided by the
implementation of the heuristics. The paper[12]
described a methodology that allows schedulers to take
advantage of stochastic information (choose a value
from the range given by the stochastic prediction of
performance) to improve application execution
performance. In next section, we will show a novel
stochastic algorithm based on the balls-and-bins model.
box. It is well known that when we are done, the
fullest
box
has
with
high
probability
(1 + O(1)) ln n / ln ln n balls in it. Suppose instead,
that for each ball we choose d ( d ≥ 2) boxes at
random and place the ball into the one that is less full
at the time of placement. Azer[13] has shown that with
high probability, the fullest box contains only
ln ln n / ln d + Θ(1) balls - exponentially less than
before. Further, Vöcking[14] found that choosing the
bins in a non-uniform way results in a better load
balancing than choosing the bins uniformly.
Mitzenmacher[3] demonstrated the importance of
using randomness to break symmetry. For systems
replying on imperfect information to make scheduling
decisions, “deterministic” heuristics that choose the
lightest loaded server do poorly and significantly hurt
performance (even for “continuously updated but
stale” load info) due to “herd behavior”, but a small
amount of randomness (e.g. “d-random” strategy)
suffices to break the symmetry and give performance
comparable to that possible with perfect information.
3. How to avoid herd behavior
3.2. How to employ this model
Recall that herd behavior is caused by same
behaviors simultaneously aggregating on a system. It is
the imperfect and stale information that leads to
multiple individuals taking same behaviors. With
perfect and up-to-date information (the current state of
resources and all the decisions made by others are
known), each scheduling instance can properly use
greedy algorithms, making appropriate decisions to
maximize the system performance, and thus herd will
not take place. In grid environment, however, the
information for scheduling is inevitably imperfect and
stale. In this case, stochastic algorithms instead of
greedy algorithms can be used to make individuals
avoid taking same behaviors. In this paper, we
introduce a stochastic algorithm, which is based on the
balls and bins model, into the grid scheduling.
However, the grid scheduling cannot immediately
benefit from the model because it has very strong
constraints. Therefore, some technologies will be
proposed to overcome these constraints.
This multiple-choice idea, that is, the concept of
first selecting a small subset of alternatives at random
and then placing the ball into one of these alternatives,
has several applications in randomized load balancing,
data allocation, hashing, routing, PRAM simulation,
and so on[15]. For example, it is recently employed in
the DHT load balance[16].
When we use such an algorithm for load balancing
in grid scheduling systems, balls represent jobs or
tasks, bins represent machines or servers and the
maximum load corresponds to the maximum number
of jobs assigned to a single machine. However, there
are three essential problems which we must tackle
before employing this model.
Organization of grid resources
In the original theory, it is supposed that all of n
bins are known and we can find any bin when placing
a ball. In grid environments, however, we must
implement this supposition by organizing all grid
resources in resource management.
Constraints imposed by jobs on resources
In the original theory, there is no constraint on the
choice of d bins, that is, one ball can be placed into
any one of n bins if we do not care about load balance.
In grid scheduling applications, however, some jobs
will possibly have certain resource requirements (e.g.
3.1. Balls and bins model
The balls and bins model is a theoretical model to
study the load balance. Here, we introduce some
results as following.
269
CPU clock rate, MEM size, etc.), which limits the
scope of d choices. Hence, we should make
appropriate resource choices for each job.
Probability of choosing a resource
In the original theory, all bins are homogeneous and
each one is chosen with same probability (1/n). It
doesn’t matter that which bin burdens the maximum
load. In grid scheduling applications, however,
resources are drastically heterogeneous and their
capabilities (e.g. CPU, MEM) differ greatly from each
other’s. Therefore, we must carefully consider how to
adjust the probability of choosing a resource somewhat
because it does matter that which resource is allocated
the maximum load now.
We will bridge these gaps in the following
subsections.
The implementations of SWORD use the Bamboo
DHT[21], but the approaches generalize to any DHT
because only the key-based routing functionality of
DHT is used. One disadvantage of this decentralized
DHT-based resource discovery infrastructure is the
relative poor performance of collecting information
compared to a centralized architecture.
All above three methods have their advantages as
well as disadvantages. Nevertheless, we adopt the
DHT-based resource management based on the
following reasons:
(1) The advantages of using DHT
DHT inherits the quick search and load balance
merits of hash table, and its key properties include selforganization, decentralization, routing efficiency,
robustness, and scalability. Moreover, as a middleware
with simple APIs, DHT features simplicity and
easiness of large-scale application development.
Besides, using it as an infrastructure, applications
automatically inherit the properties of underlying DHT.
(2) Small overhead of the d-random strategy
As mentioned above, the disadvantage of using
DHT is the relative poor performance of collecting
information compared to a centralized architecture.
However, our “d-choice” strategy only submits
d queries on the DHT, each taking a time complexity
of O (log(n)) . Hence, our query complexity is only
3.2.1. Organization of grid resources. There are three
typical kinds of organizing resources and each can be
found in a certain project.
Flat organization: flock of Condor. As stated in
Section 2, sharing resources between Condor pools
requires the bi-connection of their Gateways. All of
these Gateways are placed in a same plane, each
connecting directly to others when flocking. The
disadvantage of flat organization is obvious: for a
global share of resources, an all-connection network is
created by Gateways, which might be unsuitable for a
large-scale environment such as grid because of the
overburden of these Gateways; otherwise, it is lack of
efficiency to share several rare resources among all
users.
Hierarchal organization: MDS[17] of Globus[18].
In MDS, aggregate directories provide often
specialized, VO-specific views of federated resources,
services, and so on. Like a Condor pool, each
aggregate directory defines a scope within which
search operations take place, allowing users and other
services to perform efficient discovery without scaling
to large numbers of distributed information providers
within a VO. Unlike Condor flocking mechanism,
MDS builds an aggregate directory in a hierarchal way
as a directory of directories. One disadvantage of MDS
is that directory should be maintained by an
organization. Another disadvantage is that when more
and more directories are added––and we envision truly
root directory consisting of a very large number of
directories––and many searches are directed to the root
directory, the concerns of performance bottleneck and
single point of failure are aroused.
P2P structure organization: SWORD[19] of
PlanetLab[20]. SWORD is a scalable resource
discovery service for wide-area distributed systems.
O(d * log(n)) - far less than that of collecting all
information which requires traveling the whole DHT
and takes a time complexity of O (n) .
From above statement, the powerful theoretical
results of d-choice strategy, combined with the quick
search of DHT, will possibly produce a practical
scheduling system, which can provide good
performance even with imperfect information while
incurring small overhead.
In this paper, we use Chord[22] to organize
resources. Note that other DHTs can also be employed,
and the advanced research on DHT can benefit our
scheduling system too.
The application using Chord is responsible for
providing desired authentication, caching, replication,
and user-friendly naming of data. Chord’s flat keyspace eases the implementation of these features.
Chord provides support for just one operation: given a
key, it maps the key onto a node. Depending on the
application using Chord, that node might be
responsible for storing a value associated with the key.
Data location can be easily implemented on top of
Chord by associating a key with each data item, and
270
storing the <key, data> pair at the node to which the
key maps.
As SWORD of PlanetLab has successfully designed
and implemented a resource discovery system with
DHT that can answer multi-attribute range queries, we
just follow its design of key mapping in this part. In
this way, our research can efficiently be employed on
top of SWORD.
For simplicity, we use multiple Chord instances,
one per attribute of resource. For example, suppose
that resource defines two attributes: <CPU, MEM>, we
should create two Chord rings, each corresponding to
an attribute.
Chord places no constraint on the structure of the
keys it looks up: the Chord key-space is flat. This
gives applications a large amount of flexibility in how
they map their own names to Chord keys. Fig. 2 shows
how we map a measurement of one attribute to a DHT
key. The length of value bits is allocated according to
the range of attribute value. The random bits spread
measurements of the same value among multiple DHT
nodes, to load balance attributes that take on a small
number of values. In this way, we can get an ordered
DHT ring.
Resource
A
B
C
D
Node Key
0x16F92E11
0x990663CB
0x5517A339
0xD5827FA3
D
B
Routing table
of CHORD
Stored Data
Routing table
of CHORD
Stored Data
Routing table
of CHORD
CPU = 86 (MFLOPS)
Stored Data
Value Random
bits
bits
Routing table
of CHORD
Key = 0 x 5 6 3 A 8 1 D 9
Fig. 2. Mapping a measured value to a DHT key.
Stored Data
CPU
86
120
220
135
CPU_Key
0x563A81D9
0x780209E1
0xDC937C5A
0x87103FF2
A
CHORD
ordered ring
of CPU
attribute
C
Node A
Predecessor
Successor
Finger[0:m]
Resource List
Node D
Node C
…
C
Node C
Predecessor
Successor
Finger[0:m]
Resource List
Node A
Node B
…
Nil
Node B
Predecessor
Successor
Finger[0:m]
Resource List
Node C
Node D
…
A, B, D
Node D
Predecessor
Successor
Finger[0:m]
Resource List
Node B
Node A
…
Nil
Fig. 3. Resource organization for the CPU
attribute
Fig. 3 shows an example of how to organize
resources for an attribute. When a resource joins the
DHT, it is assigned a random node key and holds a key
space between its predecessor’s node key and itself. It
is responsible for storing the data whose key falls in its
holding key space, answering the query whose key
falls in this space and forwarding to other node the
query whose key is beyond this space.
simultaneously because the intersection of these query
results will possibly be a null set. We firstly
generate d queries, each having a random
CPU_Query_Key in the interval [0x64000000,
0xFFFFFFFF]. Then all these d queries are launched
to the corresponding DHT instance. With the help of
the Chord routing algorithms, each query can reach a
proper DHT node whose holding key space contains
the CPU_Query_Key. Now, this DHT node checks its
stored data whether there is a valid resource satisfying
the job requirements. If so, it should send the valid
resource to the query node for further use; if not, it
should alter the CPU_Query_Key to fit the holding key
space of its successor and then forward the query to its
successor. This process can be iterative until a valid
resource is returned, or the query reaches the first node
in the DHT key space (the one having a minimum
3.2.2. Constraints imposed by jobs on resources.
Next, we focus on how to choose d resources that
satisfy the job requirements.
We take Fig. 3 as an example. Suppose that there is
a job requiring a resource with CPU≥100(MFLOPS)
and MEM≥700(MB) for running. We query resources
on the DHT instance corresponding to the CPU
attribute of resource. Note that it is a big challenge to
perform this query on multiple DHT instances
271
node key), for example, Node A in Fig. 3. Therefore,
all resources returned by d queries, if not nil, can
satisfy the job requirements.
distribution of their keys is an open issue in this
scheduling application.
In above statement, several techniques have been
introduced to prevent herd behavior in grid scheduling.
Firstly, the balls and bins model is a powerful tool for
this problem. Secondly, we employ DHT and data
replication to bridge the gap between the homogeneity
of the model and the heterogeneity of grid environment.
We believe that d-random strategy, combined with the
quick search of DHT, can provide a good system for
grid job scheduling.
3.2.3. Probability of choosing a resource. From the
point of load balance, the number of jobs assigned to a
resource should be proportional to its capabilities when
system is middle or high loaded. Whereas, when
system is light-loaded, users will favor fast machines
to run their jobs. In order to achieve above both goals,
intuitively, we should appropriately increase the
probabilities of choosing high capability resources.
There are two existing techniques for this purpose.
The first is the virtual server used in DHT load
balance[23], and the second is data replication. In [23],
each host acts as one or several virtual servers, each
behaving as an independent peer in the DHT. The
host’s load is thus determined by the sum of these
virtual servers’ loads. By adjusting the number of
working virtual servers(for example, the number of
virtual servers of a host is proportional to its capacity),
we can achieve the DHT load balance among all hosts.
From the view of our algorithm, however, one of its
disadvantages is that it requires DHT middleware to
support this technique by itself. Some DHT
middleware can do it, but certainly not all.
In order to make our system running over different
platforms with flexible parameter settings, we adopt
another application-layer method at the same time.
Data replication has extensively been used in widearea distributed systems (Content Distribution
Networks). It aims at decreasing the access to some
hot-spots and increasing the access QoS. Here, we use
data replication to appropriately increase the access to
some high capability resources instead.
To keep the constraints of job on resource, we place
replicas of a resource on DHT nodes whose node keys
are not more than the maximum key corresponding to
the capabilities of that resource. Taking Fig. 3 for an
example, we create a replica for Resource C. The
CPU_Key of this replica should be in the interval
[0x00000000, 0xDCFFFFFF] and hence this replica is
stored on a DHT node whose node key is not more
than 0xDCFFFFFF. Therefore, any query reaching on
this DHT node can possibly get Resource C because
the CPU capability of Resource C is greater than that
of the query with high probability.
The number of replicas of a resource as well as the
distribution of their keys can affect the probability of
choosing this resource. As the number increases, the
possibility of herd behavior will increase too. How to
determine and adjust the number of replicas and the
4. Our implementation
In this section, we describe our system architecture
and related algorithms. Each resource acts as a
scheduling instance and independently performs job
scheduling.
Fig.4 shows the modules in a DHT node.
Submit Job
Launch Job
Job Proxy
①
⑥
Scheduling
②
Resource
Index
⑤ ③
Resource
Update
④
DHT Module
TCP/IP
Chord
DHT
Fig. 4. Modules in a DHT node
When a resource joins the system, it firstly joins the
Chord DHT. The algorithm of joining Chord can be
seen in [22]. Then, it places certain number of resource
replicas on the DHT and updates these replicas
periodically to keep alive, using the algorithm shown
in Fig.5. The resource update module in Fig.4 is
responsible for collecting the information of local
resource, generating and updating replicas. The
resource index module uses a list to store the replicas
which other resources place on the DHT.
272
STEP6: When all of d queries return, among
these d resources, one with the best rank according to
the performance model will be chosen and informed to
the job proxy module. Then, the job proxy module
launches the job to that resource.
#
The scheduling process includes the following 6
steps:
STEP1: The job proxy module receives a job from a
user, abstracts the job running requirements and
performance model, and requests for scheduling with
the requirements and performance model.
STEP2: The scheduling module generates d queries
according to the job requirements and then places these
queries on the DHT. The job requirements as well as
the performance model are included in each of
these d queries. Here, d is a system parameter that is
predefined, or dynamically adjusted according to
system utilization rate. These d query keys should be
not less than a certain minimum value to keep the job
constraints as stated in Section 3.2.3, obeying the
uniform or non-uniform and independent or dependent
distribution between this minimum value and the
maximum key of DHT key space.
STEP3: With the help of CHORD routing
algorithms, these queries reach their destination DHT
nodes respectively. In the destination node, the query
is processed in the resource index module.
STEP4: The resource index module checks its
stored resource list whether there is a resource
satisfying the job running requirements. If not so, the
query will be forwarded to the successor node (until it
reaches the first node of the DHT key space, as stated
in Section 3.2.3) and GOTO STEP3. If there were
multiple resources satisfying the requirements, a
competition algorithm would be performed to return
one of valid resources. Now, the valid resource is
returned to the source node of this query.
STEP5: When a query result returns, we can use
one-more-hop to collect the up-to-date information of
the resource indicated in the returned query result.
5. Experiment
Performance model, which can be expressed in for
example ClassAd, is widely used to rank candidate
resources when scheduling. In this paper, we only use
one resource attribute (CPU capability, which is
expressed by an integer with the meaning of, for
example, number of million floating-point operations it
can process per second) to simulate the performance of
our scheduling system. Figueira[24] had modeled the
effect of contention on a single-processor machine, and
a performance model can be given as
PM = CPU _ capability /(1 + load )
where load is the number of jobs being executed on the
single-processor machine.
All algorithms use the one-more-hop, as mentioned
in Section 4.1, to get the up-to-date resource
information to reduce herd effect, which means that the
information update interval T is set to zero.
Nevertheless, the information we collect with the onemore-hop is still “stale” due to network delays, as we
can see in the experiment results later.
5.1. Environment
First we introduce our simulation tools. The
BRITE[25] (for Boston university Representative
Internet Topology gEnerator.) is used to generate
network topologies, and the NS2[26] (for Network
Simulator) is used to simulate network behavior.
FUNCTION Replica Creation Algorithm (Input Parameter: r)
// Number of replicas = Capability / r, except for r=0
BEGIN
IF (r==0) THEN RETURN;
5.1.1. BRITE. Based on examining the origin of the
power laws in the Internet topologies that was shown
by empirical studies, Medina[25] build a topology
generator named BRITE that can do a fairly good job
in reproducing some properties of the Internet
topologies. We use BRITE to generate a Top-Down
network with 1000 nodes.
These 1000 nodes act as 1000 resources, each
scheduling its jobs independently. Using a normal
distribution with avg.=128 and std=32, we assign a
stochastic value to the capability of each resource.
Local_Replica_List.set_nil();
rep_num = local_resource.capability DIV r;
remainder = local_resource.capability MOD r;
IF (RANDOM (r)<=remainder)
THEN rep_num = rep_num +1;
FOR (i=0; i<rep_num; i++)
BEGIN
NEW replica;
replica.key = HASH_REPLICA(local_resource.CPU);
replica.resource = local_resource;
PLACE_DHT_ITEM(replica); //place replica on DHT
Local_Replica_List.insert(replica); //for update
END //for end
END
5.1.2. NS2. The NS2 is a discrete event simulator
targeted at networking research. It provides substantial
support for simulation of TCP, routing, and multicast
protocols over wired and wireless (local and satellite)
Fig.5. Replica creation algorithm
273
networks. We use the release ns-allinone-2.28, and
program and simulate our system based on NS2.
at the same time, which led to the turn-around time of
these jobs being double or triple or even more times
than expected. This figure demonstrated that there are
heavy herd behaviors in grid environment by the
common greedy algorithm.Fig.9.b shows the results of
the stochastic greedy algorithm as listed in Fig.8. From
this figure, it can be seen that herd happens
occasionally(only a small quantity of points
congregates round 2400sec.). The results reveal that
the stochastic method can greatly reduce herd
occurrences and decrease the average turn-around
times of jobs. Comparing with Fig.9.a, the average
turn-around time of jobs reduces from 1969.76 sec. to
1311.37 sec., which is only 66.58% of the former.
Thus, the improved greedy algorithm makes a dramatic
improvement on performance. Fig.9.c depicts the
results of the d-choice algorithm. From this figure, we
can hardly find herd occurrences. This algorithm
reduces the average turn-around time from 1311.37 sec.
to 1296.68 sec. with much lower overheads as shown
in Tab 1 and 2.
From above results, we see that the d-choice
algorithm is much better than the common greedy
algorithm, and get slight performance improvement to
the stochastic greedy algorithm while having much
lower overheads. Next, we shall investigate what
factors impact our algorithm and how they do.
We run our experiments on a Dawning 4000A
server with 20 nodes, each having dual 1.8G AMD
Opteron processors and 4.0GB memory. The OS is
Suse9.0 (Suse Linux Enterprise Server 9.0) and the
kernel is 2.6.5-7.97-smp.
5.2. Data and analysis
In this paper, we assume that the job size is fixed
and the execution time of such a job on a resource with
128 MFLOPS capability is 30 minutes. The minimum
running requirement of CPU capability for a job is a
stochastic value uniformly distributed in [0,200].
Moreover, we suppose that the arrived jobs form a
Poisson process in every node. Note that ρ = 0.1
denotes every node receives 0.2 jobs in one hour.
Similarly, ρ = 0.3 denotes every node receives 0.6
jobs in one hour, and ρ = 0.5 denotes every node
receives one job in one hour.
5.2.1. Algorithm comparison. We compare our
algorithm with greedy algorithms, which, scheduling a
job to the resource with the best rank among all
according to the performance model, is used by many
projects as stated in related work.
Our scheduling system might be impacted mostly
by three factors. The first one is the number of
choices(i.e. d ) , the second one is the capability
distribution of resources, and the third one is the
number of replicas and the distribution of their keys.
The policy of replica has been listed in Fig.5. We
can use parameter r to control the number of replicas
that a resource can generate, studying the relation
between replicas and scheduling performance. All of
these replicas’ keys obey the uniform distribution.
The d-choice algorithm has been shown in Section
4.1 (6 steps). The competition algorithm, used to return
one of valid resources from the resource list in a DHT
node as mentioned in Section 4.1 STEP4, is listed in
Fig.6. Here, in order to reduce herd behaviors, we
employ a stochastic greedy algorithm.
Fig.9.a shows the results of the common greedy
algorithm as listed in Fig.7. From this figure, we can
find that almost all points congregate as three parallel
straps. The greedy algorithm always chooses the
fastest machine in scheduling. However, for most
jobs(Job ID < 7500), there is a blank area at the bottom
of this figure(turn-around time<1100 sec.), because the
nodes with high capability were assigned multiple jobs
FUNCTION Competition Algorithm ()
BEGIN
For each resource satisfying the job requirements:
Calculating its PM according to the performance
model;
RETURN resource having MAX{RANDOM(PM)};
END
Fig.6. Competition algorithm used in the
resource index module.
FUNCTION Greedy Algorithm ()
BEGIN
Collect all resources that satisfy the job requirements;
Calculate these resources’ PMs according to
performance model;
RETURN resource having MAX{PM};
END
the
Fig.7. Common Greedy Algorithm.
FUNCTION Stochastic Greedy Algorithm ()
BEGIN
Collect all resources that satisfy the job requirements;
Calculate these resources’ PMs according to
performance model;
RETURN resource having MAX{RANDOM(PM)};
END
Fig.8. Stochastic Greedy Algorithm.
274
the
5.2.2. Performance of the d-choice algorithm. In
following figures corresponding to various parameter
settings, in order to make the result more obvious, we
just give the average turn-around time instead of
detailed turn-around time for each job.
Fig.11 shows the average turn-around times with
different parameter settings (parameter d and r ). It is
clear that the replica policy does improve the
performance of scheduling system because each
d-choice strategy with replica ( r < ∞ ) outperforms
the one without replica ( r = ∞ ) even if the replica
number is small ( r = 128 ). On the other hand, as
parameter d increases, the average turn-around time
steadily decrease. When d reaches 15, the d-choice
strategy outperforms the stochastic greedy algorithm in
all cases: in the high-loaded case ( ρ = 0.5 ), a
maximum improvement of 12.71% can be achieved by
the d-choice strategy with settings of d = 15
and r = 8 , and even in the light-loaded case
( ρ = 0.1 ), an improvement of 1.07% can be achieved
(a). The common greedy algorithm
by the d-choice strategy with settings of d = 15
and r = 128 . These figures show that, with parameter
d increasing, the average turn-around time decreases
slowly. That is to say, parameter d might be related
with for example the number of resources, and we
deduce that too big parameter d is of no use to improve
the scheduling performance.
Compared to parameter d , parameter r exerts less
influence on the performance. What surprising us most
is that the increase of the number of replicas impacts
only a little on the job turn-around time, which is
possibly because we use a stochastic algorithm to
select a resource from the resource list in a DHT node.
Anyway, the existence of replicas does improve the
system performance.
Tab.1 shows the number of query messages
processed by the DHT. For d-choice strategy, the
number of query messages doubles the theoretical
number ( d * log(n)) * NumOfJob / n because, as
we analyzed above, resource lists in some DHT nodes
are possibly nil, which causes additional query
messages forwarding along the DHT ring until a valid
resource is found or the query message reaches the
first node of the DHT key space. Nevertheless, it is
less than that of the stochastic greedy algorithm, which
collects all resources to perform scheduling. Most of
all, from Tab.2, it can be seen that, the waiting(query)
time of scheduling of the d-choice strategy is far less
than that of the stochastic greedy algorithm - only 12%
of the later one (2 sec. vs. 17 sec.). For a large-scale
(b). The stochastic greedy algorithm
(c). The d-choice algorithm with d=15, r=128
Fig.9. Detailed the turn-around time for each
job when ρ = 0.1 . X-axis stands for jobs
submitted in time sequence.
275
network, this predominance on both DHT query load
and scheduling waiting(query) time will significantly
appear. These two tables show that the overheads of
the algorithms are hardly affected by ρ , because they
depend on their query processes instead of the system
loads.
From Tab.3, it can also be seen that herd is
effectively prevented by the d-choice strategy because
the number of its unexpected competitions keeps stable
as ρ increases, while the stochastic greedy algorithm
suffers greatly from herd (the number of its unexpected
competitions increases from 249 to 1191).
In fact, herd is not a direct metric for evaluating a
system, but it has the impact on both the job turnaround time and its standard deviation. From Tab.3,
we can see that the number of unexpected competitions
caused by the stochastic greedy algorithm is almost
proportional to ρ . When the percentage of herd
jobs(jobs with unexpected competition) over total jobs
is small(i.e. 2.5% for ρ = 0.1 ), herd has very little
impact on the average turn-around time; however,
when the percentage increases to 11.9%( ρ = 0.5 ),
the impact on the average turn-around time can be
easily seen in Fig.11.c.
On the other hand, Tab.4 shows the standard
deviation of job turn-around times for the algorithms.
The standard deviation can show how much the job
turn-around times disperse. A system with a small
standard deviation can promise more stable
performance than the one with a big standard deviation.
The results of Tab.4 can show that our algorithm is
priority of the stochastic greedy algorithm.
(a).
ρ = 0 .1
(b).
ρ = 0 .3
(c).
ρ = 0 .5
Tab.1. Avg. query load of a DHT node
ρ
0.1
0.3
0.5
Stochastic greedy
6050
6050
6059
d=6, r=8
1324
1327
1322
d=9, r=8
2047
2021
2023
d=12, r=8
2747
2725
2717
d=14, r=8
3196
3193
3184
d=15, r=8
3418
3427
3431
Tab.2. Avg. waiting(query) time (sec.)
ρ
0.1
0.3
Stochastic greedy
16.67
16.68
d=6, r=8
1.33
1.33
d=9, r=8
1.64
1.61
d=12, r=8
1.82
1.81
d=14, r=8
1.92
1.92
d=15, r=8
1.96
1.96
0.5
16.68
1.33
1.61
1.81
1.92
1.97
Fig.10. Turn-around time. The dashed line
stands for the average job turn-around time of
the stochastic greedy algorithm, which is not
impacted by the replica policy. The dots under
this
dashed
line
account
for
their
corresponding
algorithms
get
better
performance than that of the stochastic
greedy algorithm.
276
Tab.3. Number of unexpected competitions1
ρ
0.1
0.3
0.5
Stochastic greedy
249
782
1191
d=6, r=8
5
9
7
d=9, r=8
17
19
17
d=12, r=8
21
24
31
d=14, r=8
29
30
36
d=15, r=8
29
36
33
Tab.4. Standard deviation of job turn-around
time
ρ
0.1
0.3
0.5
Stochastic greedy
288
570
863
d=6, r=8
318
473
906
d=9, r=8
230
395
663
d=12, r=8
209
364
655
d=14, r=8
202
347
650
d=15, r=8
199
357
617
(a)
ρ = 0 .1
(b)
ρ = 0 .3
(c)
ρ = 0 .5
5.3. Discussion
Fig.10 shows the cumulative distribution of jobs on
resources. It can be seen that the stochastic greedy
algorithm causes all jobs being allocated on only a
small part of resources with high capability, which
leads to a total number of 249 unexpected competitions
in the light-loaded case and a total number of 1191
ones in the heavy-loaded case, and thus degrades the
system performance. On the other hand, d-choices
algorithm can disperse jobs on a wide spread of
resources, and the smaller parameter d is, the wider the
range of resources is.
From above experiment results, it is clear that
scheduling jobs only favoring high capability
resources will possibly cause herd behavior and thus
degrade the system performance. On the other hand,
the wide spread of jobs on heterogeneous resources
will possibly degrade the performance too (because
resources with low capability are allocated jobs while
resources with high capability are still idle). Therefore,
it is important to adjust parameter d and r to make the
tradeoff between the overburden (herd behavior) and
insufficient use of high capability resources. Our
experiments show that d=15, in all three cases (light,
mid, and heavy loaded), is one of good tradeoffs that
can preserve both advantages of preventing herd
behavior and providing high scheduling performance.
Fig.12. Cumulative allocation of Jobs on
resources. It shows the number of jobs being
allocated to resources whose capability is in a
descend order. For example, Fig.(a) shows
that all jobs are allocated to resources whose
aggregate capability is the first 21.28%,
52.34%, and 85.41% of total resource
capability corresponding to the stochastic
greedy algorithm, the d-choice strategy with
d=15 r=8, and the d-choice strategy with d=6
r=8 respectively.
1
For a job, if the submission load(the resource load at the time
the job is submitted to this resource for running) is bigger than
the query load(being used to make resource selection when
scheduling) , we increase the number of unexpected
competitions by 1 and say this job is a herd job.
277
[7] R. Raman, M. Livny, and M. Solomon, “Resource
Management through Multilateral Matchmaking,” HPDC-9,
pages 290–291, Pittsburgh, PA, 2000.
[8]D. H. J. Epema, M. Livny, R. van Dantzig, X. Evers and J.
Pruyne, “A Worldwide Flock of Condors: Load Sharing
among Workstation Clusters,” Future Generation Computer
Systems, 12, 1996
[9] A. R. Butt, R. Zhang, and Y. C. Hu, “A Self-Organizing
Flock of Condors,” ACM/IEEE SuperComputing’03.
[10] D. Zagorodnov, F. Berman, and R. Wolski, “Application
Scheduling on the Information Power Grid,” International
Journal of High-Performance Computing, 1998. 8
[11] H. Casanova, A. Legrand and D. Zagorodnov et al,
“Using Simulation to Evaluate Scheduling Heuristics for a
Class of Applications in Grid Environments,” Ecole Normale
Superieure de Lyon, RR1999-46. 1999.
[12] J. M. Schopf, and F. Berman, “Stochastic Scheduling,”
ACM/IEEE SuperComputing '99.
[13] Y. Azar, A.Z. Broder, A.R. Karlin, E.Upfal, “Balanced
Allocations,” Proc. 26th Annual ACM Symposium on the
Theory of Computing (STOC 94), pp. 593-602, 1994
[14] B. Vöcking, “How Asymmetry Helps Load Balancing,”
Journal of the ACM, Vol. 50, No. 4, pp. 568-589, 2003.
[15] B. Vöcking, “Symmetric vs. Asymmetric MultipleChoice Algorithms (invited paper),” Proc. of 2nd ARACNE
workshop (Aarhus, 2001), pp. 7-15
[16] J. Ledlie, and M. Seltzer, “Distributed, Secure Load
Balancing with Skew, Heterogeneity, and Churn,” IEEE
INFOCOM 2005, March 2005
[17] K. Czajkowski, S. Fitzgerald, I. Foster, and C.
Kesselman, “Grid Information Services for Distributed
Resource Sharing,” HPDC-10, August 2001.
[18] The Globus Alliance, http://www.globus.org/
[19] D. Oppenheimer, J. Albrecht, D. Patterson, and Amin
Vahdat, “Scalable Wide-Area Resource Discovery,” UC
Berkeley Technical Report UCB//CSD-04-1334, July 2004.
[20] The PlanetLab Home, http://www.planet-lab.org/
[21] S. Rhea, D. Geels, T. Roscoe, and J. Kubiatowicz,
“Handling Churn in a DHT,” Proc. of the USENIX Annual
Technical Conference, 2004.
[22] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H.
Balakrishnan, “Chord: A Scalable Peer-to-peer Lookup
Service for Internet Applications,” ACM SIGCOMM 2001,
San Deigo, CA, August 2001, pp. 149-160.
[23] P. Brighten Godfrey and Ion Stoica, “Heterogeneity and
Load Balance in Distributed Hash Tables,” IEEE INFOCOM
2005, March 2005
[24] S.M. Figueira, and F. Berman, “Modeling the Effects of
Contention on the Performance of Heterogeneous
Application,” HPDC. 1996.
[25] A. Medina, A. Lakhina, I. Matta, and J. Byers, “BRITE:
An Approach to Universal Topology Generation,” Proc. of
the International Workshop on Modeling, Analysis and
Simulation of Computer and Telecommunications SystemsMASCOTS '01, Cincinnati, Ohio, Aug. 2001.
[26]
The
Network
Simulator
NS2,
http://www.isi.edu/nsnam/ns/.
6. Conclusion
On the basis of the theoretical results of the
homogeneous balls and bins model, we proposed a
novel stochastic algorithm and employed it in the grid
environment. Our analysis and experiments have
demonstrated that the multi-choice strategy, combined
with the advantages of DHT, can prevent herd
behavior in a large-scale sharing environment such as
grid, therefore, providing better scheduling
performance with the much less incurred overhead
compared to the common greedy algorithms and the
stochastic greedy algorithms.
The balls and bins model has been intensively
investigated in the past and thus a lot of powerful
theoretical results can be used to further improve this
system. Some better performance will possibly appear
in the future.
Acknowledgement
We would like to thank those anonymous reviewers
who gave us much valuable advice to revise this
manuscript. This work was supported in part by the
National Natural Science Foundation of China under
grants 90412010 and 90412013, China's National
Basic Research and Development 973 Program
(2003CB317008, 2005CB321807), and Theory
Foundation of ICT (NO. 20056130).
References
[1] F. Berman, A. Chien, K. Cooper, J. Dongarra, I. Foster, L.
J. Dennis Gannon, K. Kennedy, C. Kesselman, D. Reed, L.
Torczon, and R. Wolski, “The GrADS Project: Software
Support for High-Level Grid Application Development,”
International Journal of High-performance Computing
Applications, 15(4), Winter 2001.
[2] M. Sato, H. Nakada, S. Sekiguchi, S. Matsuoka, U.
Nagashima, and H. Takagi, “Ninf: A Network based
Information Library for a Global World-Wide Computing
Infrastracture,” Proc. of HPCN'97 (LNCS-1225), pages 491502, 1997.
[3] M. Mitzenmacher, “How Useful is Old Information?”
IEEE Trans. Parallel and Distributed Systems, vol. 11, no. 1,
pp. 6-20, Jan. 2000.
[4] W. Cirne and F. Berman, “When the Herd Is Smart: The
Aggregate Behavior in the Selection of Job Request,” IEEE
Trans. Parallel and Distributed Systems, vol. 14, no. 2, pp.
181-192, Feb. 2003.
[5] M. J. Litzkow, M. Livny, and M. W. Mutka, “Condor – A
Hunter of Idle Workstations,” Proc. 8th International
Conference on Distributed Computing Systems (ICDCS
1988), pages 104–111, San Jose, CA, 1988.
[6] R. Raman, M. Livny, and M. Solomon, “Matchmaking:
Distributed Resource Management for High Throughput
Computing,” HPDC-7, pages 140–146, Chicago, IL, 1998.
278