An Experimental Study on How to Build

Transcription

An Experimental Study on How to Build
2008 11th IEEE International Conference on Computational Science and Engineering
An Experimental Study on How to Build
Efficient Multi-Core Clusters for High Performance Computing
Luiz Carlos Pinto, Luiz H. B. Tomazella, M. A. R. Dantas
Distributed Systems Research Laboratory ( LaPeSD )
Department of Informatics and Statistics ( INE )
Federal University of Santa Catarina ( UFSC )
{ luigi, tomazella, mario }@inf.ufsc.br
interconnect fabric, it roughly means that each MPI
process execution is independent of the others.
Nowadays, multi-processor (SMP) and now multicore (CMP) technologies are increasingly finding their
way into cluster computing. Inevitably, clusters built
using SMP and also CMP-SMP nodes will become
more and more common. Lacking of a widely accepted
term for CMP-SMP cluster design, both architectures
will be referenced as CLUMPS, a usual term for
defining a cluster of SMP nodes.
Traditional MPI programs follow the SPMD (Single
Program, Multiple Data) parallel programming model
which was designed basically for cluster architectures
built using nodes with a single processing unit, that is
to say single-core nodes. For example, in a modern
cluster design built with multi-core multi-processor
nodes, the access to interconnect fabric is shared by
locally executing processes. Either main memory
accessing or CMP-SMP’s usually deeper cache
memory hierarchy might also slow down inter-process
communication, since bus and memory subsystem of
each node is shared. Thus, in such a modern cluster,
moving data around between communicating cores is a
function of not only their physical distance (inside a
processor socket, inside a node or inter-node) but also
of shared memory and network bandwidth limitations.
Our motivation concerns the importance of realizing
from the point of view of an architectural designer that
modern multi-core cluster designs create a different
scenario for predicting performance. This urging need
to understand the trade-offs between these architectural
cluster designs guided our research and finally lead to a
new approach for setting up more efficient clusters of
commodities. Thus, an alternative approach to the
utilization of non-commodity interconnects, such as
Myrinet and Infiniband, is proposed in order to build
economically more accessible clusters of commodities
with higher performance.
Abstract
Multi-core technology produces a new scenario for
communicating processes in an MPI cluster
environment and consequently the involved trade-offs
need to be uncovered. This motivation guided our
research and lead to a new approach for setting up
more efficient clusters built with commodities. Thus,
alternatively to the utilization of non-commodity
interconnects such as Myrinet and Infiniband, we
present a proposal based on leaving cores idle
relatively to application processing in order to build
economically more accessible clusters of commodities
with higher performance. Execution of fine-grained IS
algorithm from NAS Parallel Benchmark revealed a
speedup of up to 25%. Interestingly, a cluster
organized according to the proposed setup was able to
outperform a single multi-core SMP host in which all
processes communicate inside the host. Therefore,
empirical results indicate that our proposal has been
successful for medium and fine-grained algorithms.
1. Introduction
Scientific applications used to be executed
especially on expensive and proprietary massively
parallel processing (MPP) machines. As processing
power and communication speed are increasingly
becoming off-the-shelf products, building clusters of
commodities [27] has been taking a large piece on high
performance computing (HPC) world [5].
Not long ago, identical single-processor computing
nodes used to be aggregated in order to form a cluster,
also known as NoW (Network of Workstations). Such
parallel architecture design demands a distributed
memory programming interface like MPI [1] for interprocess communication. As each computing node has
its own memory subsystem and path to the
978-0-7695-3193-9/08 $25.00 © 2008 IEEE
DOI 10.1109/CSE.2008.63
33
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on February 24, 2009 at 17:38 from IEEE Xplore. Restrictions apply.
Lastly, other works focus on characterizing NPB
algorithms. Kim and Lilja [20], Tabe and Stout [17],
Martin [15], and Faraj and Yuan [16] concentrate their
study on MPI-based NPB algorithms in order to
determine types of communication, size of messages,
quantity and frequency of communication phases.
Additionally, Sun, Wang and Zu [19], and Subhlok,
Venkataramaiah and Singh [18] take into account the
amount of transferred data as well as processor and
memory usage in order to characterize NPB.
A preliminary evaluation of the impact of multicore technology on performance of clusters has shown
quite surprising results for scientific computing. When
it comes to applications with small computation to
communication ratios, the potentially advantageous
characteristics of “many-core” hosts indicate little
superiority over performance of “few-core” hosts.
Furthermore, even loss of performance and efficiency
is revealed depending on the specific application.
We investigated intra-node and inter-node
communication behavior of four distinct cluster setups
as described in Table 1 with MPI micro-benchmarks.
Although a hybrid programming model with MPI for
inter-node parallelism and OpenMP for intra-node
parallelism is often proposed as the most efficient way
to use multi-core computing nodes within a cluster [3,
4], traditional MPI programming is likely to remain
important for portability issues and also to cope with
the huge set of existing MPI-based applications.
Moreover, in order to bring this study closer to
“real-world” applications environment, all five MPIbased kernel algorithms of the NAS Parallel
Benchmark suite [2] (or NPB) were run on the same
four cluster setups and analyzed altogether. NPB is
derived of real computational fluid dynamics (CFD)
applications required by NASA.
This paper follows with related works in Section 2.
Our proposal of cluster setups is described in Section 3
while experimental results are explained in Section 4.
Conclusion and future work are presented in Section 5.
Lastly, acknowledgements are found in Section 6.
3. Proposed setup approach
High
performance
computing
based
on
commodities has become feasible with the growing
popularity of multi-core technology and Gigabit
Ethernet interconnect. Besides, a computing host with
more than one core offers the possibility to leave, for
instance, one core idle relatively to application
processing. Our proposal consists of leaving idle cores
on some or all hosts of a cluster in order to process
communication overhead of a running application.
First, we shall define a few terms. A core is the
atomic processing unit of a computing system. A
socket contains one or more cores. A host or node is a
singular machine containing one or more sockets that
shares resources such as main memory and
interconnect access. A cluster, also referred to as
system, is a set of interconnected hosts.
Table 1. Clusters architecture and hardware setups
2. Related work
Works related to this study comprise three streams:
impact of multi-core technology and also of dedicated
network processors on cluster performance, and
characterization of the NAS Parallel Benchmark.
Impact of multi-core technology on performance of
clusters is investigated in some works. Chai, Hartono
and Panda [26] focus on intra-node MPI processes
communication whereas Pourreza and Graham [25]
also take into account advantages of resource sharing,
both of them using communication micro-benchmarks
only. Differently, Alam et al [21] base the investigation
on characterization of scientific applications.
Moreover, some other works are in relation to this
study since the impact of dedicated processors for
communication processing is investigated though with
non-commodity interconnects such as Myrinet and
Infiniband. These refer to works of Lobosco, Costa and
de Amorim [24], of Brightwell and Underwood [22]
and of Pinto, Mendonça and Dantas [23], all of which
focus on broadly used scientific applications such as,
for instance, the NAS Parallel Benchmark (NPB).
Table 1 describes all four clusters, computing
nodes, setups and interconnects. Xeon-based [6] cluster
runs Linux kernel 2.6.8.1 and Opteron-based [7], Linux
kernel 2.6.22.8. Both have SMP support enabled.
All systems use at most 8 cores for application
processing. Systems A and C have idles cores whereas
systems B and D do not. System A has one idle core
per host whereas in system C all 4 idle cores reside in
the same host so that second host has no idle cores.
34
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on February 24, 2009 at 17:38 from IEEE Xplore. Restrictions apply.
medium and large message lengths either in one-way
or two-way modes. Moreover, results for latency of
systems A and C show similar results whereas system
D indicates lowest latency for all message lengths. In
the specific case of two-way communication, the
randomly ordered ring has no effect if it were naturally
ordered. Processes on systems B and D run on the
same host thus using host bus to communicate. On the
other hand, processes on systems A and C run in
different hosts thus using Gigabit Ethernet LAN to
communicate.
MPICH2 [8, 9], version 1.0.6, is used as the MPI
library implementation for all systems. It is important
to emphasize that Gigabit Ethernet engages all
communication processing to a host processor. Either
system calls or protocol and packet processing are
performed by a host processor. Differently, Myrinet
[14] and Infiniband [12] NIC’s are equipped with a
dedicated network processor which is in charge of
protocol processing. Furthermore, communication flow
bypasses OS via DMA data transfer. Figure 1 presents
distinct flows of Ethernet technology and VI
Architecture [28].
Figure 2. Latency for one-way and two-way
communication between 2 processes
From Figure 3, we can state that (1) bandwidth for
either one-way or two-way communication on systems
B and D are greater than for systems A and C for any
message length. Moreover, (2) bandwidth behavior of
two-way communication for system D and of one-way
communication for system B are quite similar. (3)
Two-way communication bandwidth for system B is
similar compared to its one-way communication
pattern for small and medium-sized messages, but (4)
for messages larger than 32KB its pattern becomes flat
and of lower performance. (5) Communication
bandwidth pattern of either one-way or two-way for
systems A and C are very similar. However, (6)
bandwidth is greater for two-way than for one-way
communication for messages of up to 8KB for system
C and up to 64KB for system A. That is in part because
(7) b_eff calculates bandwidth for one-way
communication based on maximum latency while
bandwidth for two-way communication is based on
average latency. Anyway, (8) when message length is
larger, bandwidth for two-way represents up to 80% of
bandwidth for one-way communication.
Based on previous assertions (4) and (6), the idle
cores of systems A and C seemingly do not indicate
great positive effects on performance of either one-way
or two-way inter-process communication.
Figure 1. Dataflow of Ethernet and VI Architecture
Systems are idle awaiting experiments to be run.
4. Results
4.1. Communication benchmark: b_eff
In order to characterize bandwidth and latency of all
four systems, we ran b_eff communication benchmark
[10], which has a version as part of the HPC Challenge
Benchmark [11]. However, this version only tests
bandwidth and latency for messages of length 8 and
2.000.000 bytes. So we adapted b_eff to evaluate
communication for a wider range of message sizes,
from 2 bytes to 16 megabytes.
In Figure 2, latency for 2 communicating processes
is shown as a function of message length. Results show
that Ethernet-based systems A and C indicate higher
latency than systems B and D for communicating
35
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on February 24, 2009 at 17:38 from IEEE Xplore. Restrictions apply.
bandwidth of system C indicates a decreasing pattern
and its overall inter-process communication
performance is quite worse than of system A. On the
other hand, bandwidth of system A indicates an
increasing pattern and its overall inter-process
communication performance is the second best of all.
This issue is explained with the fact that system A
has one idle core per host whereas system C has four
idles processes on only one host. It means that, for
system C, overall performance is held back by the
slowest host, which has all of its cores busy. For
system A, the opposite result is seen. Each idle core
offloads some communication overhead. Roughly
speaking, one core is in charge of processing user-level
MPI process and the second one is executing MPI
communication operations.
Figure 3. Bandwidth for one-way and two-way
communication between 2 processes
However, one-way and two-way communication
patterns do not provide a whole behavioral overview.
So, in addition, results for 8 simultaneous
communicating processes are presented in Figure 4.
4.2. The NAS Parallel Benchmark
The NAS Parallel Benchmark consists of 8
benchmark programs. Five of them are kernel
benchmarks (EP, FT, IS, CG and MG) and the other
three are considered simulated application benchmarks
(SP, BT and LU) [2]. NPB version used is 2.3. This
study focuses on all five kernel benchmarks only.
Algorithms were compiled equally for all systems,
using O3 optimization directive either with mpif77 for
Fortran code as with mpicc for IS, the only algorithm
in C code. All kernel benchmarks were run for Class B
size. Experiments were run 5 times for each case in
order to achieve a fair mean of execution time [13].
First, we present NPB algorithms which perform
predominantly collective communications and then
algorithms
dominated
by
point-to-point
communications. It also follows a descendent order of
granularity, which is the ratio of computation to the
amount of communication the algorithm performs.
Greater demands for communication among processes
characterize lower granularity. On the other hand, little
amount of data communicated in an infrequent fashion
characterizes a coarse-grained algorithm and therefore
higher computation to communication ratio.
The EP (Embarrassingly Parallel) algorithm
consists of much computation and negligible
communication. It provides an estimate of the upper
limit for floating-point computational power.
Communication occurs only to distribute initial data
and to gather partial results for a final result. Thus, EP
is coarse-grained.
Although systems A and B run at 10% higher clock
frequency, as of Table 1, they are outperformed by
systems C and D for EP algorithm, as of Figure 5. In
fact, this positive impact is lead mostly by more
efficient memory subsystem of systems C and D. In
Figure 4. Latency and bandwidth for 8 communicating
processes in a randomly ordered ring.
In Figure 4, results indicate that the best overall
inter-process communication performance for system
D but note that there is no need to access network for it
is an SMP host. However, there is a great negative
impact on bandwidth for messages larger than 64KB.
The worst overall inter-process communication
performance is of system B, which has no idle cores
and each pair of processes compete for accessing main
memory and network card because the host bus is
shared.
Now, an interesting issue concerns communication
performance of systems A and C. Both systems are set
up according to our proposal, with idle cores which can
be in charge of communication processing. This gives
the impression that their performance should be better
and in some way similar. However, on the one hand,
36
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on February 24, 2009 at 17:38 from IEEE Xplore. Restrictions apply.
conclusion, we assume higher per-core performance of
systems C and D compared to systems A and B.
As we can see in Figure 6, systems A and B scale
better than system C and D. Requirements of systems
A and B for bandwidth are lower than of systems C
and D because of lower per-core performance [19].
Note that for 8 processes on systems B and D, all cores
are busy running the application. So execution time on
systems A and D turn out to be practically the same,
despite higher per-core performance of system D. That
is basically because idle cores of system A act as if
they were dedicated network processors, allowing
higher computing power for the application, if in
asynchronous communication mode, and also
decreasing accesses to main memory and overhead due
to context switches between application, OS and MPI
communication processing.
The IS (Integer Sort) benchmark is the only
algorithm of the NPB that does not focus on floatingpoint computation for it is an integer bucket sort
algorithm. It is dominated by reductions and
unbalanced all-to-all communications, relying on a
random key distribution for load balancing. That
means communication pattern is dependent on data set
[15, 17]. Anyway, granularity of IS is smaller than that
of EP and FT, and is characterized as fine-grained.
Additionally, messages are smaller than in FT. It is
mid-sized in average, ranging from small to large
messages however only a few are mid-sized [20].
Figure 5. Mean execution time for EP class B.
The FT (FFT 3D PDE) algorithm performs a 3D
partial differential equation solution using a series of
1D FFT’s. It requires a large amount of floating-point
computation as well as communication, although
mostly very large messages in a low frequency, within
the range of megabytes. That is because authors of
NPB have put effort on aggregating messages in the
application level in order to minimize message cost
[15]. The result is a mid-grained algorithm with a
perfectly balanced all-to-all communication pattern,
which means each process sends and receives the same
amount of data. Additionally, although the bandwidth
required increases proportionally to per-core
performance, it does not increase as the number of
cores used are scaled up [19].
Figure 7. Mean execution time for IS class B.
Once again, as in Figure 7, systems A and B show
greater performance than systems C and D as the
number of processors used is scaled up. However, note
that for 8 processes, the loss of efficiency of systems C
and D is even greater than that of FT algorithm, as
opposed to Figure 6. This loss of efficiency is due to a
higher frequency of inter-process communication
although messages are mid-sized in average.
Figure 6. Mean execution time for FT class B.
37
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on February 24, 2009 at 17:38 from IEEE Xplore. Restrictions apply.
alternated with short computation phases, overlapping
communication and computation becomes difficult.
The CG (Conjugate Gradient) benchmark is a
conjugate gradient method which consists of floatingpoint computations and tests frequent and irregular
long distance point-to-point communication [2].
Although it is also computing-intensive, CG is
characterized a fine-grained algorithm for its large
number of messages communicated. Average message
size is smaller than that of FT and IS with a
predominant number of small messages, only a few
bytes long, and the rest mostly large messages [20].
Figure 9. Mean execution time for MG class B.
The better performance of FT, IS, CG and MG on
system A, which has one core per host idle in relation
to application processing, when compared to system D,
a single SMP host with 8 cores, indicates that the
proposed cluster setup is advantageous in order to gain
performance on clusters of commodities.
Furthermore, in Table 2, the impact of not leaving
one idle core in each host is presented. NPB algorithms
were executed with 16 processes on system A. Coarsegrained EP had a considerable speedup whereas
medium and fine-grained algorithms resulted in loss of
performance when compared to the execution with 8
processes also on system A. These results confirm and
quantify the benefits from the proposed cluster setup
for medium and fine-grained applications.
Figure 8. Mean execution time for CG class B.
Fine-grained algorithm execution does not take as
much advantage of greater per-core performance as
mid-grained
and
coarse-grained
applications.
Compared to FT and IS, Figure 8 shows that execution
time of CG on systems A and B for increasing number
of processes are closer to those of systems C and D.
Besides, systems A and B even exceed systems C and
D for 8 running processes.
The MG (Multi-Grid) algorithm is a simplified
multi-grid algorithm that approximates a solution to
the discrete Poisson problem. It tests both short and
long distance communication in a frequent manner
with predominant point-to-point communication
between neighboring processes. Communication
phases are somewhat evenly distributed throughout the
execution so that MG is a fine-grained algorithm with
small computation phases. Message size is medium in
average because processes exchange messages of many
sizes, from small to large, in a uniform pattern [15, 20].
Such characteristics hold back higher per-core
performance of systems C and D. As of Figure 9,
execution time of MG running 8 processes on system
A are even shorter than those on systems B, C and D.
Besides, when frequent communication phases are
Table 2. Speedup on System A
Exec.
EP
FT
IS
CG
MG
8 proc.
47.696
53.232
3.234
45.268
6.798
16 proc.
25.860
57.492
4.066
49.988
7.724
Speedup
45.78%
- 8.00%
-25.73%
-10.43%
-13.62%
5. Conclusion and Future Work
This research allowed us to identify there is no
simple relation for predicting performance of multicore clusters of commodities. Depending on the
application, cluster behavior and performance may
vary unexpectedly. Additionally, a detailed analysis of
four distinct multi-core cluster systems helped to better
38
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on February 24, 2009 at 17:38 from IEEE Xplore. Restrictions apply.
characterize the trade-offs involved. In corroboration
with our preliminary evaluation, performance of multicore clusters tends to be approximated as application
granularity decreases.
Based on our experiments, a detailed analysis
allowed us to point considerable benefits with the
proposed cluster setup in which one processing core
per host is idle in relation to application processing.
So, the proposed approach may introduce considerable
performance gains. However, if no core is left idle in
one host of a cluster, this host may hold back overall
performance. This could be evidenced with results
from b_eff and from medium and fine-grained NPB
algorithms.
A cluster set up according to the proposed approach
was able to outperform a single eight-core SMP host in
which all communication occurs within the host bus
and thus no networking is required. Since Ethernet
interconnect overloads host processors with
communication overhead, efficiency is penalized when
all processors of a host are busy running application
because of competition for communication and
application
processing.
However,
resulting
performance depends on application granularity and
behavior.
Finally, this paper has succeeded in indicating
economically more accessible alternatives, based on
commodities only, in order to achieve better
performance in clusters of small and medium sizes.
For future work, we plan to extend this study to
other benchmarking suites and also to full “real-world”
applications towards a broader analysis of multi-core
clusters trade-offs. Ongoing research focuses on
quantifying benefits of the proposed approach
compared to clusters interconnected with Myrinet and
Infiniband. Besides, it is important to verify scalability
issues of our proposal in opposition to currently
increasing number of cores within a single host.
[3] R. Rabenseifner and G. Wellein: “Communication and
Optimization Aspects of Parallel Programming Models on
Hybrid Architectures”, International Journal of High
Performance Computing Applications, Sage Science Press,
Vol. 17, No. 1, 2003, pp 49-62.
[4] F. Cappello and D. Etiemble, “MPI versus MPI+OpenMP
on the IBM SP for the NAS benchmarks”,
Supercomputing'00, Dallas, TX, 2000.
[5] H. Meuer, E. Strohmaier, J. Dongarra, H. D. Simon,
Universities of Mannheim and Tennessee, “TOP500
Supercomputer Sites”, www.top500.org.
[6] Intel, “Intel® Xeon® Processor with 533 MHz FSB at
2GHz to 3.20GHz Datasheet”, Publication 252135, 2004.
[7] AMD, “AMD Opteron™ Processor Product Data Sheet”,
Publication 23932, 2007.
[8] W. Gropp, E. Lusk, N. Doss and A. Skjellum, “A highperformance, portable implementation of the MPI message
passing interface standard”, Parallel Computing, Vol. 22,
No. 6, 1996, pp 789-828.
[9] Message Passing Interface Forum. “MPI-2: Extensions to
the Message-Passing Interface”, July 1997.
[10] R. Rabenseifner and A. E. Koniges, “The Parallel
Communication and I/O Bandwidth Benchmarks: b_eff and
b_eff_io”, Cray User Group Conference, CUG Summit, 2001.
[11] P. Luszczek, D. Bailey, J. Dongarra, J. Kepner, R.
Lucas, R. Rabenseifner, D. Takahashi, “The HPC Challenge
(HPCC) Benchmark Suite”, SC06 Conference Tutorial,
IEEE, Tampa, Florida, 2006.
[12] Cassiday D. “InfiniBand Architecture Tutorial”. Hot
Chips 12. 2000.
[13] H. Jordan and G. Alaghband, “Fundamentals of Parallel
Processing”, Prentice Hall, 2003.
6. Acknowledgements
[14] N. Boden, D. Cohen, R. Felderman, A. Kulawik, C.
Seitz, J. Seizovic, and W. Su, “Myrinet: A Gigabit-persecond Local Area Network”, IEEE Micro, 1995.
This research was supported with cluster
environments by OMEGATEC and Epagri in
collaboration with CAPES.
[15] R. Martin, “A Systematic Characterization of
Application Sensitivity to Network Performance”, PhD
thesis, University of California, Berkeley, 1999.
7. References
[16] A. Faraj and X Yuan, “Communication Characteristics
in the NAS Parallel Benchmarks”, Parallel and Distributed
Computing and Systems, 2002.
[1] Message Passing Interface Forum. “MPI: A MessagePassing Interface Standard, Rel. 1.1”, June 1995,
www.mpi-forum.org.
[17] T. Tabe and Q. Stout, “The use of MPI communication
library in the NAS parallel benchmarks”, Technical Report
CSE-TR-386-99, University of Michigan, 1999.
[2] D. Bailey, H. Barszcz, et al, “The NAS Parallel
Benchmarks”, International Journal of Supercomputer
Applications,Vol. 5, No. 3, 1991, pp. 63-73.
[18] J. Subhlok, S. Venkataramaiah, and A. Singh,
“Characterizing NAS Benchmark Performance on Shared
39
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on February 24, 2009 at 17:38 from IEEE Xplore. Restrictions apply.
Heterogeneous Networks”, International Parallel and
Distributed Processing Symposium, IEEE, 2002, pp 86-94.
[19] Y. Sun, J. Wang and Z. Xu, “Architetural Implications
of the NAS MG and FT Parallel Benchmarks”, Advances in
Parallel and Distributed Computing, 1997, pp 235-240.
[20] J. Kim, D. Lilja, “Characterization of Communication
Patterns in Message-Passing Parallel Scientific Application
Programs”, Communication, Architecture, and Applications
for Network-Based Parallel Computing, 1998, pp 202-216.
[21] S. R. Alam, R. F. Barrett, J. A. Kuehn, P. C. Roth, J. S.
Vetter, “Characterization of Scientific Workloads on Systems
with Multi-Core Processors”, International Symposium on
Workload Characterization, IEEE, 2006, pp 225-236.
[22] R. Brightwell, and K. Underwood, “An Analysis of the
Impact of MPI Overlap and Independent Progress”,
International Conference on Supercomputing, 2004.
[23] L. C. Pinto, R. P. Mendonça and M. A. R. Dantas,
“Impact of interconnects to efficiently build computing
clusters”, ERRC, 2007.
[24] M. Lobosco, V. S. Costa and C. L. de Amorim,
“Performance Evaluation of Fast Ethernet, Giganet and
Myrinet on a Cluster”, International Conference on
Computational Science, 2002, pp. 296-305.
[25] H. Pourreza and P. Graham, “On the Programming
Impact of Multi-core, Multi-Processor Nodes in MPI
Clusters”, High Performance Computing Systems and
Applications, 2007.
[26] L. Chai, A. Hartono and D. Panda, “Designing High
Performance and Scalable MPI Intra-node Communication
Support for Clusters”, IEEE International Conference on
Cluster Computing, 2006.
[27] Anubis G. M. Rossetto, Vinicius C. M. Borges, A. P. C.
Silva and M. A. R. Dantas, “SuMMIT - A framework for
coordinating applications execution in mobile grid
environments”, GRID, 2007, p. 129-136.
[28] D. Dunning et al., “The Virtual Interface Architecture”,
IEEE Micro, 1998, pp. 66-76.
40
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on February 24, 2009 at 17:38 from IEEE Xplore. Restrictions apply.