a HowTo - Al Akhawayn University

Transcription

The Impact of Virtualization on High Performance
Computing Clustering in the Cloud
Master Thesis Report
Submitted in
Fall 2013
In partial fulfillment of the requirements for the degree of
Master of Science in Software Engineering at the School of
Science and Engineering of Al Akhawayn University in Ifrane
By
Ouidad ACHAHBAR
Supervised by
Dr. Mohamed Riduan ABID
Ifrane, Morocco
January, 2014
Acknowledgment
I would like to express my deepest and sincere gratitude to ALLAH for giving me guidance
and strength to complete this work, and for having the chance to study and accomplish my
master degree with high support from my family, friends and professors. Thank you ALLAH.
I would also like to deeply thank my supervisor Dr. Abid for trusting me to conduct this
research, providing me with valuable feedback and overseeing my progress in a weekly basis.
Thank you Dr. Abid for your motivation and support.
My gratitude also goes to Dr. Haitouf who provided me with valuable comments and shared
with me his knowledge in cloud computing and distributed systems. Thank you Dr. Haitouf.
I am most thankful to my dear parents, brothers, sisters, nephews and fiancé for their
continuous support, encouragement and love. There are no words to express my gratitude to
all of you.
Many thanks go to my very close friends: Nora El Bakraoui Alaoui, Inssaf El Boukari, Sara
El Alaoui, Aida Tahditi, Jamila Barroug, Wafa Bouya and Chahrazad Touzani. Thank you for
being always by my side; thank you for sharing enjoyable moments with me, and thank you
for being my friends.
Last but not least, special acknowledgements go to all my professors for their support, respect
and encouragement. Thank you Ms. Hanaa Talei, Ms. Asmaa Mourhir, Dr. Naeem Nizar
Sheikh, Mr. Omar Iraqui, Dr. Violetta Cavalli Sforza, Dr. Kevin Smith and Dr. Harroud.
Ouidad Achahbar
2
Abstract
The ongoing pervasiveness of Internet access is largely increasing big data production. This,
in turn, increases demand on compute power to process the massive data, and thus rendering
High Performance Computing (HPC) into a high solicited service.
Based on the paradigm of providing computing as a utility, the cloud is offering user-friendly
infrastructures for processing these big data, e.g., High Performance Computing as a Service
(HPCaaS). Still, HPCaaS performance is tightly coupled with the underlying virtualization
technique since the latter controls the creation of virtual machines instances that carry data
processing jobs.
In this thesis, we characterize and evaluate the impact of machine virtualization on HPCaaS.
We track HPC performance under different cloud virtualization platforms, namely KVM and
VMware ESXi, and compare it to the performance in a physical computing cluster
infrastructure. The virtualized environment is deployed using Hadoop on top of Openstack.
The resulting HPCaaS runs MapReduce algorithms on benchmarked big data samples using a
granularity of 8 physical machines per cluster.
We got several interesting results when we ran the selected benchmarks on virtualized and
physical cluster. Each tested cluster provided different performance trends. Yet, the overall
analysis of the research findings proved that the selection of virtualization technology can
lead to significant improvements when running and handling HPCaaS.
3
‫ملخص‬
‫يعتبر التفشي المستمر لظاهرة ولوج واستعمال اإلنترنت سببا رئيسيا في تزايد إنتاج العديد من البيانات الضخمة‪ .‬هذا بدوره‬
‫يؤدي إلى زيادة الطلب على قدرات حسابية عالية لمعالجة هذه البيانات‪ .‬هذه المؤشرات جعلت من خدمة "حوسبة عالية‬
‫األداء" كخدمة مثيرة لإلهتمام‪.‬‬
‫استنادا إلى نموذج توفير الحوسبة كأداة مساعدة‪ ،‬تقدم الحوسبة السحابية بنيات تحتية مرنة اإلستعمال لمعالجة البيانات‬
‫الضخمة ‪ ،‬على سبيل المثال‪" ،‬الحوسبة العالية األداء كخدمة"‪ .‬مع ذلك‪ ،‬يقترن أداء هذه األخيرة بشكل كبير بتقنية البيئة‬
‫االفتراضية نظرا إلى تحكمها في إنشاء األالت االفتراضية (الحواسب االفتراضية) التي تقوم بوظائف معالجة البيانات‪.‬‬
‫في هذه األطروحة‪ ،‬قمنا بوصف و تقييم تأثير البيئة االفتراضية على "الحوسبة العالية األداء كخدمة"‪ .‬قمنا أيضا بتتبع أداء‬
‫"الحوسبة العالية األداء" على برامج سحابية افتراضية مختلفة وعلى حوسبة مادية مكونة من ثمان أجهزة كمبيوتر‪ .‬قمنا‬
‫باستخدام "أوبن ستاك" لبناء "الحوسبة العالية األداء كخدمة"‪ ،‬و "هادوب" لتشغيل خوارزميات "ماب رديوس" على‬
‫بيانات كبيرة‪.‬‬
‫من خالل نتائج هذا البحث‪ ،‬الحظنا تغير مهم في أداء " الحوسبة العالية األداء" بتغير حجم البيانات‪ ،‬نوعية الحوسبة (البنية‬
‫التحتية‪ :‬المادية واالفتراضية) وحجم الحوسبة‪ .‬بالرغم من ذالك‪ ،‬فاالستناج الذي وصلنا اليه يثبت ان تقنية البيئة االفتراضية‬
‫لها دور مهم ومعتبر في تحسين أداء "الحوسبة العالية األداء"‪.‬‬
‫‪4‬‬
Table of Content
Acknowledgment
Abstract
Table of Content
List of Figures
List of Tables
List of Appendices
List of Acronyms
2
3
4
5
7
9
10
11
PART I: THESIS OVERVIEW
12
Chapter 1: Introduction
13
‫ملخص‬
1.1.
1.2.
1.3.
1.4.
1.5.
1.6.
1.7.
Background
Motivation
Problem Statement
Research Question
Research Objective
Research Approach
Thesis Organization
13
14
15
15
15
15
16
PART II: THEORETICAL BASELINES
17
Chapter 2: Cloud Computing
18
3.1.
3.2.
3.3.
3.4.
3.5.
3.6.
Cloud Computing Definition
Cloud Computing Characteristics
Cloud Computing Service Models
Cloud Computing Deployment Models
Cloud Computing Benefits
Cloud Computing Providers
Chapter 3: Virtualization
4.1.
4.2.
4.3.
4.4.
4.5.
Definition of Virtualization
History of Virtualization
Benefits of Virtualization
Virtualization Approaches
Virtual Machine Manager
Chapter 4: Big Data and High Performance Computing as a Service
5.1.
5.2.
Big Data
High Performance Computing as a Service (HPCaaS)
Chapter 5: Literature Review and Research Contribution
5.1.
5.2.
Related Work
Contribution
18
19
20
21
22
23
24
24
25
25
26
28
32
32
33
35
35
36
PART III: TECHNOLOGY ENABLERS
37
Chapter 6: Technology Enablers Selection
38
6.1.
6.2.
Cloud Platform Selection
Distributed and Parallel System Selection
38
40
5
Chapter 7: Openstack
7.1.
7.2.
7.3.
7.4.
OpenStack Overview
OpenStack History
OpenStack Components
OpenStack Supported Hypervisors
Chapter 8: Hadoop
8.1.
8.2.
8.3.
8.4.
8.5.
Hadoop Overview
Hadoop History
Hadoop Architecture
Hadoop Implementation
Hadoop Cluster Connectivity
42
42
42
43
49
50
50
50
51
52
55
PART III: RESEARCH CONTRIBUTION
57
Chapter 9: Research Methodology
58
9.1.
9.2.
Research Approach
Research Steps
Chapter 10: Experimental Setup
10.1.
10.2.
10.3.
10.4.
10.5
10.6
Experimental Hardware
Experimental Software and Network
Clusters Architecture
Experimental Performance Benchmarks
Experimental Datasets Size
Experiment Execution
Chapter 11: Experimental Results
11.1.
11.2.
11.3.
11.4.
Hadoop Physical Cluster Results
Hadoop Virtualized Cluster- KVM Results
Hadoop Virtualized Cluster- VMware ESXi Results
Results Comparison
Chapter 12: Discussion
12.1.
12.2.
12.3.
TeraSort
TestDFSIO
Conclusion
PART IV: CONCLUSION
Chapter 13
Conclusion and Future Work
Bibliography
Appendix A: OpenStack with KVM Configuration
Appendix B. OpenStack with VMware ESXi Configuration
Appendix C: Hadoop Configuration
Appendix D: TeraSort and TestDFSIO Execution
Appendix E: Data Gathering for TeraSort
Appendix F: Data Gathering for TestDFSIO
58
58
59
59
60
60
64
65
66
67
67
72
77
82
88
88
90
91
92
93
93
94
100
127
131
145
147
153
6
List of Figures
Figure 1: Thesis organization ................................................................................................................ 16
Figure 2: NIST visual model of cloud computing definition ................................................................ 19
Figure 3: services provided in cloud computing environment .............................................................. 21
Figure 4: Full virtualization architecture .............................................................................................. 26
Figure 5: Paravirtualization architecture .............................................................................................. 27
Figure 6: Hardware assisted virtualization architecture ....................................................................... 28
Figure 7: a) Type 1 hypervisor b) Type 2 hypervisor ........................................................................... 29
Figure 8: Xen hypervisor architecture ................................................................................................... 30
Figure 9: KVM hypervisor architecture ................................................................................................ 31
Figure 10: VMware ESXi architecture ................................................................................................. 31
Figure 11: Data growth over 2008 and 2020 ........................................................................................ 32
Figure 12: Active cloud community population .................................................................................... 38
Figure 13: Active distributed systems population ................................................................................. 40
Figure 14: OpenStack conceptual architecture ..................................................................................... 44
Figure 15: Nova subcomponents ........................................................................................................... 44
Figure 16: Glance subcomponents ........................................................................................................ 46
Figure 17: Keystone subcomponents..................................................................................................... 46
Figure 18: Swift subcomponents ........................................................................................................... 47
Figure 19: Cinder subcomponents ......................................................................................................... 48
Figure 20: Quantum subcomponents ..................................................................................................... 48
Figure 21: Apache Hadoop subprojects ............................................................................................... 51
Figure 22: Hadoop Architecture ............................................................................................................ 52
Figure 23: HDFS and MapReduce representation................................................................................. 53
Figure 24: Word count MapReduce example ....................................................................................... 55
Figure 25 : Research steps ..................................................................................................................... 58
Figure 26 : Hadoop Physical Cluster ..................................................................................................... 61
Figure 27: Hadoop Physical Cluster architecture .................................................................................. 61
Figure 28: Hadoop virtualized cluster - KVM ...................................................................................... 62
Figure 29: Hadoop virtualized cluster – VMware ESXi (a) .................................................................. 63
Figure 30 : Hadoop virtualized cluster – VMware ESXi (b) ................................................................. 64
Figure 31 : Experimental execution ...................................................................................................... 66
Figure 32: TeraSort performance on Hadoop Physical Cluster ............................................................ 67
Figure 33: TeraSort performance for 100 MB on Hadoop Physical Cluster ........................................ 68
Figure 34 : TeraSort performance for 1 GB on Hadoop Physical Cluster............................................. 68
Figure 35: TeraSort performance for 10 GB on Hadoop Physical Cluster........................................... 68
Figure 36 : TeraSort performance for 30 GB on Hadoop Physical Cluster........................................... 68
Figure 37: TestDFSIO-Write performance on Hadoop Physical Cluster .............................................. 69
Figure 38: TestDFSIO-Write performance for 1 GB on Hadoop Physical Cluster ............................ 70
Figure 39 : TestDFSIO-Write performance for 100 MB on Hadoop Physical Cluster ......................... 70
Figure 40: TestDFSIO-Write performance for 10 GB on Hadoop Physical Cluster ........................... 70
Figure 41 : TestDFSIO-Write performance for 100 GB on Hadoop Physical Cluster .......................... 70
Figure 42: TestDFSIO-Read performance on Hadoop Physical Cluster ............................................... 71
Figure 43: TestDFSIO-Read performance for 100 MB on Hadoop Physical Cluster .......................... 71
Figure 44 : TestDFSIO-Read performance for 1 GB on Hadoop Physical Cluster ............................. 71
Figure 45: TestDFSIO-Read performance for 10 GB on Hadoop Physical Cluster............................. 72
Figure 46 : TestDFSIO-Read performance for 100 GB on Hadoop Physical Cluster.......................... 72
Figure 47: TeraSort performance on Hadoop KVM Cluster ................................................................. 72
7
Figure 48: TeraSort performance for 100 MB on Hadoop KVM Cluster ............................................ 73
Figure 49 : TeraSort performance for 1 GB on Hadoop KVM Cluster ................................................. 73
Figure 50: TeraSort performance for 10 GB on Hadoop KVM Cluster ................................................ 73
Figure 51 : TeraSort performance for 30 GB on Hadoop KVM Cluster ............................................... 73
Figure 52 : TestDFSIO-Write performance on Hadoop KVM Cluster ................................................. 74
Figure 53: TestDFSIO-Write performance for 100 MB on Hadoop KVM Cluster .............................. 75
Figure 54 : TestDFSIO-Write performance for 1GB on Hadoop KVM Cluster ................................... 75
Figure 55: TestDFSIO-Write performance for 10 GB on Hadoop KVM Cluster ................................. 75
Figure 56 : TestDFSIO-Write performance for 100 GB on Hadoop KVM Cluster .............................. 75
Figure 57: TestdFSIO-Read performance on Hadoop KVM Cluster .................................................... 76
Figure 58: TestDFSIO-Read performance for 100 MB on Hadoop KVM Cluster ............................... 76
Figure 59 : TestDFSIO-Read performance for 1GB on Hadoop KVM Cluster .................................... 76
Figure 60: TestDFSIO-Read performance for 10 GB on Hadoop VMware ESXi Cluster ................. 77
Figure 61 : TestDFSIO-Read performance for 100 GB on Hadoop VMware ESXi Cluster ................ 77
Figure 62 : TeraSort performance on Hadoop VMware ESXi Cluster ................................................. 77
Figure 63: TeraSort performance for 100 MB on Hadoop VMware ESXi Cluster ............................. 78
Figure 64 : TeraSort performance for 1GB on Hadoop VMware ESXi Cluster ................................... 78
Figure 65: TeraSort performance for 10 GB on Hadoop VMware ESXi Cluster ............................... 78
Figure 66 : TeraSort performance for 30GB on Hadoop VMware ESXi Cluster ................................. 78
Figure 67 : TestDFSIO-Write performance on Hadoop VMware ESXi Cluster................................... 79
Figure 68: TestDFSIO-Write performance for 100 MB on Hadoop VMware ESXi Cluster .............. 80
Figure 69 : TestDFSIO-Write performance for 1GB on Hadoop VMware ESXi Cluster .................... 80
Figure 70: TestDFSIO-Write performance for 10 GB on Hadoop VMware ESXi Cluster ................ 80
Figure 71 : TestDFSIO-Write performance for 100 GB on Hadoop VMware ESXi Cluster ............... 80
Figure 72 : TestDFSIO-Read performance on Hadoop VMware ESXi Cluster ................................... 81
Figure 73: TestDFSIO- Read performance for 100 MB on Hadoop VMware ESXi Cluster.............. 81
Figure 74 : TestDFSIO-Read performance for 1 GB on Hadoop VMware ESXi Cluster .................... 81
Figure 75: TestDFSIO- Read performance for 10 GB on Hadoop VMware ESXi Cluster ................ 82
Figure 76 : TestDFSIO-Read performance for 100 GB on Hadoop VMware ESXi Cluster ................ 82
Figure 77 : Average time for sorting 100 MB on HPhC, HVC with KVM and ................................... 83
Figure 78 : Average time for sorting 1 GB on HPhC, HVC with KVM and VMware ESXi................ 83
Figure 79 : Average time for sorting 10 GB on HPhC, HVC with KVM and ...................................... 84
Figure 80 : Average time for sorting 30 GB on HPhC, HVC with KVM and ...................................... 84
Figure 81: Average time for writing 1 GB on HPhC, HVC with KVM and VMware ESXi ................ 85
Figure 82 : Average time for writing 10 GB on HPhC, HVC with KVM and VMware ESXi ............. 85
Figure 83: Average time for wrting 100 GB on HPhC, HVC with KVM ............................................. 86
Figure 84: Average time for reading 1 GB on HPC, HVC with KVM and VMware ESXi .................. 86
Figure 85 : Average time for reading 1 GB on HPC, HVC with KVM and HVC VMware ESXi ....... 86
Figure 86: Average time for reading 10 GB on HPhC, HVC with KVM and HVC VMware ESXi .... 87
Figure 87 : Average time for reading 100 GB on HPhC, HVC with KVM and HVC VMware ESXi 87
Figure 88: Memory overhead when running 30 GB (started at 4.55PM) on 8 VMware ESXi VMs .... 89
Figure 89 : System latency reaches its peak (at 12.28PM) when running 30 GB on 8 VMware ESXi VMs89
Figure 90: OpenStack warning statistics about system’ resources usage .............................................. 90
8
List of Tables
Table 1 : A Comparison of cloud deployment models ......................................................................... 22
Table 2 : Cloud IaaS selection ............................................................................................................... 39
Table 3 : Parallel and distributed platform selection ............................................................................. 41
Table 4 : OpenStack releases ................................................................................................................ 43
Table 5 : OpenStack projects................................................................................................................. 43
Table 6: Apache Hadoop subprojects .................................................................................................... 51
Table 7 : Dell OptiPlex 755 computer features (used for Hadoop physical cluster) ............................. 59
Table 8 : Dell PowerEdge server used for building OpenStack & Hadoop virtualized cluster ............. 60
Table 9 : OpenStack virtual machines’ features .................................................................................... 60
Table 10 : Experimental performance metrics ...................................................................................... 64
Table 11 : Datasets size used for Hadoop benchmarks ......................................................................... 65
Table 12: Average time (in seconds) of running TeraSort on different dataset sizes and different
number of nodes- Hadoop Physical Cluster .......................................................................................... 67
Table 13: Average time (in seconds) of running TestDFSIO-Write on different dataset sizes and
different number of nodes- Hadoop Physical Cluster ........................................................................... 69
Table 14: Average time (in seconds) of running TestDFSIO-Read on different dataset sizes and
different number of nodes- Hadoop Physical Cluster ........................................................................... 71
number of nodes- Hadoop KVM Cluster .............................................................................................. 72
different number of nodes- Hadoop KVM Cluster................................................................................ 74
Table 17 : Average time (in seconds) of running TestdFSIO-Read on different dataset sizes and
different number of nodes- Hadoop KVM Cluster................................................................................ 76
different number of nodes- Hadoop VMware ESXi Cluster ................................................................. 77
Table 19 : Average time (in seconds) of running TestdFSIO-Write on different dataset sizes and
9
List of Appendices
Appendix A : OpenStack with KVM Configuration……………………………………………...….100
Appendix B : OpenStack with VMware ESXi Configuration……………………………………….127
Appendix C: Hadoop Configuration………………………………………………….....……………131
Appendix D: TeraSort and TestDFSIO Execution…………………………………….… ………….145
Appendix E: Data Gathering for TeraSort……………………………………………..……………..147
Appendix F: Data Gathering for TestDFSIO…………………………………………………………153
10
List of Acronyms
HPC
High Performance Computing
HPCaaS
High Performance Computing as a Service
VM
Virtual Machine
VMM
Virtual Machine Manager
EMC
American Multinational Corporation
DCI
Digital Communications Inc.
GFS
Google File System
HDFS
Hadoop Distributed File System
NDFS
Nutch Distributed File System
DOE
Department of Energy National Laboratories
NIST
National Institute of Standards and Technology
SaaS
Software as a Service
PaaS
Platform as a Service
IaaS
Infrastructure as a Service
NoSQL
Not Only Structured Query Language
SNIA
Storage Networking Industry Association
ACID
Atomicity, Consistency, Isolation and Durability
AWS
Amazon Web Services
HPhC
Hadoop Physical Cluster
HVC
Hadoop Virtualized Cluster
SSH
Secure Shell
JSON
JavaScript Object Notation
XML
Extensible Markup Language
API
Application Programming Interface
Amazon EC2
Amazon Elastic Compute Cloud
Amazon S3
Amazon Simple Storage Service
VLAN
Virtual Local Area Network
DHCP
Dynamic Host Configuration Protocol
11
Part I: Thesis Overview
This part introduces the key points to understand the purpose of the present research. It
provides an introduction of the research starting with its background, motivation, problem
statement, research question, objective and research methodology.
12
Chapter 1: Introduction
In this chapter, we first come to the background of the present research, and then describe the
motivation and the problem behind conducting this study. After that, questions, objectives,
and methodology of the research are stated. Finally, an outline of the thesis is given out at the
end of this chapter.
1.1.Background
During the last decades, the demand for computing power has steadily increased as data
generated from social networks, web pages, sensors, online transactions, etc. is continuously
growing. A study done in 2012 by American Multinational Corporation (EMC), has estimated
that from 2005 to 2020, data will grow by a factor of 300 (from 130 exabytes to 40,000
exabytes), and therefore, digital data will be doubled every two years [1]. The growth of data
constitutes the “Big Data” phenomenon.
As Big Data grows in terms of volume, velocity and value, the current technologies for
storing, processing and analyzing data become inefficient and insufficient. Gartner survey
stated that data growth is considered as the largest challenge for organizations [2]. Stating this
issue, High Performance Computing (HPC) has started to be widely integrated in managing
and handling Big Data. In this case, HPC is used to process and analyze Big Data related to
different problems including scientific, engineering and business problems that require high
computation capabilities, high bandwidth, and low latency network [3].
However, HPC still lacks the toolsets that fit the current growth of data. In this case, new
paradigms and storage tools were integrated with HPC to deal with the current challenges
related to data management. Some of these technologies include, providing computing as a
utility (cloud computing) and introducing new parallel and distributed paradigms.
Cloud computing plays an important role as it provides organizations with the ability to
analyze and store data economically and efficiently. Performing HPC in the cloud was
introduced as data has started to be migrated and managed in the cloud.
Digital
Communications Inc. (DCI) stated that by 2020, a significant portion of digital data will be
managed in the cloud, and even if a byte in the digital universe is not stored in the cloud, it
will pass, at some point, through the cloud [4]. Performing HPC in the cloud is known as
High Performance Computing as a Service (HPCaaS).
In short, HPCaaS offers high13
performance, on-demand, and scalable HPC environment that can handle the complexity and
challenges related to Big Data [5].
One of the most known and adopted parallel and distributed systems is MapReduce model
that was developed by Google to meet the growing of their web search indexing process [6].
MapReduce computations are performed with the support of data storage system known as
Google File System (GFS). The success of both Google File System and MapReduce inspired
the development of Hadoop which is a distributed and parallel system that implements
MapReduce and Hadoop Distributed File System (HDFS) [7]. Nowadays, Hadoop is widely
adopted by big players in the market because of its scalability, reliability and low cost of
implementation. Stating this, Hadoop is also proposed to be integrated with HPC as an
underlying technology that distributes the work across HPC cluster [8, 9].
1.2.Motivation
Many solutions have been proposed and developed to improve computation performance of
Big Data. Some of them tend to improve algorithms efficiency, provide new distributed
paradigms or develop powerful clustering environments. Though, few of those solutions have
addressed a whole picture of integrating HPC with the current emerging technologies in terms
of storage and processing.
As stated before, some of the most popular technologies currently used in hosting and
processing Big Data are cloud computing, HDFS and Hadoop MapReduce[10]. At present,
the use of HPC in the cloud computing is still limited. The first step towards this research was
done by the Department of Energy National Laboratories (DOE), which started exploring the
use of cloud services for scientific computing [11]. Besides, in 2009, Yahoo Inc. launched
partnership with major top universities in United States to conduct more research about cloud
computing, distributed systems and high computing applications.
HPCaaS still needs more investigation to decide on appropriate environments that can fit high
computing requirements. One of the HPCaaS’ aspects that is not yet investigated is the impact
of different virtualization technologies on HPC in the cloud. Therefore, the motivation of this
research consists in the need for evaluating HPCaaS performance using MapReduce and
different virtualization techniques. This motivation is accompanied by a strong rational that is
addressed by the free accessibility to MapReduce and cloud computing open sources.
14
1.3.Problem Statement
Cloud computing is offering set of services for processing Big Data; one of these services is
HPCaaS. Still, HPCaaS performance is highly affected by the underlying virtualization
techniques which are considered as the heart of cloud computing. Stating this, the problem
addressed in this research is formulated as follow: “HPCaaS is still facing poor performance
and still doesn’t fit Big Data requirements”.
1.4.Research Question
Addressing the problem statement, this thesis aims at bringing answers to the following
research questions:
1. What is the performance of HPC on Hadoop Physical Cluster (HPhC)?
2. Is it worth moving HPC to the cloud?
3. How virtualization techniques affect HPCaaS performance?
4. Is there an optimal virtualization technique that can ensure good performance?
1.5.Research Objective
The purpose of the present research is to find solutions for the addressed issues and questions
in the previous sections. Hence, this research introduces a new architecture that can handle
HPC complexity and increase its performance. The proposed architecture consists of building
a Hadoop Virtualized Cluster (HVC) in a private cloud using OpenStack. Hence, the first goal
of this research is to investigate the added value of adopting virtualized cluster, and the
second goal is to evaluate the impact of virtualization techniques on HPCaaS.
1.6.Research Approach
To evaluate HPCaaS over different virtualization technologies, we followed both qualitative
and quantitative research methodologies. The qualitative approach was adopted to select
appropriate technology enablers that will be used in building an architecture that will solve
the issues addressed in this study. On the other hand, quantitative approach was adopted to
conduct different experiments on three different clusters: Hadoop Physical Cluster (HPhC),
Hadoop Virtualized Cluster using KVM (HVC- KVM) [12] and Hadoop Virtualized Cluster
using VMware ESXi (HVC- VMware ESXi) [13]. Each experiment tends to measure the
performance of HPC.
15
1.7.Thesis Organization
The rest of this thesis is structured as follow (Figure 1):

Part I covers chapter 1 (current chapter) which introduces the present research.

Part II covers chapter 2, 3, 4 and 5. Chapter 2 provides basic understanding of cloud
computing; chapter 3 introduces virtualization; chapter 4 presents the concept of Big
Data and HPCaaS, and chapter 5 lists some related work and states clearly our
contribution

Part III covers chapter 6, 7 and 8. Chapter 6 explains the steps we followed in selecting
the technology enablers of this research, and chapter 7 and 8 present in details OpenStack
and Hadoop respectively.

Part IV covers chapter 9, 10, 11 and 12. Chapter 9 presents the methodology adopted in
conducting this research; chapter 10 demonstrates the environment preparation to run the
needed experiments; chapter 11 introduces the results, and chapter 12 discusses the
research findings.

Part V covers chapter 13 which concludes the research findings and proposes some future
work; further, this part includes bibliography and appendices of this study.
Figure 1: Thesis organization
16
Part II: Theoretical Baselines
The objective of this part is to elaborate and shed light on some scientific concepts, theories
and topics that serve as a foundation to understand the whole picture of the present research.
Hence; this part is structured as follow: chapter 2 demonstrates basic background of cloud
computing; chapter 3 introduces cloud computing related technologies, namely virtualization;
chapter 4 presents Big Data and HPaaS, and chapter 5 situates this research by introducing
previous research that were done in the domain of evaluating HPC.
17
Chapter 2: Cloud Computing
Cloud computing becomes the current innovative and emerging trend in delivering IT services
that attract both the interest of academic and industrial fields. Using advanced technologies,
cloud computing provides end users with a variety of services, starting from the hardware
level services to the application level. Cloud computing is understood as utility computing
over the Internet. Meaning, computing services have moved from local data centers to hosted
services which are offered over the Internet and paid based on pay-per-use model [14]. This
chapter provides an overview of cloud computing concept. It provides a distinct definition of
what cloud computing is; defines cloud computing characteristics, describes cloud service and
deployment models, discusses some cloud computing benefits, and finally this chapter lists
some cloud computing providers.
3.1.Cloud Computing Definition
In the late 1960’s, John McCarthy brought a new concept into computer science field which
predicts that technology will not be only provided as tangible products [14]. Meaning,
computer resources will be provided as a service like water and electricity. The concept was
known as utility computing, and nowadays it known as cloud computing.
Cloud computing is defined by NIST (National Institute of Standards and Technology) [15] in
2009 as:
“Cloud computing is a model for enabling ubiquitous,
convenient, on-demand network access to a shared
pool of configurable computing resources (e.g.,
networks, servers, storage, applications, and services)
that can be rapidly provisioned and released with
minimal management effort or service provider
interaction. This cloud model is composed of five
essential characteristics, three service models, and
four deployment models. ”
NIST definition of cloud sheds light on the effective use of cloud computing in terms of
providing minimum management efforts of the shared resources. It sets five characteristics
that define cloud computing: on-demand self-service, broad network access, resource pooling,
rapid elasticity and measured service. Concerning the deployment models, NIST has
classified them into: private, public, community and hybrid cloud. More details about cloud
characteristics, delivery and deployment models are provided in the upcoming subsections.
18
The NIST definition of cloud is summarized in Figure 2 which encapsulates cloud computing
characteristics, service models, and deployment models.
Figure 2: NIST visual model of cloud computing definition [14]
3.2.Cloud Computing Characteristics
NIST has listed five main characteristics that describe precisely cloud computing, which are
[15]:

On-demand self-service: end users can use and change computing capabilities as desired
without the need of human interaction with each service provider.

Broad network access: resources are accessed over network using standards mechanism.

Resource pooling: the provider’s computing resources are pooled to serve multiple
consumers; these resources are dynamically assigned and reassigned according to
consumer demand. Examples of resources include storage, processing, memory, and
network bandwidth.

Rapid elasticity: cloud providers can elastically scale in and scale out resources
depending on current end users’ demand. Therefore, resources can be available for
provisioning in any quantity at any time.

Measured service: resources usage can be monitored, controlled and measured;
therefore, these features enable end users to pay using the pay as you go model.
Other characteristics were investigated in [16], and which are listed as follow:
19

Reliability: this feature is ensured by implementing and providing multiple redundant
sites. Having this feature, cloud computing is considered as an ideal solution for disaster
recovery and business critical tasks.

Customization: cloud computing allows customization of infrastructure and applications
based on end user’ demand.

Efficient resource utilization: this feature ensures delivering resources as long as they
are needed.
3.3. Cloud Computing Service Models
Based on NIST definition of cloud computing, cloud deployment models are classified as
follow:

Software as a Service (SaaS)
Software as a Service (SaaS) represents application software, operating system and computing
resources. End users can view the SaaS model as a web-based application interface where
services and complete software applications are delivered over the Internet. Some examples of
SaaS applications are: Google Docs, Microsoft Office Live, Salesforce Customer Relationship
Management, etc.

Platform as a Service (PaaS)
This service allows end users to create and deploy applications on provider’s cloud
infrastructure. In this case, end users do not manage or control the underlying cloud
infrastructure like network, servers, operating systems, or storage. However, they do have
control over the deployed applications by being allowed to design, model, develop and test
them. Examples of PaaS are: Google App Engine, Microsoft Azure, Salesforce, etc.

Infrastructure as a Service (IaaS)
This service consists of a set of virtualized computing resources such as network bandwidth,
storage capacity, memory, and processing power. These resources can be used to deploy and
run arbitrary software which can include operating systems and applications. Examples of
IaaS providers are Drop Box, Amazon web service, etc.
Cloud services are summarized in Figure 3.
20
Figure 3: services provided in cloud computing environment [16]
3.4.Cloud Computing Deployment Models

Private Cloud
Private cloud computing is provisioned for exclusive use by an organization. The cloud in this
case is owned, managed and operated by the organization, a third party, or both of them. The
advantage of private cloud consists in providing high security since the cloud is accessed by
trusted entities within the organization [15].

Public Cloud
The cloud infrastructure is provisioned for general public use. It may be owned, managed, and
operated by cloud service provider who offers services based on pay-per-use model. In
contrast to private cloud, public cloud is known as untrustworthy environment [15].

Community Cloud
The cloud infrastructure is provisioned for exclusive use by a specific community of
consumers from different organizations that share some goals (e.g., mission, security
requirements, policy, and compliance considerations). In this case, the cloud may be owned,
managed, and operated by one or more organizations in the community, a third party, or
combination of them [15].

Hybrid Cloud
This cloud is a combination of both private and public cloud computing environments. Hybrid
cloud provides high flexibility and choices for organization; for instance, critical core
activities of an organization can be run under the control of the private part of the hybrid
cloud while other tasks may be outsourced to the public part [17].
Table 1 summarizes cloud deployment models discussed above [17].
21
Table 1 : A Comparison of cloud deployment models [17]
3.5.Cloud Computing Benefits
Nowadays, cloud is widely used because of the benefits it provides to end users. Some of the
key benefits offered by the cloud include [17, 18]:

Initial Cost Savings
Organizations or individuals can save the big initial investment for launching new hardware,
products and services; in this case, cloud computing platform offers the needed resources in
terms of infrastructure, platform and applications.

Scalability
Cloud computing ensures high computing scalability by scaling up resources as needed.
Therefore, when the usage increases, resources increase relatively to respond to end user’
demand.

Availability
Cloud providers have the infrastructure and bandwidth to accommodate business
requirements for high speed access, storage and systems.

Reliability
Cloud computing implements redundant paths to support business continuity and disaster
recovery.
22

Maintenance
End users are not concerned with the resources maintenance since it is done by the cloud
service provider.
3.6.Cloud Computing Providers
There are many providers who offer cloud services with different features and pricing. Some
of them are listed as follow [16, 19]:

Amazon Web Services
Amazon (AWS) [20] offers a number of cloud services for all business sizes. AWS ensures
advanced data privacy techniques to protect users’ data. For that reason, AWS got various
security certifications and audits such as ISO 27001, FISMA moderate and SAS 70 Type II.
Some AWS services are: Elastic Compute Cloud, Simple Storage Service, SimpleDB
(relational data storage service that stores, processes and queries data sets in the cloud), etc.

Google
Google [21] offers high accessibility and usability in its cloud services. Some of Google
services include: Google’s App Engine, Gmail, Google Docs, Google analytics, Picasa (a tool
used to exhibit product and uploading their images in the cloud), etc.

Microsoft
Microsoft [22] offers a famous cloud platform called Windows Azure which runs Windows
applications. Some other services include: SQL Azure, Windows Azure Marketplace (an
online market to buy and sell applications and data), etc.

OpenStack
OpenStack [23] is an open source platform for public and private cloud computing that aims
at ensuring scalability and flexibility. It was founded by Rackspace hosting and NASA.
Some other organizations that invest in the cloud are: Dell, IBM, Oracle, HP, Sales force, etc.
[16].
23
Chapter 3: Virtualization
There are many different existing technologies and practices used by cloud providers; some of
them are internet protocols for communication, virtual private cloud provisioning, load
balancing and scalability, distributed processing, high performance computing technologies
and virtualization [24]. This chapter emphasizes an understanding of virtualization technology
as it is considered the core of cloud computing. It describes in details the history, benefits,
types and the abstract layer of virtualization.
4.1.Definition of Virtualization
Virtualization is a widely used term; it has been introduced for many years as a powerful
technology in computer science. The definition of virtualization can change depending on
which component of computer system is applied on. However, it is broadly defined as an
abstract layer between physical resources and their logical representation [25]. NIST has
defined virtualization as [26]:
“The simulation of the software and/or hardware upon which other
software runs. This simulated environment is called a virtual machine
(VM). There are many forms of virtualization, distinguished primarily by
computing architecture layer. For example, application virtualization
provides a virtual implementation of the application programming
interface (API) that a running application expects to use, allowing
applications developed for one platform to run on another without
modifying the application itself. The Java Virtual Machine (JVM) is an
example of application virtualization; it acts as an intermediary between
the Java application code and the operating system (OS). Another form of
virtualization, known as operating system virtualization, provides a
virtual implementation of the OS interface that can be used to run
applications written for the same OS as the host, with each application in
a separate VM container”.
Furthermore, Virtualization is defined by SNIA (Storage Networking Industry Association) as
follow [27]:
“The act of abstracting, hiding, or isolating the internal functions of a
storage (sub) system or service from applications, host computers, or general
network resources, for the purpose of enabling application and networkindependent management of storage or data”.
From both definitions, we can say that virtualization is a methodology of dividing a physical
machine into multiple execution environments that allow multiple tasks to run
simultaneously. This is done by providing a software abstract layer that is called Virtual
24
Machine Manager (VMM) or Hypervisor. VMM is therefore designed to hide the physical
resources from the operating system. In this case, VMM allows creating multiple guest
Operating Systems (OS) (each guest is run by software units called Virtual Machines (VM)
[28].
4.2.History of Virtualization
The roots of virtualization go back to the first visualized IBM mainframes that were designed
in the 1690s, and which allowed the company to run multiple applications and processes
simultaneously. In fact, the main drivers behind introducing virtualization were the high cost
of hardware and the need for running and isolating applications on the same hardware. During
1970s, the adoption of virtualization technology increased sharply in many organizations
because of cost effectiveness. However, in 1980s and 1990s, hardware prices dropped down
as well as the emergence of multitasking operating systems. With these facts, there was no
need to assure a high CPU utilization, and therefore, there was no need for virtualization
technology. Yet, in the 1990s, virtualization technology brought again to the market after
introducing VMware Inc. at Stanford University. Nowadays, virtualization is widely used to
reduce management costs by replacing a bunch of low-utilized servers by a single server [29].
4.3.Benefits of Virtualization
There a bunch of reasons that push many organizations to go for virtualization technology;
some of them are listed in [24, 29, 30] as follow:

Server Consolidation
It condenses multiple servers into one physical server that would host many virtual machines.
This feature allows the physical server to run at high rate of utilization, and it reduces at the
same time the hardware maintenance, power and cooling requirements’ costs.

Application Consolidation
Legacy applications might require newer hardware and/or operating systems. In this case,
virtualization can be used to virtualize the new requirements.

Sandboxing
Virtualization can provide secure and isolated environment by running virtual machines that
can be used to run foreign or less-trusted applications.

Multiple Simultaneous OS
25
It can provide the facility of having multiple simultaneous operating systems that can run
different types of applications.

Reducing Cost
Virtualization reduces cost deployment and configuration by ensuring less hardware, less
space and less staffing. Furthermore, virtualization reduces the cost of networking by
requiring less wirings, switches and hubs.
4.4. Virtualization Approaches
Virtualization can take different forms depending on which component of computer system is
applied on [31]. In this section, we will shed light on three famous virtualization techniques:
Full Virtualization, Para-virtualization, and Hardware Assisted Virtualization.
4.4.1. Full Virtualization
In full virtualization, guest OS is fully abstracted from the hardware level by adding
virtualization layer: VMM or hypervisor. In this case, the guest OS is not aware it is being
virtualized, and it requires no modifications. This approach provides each VM with all
services of the physical system, including virtual BIOS, virtual devices and virtualized
memory management. To manage the communication between different layers, full
virtualization provides both binary translation and direct execution techniques (Figure 4).
Binary translation is used to convert guest OS instructions into host instructions. On the other
hand, application or user level instructions are directly executed on the processor to ensure
high performance [32]. Microsoft Virtual Server is an example of full virtualization.
Figure 4: Full virtualization architecture [32]
26
4.4.2. Paravirtualization
The fundamental issue with full virtualization is the emulation of devices within the
hypervisor. This issue was solved by developing paravirtualization technique which allows
the guest OS to be aware that it's being virtualized and to have direct access to the underlying
hardware. In paravirtualization, the actual guest code is modified to use a different interface
that accesses the hardware directly or the virtual resources controlled by the hypervisor [32].
In more details, paravirtualization changes the OS kernel to replace non-virtualized
instructions with hypercalls that communicate directly with the hypervisor. Thus, when a
privileged command is to be executed on the guest OS, it is delivered to the hypervisor
(instead of the OS) by using a hypercall; the hypervisor receives this hypercall and accesses
the hardware to returns the needed result (Figure 5). Xen is one of the systems that adopt
paravirtualization technology.
Figure 5: Paravirtualization architecture [32]
The downside of paravirtualization is that the guest must be modified to integrate hypervisor
awareness. This is a limitation as some operating systems do not allow such modifications
(e.g. Windows 2000/XP), and even the ones that can be modified may need additional
resources for maintenance/troubleshooting [32].
4.4.3. Hardware Assisted Virtualization
Hardware Assisted Virtualization allows VMM to run directly on the hardware. In this case,
VMM controls the access of the guest OS to the hardware resources. As depicted in Figure 6,
privileged and sensitive calls are sent directly to the hypervisor, removing the need for binary
translation and paravirtualization. VMWare ESX Server is one of the main competing VMMs
that use this approach [29].
27
Figure 6: Hardware assisted virtualization architecture [32]
4.5.Virtual Machine Manager
As defined before, hypervisor or VMM is the layer between the operating system and a guest
operating system or the layer between the hardware and the guest operating systems. In [25],
the author has set three main features that need to be maintained by VMM. First feature
demonstrates that VMM has to provide an environment that is identical with the original
machine that we want to virtualize. Second feature shows that programs running on VM or
original machine should show the same performance, or, with some minor decrease. Finally,
last feature states that VMM needs to control all system resources provided to VMs.
4.5.1. Hypervisor Types
Hypervisors are classified into Type 1 Hypervisor and Type 2 Hypervisor. Type 1 runs
directly on the system hardware, and therefore they monitor the operating system guests and
they allocate all the needed resources including disk, memory, and CPU and I/O peripherals.
Having no intermediary between Type 1 hypervisor and the physical layer has led to an
efficient performance in terms of hardware access and security level (Figure 7-a). On the
other hand, Type 2 hypervisor runs on host operating system that provides virtualization
services such as I/O and memory management (Figure 4-b). Having an intermediary layer
between the hypervisor and the hardware makes the installation process easier than Type 1
hypervisor since the operating system is in charge of hardware configuration such as
networking and storage [33].
28
Figure 7: a) Type 1 hypervisor b) Type 2 hypervisor [33]
The differences between Type 1 and Type 2 hypervisor can lead to different performance
results. The layer between the hardware and the hypervisor in Type 2 makes the performance
less efficient than in Type 1. A sample scenario that illustrates this difference is when a virtual
machine requires a hardware interaction (reading from disk); in this case, Type 2 hypervisor
needs first to pass the request to the operating system and then the hardware layer. Besides
performance efficiency, the reliability of Type 1 hypervisor is higher than in Type 2
reliability. For instance, the failure in operating system can directly affect the hosted guests in
Type 2 hypervisor; therefore, the availability of hypervisor type 2 is highly related to the
operating system availability. However, hypervisor type 2 has some advantages which consist
in having fewer hardware/driver issues as the host operating system is responsible for
interfacing with the hardware [34].
4.5.2. Examples of Hypervisors
a) Xen Hypervisor
Xen hypervisor is a Type 1 or bare metal hypervisor that is widely used for paravirtualization
[35]. It is managed by a specific privileged guest (privileged VM) called Domain-0 (Dom0).
Dom0 runs on the hypervisor, and it is responsible of managing all aspects of other
unprivileged virtual machine that are known as DomainU (DomU). Furthermore, Dom0 has
direct access for the resources on the physical, which is not the case for DomU guests [36].
Overall architecture of Xen hypervisor is shown in Figure 8.
29
Figure 8: Xen hypervisor architecture
Xen uses paravirtualization as well as full virtualization. In paravirtualization, DomU are
referred to DomU PV Guests, and they can be modified Linux operating systems, Solaris,
FreeBSD, and other UNIX operating systems [37]. DomU PV Guests are aware that they are
running in a virtualized environment, and they don’t have direct access to the hardware
resources. In this case, the guest operating system is modified to make special calls
(hypercalls) to the hypervisor for privileged operations, instead of the regular system calls in a
traditional unmodified operating system. On the other, in full virtualization, DomU are
referred to as DomU HVM Guests and run standard any unchanged operating system [37].
DomU HVM is not aware that it is sharing processing time on the hardware, and it is not
aware of the presence of other virtual machines. In this case, DomU HVM requires processors
which specifically support hardware virtualization extensions (Intel VT or AMD-V).
Virtualization extensions allow for many of the privileged kernel instructions (which in PV
were converted to "hypercalls") to be handled by the hardware using the trap-and-emulate
technique.
b) KVM Hypervisor
KVM hypervisor provides a full virtualization solution based on Linux operating system. It
works by reusing the hardware assisted virtualization extensions that were already developed.
In this case, KVM requires the presence of Intel VT or AMD-V extensions on the host
system. When KVM is loaded, it converts the kernel into a bare metal hypervisor. As a result,
it takes; as mentioned above, a full advantage of many components which are already present
within the kernel such as memory management and scheduling [38]. KVM is implemented
using two main components; the first one is the KVM-loadable module that, when installed in
the Linux kernel, provides management of the virtualization hardware (Figure 9). The second
component provides PC platform emulation, which is offered by a modified version of
30
QEMU. QEMU is executed as a user-space process, coordinating with the kernel for guest
operating system requests [39].
Figure 9: KVM hypervisor architecture
c) VMware ESXi Hypervisor
VMware was the first leader company that contributed to virtualization technology. One of its
virtualization products is VMware ESXi which is installed directly on top of the physical
machine [40]. VMware ESXi was introduced in 2007 to provide the highest levels of
reliability and performance to companies of all sizes. The overall architecture of VMware
ESXi is illustrated in Figure 10. The main component is the vmkernel which contains all the
necessary processes to manage VMs. It provides certain functionality similar to that found in
other operating systems, such as process creation and control, signals, file system, and process
threads. Therefore, vmkernel supports running multiple virtual machines and provides some
core functionalities like: Resource scheduling, I/O stacks and Device drivers [24].
Figure 10: VMware ESXi architecture [40]
31
Chapter 4: Big Data and High Performance
Computing as a Service
As big companies such as Google, Amazon, Facebook, LinkedIn and Twitter grow in terms of
users and data generated, the capacity and computing power of current data tools lead to
inefficient and insufficient data processing, analyzing, managing, and storing. IBM estimates
that every day 2.5 quintillion bytes of data are created, and 90% of the data in the world today
has been created in the last two years [41]. Besides, Oracle estimated that 2.5 zettabytes of
data were generated in 2012, and it will grow significantly every year (Figure 11) [42]. The
increase in data size to many terabytes and petabytes is known as Big Data. To handle the
complexity of Big Data, HPC is adopted to provide high computation capabilities, high
bandwidth, and low latency network. This chapter provides an overview of Big Data
phenomena and HPaaS concept.
Figure 11: Data growth over 2008 and 2020 [54]
5.1.Big Data
5.1.1. Big Data Definition
Big Data is defined as large and complex datasets that are generated from different sources
including social media, online transactions, sensors, smart meters and administrative services
[43]. Having all these sources, the size of Big Data goes beyond the ability of typical tools of
storing, analyzing and processing data. Literature reviews on Big Data divided the concept
into four dimensions: Volume, Velocity, Variety and Value [43].
32
 Volume: the size of data generated is very large, and it goes from terabytes to petabytes.

Velocity: data grows continuously at an exponential rate.

Variety: data are generated in different forms: structured data, semi-structured and
unstructured data. These forms require new techniques that can handle data
heterogeneity.

Value: the challenge in Big Data is to identify what is valuable as to be able to capture,
transform and extract data for analysis.
5.1.2. Big Data Technologies
With Big Data phenomenon, there is an increasing demand for new technologies that can
support the volume, velocity, variety and value of data. Some of the new technologies are
NoSQL, parallel and distributed paradigms and new cloud computing trends that can support
the four dimensions of big data.
NoSQL (Not Only Structured Query Language) is the transition from relational databases to
non-relational databases [44]. It is characterized by the ability to scale horizontally; the ability
to replicate and to partition data over many servers, and the ability to provide high
performance operations. However, moving from relational to NoSQL systems has eliminated
some of the ACID transactional properties (Atomicity, Consistency, Isolation and Durability)
[45]. In this context, NoSQL properties are defined by CAP theory [46] which states that
developers must make trade-off decisions between consistency, availability and partitioning.
Some example of NoSQL tools are: Cassandra [47], HBase [48], MongoDB [49] and
CouchDB [50].
Other supporting technologies for Big Data are parallel and distributed paradigms (e.g.
Hadoop) and cloud computing services (e.g. OpenStack). These technologies are discussed in
the upcoming chapters (Part III- Chapter 8, 9).
5.2. High Performance Computing as a Service (HPCaaS)
5.2.1. HPCaaS Overview
High Performance Computing (HPC) is used to process and analyze large and complex
problems, including scientific, engineering and business problems that require high
computation capabilities, high bandwidth, and low latency network [3]. HPC fits these
requirements by implementing large physical clusters. However, traditional HPC faces a set
33
of challenges that consist in peak demand, high capital, and high expertise to acquire and
operate the HCP [51]. To deal with these issues, HPC experts have leveraged the benefits of
new technology trends including, cloud technologies, parallel processing paradigms and large
storage infrastructures. Merging HPC with these new technologies has proposed new HPC
model, called HPC as a service (HPCaaS).
HPCaaS is an emerging computing model where end users have on-demand access to preexisting needed technologies that provide high performance and scalable HPC computing
environment [52]. HPCaaS provides unlimited benefits because of the better quality of
services provided by the cloud technologies, and the better parallel processing and storage
provided by, for example, Hadoop Distributed System and MapReduce paradigm. Some
HPCaaS benefits are stated in [51] as follow:

High Scalability: resources are scaling up as to ensure essential resources that fit users’
demand in terms of processing large and complex datasets.

Low Cost: End-users can eliminate the initial capital outlay, time and complexity to
procure HPC.

Low Latency: by implementing the placement group concept that ensures the execution
and processing of data in the same rack or on the same server.
5.2.2. HPCaaS Providers
There are many HPCaaS providers in the market. An example of HPCaaS provider is Penguin
Computing [53] which has been a leader in designing and implementing high performance
environments for over a decade. Nowadays, it provides HPCaaS with different options: ondemand, HPCaaS as private services and hybrid HPCaaS services. Amazon Web Services
(AWS) [3] is also an active HPCaaS in the market; it provides simplified tools to perform
HPC over the cloud. AWS allows end users to benefit from HPCaaS features with different
pricing models: On-Demand, Reserved [54] or Spot Instances [55]. HPCaaS on AWS is
currently used for Computer Aided Engineering, molecular modeling, genome analysis, and
numerical modeling across many industries including Oil and Gas, Financial Services and
Manufacturing [3]. Other leaders of HPCaaS in the market are Microsoft (Windows Azure
HPC) [56] and Google (Google Compute Engine) [57].
34
Chapter 5: Literature Review and Research
Contribution
In order to bridge the gap between the present research and previous studies, a review was
conducted on the current state of HPC and virtualization. Therefore, this chapter situates the
research in relation to previous research publications and states clearly the research
contribution.
5.1. Related Work
There have been several studies that evaluated the performance of high computing in the
cloud. Most of these studies used Amazon EC2 [20] as a cloud environment [58-63]. Besides,
only few studies have evaluated the performance of high computing using the combination of
both new emerging distributed paradigms and cloud environment [64].
In [58], authors have evaluated HPC on three different cloud providers: Amazon EC2, GoGrid
Cloud and IBM Cloud. For each cloud platform, they run HPC on Linux virtual machines
(VM), and they came up to the conclusion that the tested public clouds do not seem to be
optimized for running HPC applications. This was explained by the fact that public cloud
platforms have slow network connections between virtual machines. Furthermore, authors in
[13] evaluated the performance of HPC applications in today's cloud environments (Amazon
EC2) to understand the tradeoffs in migrating to the cloud. Overall results indicated that
running HPC on EC2 cloud platform limits performance and causes significant variability.
Besides Amazon EC2, a research done in [63] evaluated the performance-cost tradeoffs of
running HPC applications on three different platforms. First and second platform consist of
two physical clusters (Taub and Open Cirrus cluster), and the third platform consists of
Eucalyptus. Running HPC on these platforms led authors to conclude that cloud is more costeffective for low communication-intensive applications.
In order to understand the performance implications on HPC using virtualized resources and
distributed paradigms, authors in [64] performed an extensive analysis using Eucalyptus (16
nodes) and other technologies such as Hadoop [7], Dryad and DryadLINQ [65], and
MapReduce [6]. The conclusion of this research suggested that most parallel applications can
be handled in a fairly and easy manner when using cloud technologies (Hadoop, MapReduce,
35
and Dryad); however, scientific applications, which require complex communication patterns,
still require more efficient runtime support.
Evaluating HPC without relating it to new cloud technologies was also performed using
different virtualization technologies [66, 67, 68, 69]. In [66], authors performed an analysis of
virtualization techniques including VMWare, Xen, and OpenVZ. Their findings showed that
none of the techniques match the performance of the base system perfectly; yet, OpenVZ
demonstrates high performance in both file system performance and industry-standard
benchmarks. In [67], authors compared the performance of KVM and VMware. Overall
findings showed that the VMWare performs better than KVM. Still, in few cases KVM gave
better results than VMWare. In [68], authors conducted quantitative analysis of two leading
open source hypervisors, Xen and KVM. Their study evaluated the performance isolation,
overall performance and scalability of virtual machines for each virtualization technology. In
short, their findings showed that KVM has substantial problems with guests crashing (when
increasing the number of guests); however, KVM still has better performance isolation than
Xen. Finally, in [69] authors have extensively compared four hypervisors: Hyper-V, KVM,
VMWare, and Xen. Their results demonstrated that there is no perfect hypervisor.
5.2.Contribution
So far, there are only few studies that compared different virtualization techniques and its
impact on HPC in the cloud. The only study we found was done in [70], where authors
compared the performance of adopting Xen, KVM and Virtual Box. Each virtualization
technology was compared with bare-metal using a set of high performance benchmarking
tools. The results of this research demonstrated that KVM is the best choice for HPC in the
cloud because of its rich features and near-native performance.
The contribution of this present research will fill the literature gap by examining the impact of
virtualization techniques on HPCaaS using OpenStack as a cloud platform and Hadoop as a
distributed and parallel system.
36
Part III: Technology
Enablers
This part explains the use of OpenStack and Hadoop as underlying technologies for this
research. Hence, this part starts first with providing a qualitative study for selecting an
appropriate cloud platform and distributed system; second chapter of this part introduces in
details OpenStack components, and third chapter presents Hadoop and its main aspects.
37
Chapter 6: Technology Enablers Selection
The architecture we adopted to evaluate the impact of virtualization on HPCaaS was built
after conducting a qualitative study of available tools in the market. We targeted mainly open
sources to select appropriate cloud computing platform and distributed system. Hence, this
chapter presents the analysis we followed in selecting cloud platform and distributed system.
6.1.Cloud Platform Selection
To compare available cloud open sources, we tried to choose the most popular platforms. The
selection of competing platforms was based on a study that compares the popularity of
OpenStack, Opennebula, Eucalyptus and CloudStack in 2013 [71]. As depicted in Figure 12,
the study showed that OpenStack has the largest total population index, followed by
Eucalyptus, CloudStack, and Opennebula.
Figure 12: Active cloud community population [71]
Based on Figure 12, we selected to compare and study OpenStack, Opennebula and
Eucalyptus. To adopt one of these cloud open sources, we used some other studies that
compare their performance and quality [72-75].
In [72], authors compared some open and commercial cloud platforms. Concerning open
platforms, they compared OpenNebula and Eucalyptus. To perform the comparison, they
adopted a set of criteria, including storage, virtualization, network, management, security and
vendor support. The results of the research showed that open-source and commercial solutions
38
can have comparable features, and that OpenNebula is the most feature-complete cloud
platform when it is compared with Eucalyptus.
[73] and [74] provide a comparison study of OpenStack and OpenNebula. In [73], authors
compared the performance of both cloud platforms based on measuring the time when the
cloud starts instantiating VMs and the time when they are ready to accept SSH connections.
The findings of the research demonstrate that OpenStack is slightly better than OpenNebula
due to smaller instantiation time. Moreover, the results showed that OpenStack is more
suitable for high computing due to faster instantiation of large number of VMs. In [74],
authors used qualitative and quantitative analysis to compare OpenStack and OpenNebula.
For the qualitative analysis, they adopted some of the following criteria: security,
virtualization supported, access, image support, resource selection, storage support, highavailability support and API support. Based on the results of the qualitative study, authors
concluded that OpenStack would benefit in case of auto-scaling, while OpenNebula would
benefit in case of persistent storage support. For the quantitative analysis, authors measured
the deployment, network overhead and the clean-up time of VMs. The results of quantitative
analysis showed that each platform can be used depending on user requirements and
specifications.
In [75], authors provided a comparative study of four solutions: Eucalyptus, OpenStack,
OpenNebula and CloudStack. To perform the comparison, authors adopted the following
criteria: storage, network, security, hypervisor, scalable and installation code openness. In
short, the results of this study [75] showed that OpenStack is the preferred cloud open source.
Table 2 summarizes the preferred cloud IaaS in [72-75]. Based on this table, we decided to go
for OpenStack as it is known for its flexibility and total openness.
Table 2 : Cloud IaaS selection
39
6.2.Distributed and Parallel System Selection
To compare available distributed and parallel systems in the market, we opted again for the
popularity index of those systems. The selection of competing systems was based on a study
done in [76]. The study is summarized in Figure 13 which compares the popularity index of
Hadoop, MongoDB, Cassandra, CouchDB, Redis, VoltDB, Neo4j, Riak and Infinispan. The
study was done in 2012, and it demonstrates the total downloads between January 2011 and
March 2012. Figure 13 depicts that Hadoop is the most popular distributed system, followed
by MongoDB and Cassandra.
Figure 13: Active distributed systems population [76]
Based on Figure 13, we performed a qualitative analysis of both Hadoop and MongoDB in
order to end up with one selected system for the present research.
MongoDB is a document-oriented, uses a binary form of JSON called Binary JSON store data
in tables with columns and rows. To provide high redundancy and make data highly available,
MongoDB offers replication across multiple servers. While data is synchronized between
servers using replication, MongoDB also facilitates the scale out option by supporting
sharding which partitions a collection and stores the different portions on different machines.
MongoDB can be built with MapReduce as to execute data in parallel at each shard [62]. On
the other hand, Hadoop is an open source for distributed file system that supports processing,
analyzing and storing large data sets across large clusters using MapReduce paradigm and
HDFS [7]. More details about Hadoop are included in chapter 8.
40
A study done in [77] compares MongoDB and Hadoop systems. The study came up with three
main conclusions; first, it is not appropriate to use MongoDB as an analytics platform;
second, using Hadoop for MapReduce jobs is several times faster than using the built-in
MongoDB MapReduce capability, and third, MongoDB is much slower than HDFS. Besides,
a study was done in [78] did a comparison of Map-Reduce Performance of Hadoop and
MongoDB. In short, the study showed that MongoDB is roughly four times slower than
Hadoop in fully-distributed mode.
Table 3 summarizes the selected distributed system in [77] and [78]. Based on this table, we
decided to go for Hadoop as an analytical and storage tool for the present research.
Table 3 : Parallel and distributed platform selection
41
Chapter 7: Openstack
OpenStack is an open source platform for public and private cloud computing that aims at
ensuring scalability and flexibility. It was developed by a wide range of developers and
contributors using mainly Python (68%), XML (16%) and JavaScript (5%) [79]. This chapter
provides detailed description of Openstack including, brief history; its components, the
corresponding architecture, and finally some supported hypervisors.
7.1.OpenStack Overview
The formal definition of OpenStack was stated in [80], which defines OpenStack as:” a cloud
operating system that controls large pools of compute, storage, and networking resources
throughout a datacenter, all managed through a dashboard that gives administrators control
while empowering their users to provision resources through a web interface”. From this
definition, OpenStack is considered as an Infrastructure as a Service (IaaS).
An important feature of OpenStack is that it provides a web interface called dashboard and
APIs that make its services available via Amazon EC2 and S3 compatible APIs. This feature
ensures that all existing tools that work with Amazon’s cloud platform, can also work with
OpenStack platform [81].
7.2.OpenStack History
OpenStack was a collaboration project between Rackspace Hosting and NASA. Both
organizations planned to release internal cloud project object storage and compute. Rackspace
contributed with their Cloud Files platform to support the storage part of OpenStack, while
NASA contributed with their Nebula platform to support the compute part [82]. In July 2010,
both organizations released the first version of OpenStack under Apache 2.0 License. In
September 2012, OpenStack Foundation was established as an independent entity with the
mission of protecting, empowering, and promoting OpenStack software. Now, OpenStack
project is currently supported by more than 150 companies including AMD, Intel, Canonical,
Red Hat, Cisco, Dell, HP, IBM and Yahoo! [83].
7.1.OpenStack Releases
OpenStack releases different versions with new improvement and contributions. All
OpenStack releases since 2010 are listed in Table 4 [79].
42
Table 4 : OpenStack releases [79]
7.3.OpenStack Components
The core components of OpenStack software are: OpenStack Compute Infrastructure (Nova);
OpenStack Object Storage Infrastructure (Swift) and OpenStack Image Service Infrastructure
(Glance). Besides these components, OpenStack include Identity Service (Keystone),
Network Service (Quantum), Dashboard Service (Horizon) and Block Storage (Cinder).
Table 5 summarizes the main components of OpenStack and the corresponding code name.
Table 5 : OpenStack projects
Taking into consideration the previous mentioned OpenStack components, a conceptual
architecture of OpenStack is provided in Figure 14 which shows how OpenStack components
are interconnected [79].
43
Figure 14: OpenStack conceptual architecture [79]
7.3.1. OpenStack Compute (Nova)
Nova provides flexible management for virtual machines by allowing users to create, update,
and terminate virtual machines. The overall architecture of Nova (Figure 15) is composed of
the following sub-components: nova-api, nova-scheduler, nova-compute, nova-volume, queue
and database [82].
Figure 15: Nova subcomponents
44
Nova-api is responsible of accepting and fulfilling the API requests. A request consists of
actions that will be performed by nova subcomponents. In order to accept an API request,
nova-api provides an endpoint for all API queries and enforcing some policies. If the request
is about managing virtual machines, the nova-compute is involved to be in charge of creating
or terminating a virtual machine instances. Normally, nova-compute receives requests from
the queue sub-component. In order to manage virtual machine instances, nova-compute uses
different ways and drivers such as libvirt software package, Xen API, vSphere API, etc. to
support virtualization technologies. To specify where to send a request, nova-scheduler
retrieves the request from the queue and determines which compute server host it should run
on. In case there is a need for memory space, nova-volume does the creation, attachment, and
detachment of persistent volumes to virtual machine instances [82].
Nova also provides network management by its subcomponent nova-network. The latter
accepts networking tasks from the queue and then performs system commands to manipulate
the network. Nova-network defines two types of IP addresses: Fixed IPs and Floating IPs.
Fixed IP is considered as a private IP that is assigned to an instance during its life cycle. On
the other hand, floating IP is considered as a public IP that will be used for external
connectivity. The network itself that is defined in nova-compute can be classified into three
categories: Flat, FlatDHCP and VLAN network [82].

Flat assigns a fixed IP address to an instance and attaches that IP on common bridge
(created by the administrator).

FlatDHCP builds upon the Flat manager by providing DHCP services to handle
instance addressing and creation of bridges.

VLAN provides a subnet, and a separate bridge for each project. The range of IPs of a
given project is only accessible within the VLAN.
The last subcomponents of nova are queue and database. Queue is responsible of passing
messages between nova sub-components to facilitate the communication between them. It is
implemented using RabbitMQ. Nova database stores most of the configuration and run-time
state of the cloud infrastructure; it contains a set of tables such as: instance types, instances in
use, networks available, fixed IPs, projects and virtual interfaces [82].
7.3.2. OpenStack Object Storage (Glance)
Glance manages virtual disk images. It consists of three main sub-components glance-api,
glance-registry and glance database (Figure 16). Glance-api accepts incoming API requests
45
and then communicates them to other components (glance-registry and image store). All
information about images is stored in glance-database. Last component which is glanceregistry is responsible of retrieving and storing metadata about images [82].
Figure 16: Glance subcomponents
7.3.3. OpenStack Identity Service (Keystone)
Keystone authorizes users’ access to OpenStack components. It supports multiple forms of
authentication including standard username and password credentials and token-based
systems. Keystone architecture is represented by the following subcomponents (Figure 17):
token backend, catalog backend, policy backend and identity backend [82].
Figure 17: Keystone subcomponents
7.3.4. OpenStack Object Store (Swift)
Swift is the oldest project within OpenStack, and it is the underlying technology that powers
Rackspace’s Cloud Files service [82]. Swift provides a massively scalable and redundant
object store by writing multiple copies of each object to multiple and separated storage
46
servers as to handle failures efficiently. Swift component consists of Proxy Server, Account
Server, Container Server, and Object Server (Figure 18).
Figure 18: Swift subcomponents
Swift-proxy accepts incoming requests that consists of uploading files, making modifications
to metadata and creating containers. Requests are served by account server, container server
or object server. Object servers request about managing pre-existing objects or files in the
storage; account server manages accounts defined with the object storage service, and
container server manages the mapping of containers, folders, within the object store service
[82].
7.3.5. OpenStack Block Storage Service (Cinder)
Cinder allows block devices to be connected to virtual machine instances for better
performance. It consists of the following sub-components: cinder-api, cinder-volume, cinderdatabase and cinder-scheduler (Figure 19).
Cinder-api accepts incoming requests and directs them to the cinder-volume which performs
reading or writing to the cinder database to maintain states and interacts with other processes.
Cinder-scheduler is responsible of selecting the optimal block storage node to create the
volume on. In order to maintain communication between cinder components, message queue
is used.
47
Figure 19: Cinder subcomponents
7.3.6. OpenStack Network Service (Quantum)
Quantum allows users to create their own networks and then attach interfaces to them. It
consists of quantum-server, quantum-account, quantum-plugin and quantum-database
(Figure 20). Quantum-server accepts incoming API requests and then directs them to the
correct quantum-plugin. Plugins and agents perform special actions such as plug/unplug ports,
creating networks, subnets and IP addressing. Finally, quantum-database stores networking
state for particular plugins.
Figure 20: Quantum subcomponents
48
7.4.OpenStack Supported Hypervisors
The abstraction feature provided by OpenStack Compute lead to support various existing
hypervisors. Some of the supported hypervisors are listed as follow: KVM, LXC, QEMU,
UML, VMWare ESX/ESXi, Xen, PowerVM, Hyper-V [79]. However, KVM is still the most
widely used hypervisor in deploying OpenStack. Besides KVM, more existing deployments
run Xen, LXC, VMWare and Hyper-V, but each of these hypervisors lack some features
support or the documentation on how to use them with OpenStack is not well documented.
49
Chapter 8: Hadoop
Hadoop has been adopted by big players in the market such as Google, Yahoo, LinkedIn,
Facebook, New York Times, IBM, etc. [84]. This chapter provides a detailed overview of
Hadoop, starting with a brief history of this open source, the corresponding architecture,
implementation and some related features.
8.1.Hadoop Overview
Hadoop is an Apache Java open source for distributed file system that supports processing,
analyzing and storing large data sets across large clusters using MapReduce paradigm and
HDFS [85]. Hadoop has been designed to be reliable, fault tolerant and scalable project that
can scale up from one single machine to thousands of machines.
8.2.Hadoop History
In 2002, Hadoop was created by Doug Cutting as an open source for web crawling and
indexing, and it was first named Nutch project. Nutch was developed to handle searching
issues, but it faced the scalability problem as it wouldn’t scale up to billions of web pages. To
deal with this issue, Nutch team got inspired by Google’s distributed filesystem (GFS). By
adopting GFS architecture in 2004, Nutch team has delivered an open source called Nutch
Distributed Filesystem (NDFS) [86].
When Google published its paper about MapReduce algorithm, Nutch team has tried to get
advantage of that work by introducing MapReduce to its NDFS system. Implementing both
NDFS and MapReduce made Nutch as a powerful system for web crawling and indexing.
This success has pushed Nutch team to build an independent project in 2006 named Hadoop
project. By this time, Doug Cutting joined Yahoo!, which provided enough resources to
improve Hadoop performance. Even if Yahoo! has developed and contributed to 80% of
Hadoop project, Hadoop was made its own top-level project at Apache in January 2008 [87].
Besides implementing MapReduce and HDFS algorithms, Hadoop project includes other subprojects that are listed in Table 6 [85].
50
Table 6: Apache Hadoop subprojects
Hadoop subprojects are grouped and named Hadoop Ecosystem. The overall picture of
Hadoop Ecosystem is illustrated in Figure 21.
ETL Tools
BI Reporting
RDBMS
Pig (Data Flow)
Hive (SQL)
Sqoop
HBase
Avro
Zookeeper
MapReduce (Job Scheduling / Excution System)
HDFS
(Hadoop Distributed File System)
Figure 21: Apache Hadoop subprojects [85]
8.3.Hadoop Architecture
Hadoop implements master/slave architecture, where master is named NameNode and slave is
named DataNode. NameNode manages the file system namespace that consists of a hierarchy
of files and directories used for data storage. When a file is created by client application, it is
divided into blocks; each block is replicated and stored in DataNodes. In this case,
information about the replicas numbers (number of block copies) and the mapping of replicas
and blocks are stored in the NameNode. On the other hand, each DataNode is in charge of
51
managing storage attached to the node in which it is running on. Furthermore, each DataNode
handles the read operation, write, block creation, deletion, and replication that come as
instructions from the NameNode [86].
Besides NameNode and DataNodes, Hadoop cluster consists of Secondary NameNode
(backup node for NameNode), JobTracker and TaskTracker. JobTracker is located in the
master node, and it is responsible of distributing MapReduce tasks to other nodes in the
cluster. On the other hand, TaskTracker runs locally tasks distributed by the JobTracker; each
slave in the cluster contains one TaskTracker that can also run on master node [86].
The overall architecture of Hadoop is illustrated in Figure 22.
Figure 22: Hadoop Architecture
8.4.Hadoop Implementation
Hadoop is mainly implemented using HDFS and MapReduce paradigm. HDFS is used to
store large data sets while MapReduce is used to analyze and process data across Hadoop
cluster. Taking into consideration the architecture provided in Figure 22, HDFS concept is
represented by the NameNode, Secondary NameNode and DataNodes, while MapReduce is
represented by the JobTracker and TaskTracker (Figure 23).
52
Figure 23: HDFS and MapReduce representation
8.4.1. HDFS Overview
HDFS is designed as a hierarchy of files and directories. Each file is divided into blocks that
are stored in different DataNodes. NameNode stores only the metadata that includes
information about blocks’ locations and the number of copies of each block. Furthermore,
HDFS allows NameNode to perform the namespace operations such as opening, closing and
renaming files and directories. As stated before, HDFS performs data replication to ensure
fault-tolerance. The replication factor is set when a file is created, and it can be modified later
[85].
An example that illustrates the HDFS process is the read, write and creation operations.
During the read operation, the HDFS request from the NameNode the list of DataNodes that
host replicas of the blocks of a given file. The list is sorted by the network topology distance
from the client. After deciding on the DataNode from where to fetch data, The HDFS client
contacts directly the DataNode and requests the desired block. On the other hand, during the
write operation, the HDFS asks the NameNode to choose DataNodes that will store replicas of
the first block of the file, second block and so on as so far. For each block, the client
organizes a pipeline from node-to-node and sends the data. When the first block is filled, the
client requests new DataNodes to be chosen to host replicas of the next block. Concerning the
creation operation, when there is a request to create a file, the HDFS caches first the file into a
temporary local file. When the latter accumulates data up to the HDFS block size, the HDFS
53
contacts the NameNode to insert the file name into the file system namespace and allocate a
data block for it. After that, the NameNode selects the DataNodes that will host the data
blocks. At this stage, the client moves the block of data from the local temporary file to the
specified DataNode [85].
8.4.2. MapReduce Overview
Hadoop MapReduce is a programming paradigm that processes very large data sets in parallel
manner on large clusters. It was first introduced by Google in 2004 [6]. The core idea of
MapReduce is splitting the input data set into chunks that will be processed by map tasks in a
parallel manner. The output of each map task is sorted to be then directed as an input to the
reduce task. Taking into consideration the previous definition, MapReduce can be classified
into two steps: map step and reduce step [88].
Map task process is divided by itself into five phases: read, map, collect, spill and merge. The
read phase consists of reading the data chunk from the HDFS, and then creating the input
key-value. Map phase is about executing the user-defined map function to generate the mapoutput data. Collect phase performs the collection of intermediate (map-output) data into a
buffer before spilling. Spilling process sorts, performs compression, if specified, and writes to
local disk to create file spills. The last step in the map task is the merge phase which merges
all file spills into one single map output file [88] .
Reduce task is also divided into four phases: shuffle, merge, reduce and reduce phase. Shuffle
phase transfers the intermediate data (map output) from the mapper slaves to a reducer's node
and decompressing if needed. Merge phase performs the merging of the sorted outputs that
come from different mappers to be directed as the input to the reduce phase. Reduce phase
executes the user-defined reduce function to produce the final output data. Finally, write
phase compresses, if needed, and writes the final output to HDFS [88] .
A popular example that illustrates the MapReduce execution is the Words Count example
which counts the number of occurrence of each individual word in a given file (Figure 24)
[89].
54
Figure 24: Word count MapReduce example [89]
8.5.Hadoop Cluster Connectivity
When Hadoop starts connecting, each DataNode performs a handshake with the NameNode.
The purpose of this operation is to verify the namespace ID and the software version of the
DataNode. The namespace ID is assigned to the filesystem instance when it is formatted, and
it is stored in all nodes of the cluster. Nodes with a different namespace ID will not be able to
be part of the cluster. However, if the namespace ID is the same, the handshake will be
performed successfully between the DataNodes and the NameNode. At this point, each
DataNode stores its unique storage ID, which is an internal identifier of the DataNode. The
main purpose of this ID is to make the DataNode recognizable even if it is restarted with a
different IP address or port [87].
During normal operation, DataNodes send heartbeats to the NameNode to confirm that the
DataNode is operating and the block replicas it hosts are available. The default heartbeat
interval is three seconds. In case the NameNode does not receive a heartbeat from a DataNode
in ten minutes, the NameNode considers the DataNode as a dead node. In this case,
NameNode creates new replicas of those blocks on dead DataNodes. In fact, heartbeats are
not only used for ensuring NameNode-DataNodes connectivity, but it is also used to send
statistical information such as total storage capacity, and fraction of storage in use. Another
benefit of heartbeats is to send instructions from the NameNode to DataNodes. Those
instructions include commands to replicate blocks to other DataNodes, remove local block
55
replicas, reregister and send an immediate block report, and shut down the node. These
commands are important for maintaining the overall system integrity and therefore it is
critical to keep heartbeats frequent even on big clusters. The NameNode can process
thousands of heartbeats per second without affecting other NameNode operations [87].
56
Part III: Research
Contribution
To clarify the steps we followed in this study, we divided this part into four chapters 9, 10, 11
and 12. Chapter 9 defines the research methodology; chapter 10 describes the experimental
setup that we used to get the performance of HPCaaS; chapter 11 presents the results we got
from each experiment, and finally, chapter 12 discusses and analyzes the research findings.
57
Chapter 9: Research Methodology
The choice of research methodology depends mainly on the nature of the research question.
This chapter discusses the methodology that was followed in conducting the present study. It
explains first the choice of the selected methodology, and then it demonstrates an overall
picture of the research steps.
9.1.Research Approach
The present research was based on a combination of qualitative and quantitative approach.
Qualitative approach was followed to compare and select appropriate technology enablers for
this research (Part III, Chapter 7), whereas quantitative approach was adopted to provide
numeric measurements of HPC on physical cluster and virtualized clusters (Part IV, Chapter
10, 11 and 12),
9.2.Research Steps
Figure 25 summarizes the steps followed in conducting the present research.
Figure 25 : Research steps
58
Chapter 10: Experimental Setup
In order to investigate the research question, we have conducted three main experiments. The
first experiment evaluates the performance of HPC on Hadoop Physical Cluster (HPhC); the
second experiment evaluates the performance of HPC using Hadoop Virtualized Cluster
(HVC) with KVM, and the last experiment evaluates HPC using Hadoop virtualized cluster
with VMware ESXi virtualization technology.
This chapter describes the experiment setup used in this research; it provides an overall
picture of the three adopted clusters; it specifies the hardware, software and network
specifications; it introduces the benchmarks used to evaluate the performance of HPC on each
cluster; it lists the datasets sizes used in each experiment, and finally, this chapter explains the
experimental execution of the present research.
10.1.Experimental Hardware
In our performance study, we have built 3 different clusters: Hadoop Physical Cluster,
Hadoop Virtualized Cluster using KVM and Hadoop Virtualized Cluster using VMware
ESXi. Each cluster is composed of eight machines.
For the physical cluster, we used 8 Dell OptiPlex 755 Desktop computers with specifications
listed in Table 7. For both Hadoop virtualized clusters (KVM and VMware ESXi), we used a
Dell PowerEdge server with features listed in Table 8. On top of the server, we installed
OpenStack to create eight virtual machines using KVM hypervisor and then VMware ESXi
hypervisor. Because of some limited flexibility of OpenStack, we cloud create VMs with
features described in Table 9.
Table 7 : Dell OptiPlex 755 computer features (used for Hadoop physical cluster)
59
Table 8 : Dell PowerEdge server used for building OpenStack & Hadoop virtualized cluster
Table 9 : OpenStack virtual machines’ features
10.2.Experimental Software and Network
As stated in chapter 6, we opted for Hadoop to process and store small and large datasets; we
chose to install Hadoop version 1.2.1. Concerning OpenStack, the version that was adopted is
Folsom Release which supports KVM, Xen, VMWare and other hypervisors. Networking
configuration was characterized by a bandwidth of 100Mbps per port.
10.3.Clusters Architecture
In this section, we will conceptualize each individual cluster in terms of its layers and
components.
10.3.1. Hadoop Physical Cluster
Figure 26 and 27 show an overall picture of Hadoop Physical Cluster. The configuration was
done in Linux Lab at AUI. The lab is connected to 1 Gbps switch (provides 100 Mbps per
port) that is also connected to other offices in the building (where the lab is allocated). As
60
both figures depict, the cluster contains eight machines where one machine was selected to be
the master and slave node at the same time. The reason behind choosing the master node to
serve as both master and slave node is to increase the cluster performance when processing
and storing datasets.
Figure 26 : Hadoop Physical Cluster
Figure 27: Hadoop Physical Cluster architecture
10.3.2. Hadoop Virtualized Cluster – KVM
The second cluster we built in this research is Hadoop Virtualized Cluster with KVM
technology. As Figure 28 shows, the first step in configuring the cluster is to install an
operating system on Dell PowerEdge server; the OS that was selected is Ubuntu Precise 12.04
61
LTS- 64 bits. The next step is to install and configuring KVM packages which are loaded in
Linux OS as KVM driver. After preparing the system with OS and KVM hypervisor, next
step is to install OpenStack on top of the OS (OpenStack with KVM documentation is
provided in Appendix A). Finally, last step is to configure Hadoop on top of each OpenStack
VM instance (Hadoop documentation is provided in Appendix C).
Figure 28: Hadoop virtualized cluster - KVM
The first OpenStack component that needs to be installed is the keystone which manages the
authentication to OpenStack resources. After downloading and installing the keystone
package, the next step is to create tenants (OpenStack projects) and OpenStack users that are
associated to one or more tenants. Each user can be a member or an admin in a given project;
in this case, roles need to be created in order to set rights and privileges to each user. After
creating users, tenants and roles, next step is to create OpenStack services (nova, keystone,
and glance service) that provide one or more endpoints (URLs) through which users can
access OpenStack resources. The second component to install is OpenStack glance which
allows creating and managing different formats of images (Ubuntu, Fedora, Windows, etc.)
Glance packages include glance-api that accepts incoming API requests; glance-database that
stores all information about images, and finally glance-registry that is responsible of
retrieving and storing metadata about images. Third component to deploy in OpenStack is the
Nova package which includes nova-compute, nova-scheduler, nova-network, novaobjectstore, nova-api, rabbitmq-server, novnc and nova-consoleauth. All these components
collaborate and communicate with each other to create and manage instances, networks and, if
needed, volumes. Finally, to have access to instances, a user friendly insterface can be
62
installed through configuring OpenStack dashboard or Horizon. After login to OpenStack
Dashboard, the user can launch instances with the possibility of specifying the number of
CPUs, disk space, total RAM memory per VM, etc.
After creating VM instances (with requirements listed in Table 9), we installed Hadoop 1.2.1
on each VM. Hadoop configuration starts with identifying the master node and slave nodes.
For master node, there are six files that need to be configured: core-site, hadoop-env, hdfs,
mapred-site, master and slaves files. Concerning slave nodes, the only files that need to be
configured are hadoop-env, core-site, hdfs and mapred-site files. When connecting nodes, the
cluster needs to be formatted as to clean the file namespace. After formatting Hadoop, the
cluster can be started to run jobs.
10.3.3. Hadoop Virtualized Cluster – VMware ESXi
The third cluster that was built in this research is Hadoop Virtualized Cluster using VMware
ESXi technology (Figure 29). The first step in configuring this cluster is to install VMware
ESXi on top of Dell PowerEdge server. Then, OpenStack is configured on top of the
hypervisor (OpenStack with VMware ESXi documentation is provided in Appendix B). After
configuring OpenStack, instances can be then created to build Hadoop cluster.
Figure 29: Hadoop virtualized cluster – VMware ESXi (a)
In fact, when installing OpenStack with VMware ESXi, Openstack is installed as a VM on top
of VMware ESXi hypervisor. Then, through OpenStack dashboard, instances can be created
as VMs on top of VMware ESXi hypervisor (Figure 30).
63
Figure 30 : Hadoop virtualized cluster – VMware ESXi (b)
10.4.Experimental Performance Benchmarks
To evaluate the impact of machine virtualization on HPCaaS, we adopted two main known
benchmarks: Terasort and TestDFSIO benchmarks [90]. TeraSort performance metrics consist
of measuring the average time to sort certain datasets, while TestDFSIO performance metrics
consist of measuring the execution time to write and read datasets. Table 10 summaries the
performance metrics used in evaluating HPCaaS.
Table 10 : Experimental performance metrics
10.4.1. TeraSort Description
TeraSort was developed by Owen O’Malley and Arun Murthy at Yahoo Inc [90]. It won the
annual general purpose terabyte sort benchmark in 2008 and 2009. It does considerable
computation, networking, and storage I/O, and is often considered to be representative of real
Hadoop workloads [90]. Terasort is divided into three main steps: Teragen, Terasort and
Teravalidate.
64
Teragen generates random data that will be sorted by Terasort. It writes the generated data as
a file of n rows, where each row is 100 bytes. Each row is formatted as follow: 10 bytes key,
10 bytes rowid and 78 bytes filler, where keys are random characters from the set ‘ ‘ .. ‘~’ ,
rowid is an integer that specifies the row id, and filler consists of 7 runs of 10 characters from
‘A’ to ‘Z’. When data is generated, TeraSort sorts this data using quicksort algorithm. The
latter is integrated with map/reduces tasks to use a sorted list of n-1 sampled keys that define
the key range for each reduce [9]. Finally, Teravalidate ensures that the output data of
TeraSort is sorted. It creates one map task per file in TeraSort’s output directory; in this case,
each map task ensures that each key is less than or equal to the previous one. Furthermore,
map task generates records with the first and last keys of the file; then the reduce tasks
ensures that the first key of file i is greater than the last key of file i−1. If there is any
unordered keys, Teravalidate reports this as an output of the reduce task [90]. (TeraSort
benchmark is documented in Appendix D)
10.4.2. TestDFSIO Description
TestDFSIO benchmark is used to check the I/O rate of Hadoop cluster with write and read
operations. Such benchmark can be helpful for testing HDFS by checking network
performance, and testing hardware, OS and Hadoop setup [90]. TestDFSIO is written in Java,
and its source code can be found in [91]. TestDFSIO is composed of TestDFSIO-Write and
TestDFSIO-Read. Both operations are performed by specifying the number of files and the
size of each file in megabyte [90]. (TestDFSIO benchmark is documented in Appendix D)
10.5 Experimental Datasets Size
In each experiment, we measured the performance of Hadoop cluster using different dataset
sizes. For TeraSort, we used 100 MB, 1 GB, 10 GB and 30 GB datasets, and for TestDFSIO,
we used 100 MB, 1 GB, 10 GB and 100 GB datasets. Table 11 summarizes the dataset sizes
used in this research.
Table 11 : Datasets size used for Hadoop benchmarks
65
10.6 Experiment Execution
We started conducting each experiment by scaling the cluster from three machines up to eight
machines. In other words, we test each benchmark on three machines, four machines… until
we reached eight machines. Furthermore, for each individual benchmark, we performed three
tests on 100MB, 1GB, 10 GB and 30 GB (TeraSort) and 100MB, 1GB, 10 GB and 100 GB
(TestDFSIO), then we calculated the mean to avoid any outliers and to provide more accurate
results. Figure 31 simplifies the steps of running experiment 1 on HPhC using Terasort
benchmark.
Figure 31 : Experimental execution
66
Chapter 11: Experimental Results
This chapter presents the findings we got from running each experiment. It presents the results
of running HPC on HPhC; on HVC with KVM, and then the results of running HPC on HVC
using VMware ESXi. Last section, compares the results we got from running each
experiment. (The results we got from running experiments are listed in Appendix E and F)
11.1.Hadoop Physical Cluster Results
11.1.1. TeraSort Performance on HPhC
Running TeraSort benchmark showed that it needs much time to sort large datasets like 10
GB and 30 GB. Yet, scaling the cluster to more nodes led to significant time reduction in
sorting datasets. The results we got from running this benchmark on Hadoop Physical Cluster
are listed in Table 12 and conceptualized in Figure 32.
number of nodes- Hadoop Physical Cluster
Figure 32: TeraSort performance on Hadoop Physical Cluster
67
Figure 33 and 34 illustrate clearly the benefit of scaling the cluster. For instance, running
100MB with 3 nodes needs around 21.33 seconds, while with 8 nodes, it needs 19.97 seconds
(reduced by 6%). In the case of 1GB, the average time was reduced by 4% when scaling from
3 to 8 nodes.
Figure 33: TeraSort performance for
100 MB on Hadoop Physical Cluster
Figure 34 : TeraSort performance for 1 GB on
Concerning 10GB, the results were somehow different (Figure35). Sorting 10 GB was reduced
by 18.55% when scaling from 3 to 6 machines. Yet, increasing the number of machines to 8
nodes led to significant reduction in sorting performance. This can be explained by the impact
of network bottleneck, especially that Hadoop is highly influenced by this issue. Furthermore,
the impact of 8 nodes was important when running large datasets like 30 GB (Figure 36). For
this case, the average time to sort the dataset was reduced by 24.77% (difference of 42
minutes) when increasing the number of nodes from 3 to 8.
10 GB on Hadoop Physical Cluster
68
11.1.2. TestDFSIO- Write Performance on HPhC
Running TestDFSIO-Write on Hadoop physical cluster follows in general one pattern.
Meaning, as the number of VMs increases, the average time decreases when writing different
dataset sizes. Table 13 and Figure 37 list and illustrate the results we got from running
TestDFSIO-Write on HPhC.
different number of nodes- Hadoop Physical Cluster
Figure 37: TestDFSIO-Write performance on Hadoop Physical Cluster
Zooming on TestDFSIO-Write for 100MB dataset (Figure 38), the average time for running
TestDFSIO-Write decreased as the number the of slaves increases. In this case, scaling the
cluster from 3 machines (including the master) to 8 machines led to a reduction of 11.25% in
overall writing average time. The same observation is applied when running TestDFSIOWrite for 1GB dataset (Figure 39) where the average time was reduced by 46.5 % when
scaling from 3 to 8 slaves.
69
Figure 38: TestDFSIO-Write performance for
Figure 39 : TestDFSIO-Write performance
for 100 MB on Hadoop Physical Cluster
Figure 41 : TestDFSIO-Write performance
for 100 GB on Hadoop Physical Cluster
When running 100 GB (Figure 41), we observe a sharp time reduction in running the
TestDFSIO-Write when scaling from 3 to 8 slaves; this reduction was quantified by 12.53%.
However, an expected average time was increased when scaling from 4 to 5 machines. Again,
this unexpected result can be explained by the overall network performance.
11.1.3. TestDFSIO- Read Performance on HPhC
Running TestDFSIO-Read led also to significant performance improvement when the
physical cluster was scaled up to 8 machines (Table 14 and Figure 42). In general, this
observation is applied for all dataset sizes.
70
Table 14: Average time (in seconds) of running TestDFSIO-Read on different dataset sizes and
different number of nodes- Hadoop Physical Cluster
Figure 42: TestDFSIO-Read performance on Hadoop Physical Cluster
When the cluster was scaled from 3 to 7 nodes, the average time for reading 100MB (Figure
43) was reduced by 4.36% and 2.46% when reading 1GB (Figure 44). However, when scaling
the cluster from 7 to 8 machines, the average time increased suddenly when reading both
100MB and 1GB. The same observation was made when reading 10GB and 100GB (Figure
45 and 46).
Figure 43: TestDFSIO-Read performance for
100 MB on Hadoop Physical Cluster
Figure 44 : TestDFSIO-Read performance for 1
GB on Hadoop Physical Cluster
71
Figure 45: TestDFSIO-Read performance for
Figure 46 : TestDFSIO-Read performance for
11.2.Hadoop Virtualized Cluster- KVM Results
11.2.1. TeraSort Performance on HVC-KVM
Running TeraSort on Hadoop KVM Cluster showed an important improvement in sorting
various dataset sizes. Yet, this observation is applied when scaling the KVM cluster from 3 to
5 VMs. The results we got from running this benchmark on Hadoop KVM Cluster are listed
in Table 15 and conceptualized in Figure 47.
number of nodes- Hadoop KVM Cluster
Figure 47: TeraSort performance on Hadoop KVM Cluster
72
From Figure 48, sorting 100MB on 3 VMs takes around 15 seconds, and it decreases by 2.2%
and 5.5% when sorting the dataset on 4 and 5 VMs respectively.
100 MB on Hadoop KVM Cluster
Hadoop KVM Cluster
When sorting 1GB, 10 GB and 30 GB (Figure 49, 50 and 51), the performance was slightly
improved as the number of VMs increases. For example, sorting time of 10GB was decreased
by 0.3%, and sorting time of 30 GB was decreased by 5% when scaling from 3 to 4 nodes.
However, when the cluster was scaled to 5, 6, 7 and 8 nodes, the overall performance of
sorting 1GB, 10 GB and 30 GB was sharply decreased.
10 GB on Hadoop KVM Cluster
Hadoop KVM Cluster
73
11.2.2. TestDFSIO-Write Performance on HVC-KVM
Running TestDFSIO-Write on Hadoop KVM was slightly improved as the number of VMs
increases. The results of running TestDFSIO-Write are listed in Table 16 and illustrated in
Figure 52.
Table 16: Average
time (in seconds) of running TestDFSIO-Write on different dataset sizes and
different number of nodes- Hadoop KVM Cluster
Figure 52 : TestDFSIO-Write performance on Hadoop KVM Cluster
For all dataset sizes (Figure 53, 54, 55 and 56), as stated before, the overall performance was
slightly improved as the number of VMs increased from 3, 4 and 5. For instance, writing
10GB was improved by 1.6% when scaling from 3 to 5 VM. Furthermore, when trying to
write 100GB, the system was crashed because of the overall system overhead (Figure 56).
74
Figure 53: TestDFSIO-Write performance for 100
MB on Hadoop KVM Cluster
Figure 54 : TestDFSIO-Write performance for
1GB on Hadoop KVM Cluster
11.2.3. TestDFSIO- Read Performance on HVC-KVM
TestDFSIO- Read has the same behavior as TestDFSIO-Write. Meaning, the performance of
reading different dataset sizes increases as the number of VMs increases from 3 to 5. The
results we got from running TestDFSIO- Read are illustrated in Table 17 and Figure 57.
75
different number of nodes- Hadoop KVM Cluster
Figure 57: TestdFSIO-Read performance on Hadoop KVM Cluster
As Figure 58, 59, 60 and 61 depict, the overall performance of reading different dataset sizes
increases as the number of VMs increases from 3 to 5. For example, the average time for
reading 100GB was slightly decreased by 3% when scaling from 3 to 5 VMs.
Figure 58: TestDFSIO-Read performance for 100
MB on Hadoop KVM Cluster
1GB on Hadoop KVM Cluster
76
Figure 60: TestDFSIO-Read performance for 10
GB on Hadoop VMware ESXi Cluster
100 GB on Hadoop VMware ESXi Cluster
11.3.Hadoop Virtualized Cluster- VMware ESXi Results
11.3.1. TeraSort Performance on HVC-VMware ESXi
Table 18 and Figure 62 present the performance of running TeraSort on Hadoop VMware
ESXi Cluster; the overall observation shows significant improvement in sorting various
dataset sizes. In contrast to KVM cluster, VMware ESXi keeps decreasing the average time
of storing as the number of VMs increases from 3 to 6 (for large datasets).
different number of nodes- Hadoop VMware ESXi Cluster
Figure 62 : TeraSort performance on Hadoop VMware ESXi Cluster
77
As Figure 63 depicts, the performance of sorting 1 GB was decreased by 23% when scaling
the cluster from 3 to 6 VMs. Yet, the performance starts degrading as the number of VMs
increases from 6 to 7 and 8.
Figure 63: TeraSort performance for 100 MB on
Hadoop VMware ESXi Cluster
Figure 64 : TeraSort performance for 1GB on
A significant high performance was observed when sorting 30GB (Figure 66). The
performance was increased by 34% from 3 to 6 VMs, 25% from 3 to 7 VMs and 3% from 3 to
8 VMs.
Figure 65: TeraSort performance for 10 GB on
Figure 66 : TeraSort performance for 30GB
on Hadoop VMware ESXi Cluster
78
11.3.2. TestDFSIO-Write Performance on HVC-VMware ESXi
Running TestDFSIO-Write on Hadoop VMware ESXi was improved as the number of VMs
increases to 7. The results of running TestDFSIO-Write are listed in Table 19 and illustrated
in Figure 67.
Table 19 : Average time (in seconds) of running TestdFSIO-Write on different dataset sizes and
Figure 67 : TestDFSIO-Write performance on Hadoop VMware ESXi Cluster
For all dataset sizes (Figure 68, 69, 70 and 71), the overall performance was improved as the
number of VMs increases from 3 to 7. For instance, writing 100 MB was improved by 37%
when scaling from 3 to 7 VMs. Furthermore, when writing large dataset like 10GB, the
overall performance increased by 12% when scaling from 3 to 7 VMs. However, for the case
of 100GB, the performance started degrading when scaling from 6 to 7 and 8 VMs.
79
MB on Hadoop VMware ESXi Cluster
1GB on Hadoop VMware ESXi Cluster
11.3.3. TestDFSIO- Read Performance on HVC-VMware ESXi
TestDFSIO- Read behaves as TestDFSIO- Write when the performance of reading different
dataset sizes increases as the number of VMs increases from 3 to 7. However, the average
time for reading different datasets was less than writing operation (by more than half). The
results we got from running TestDFSIO- Read on VMware ESXi are listed in Table 20 and
conceptualized in Figure 72.
80
Figure 72 : TestDFSIO-Read performance on Hadoop VMware ESXi Cluster
Figures 73, 74, 75 and 76 show the performance of running TestDFSIO-Read on each
individual dataset . For most dataset sizes, the performance was improved as the number
of VMs inreased up to 7. For instance, the performance of reading 100GB was improved
by 36% when scaling from 3 to 7 VMs. However, reading 1GB behavied differently as the
correspondding performance started to decline at VM 6.
Figure 73: TestDFSIO- Read performance for
100 MB on Hadoop VMware ESXi Cluster
Figure 74 : TestDFSIO-Read performance for 1
81
Figure 75: TestDFSIO- Read performance for
11.4. Results Comparison
11.4.1. TeraSort Performance
The overall performance of the 3 clusters varies depending on the datasets size and the
number of nodes involved in each cluster. Yet, Hadoop VMware ESXi cluster was performing
much better than other clusters when running TeraSort benchmark on large datasets.
Starting with 100MB (Figure 77), TeraSort showed high performance when being virtualized
with VMware ESXi and KVM. Both clusters were 25% (VMware ESXi) and 30% (KVM)
faster than the physical cluster (in case of 3 nodes). Further, a significant performance was
achieved when scaling the cluster to 4, 5 and 6 nodes; in this case, both KVM and VMware
ESXi were faster than the physical cluster.
After increasing the number of nodes to 7 and 8, VMware ESXi performance decreases by
33% and becomes slower than the physical cluster by 18% (when scaling from 3 to 8 nodes).
On the other hand, the average time of sorting 100MB dataset on KVM cluster declined as the
number of nodes increases to 7 and 8, and therefore, the sorting performance was improved
from 15 to 14 seconds. Further, virtualized cluster (KVM) was performing better than the
physical cluster by 29.5% and 27% for 7 and 8 nodes respectively.
82
Figure 77 : Average time for sorting 100 MB on HPhC, HVC with KVM and
VMware ESXi
When increasing the dataset size, the performance changes in each scenario (dataset size and
number of nodes). In the case of 1GB (Figure 78), virtualized cluster was keeping the best
performance when compared with the physical cluster. When the cluster was composed of 3-5
nodes, virtualized clusters sort the 1GB dataset with a range of 87-90 seconds, while the
physical cluster sorts the same dataset with a range of 182-187 seconds. When increasing the
number of nodes from 5 to 8, VMware ESXi was faster than other clusters; however, KVM
knew a decline in its performance when being compared with KVM cluster of 3-4 nodes and
when being compared with the physical cluster. For instance, in the case of 8 machines,
physical cluster was faster than KVM cluster by 89%.
Figure 78 : Average time for sorting 1 GB on HPhC, HVC with KVM and VMware ESXi
83
The same observation on 1GB can be applied when sorting 10GB dataset (Figure 79). Yet, in
this case, the performance of virtualized clusters was very high than the physical cluster. For
instance, in the case of 5 VMs, VMware ESXi cluster was faster than physical cluster by 60%,
and KVM was faster than physical cluster by 51%.
Figure 79 : Average time for sorting 10 GB on HPhC, HVC with KVM and
VMware ESXi
When moving to larger datasets, VMware ESXi cluster proved its significant performance in
sorting the 30 GB dataset (Figure 80). For instance, in the case of 4 nodes, VMware ESXi
was faster than KVM cluster by 28% and faster than physical cluster by 61%. Moreover,
KVM was performing better than physical cluster when the cluster was composed of 3, 4, 5
and 6 nodes. Afterward, when increasing the cluster size to 7 and 8 nodes, the KVM cluster
decreased in its performance and became slower than the physical cluster.
Figure 80 : Average time for sorting 30 GB on HPhC, HVC with KVM and
84
VMware ESXi
The last observation consists in VMware ESXi performance on 8 nodes cluster. For all
different datasets, we observed that VMware performance degraded; for example, for 10 GB,
the performance decreased by 51%. Even though, VMware ESXi kept performing better than
other clusters.
11.4.2. TestDFSIO- Write Performance
The results we got from TestDFSIO were different than the ones in TeraSort benchmark. The
overall observation of Figure 81 and 82 shows that virtualization is still performing better
than the physical cluster.
In the case of 3-5 nodes cluster, we can observe that KVM cluster performance is much
better than VMware ESXi and physical cluster. For instance, writing 100 MB using 5 nodes,
KVM cluster was 11% faster than physical cluster and 24% faster than VMware ESXi cluster
(Figure 81). However, we observed that physical cluster was performing better than VMware
ESXi, and the difference was quantified by 48% seconds (100 MB using 5 nodes).
When scaling the cluster from 5 to 8 nodes, the KVM cluster knew sharp performance
degradation. Again, this is due to system overhead. In this case, the physical cluster showed
better results than virtualized clusters.
Figure 81: Average time for writing 1 GB on
HPhC, HVC with KVM and VMware ESXi
Figure 82 : Average time for writing 10 GB on
The same observation is applied when sorting 100 GB (Figure 83). The only difference is that
KVM cluster with 8 nodes was unable to write the 100 GB.
85
Figure 83: Average time for writing 10 GB on
Figure 84 : Average time for writing 100 GB on
11.4.3. TestDFSIO- Read Performance
As illustrated in Figure 84 and 85, reading small datasets (100MB and 1GB) showed that
virtualized cluster is faster than physical cluster. Yet, this applied for KVM cluster when it is
composed of 3-5 nodes. Afterwards, when KVM clusters scaled to 6, 7 and 8 nodes, the
performance of reading all datasets degraded. On the other hand, physical cluster performed
better than VMware ESXi in all case (100MB and 1G on different number of nodes).
Figure 85: Average time for reading 1 GB on HPC,
HVC with KVM and VMware ESXi
Figure 86 : Average time for reading 1 GB on
HPC, HVC with KVM and HVC VMware ESXi
When increasing the dataset size to 10 GB and 100GB (Figure 86 and 87), we can see
different performance trends. When the cluster is composed of 3-5 nodes, KVM cluster kept
better performance than other clusters. For instance, for 100 GB and 3 nodes, KVM cluster
86
was faster than VMware ESXi by 12% and faster than physical cluster by 44%. However, as
other benchmarks (TeraSort and TestDFSIO-Write), KVM cluster showed a sharp
degradation in reading 100GB when the cluster scaled to 6, 7 and 8 nodes. When reading
10GB and 100 GB, in contrast to TestDFSIO-Write results, VMware ESXi cluster was faster
than physical cluster in all scenarios (number of nodes). For instance, using a cluster of 3
nodes; VMware ESXi was faster than the physical cluster by 36% and 55.5% in the case of 7
and 8 nodes respectively.
An important observation is that KVM cluster with 8 VMs was unable neither to write nor to
read 100GB dataset (Figure 87).
Figure 87: Average time for reading 10 GB on
HPhC, HVC with KVM and HVC VMware ESXi
Figure 88 : Average time for reading 100 GB on
HPhC, HVC with KVM and HVC VMware ESXi
87
Chapter 12: Discussion
The results we got in this research proved significant improvements when virtualizing HPC,
especially when the latter was tested with TeraSort benchmark; in this case, we found that
both virtualized clusters (KVM and VMware ESXi) have better performance than physical
cluster.
12.1.TeraSort Performance
When running TeraSort benchmark, VMware ESXi cluster proved to have fast sorting of
large datasets starting from 1GB, 10 GB and 30 GB. For instance, sorting 30GB using a
cluster of 4 nodes showed that VMware ESXi is faster than KVM by 64% and faster than
physical cluster by 84% (Figure 80). Concerning the KVM cluster, it was also proved to be
faster than the physical cluster. However, when the number of nodes increases in virtualized
clusters, the performance of TeraSort degraded significantly.
In the case of KVM cluster, when the number of nodes increases to 6, 7 and 8, the overall
performance of running TeraSort became slower. In fact, the reason behind facing this
degradation is explained by the system overhead, especially disk overhead. A study was done
in [92] performed an analysis of KVM scalability in OpenStack platform, and it state that
KVM is not recommended to be used when many virtual hard disks will be accessed at the
same time. Therefore, since TeraSort has both computational and I/O jobs, KVM VMs
affected the overall performance when they were scaled to 6, 7 and 8. Moreover, another
study was done in [93] states that KVM has substantial problems with guests crashing when it
reaches a certain number of VMs (4 for this study [93]); hence, scalability is considered an
issue for system overhead when using KVM virtualization.
In the case of VMware ESXi cluster, the performance of running TeraSort declines when the
cluster was scaled to 8 nodes. The same as KVM, the reason is due to system overhead.
However, the system overhead is not related to scalability issue because VMware ESXi is
known to be scalable [94]. To make sure from the cause that led to system overhead, we
tracked the performance of sorting 30GB dataset on 8 VMware ESXi VMs (using VMware
vSphere Client), and we found that, at some point, the memory required to sort the dataset
exceeds the available memory offered by the cluster. This can be observed in Figure 88 which
illustrates that active memory (in red, memory currently consumed by VMs) is higher than the
granted memory (in grey, memory provided by the hosting hardware) between 5:05 and 5:10
88
PM range. Another proof that confirms the system overhead is the latency rate; in this case,
we tracked the latency of running 30 GB on 8 VMs, and we observed that system latency
reaches its peak (Figure 89) when sorting this dataset. Thus, latency impacts the overall
performance when the number of VMs increases to 8. The last proof was reported by
OpenStack Dashboard (Figure 90) which showed warning state of resources usage after
creating 8 VMware ESXi instances. In short, VMware ESXi cluster performance declines at 8
VMs because of resources shortage.
Figure 89: Memory overhead when running 30 GB (started at 4.55PM) on 8 VMware ESXi VMs
Figure 90 : System latency reaches its peak (at 12.28PM) when running 30 GB on 8 VMware ESXi
VMs
89
Figure 91: OpenStack warning statistics about system’ resources usage
In short, Even if TeraSort performance decreases when the number of VMware ESXi VMs
increases to 8, the results we got still confirm that Hadoop VMware ESXi cluster is better
than Hadoop KVM Cluster and Hadoop Physical Cluster.
12.2.TestDFSIO Performance
The performance behavior of each cluster changed when running TestDFSIO benchmark. For
all dataset sizes, KVM cluster proved to have high performance than other clusters when
performing both TestDFSIO-Write and TestDFSIO-Read (Figures 81-87). On the other hand,
VMware ESXi showed the lowest performance when compared to KVM and physical cluster.
In fact, the reason that explains the good results we got from running TestDFSIO on KVM is
related to virtio API. The latter is integrated in KVM hypervisor to provide an efficient
abstraction for I/O operations [95]. Virtio was studied in [96] and proved that it enhances
KVM performance at I/O operations; the authors of this study [96] tested the performance of
KVM (with virtio API) at I/O operations and compared it with VMware vSphere 5.1
performance. They concluded that KVM with virtio API achieves I/O rates that are 49%
higher than VMware vSphere 5.1.
When running TestDFSIO, we observed again that the performance of both virtualized
clusters decreases as the number of VMs goes beyond 6 (KVM) and 7 (VMware ESXi).
90
12.3.Conclusion
Brief, the overall performance of TeraSort and TestDFSIO proved that, first, virtualization has
better performance than physical cluster, and, second, the selection of underlying
virtualization technology can lead to significant improvements when performing HPCaaS.
Therefore, in this research, VMware ESXi proved to have the best performance especially
when running computational jobs (TeraSort).
To deal with the issue of system overhead in virtualized clusters, HPCaaS needs to be run in a
cloud environment that has balanced number of VMs. For this research, the reasonable
number that provides high performance was 7 VMs for VMware ESXi and 5 VMs for KVM
cluster.
91
Part IV: Conclusion
This part summarizes the research objectives and findings and suggests some related future
work. Bibliography of this report is listed after the conclusion, and finally, a set of appendices
(OpenStack Documentation, Hadoop Documentation, Benchmarks Execution and Data
Gathering) are provided at the end of this report.
92
Chapter 13
Conclusion and Future Work
This project aimed at demonstrating the impact of running HPCaaS on different virtualization
technologies, namely, KVM and VMware ESXi cluster.
For that, we have built three main Hadoop clusters: Hadoop Physical Cluster, Hadoop
Virtualized Cluster with KVM and Hadoop Virtualized Cluster with VMware ESXi. For
virtualized clusters, we proposed to build Hadoop cluster on top of OpenStack platform. On
each cluster, we run two known benchmarks: TeraSort and TestDFSIO. Each benchmark was
tested on different dataset sizes and on different number of machines (from 3 to 8 machines).
To ensure the credibility and reliability of the research, we performed three tests on each
scenario; for instance, we tested TeraSort for 30GB on each cluster three times, and then we
took the mean to avoid any outliers.
The findings of this research clearly demonstrate that vitalized clusters can perform much
better than physical cluster when processing and handling HPC, especially when there is less
overhead on the virtualized cluster. We found that Hadoop VMware ESXi cluster performs
better at sorting big datasets (more computations), and Hadoop KVM cluster performs better
at I/O operations.
Finally, this report includes detailed installation guides of OpenStack and Hadoop that will
save time and facilitate the work for future students who want to work on related research.
As future work, the possibilities for extending this research can go in different directions. The
first proposed work is to conduct the research’ experiments using real HPC applications that
can show precisely the impact of virtualization on HPCaaS. The second proposed future work
is to conduct this research using other emerging virtualization technologies such as XEN, and
Hyper-V. Third proposed future work is to see the impact of cloud platforms on improving
the HPCaaS; meaning, another research can be conducted to see for example, if replacing
OpenStack with another cloud infrastructure can lead to better results. Finally, since we got
positive results about the impact of visualization on HPCaaS, this research can be investigated
more by integrating its findings in other domains such as Smart Grid.
93
Bibliography
[1] J. Gantz and D. Reinsel, “The Digital Universe in 2020: Big Data, Bigger Digital
Shadows, and Biggest Growth in the Far East”, IDC IVIEW, pp. 1-16, 2012
[2] Gartner, Inc., “Hunting and Harvesting in a Digital World”, in Gartner CIO Agenda
Report, pp. 1-8, 2013
[3] Amazon Web Services, “High Performance Computing (HPC) on AWS”,
http://aws.amazon.com/hpc-applications/
[4] J. Gantz and D. Reinsel, “The Digital Universe Decade – Are You Ready?”, IDC IVIEW,
pp. 1-15, 2010
[5] Ch.Vecchiola1, S. Pandey, R.Buyya, “High-Performance Cloud Computing: A View of
Scientific Applications”, in the 10th International Symposium on Pervasive Systems,
Algorithms and Networks I-SPAN, IEEE Computer Society, pp. 4-16, 2009
[6] J. Dean and S. Ghemawat, “MapReduce: Simple Data Processing on Large Clusters”, in
OSDI, pp. 1-12. 2004
[7] Hadoop: http://hadoop.apache.org/
[8] S. Krishman, M. Tatineni, and C. Baru, “MyHaddop – Hadoop-on-Demand on Traditional
HPC Resources”, in the National Science Foundation’s Cluster Exploratory Program, pp.
1-7, 2011
[9] E. Molina-Estolano, M. Gokhale, C. Maltzahn1, J. May, J. Bent, S. Brandt, “Mixing
Hadoop and HPC Workloads on Parallel Filesystems”, in the 4th Annual Workshop on
Petascale Data Storage, pp. 1-5, 2009
[10] C. Cranor, M. Polte, and G. Gibson, “HPC Computation on Hadoop Storage with
PLFS”, Parallel Data Laboratory at Carnegie Mellon University, pp. 1-9, 2012
[11] Y. Xiaotao, L. Aili, and Z. Lin, “Research of High Performance Computing with
Clouds”, in the Third International Symposium on Computer Science and Computational
Technology (ISCSCT), pp. 289-293, 2010
[12] KVM:http://www.linux-kvm.org/page/Main_Page
[13] VMware ESXi: http://www.vmware.com/
[14] D. Boulter, “Simplify Your Journey to the Cloud”, Capgemini and SOGETI, pp. 18, 2010.
[15] P. Mell and T. Grance, “The NIST Definition of Cloud Computing”, National Institute of
Standards and Technology, pp. 1-3, 2011
[16] A. E. Youssef, “Exploring Cloud Computing Services and Applications”, Journal of
Emerging Trends in Computing and Information Sciences, vol. 3, no. 6, pp. 838847, 2012
[17] T. Korri, “Cloud Computing: Utility Computing over the Internet”, Seminar on
94
Internetworking, pp. 1-5, 2009
[18] ISACA, “Cloud Computing: Business Benefits with Security, Governance and Assurance
Perspectives”, pp. 1-10, 2009
[19] A. T. Velte, T. J. Velte, R. C. Elsenpeter,”Cloud Computing, A practical approach”,1st
ed., USA : McGraw-Hills, 2009
[20] Amazon Web Services: http://aws.amazon.com/
[21] Google Cloud Platform: https://cloud.google.com/
[22] Microsoft Cloud Services: http://www.microsoft.com/enterprise/it- trends/cloudcomputing/default.aspx?Search=true#fbid=33S2kMNT99z
[23] Open Source Software for Building Private and Public Clouds:
http://www.openstack.org
[24] I. Menken, and G. Blokdijk, “Cloud Computing Virtualization Specialist Complete
Certification Kit - Study Guide Book and Online Course”, Emereo Pty Ltd, 2009
[25] M. Portnoy, Virtualization Essentials, John Wiley & Sons, 2012
[26] K. Scarfone, M. Souppaya, and P. Hoffman, “Guide to Security for Full Virtualization
Technologies”, National Institute of Standards and Technology, 2011
[27] D. Dale, “Server and Storage Virtualization with IP Storage”, Storage Networking
Industry Association (SNIA), 2008
[28] D. Marinescu and R. Kroger; “State of the Art in Autonomic Computing and
Virtualization”, Wiesbaden University of Applied Sciences, pp. 1-21,2007
[29] K. Koganti, E. Patnala2, S. Narasingu, J. Chaitanya,Virtualization Technology in Cloud
Computing Environment, in International Journal of Emerging Technology and Advanced
Engineering, vol. 3, no. 3, 2013
[30] N. Susanta and T. Chiueh, “A Survey on Virtualization Technologies”, Department of
Computer Science at Stony Brook, 2006
[31] Virtualization: A Key to Virtualization World: http://isa.unomaha.edu/wpcontent/uploads/2012/08/Virtualization.pdf
[32] “Virtualization Overview”, white paper, VMware, 2006
[33] N. Alam, “Survey on Hypervisors”, School of Informatics and Computing at Indiana
University, 2011
[34] C. D. Graziano, “A Performance Analysis of Xen and KVM Hypervisors for Hosting the
Xen Worlds Project”, Digital Repository at Iowa State University, pp. 12-39, 2011
[35] N. Yaqub, “Comparison of Virtualization Performance: VMWare and KVM”, Master
Thesis, pp. 30-44, 2012
[36] “How Does Xen Work?”, white paper, Xen, 2009
[37] O. Kulkarmi, N. Xinli, and P. K. Swamy, “Cutting-Edge Perspective of Security
Analysis for Xen Virtual Machines”, International Journal of Engineering Research and
95
Development, vo. 2, no. 3, pp. 40-45, 2012
[38] T. Hirt, “KVM – The Kernel-based Virtual Machine”, Red Hat Inc., 2010
[39] M. T. Jones, “Anatomy of a Linux Hypervisor”, IBM Corporation, 2009
[40] “VMware ESXi 5.0 Operations Guide”, white paper, VMware, 2011
[41] M. K. Kakhani, S. Kakhani, and S. R. Biradar, “Research Issues in Big Data Analytics,”
Vol. 2, No. 8, pp. 228–232, 2013
[42] C. Hagen, “Big Data and the Creative Destruction of Today’s”, ATKearney, 2012
[43] “Oracle : Big Data for the Enterprise”, white paper, Oracle Corp., 2013
[44] “Oracle NoSQL Database”, white paper, Oracle Corp., 2011
[45] S. Yu, “ACID Properties in Distributed Databases”, Advanced eBusiness Transactions
for B2B-Collaborations, 2009
[46] S. Gilbert and N. Lynch, “Brewer’s conjecture and the feasibility of consistent, available,
partition-tolerant web services,” ACM SIGACT News, vol. 33, no. 2, p. 51, 2002
[47] A. Lakshman, P. Malik, “Cassandra - A Decentralized Structured Storage System”, ACM
SIGOPS Operating Systems Review, vol. 44, no.2, pp. 35-40, 2010
[48] G. Lars., “Introduction,” in HBase: The Definitive Guide, USA: O'Reilly Media, 2011
[49] MongoDB: http://www.mongodb.org/
[50] Apache CouchDB™: http://couchdb.apache.org/
[51] J.Bernstein, K. McMahon, “Computing on Demand—HPC as a Service: High
Performance Computing for High Performance Business”, white paper, Penguin Computing
& McMahon Consulting.
[52] Y. Xiaotao, L. Aili, Z. Lin, “Research of High Performance Computing With Clouds,”
International Symposium Computer Science and Computational Technology, pp. 289–
293, 2010
[53] Self-service POD Portal: http://www.penguincomputing.com/services/hpccloud/pod
[54] Amazon Cloud Storage: http://aws.amazon.com/ec2/reserved-instances/
[55] Amazon Cloud Drive: http://aws.amazon.com/ec2/spot-instances/
[56] Microsoft High Performance Computing for Developers:
http://msdn.microsoft.com/en-us/library/ff976568.aspx
[57] Google Cloud Storage: https://cloud.google.com/products/compute-engine
[58] S. Zhou, B. Kobler, D. Duffy, and T. McGlynn, “Case Study for Running HPC
Applications in Public Clouds”, in Science Cloud '10, 2012
[59] K. R. Jackson, “Performance Analysis of High Performance Computing Applications on
the Amazon Web Services Cloud”, in Cloud Computing Technology and Science
96
(CloudCom), 2010 IEEE Second International Conference on, pp. 159-168, 2010
[60] E. Walker, “Benchmarking Amazon EC2 for High-Performance Scientific Computing”,
Texas Advanced Computing Center at the University of Texas, pp. 18-23, 2008
[61] J. Ekanayake and G. Fox, “High Performance Parallel Computing with Clouds and Cloud
Technologies”, School of Informatics and Computing at Indiana University, pp. 120, 2009.
[62] Y. Gu and R. L. Grossman, “Sector and Sphere: The Design and Implementation of a
High Performance Data Cloud”, National Science Foundation, pp. 1-11, 2008
[63] A. Gupta and D. Milojicic, “Evaluation of HPC Applications on Cloud”, HelwettPackard Development Company, pp. 1-6, 2011
[64] C. Evangelinos and C. N. Hill. “Cloud Computing for parallel Scientific HPC
Applications: Feasibility of running Coupled Atmosphere-Ocean Climate Models on
Amazon’s EC2”, Department of Earth, Atmospheric and Planetary Sciences at
Massachusetts Institute of Technology, pp. 1-6, 2009
[65] “Dryad and DryadLINQ for Data Intensive Research”:
http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx
[66] C. Fragni, M. Moreira, D. Mattos, L. Costa, and O. Duarte, “Evaluating Xen, VMware,
and OpenVZ Virtualization Platforms for Network Virtualization”, Federal University of
Rio de Janeiro, pp. 1-1, 2010
[67] N. Yaqub, “Comparison of Virtualization Performance: VMWare and KVM”, Master
Thesis, pp. 30-44, 2012
[68] T. Deshane, M. Ben-Yehuda, A. Shah, and B. Rao, “Quantitative Comparison of Xen
and KVM”, in Xen Summit, pp. 1-3, 2008
[69] J. Hwang, S. Wu, and T. Wood, “A Component-Based Performance Comparison of Four
Hypervisors”, George Washington University and IBM T.J. Watson Research Center, pp.
1-8, 2012
[70] A. J. Younge, R. Henschel, J. T. Brown, G. Laszewski, J. Qiu, and G. C. Fox, “Analysis
of Virtualization Technologies for High Performance Computing Environments”,
Pervasive Technology Institute, pp. 1-8, 2012
[71] Q. Jiang. “Open Source Iaas Community Analysis”, Eucalyptus Systems Inc., 2012
[72] I. Voras, M. Orlic, and B. Mihaljevié, “An Early Comparison of Commercial and OpenSpurce Cloud P¨latforms for Scientific Environments”, University of Zagreb Faculty of
Electrical Engineering and Computing, Zagreb, Croatia, 2012
[73] E. Caron, L. Toch, and J. Rouzaud-Cornabas, “Performance Comparison between
OpenStack and OpenNebula and the multi-Cloud Architecture: Application to
Cosmology”, Research Report N° 8421, 2013
[74] K. Kostantos, A. Kapsalis, D. Kyriazis, M. Themistocleous, and P. Cunha, “Open-Source
IAAS Fit for Purpose: A Comparison between Openbula and OpenStack”, International
Journal of Electronic Business Management, Vol. 11, No. 3, 2013
97
[75] O. Sefraoui, M. Aissaoui, and M. Eleuldj, “Comparison of Multiple IaaS Cloud Platform
Solutions”, Mohamed I University, 2012
[76] “Donnie Berkholz’s Story of Data3:
http://redmonk.com/dberkholz/2012/03/26/nosql-database-popularity-according-tojaspersoft/
[77] E. Dede, M. Govindaraju, D. Gunter, R. Canon, and L. Ramakrishnan, “Performance
Evaluation of a MongoDB and Hadoop Platform for Scientiﬁc Data Analysis”, SUNY
Binghamton and Lawrence Berekely National Lab, 2012
[78] J. H. Lee, “Log Analysis System Using Hadoop and MongoDB”, CUBRID, 2012.
[79] OpenStack: http://www.openstack.org/
[80] “OpenStack Training Guides”, white paper, OpenStack Foundation, 2013
[81] A. Sehgal, “Introduction to OpenStack: Running a Cloud Computing Infrastructure with
Openstack”, in the 6th International Conference on Autonomous Infrastructure,
Management and Security, University of Luxembourg, 2012
[82] K. Pepple, Deploying OpenStack, O'Reilly Median, 2011
[83] OpenStack, “Companies Supporting the OpenStack Foundation”,
http://www.openstack.org/foundation/companies/
[84] G. Sasiniveda and N. Revathi, “Data Analysis using Mapper and Reducer with Optimal
Configuration in Hadoop", International Journal of Computer Trends and Technology,
vol. no. 3, 2013
[85] D. Borthakur, “The Hadoop Distributed File System: Architecture and Design”, The
Apache Software Foundation, 2007
[86] T. White, Hadoop: The Definitive Guide, O'Reilly Media, 2010
[87] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop Distributed File
System”, Sunnyvale, 2010
[88] H. Herodotu, “Hadoop Performance Models”, Computer Science Department at Duke
University, 2011
[89] Blogclub Tworkshops,”Hadoop and MapReduce”,
http://www.alex-hanna.com/tworkshops/lesson-5-hadoop-and-mapreduce/
[90] M. G. Noll, “Benchmarking and Stress Testing and Hadoop Cluster with TeraSort, Test
DFSIO & Co.”, 2011
[91] Apache Hadoop, “TestDFSIO Apache Hadoop Code Source”,
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoopmapreduce- client-jobclient/0.23.9/org/apache/hadoop/fs/TestDFSIO.java
[92] F. Rahma*, T. Adji, Widyawan, “Scalability Analysis of KVM-Based Private Cloud For
IaaS”, in International Journal of Cloud Computing and Services Science, Vol.2, No.4,
ppt. 288-295, 2013
98
[93] T.Deshane, M. Yehuda, A. Shah, B. Rao, “Quantitative Comparison of Xen and KVM”, in
Journal of Physics: Conference, 2010
[94] “Virtualizing Resource intensive Applications”, white paper, VMware, 2009
[95] “Scale-up Virtualization with Red Hat Enterprise Linux 5.4 on an HP ProLiant DL785
G6”, white paper, Redhat, 2009
[96] “KVM Virtualized I/O Performance”, white paper, IBM & Redhat, 2013.
99
Appendix A: OpenStack with KVM Configuration
Pre-configuration
1. Update your machine
sudo apt-get update
sudo apt-get upgrade
2. Install bridge-utils
sudo apt-get install bridge-utils
3. NTP Server
3.1. Install the NTP Server
sudo apt-get install ntp
3.2. Open the file /etc/ntp.conf
Add the following lines to make sure that the time on the server stays in sync with an external
server.
server ntp.ubuntu.com
server 127.127.1.0
fudge 127.127.1.0 stratum 10
3.3.Restart NTP Service
sudo service ntp restart
4. Network Configuration
As public IP address changes periodically, you need to set a static IP address that will be used
in OpenStack configuration. In this case, we have two network interfaces eth0 and eth1. Eth0
was chosen as the network management; as a result, this interface was set to static IP address
(in this guide, we used 10.60.62.12 as an IP management).
100
Hypervisor Configuration
1. KVM Configuration
If you want to install OpenStack with KVM hypervisor, then you need to follow the following
steps:
1.1.Check if your machine supports virtualization
ouidad@ouidad:~$ egrep -c '(vmx|svm)' /proc/cpuinfo
8
ouidad@ouidad:~$
If the output is 0, then your machine does not support virtualization; otherwise, if the output
is greater than 0, the machine support virtualization technology.
1.2. Check if KVM can be supported
ouidad@ouidad:~$ kvm-ok
INFO: /dev/kvm exists
KVM acceleration can be used
ouidad@ouidad:~$
If the output is as shown above, then your machine supports KVM virtualization.
1.3.Install KVM and libvirt
sudo apt-get install kvm libvirt-bin
1.4.KVM configuration
You can check the following website to configure the necessary files for KVM support:
https://help.ubuntu.com/community/KVM/Installation
1.5 Reboot your machine
101
OpenStack Databases Configuration
1. MySQL
1.1.Install Mysql server and related packages
sudo apt-get install mysql-server python-mysqldb
1.2.Create the root password for MySQL
The password used in this guide is "secret"
1.3.Open /etc/mysql/my.cnf
Change the bind address from bind-address=127.0.0.1 to bind-address = 0.0.0.0
1.4. Restart MySQL server
sudo restart mysql
2. Nova Database
2.1. Create Nova database “nova”
sudo mysql -uroot -psecret -e 'CREATE DATABASE nova;'
2.2.Create nova user named “novadbadmin”
sudo mysql -uroot -psecret -e 'CREATE USER novadbadmin;'
2.3.Grant all privileges for novadbadmin on the database "nova"
sudo mysql -uroot -psecret -e "GRANT ALL PRIVILEGES ON nova.* TO 'novadbadmin'@'%';"
2.4. Create a password for the user "novadbadmin"; the password in this case is “secret”
sudo mysql -uroot -psecret -e "SET PASSWORD FOR 'novadbadmin'@'%' = PASSWORD ('novasecret');"
3. Glance Database
3.1.Create glance database named “glance”
sudo mysql -uroot -psecret -e 'CREATE DATABASE glance;'
102
3.2.Create a user named “glancedbadmin”
sudo mysql -uroot -psecret -e 'CREATE USER glancedbadmin; '
3.3. Grant all privileges for glancedbadmin on the database "glance"
sudo mysql -uroot -psecret -e "GRANT ALL PRIVILEGES ON glance.* TO 'glancedbadmin'@'%';"
3.4. Create a password for the user "glancedbadmin"
sudo mysql -uroot -psecret -e "SET PASSWORD FOR 'glancedbadmin'@'%' = PASSWORD('glancesecret');"
4. Keystone Database
4.1.Create a database named “keystone”
sudo mysql -uroot -psecret -e 'CREATE DATABASE keystone;'
4.2.Create a user named “keystonedbadmin”.
sudo mysql -uroot -psecret -e 'CREATE USER keystonedbadmin;'
4.3. Grant all privileges for keystonedbadmin on the database "keystone".
sudo mysql -uroot -psecret -e "GRANT ALL PRIVILEGES ON keystone.* TO 'keystonedbadmin'@'%';"
4.4.Create a password for the user "keystonedbadmin"
sudo mysql -uroot -psecret -e "SET PASSWORD FOR 'keystonedbadmin'@'%' = PASSWORD('keystonesecret');"
103
Keystone Configuration
1. Install Keystone
sudo apt-get install keystone python-keystone python-keystoneclient
2. Open /etc/keystone/keystone.conf
Make the following changes:


Change admin_token = ADMIN to admin_token = admin
Change connection = sqlite:////var/lib/keystone/keystone.db
to connection = mysql://keystonedbadmin:[email protected]/keystone
3. Restart keystone
sudo service keystone restart
4. Create glance schema in MySQL databas
sudo keystone-manage db_sync
5. Export environment variables
export SERVICE_ENDPOINT="http://localhost:35357/v2.0"
export SERVICE_TOKEN=admin
Note: you can also add these variables to ~/.bashrc as to avoid exporting them each time.
6. Create tenants
Create admin and service tenants
keystone tenant-create --name admin
keystone tenant-create --name service
7. Create users
Create OpenStack users by executing the following commands. In this case, we are creating
four users - admin, nova, glance and swift
keystone user-create --name admin --pass admin --email [email protected]
keystone user-create --name nova --pass nova --email [email protected]
keystone user-create --name glance --pass glance --email [email protected]
keystone user-create --name swift --pass swift --email [email protected]
104
8. Create roles
Create the roles by executing the following commands. In this case, we are creating two roles
- admin and Member.
keystone role-create --name admin
keystone role-create --name Member
Sample output:
9.
List tenants, users and roles
keystone tenant-list
keystone user-list
keystone role-list
Sample output:
105
10. Adding roles to users in tenants
10.1. Add the role of “admin” to the user “admin” of the tenant “admin”
keystone user-role-add --user 4e77ea930bf944efadfb79f5fc8789a2
--role 8af19783ac784e0397e0346c7f1ec --tenant_id ee14adbd1ac84445921
819cf7a5b7f5f
10.2. Add the role of “admin” to the user “nova” of the tenant ’service’.
keystone user-role-add --user 5ce6dd40bf2249e5ab35a95da63d7930
--role 8af19783ac784e0397e0346c7f1ec
--tenant_id 11824c8169924b098f41dae1fa726c6
10.3. Add the role of “admin” to the user “glance” of the tenant ’service’.
keystone user-role-add --user 9967843ee4aa421189f3382849700cad
--role 8af19783ac784e0397e0346c7f1ec --tenant_id 11824c8169924b098f41d
ae1fa726c6
10.4. Add the role of “admin” to the user “swift” of the tenant ’service’.
keystone user-role-add --user 24979d9ac31e4b83a58a89c1ad842ffa
--role 8af19783ac784e0397e0346c7f1ec --tenant_id 11824c8169924b098f41d
ae1fa726c6
10.5. The ’Member’ role is used by Horizon and Swift. So add the ’Member’ role
accordingly.
(user: admin , role: Member , tenant: admin)
keystone user-role-add --user 4e77ea930bf944efadfb79f5fc8789a2 --role
c2860fd6f3fd4538a07161bdb2691f60 --tenant_id ee14adbd1ac84445921
819cf7a5b7f5f
11. Create services
Create the required services which the users can authenticate with: nova-compute, novavolume, glance, swift, keystone and ec2 are some of the services that we create.
11.1.Nova Compute Service
keystone service-create --name nova --type compute --description 'Opensatck Compute Service'
106
11.2.Volume Service
keystone service-create --name volume --type volume --description 'OpenStack Volume Service'
11.3.Image Service
keystone service-create --name glance --type image --description 'Openstack Image Service'
11.4. Object Store Service
keystone service-create --name swift --type object_store --description 'Openstack Storage Service'
11.5.Identity Service
keystone service-create --name keystone --type identity --description 'Openstack Identity Service'
11.6.EC2 Service
keystone service-create --name ec2 --type ec2 --description 'EC2 Service'
12. List keystone service list
keystone service-list
Sample output:
107
13. Create endpoints
Create endpoints for each of the services that have been created above (service id is displayed
using keystone service-list command).
13.1. Endpoint for identity service
keystone endpoint-create --region RegionOne --service_id
207bf81ddfe1481aa242148f246d091f --publicurl http://localhost:5000/v2.0 --internalurl
http://localhost:5000/v2.0 --adminurl http://localhost:35357/v2.0
13.2.Endpoint for nova service
72b9d125eaa84aaf9c8ce734027eea21 --publicurl 'http://localhost:8774/v2/%(tenant_id)s' -internalurl 'http://localhost:8774/v2/%(tenant_id)s' --adminurl
'http://localhost:8774/v2/%(tenant_id)s'
13.3.Endpoint for the image service
581f6a8e337642a0a39090ffe6947e2d --publicurl 'http://localhost:9292/v1' --internalurl
'http://localhost:9292/v1' --adminurl 'http://localhost:9292/v1'
13.4.Define the EC2 compatibility service:
4b1619d4f9f34cc9aaf473282c2340f0 --publicurl http://localhost:8773/services/Cloud -internalurl http://localhost:8773/services/Cloud --adminurl http://localhost:8773/services/Admin
13.5.Endpoint for the Volume service
6afe27a1768b403b9521418a87646ec4 --publicurl 'http://localhost:8776/v1/%(tenant_id)s' -internalurl 'http://localhost:8776/v1/%(tenant_id)s' --adminurl
'http://localhost:8776/v1/%(tenant_id)s'
13.6.Endpoint for object storage service
2ec242420a114671a4fe15e745b45d3f --publicurl
'http://localhost:8888/v1/AUTH_%(tenant_id)s' --adminurl 'http://localhost:8888/v1' -internalurl 'http://localhost:8888/v1/AUTH_%(tenant_id)s'
108
Glance Configuration
1. Install Glance packages
sudo apt-get install glance glance-api glance-client glance-common glance-registry python-glance
2. Open /etc/glance/glance-api-paste.ini
Change the following lines:
admin_tenant_name = %SERVICE_TENANT_NAME%
admin_user = %SERVICE_USER%
admin_password = %SERVICE_PASSWORD%
By:
admin_tenant_name = service
admin_user = glance
admin_password = glance
3. Now open /etc/glance/glance-registry-paste.ini
Change the following lines:
By:
admin_user = glance
admin_password = glance
4. Open the file /etc/glance/glance-registry.conf
Change the line which contains the option "sql_connection =" to this:
sql_connection = mysql://glancedbadmin:[email protected]/glance
Add the following lines at the end of the file as to allow glance to use keystone for
authentication.
[paste_deploy]
flavor = keystone
109
5. Open /etc/glance/glance-api.conf
Add the following lines at the end of the file.
[paste_deploy]
flavor = keystone
6. Create glance schema in MySQL database
sudo glance-manage version_control 0
sudo glance-manage db_sync
7. Restart glance-api and glance-registry
sudo restart glance-api
sudo restart glance-registry
8. Export the following environment variables.
export SERVICE_TOKEN=admin
export OS_TENANT_NAME=admin
export OS_USERNAME=admin
export OS_PASSWORD=admin
export OS_AUTH_URL="http://localhost:5000/v2.0/"
export SERVICE_ENDPOINT=http://localhost:35357/v2.0
Note: you can add these variables to ~/.bashrc.
9. Check if glance was successfully configured
glance index
The above command displays nothing; if you get an output, check the troubleshooting section.
110
Nova Configuration
1. Install Nova packages
sudo apt-get install nova-api nova-cert nova-compute nova-compute-kvm nova-doc novanetwork nova-objectstore nova-scheduler nova-volume rabbitmq-server novnc novaconsoleauth
2. Edit the /etc/nova/nova.conf file
--dhcpbridge_flagfile=/etc/nova/nova.conf
--dhcpbridge=/usr/bin/nova-dhcpbridge
--logdir=/var/log/nova
--state_path=/var/lib/nova
--lock_path =/run/lock/nova
--allow_admin_api=true
--use_deprecated_auth=false
--auth_strategy=keystone
--scheduler_driver=nova.scheduler.simple.SimpleScheduler
--s3_host =10.60.62.12
--ec2_host=10.60.62.12
--rabbit_host=10.60.62.12
--cc_host =10.60.62.12
--nova_url=http://10.60.62.12:8774/v1.1/
--routing_source_ip=10.60.62.12
--glance_api_servers=10.60.62.12:9292
--image_service=nova.image.glance.GlanceImageService
--iscsi_ip_prefix=192.168.4
--sql_connection=mysql://novadbadmin:[email protected]/nova
--ec2_url=http://10.60.62.12:8773/services/Cloud
--keystone_ec2_url=http://10.60.62.12:5000/v2.0/ec2tokens
--api_paste_config=/etc/nova/api-paste.ini
--libvirt_type=kvm
--libvirt_use_virtio_for_bridges=true
--start_guests_on_host_boot=true
--resume_guests_state_on_host_boot=true
--novnc_enabled=true
--novncproxy_base_url=http://10.60.62.12:6080/vnc_auto.html
--vncserver_proxyclient_address=10.60.62.12
--vncserver_listen=10.60.62.12
--network_manager=nova.network.manager.FlatDHCPManager
--public_interface=eth0
--flat_interface=eth0
--flat_network_bridge=br100
--network_size=32
--flat_injected=False
--force_dhcp_release
--iscsi_helper=tgtadm
--connection_type=libvirt
--root_help
Important Note: “10.60.62.12” has to be replaced by your local machine public IP address.
Moreover, you need to change “libvirt_type” variable by the current hypervisor you are using.
111
3. Change the ownership of the /etc/nova folder and permissions for
/etc/nova/nova.conf
sudo chown -R nova:nova /etc/nova
sudo chmod 644 /etc/nova/nova.conf
4. Open /etc/nova/api-paste.ini
Change the following configuration
By:
admin_user = nova
admin_password = nova
5. Create nova schema in the MySQL database.
sudo nova-manage db sync
6. Provide a range of IPs to be associated to the instances.
sudo nova-manage network create private --fixed_range_v4=10.60.62.0/27 -bridge=br100 --bridge_interface=eth0 --network_size=32
7. Export the following environment variables.
export OS_TENANT_NAME=admin
export OS_USERNAME=admin
export OS_PASSWORD=admin
export OS_AUTH_URL="http://localhost:5000/v2.0/"
Note: you can add the environment variables at the end of ~/.bashrc file.
8. Manage nova volumes
Create a Physical Volume:
sudo pvcreate /dev/sda3
Create a Volume Group named nova-volumes:
sudo vgcreate nova-volumes /dev/sda3
112
Note: to create a physical volume, you need first to create a primary partition (in this guide,
the partition name is /dev/sda3). In this case you can follow these steps:
9. Restart nova services
sudo service libvirt-bin restart
sudo service nova-network restart
sudo service nova-compute
sudo service nova-api restart
sudo service nova-objectstore restart
sudo service nova-scheduler restart
sudo service nova-volume restart
sudo service nova-consoleauth service
10. Check if nova services are running
sudo nova-manage service list
Sample output:
Note: if you the state of a given service is not :-), then try to run the following commands in
separate terminals:
sudo /usr/bin/nova-compute
sudo /usr/bin/nova-network
…
113
OpenStack Dashboard
1. Install OpenStack Dashboard
sudo apt-get install openstack-dashboard
2. Restart apache service
sudo service apache2 restart
3. Open a browser and enter IP address of your machine
If you followed this tutorial, then the possible logins are:
Username: admin
Username: nova
Username: glance
Username: swift
Password: admin
Password: nova
Password: glance
Password swift
Figure 1: Dashboard authentication page
114
Image Configuration
In order to create an image, you can to access the following links to download the needed
images:
http://smoser.brickies.net/ubuntu/ttylinux-uec/old/
http://uec-images.ubuntu.com/
 Example: Ubuntu Precise i386 Image
1. Download Ubuntu Precise Version (12.04 LTS)
Download Ubuntu precise version (precise-server-cloudimg-i386-root.tar.gz) from http://uecimages.ubuntu.com/precise/current/, using the following command:
wget http://uec-images.ubuntu.com/precise/current/precise-server-cloudimg-i386.tar.gz
2. Extract the downloaded package
sudo tar fxvz precise-server-cloudimg-i386.tar.gz
The extracted files are:



precise- server-cloudimg-i386-vmlinuz-virtual
precise-server-cloudimg-i386-loader
precise-server-cloudimg-i386.img
3. Add the Ubuntu image into glance database
3.1. Add the kernel file
glance add name="precise32-kernel" disk_format=aki container_format=aki < preciseserver-cloudimg-i386-vmlinuz-virtual
3.2. Add the loader file
glance add name="precise32-ramdisk" disk_format=ari container_format=ari < preciseserver-cloudimg-i386-loader
3.3.Add the image file
Get the id of both the kernel and loader using: glance index
glance index
Sample output:
115
In this case, the id of Ubuntu kernel is 8386c173-cd90-4c7d-8540-da484abd0c1a and the id
of Ubuntu loader is 5e0f8ceb-8fcd-4fc7-9b2b-1bcd3e3d8c9d.
Now, add the image file using the kernel and loader id:
glance add name="Precise32_image" disk_format=ami container_format=ami
kernel_id=8386c173-cd90-4c7d-8540-da484abd0c1a ramdisk_id=5e0f8ceb-8fcd-4fc79b2b-1bcd3e3d8c9d < precise-server-cloudimg-i386.img
4. Using the Horizon, you can find the uploaded image (Precise32_image)
Figure 2: List of OpenStack images
116
Keypair Configuration
1. Generate for you local machine
If you didnt generate akey for you local machine, then run the following command :
ssh-keygen -t rsa -P ""
2. Create keypair
The following command can be used to either generate a new keypair, or to upload an existing
public key.
cd .ssh
nova keypair-add --pub_key id_rsa.pub mykey
nova keypair-list
3. List keypairs
nova keypair-list
Sample output:
4. Check the created keypair
Confirm that the uploaded keypair matches the local key by checking your key's fingerprint
with the ssh-keygen command:
ssh-keygen –l –f ~/.ssh/id_rsa.pub
Sample output:
Note: You can use OpenStack Dashboard to perform all operations related to keypair
generation.
117
Security Groups Configuration
1. List default security groups
nova secgroup-list
Sample output:
2. Enable access to TCP port 22
Allow access to port 22 from all IP addresses (specified in CIDR notation as 0.0.0.0/0) with
the following command:
nova secgroup-add-rule default tcp 22 22 0.0.0.0/0
Sample output:
3. Enable pinging to virtual machine instance by allowing ICMP traffic
nova secgroup-add-rule default icmp -1 -1 0.0.0.0/0
Sample output:
118
Flavors Configuration
1. Flavor overview
Flavors are used to specify the properties of an instance. The following table illustrates the
needed arguments to define a flavor.
Column
ID
Name
Memory_MB
Disk
Ephemeral
Swap
VCPUs
TX_Factor
Is_Public
extra_specs
Description
A unique numeric id.
A descriptive name. xx.size_name is conventional not required,
though some third party tools may rely on it.
Memory_MB: virtual machine memory in megabytes.
Virtual root disk size in gigabytes. This is an ephemeral disk the
base image is copied into. When booting from a persistent volume it
is not used. The "0" size is a special case which uses the native base
image size as the size of the ephemeral root volume.
Specifies the size of a secondary ephemeral data disk. This is an
empty, unformatted disk and exists only for the life of the instance.
Optional swap space allocation for the instance.
Number of virtual CPUs presented to the instance.
Optional property allows created servers to have a different
bandwidth cap than that defined in the network they are attached to.
This factor is multiplied by the rxtx_base property of the network.
Default value is 1.0 (that is, the same as attached network).
Boolean value, whether flavor is available to all users or private to
the tenant it was created in. Defaults to True.
Additional optional restrictions on which compute nodes the flavor
can run on. This is implemented as key/value pairs that must match
against the corresponding key/value pairs on compute nodes. Can be
used to implement things like special resources (such as flavors that
can only run on compute nodes with GPU hardware).
Table 1: Flavor arguments
2. List available flavors
Use nova flavor-list command to view the list of available flavors:
nova flavor-list
3. Create a flavor
Create a flavor with the following suggested specifications:
sudo nova-manage instance_type create --name=m1.cluster --memory=975 --cpu=2 -root_gb=100 --ephemeral_gb=10 --flavor=8
119
Instances Management
Instances can be created either by using the dashboard interface or using command line.
1. Create instances with no specifications
nova boot --flavor ID --image Image-ID MyInstanceName
2. Create an instance with an associated keypair
To associate a key with an instance on boot add --key_name Mykey to your command line:
nova boot --image Image-ID --flavor ID --key_name Mykey MyInstanceName
3. Create an instance with a security group
It is also possible to add and remove security groups when an instance is running.
nova add-secgroup MyInstanceName MysecurityGroup
nova remove-secgroup MyInstanceName MysecurityGroup
4. Create an instance with a given keypair and security group
nova boot --flavor ID
--image Image-ID --key_name Mykey MyInstanceName
5. Display instance details
nova show MyInstanceName
6. Access an instance
You can connect to an instance console via VNC. The latter can be accessed either by the
Horizon interface, command line or other tools such as virt-manager.

Using command line
nova get-vnc-console host_name novnc
Sample output:
The link displayed above can be used to access the instance console.
120

Using virt-manager
If you cannot connect to VNC console, you can use virt-manager; in this case, use the
following command to download the virt-manager package:
sudo apt-get install virt-manager
To have access to virt-manager inetrface, run the following command,
sudo virt-manager

Using local machine terminal
If the instance you created asked you for login name and password, you can in this case,
access the instance through your local machine. In this case you need to follow these steps:
ssh-copy-id -i $HOME/.ssh/id_rsa.pub username@Instance_ip_address
For Ubuntu the user name is root or ubuntu.
Example: if you want to access an Ubuntu instance with IP address 10.60.62.8, you can then
run the commands in the following commands:
ssh-copy-id -i $HOME/.ssh/id_rsa.pub [email protected]
ssh [email protected]
Sample output:
121
7. Connecting Instances
The following steps can be followed to connect OpenStack Instances (Assumption: we need
to connect instance with hostname host1 to another instance with hostname host2):

Generate the keypair on host1 & host2 to run ssh (ssh-keygen -t rsa)

On host2
o Check the “sshd_config” on that instance (It’s located in /etc/ssh/sshd_config)
o Uncomment the following two lines in sshd_config
RSAAuthentication yes
PubkeyAuthentication yes
o Append the contents of id_rsa.pub file of host 1 to authorized_keys file of host 2
8. Delete an instance
nova delete MyInstanceName
122
OpenStack Troubleshooting
Glance Exceptions
1. Exception 1: “glance index “ error
ouidad@ouidad:~$ glance index
Failed to show index. Got error:
There was an error connecting to a server
Details: [Errno 111] Connection refused
 Solution
In most cases, the above exception is due to glance-api service which may not be running.
Therefore, you need to run the following command to check why the glance-api is not
running.
For the above output, we have an error in the glance-api-paste.ini, so you need to open that file to fix the
error.
ouidad@ouidad:~$ sudo gedit /etc/glance/glance-api-paste.ini
After fixing the error, you need to restart the glance-api service
ouidad@ouidad:~$ sud/usr/bin/glance-apini
123
Nova Exceptions
1. Exception 1: nova services not running “sudo nova-manage service list”
When running “sudo nova-manage service list”, if you a service has “xxx” state, then you
need the service in a separate terminal.
 Solution
For example, if nova-compute has “xxx” state, you need to run the following command:
sudo /usr/bin/nova-compute
The same solution can be applied for other services:
sudo /usr/bin/nova-network
sudo /usr/bin/nova-scheduler
sudo /usr/bin/nova-consoleauth
sudo /usr/bin/nova-cert
sudo /usr/bin/nova-volume
2. Exception 2: “sudo nova-manage service list” doesn’t display the expected output
ouidad@ouidad:~$ sudo nova-manage service list
Command failed, please check log for more info
2013-09-02 19:46:28.050 15999 CRITICAL nova [-] No module named
quantumclient.common
 Solution
ouidad@ouidad:~$ sudo apt-get install python-quantumclient
3. Exception 3: Unable to start nova compute “libvirtError: operation failed: domain
'instance-…..‘ already exists with uuid …”
Sample output:
 Solution
You need to login to nova database and delete the instance id from instances table. Moreover,
you need to delete the instance id from related tables such as
security_group_instance_association and instance_info_caches.
124
Example: we want to delete an instance with id=3
From the tables displayed above,
security_group_instance_association and
virtual_interfaces table.
delete the instance id =
instance_info_caches as well
3
as
from
from
125
Dashboard Exceptions
1. Exception 1: “Unable to retrieve images/instances…”
Sample output
 Solution
If you get one of the following exceptions, the only way I solved the problem is to drop the
endpoint and re-create them again. Then, you need to reboot your local machine.
References for Appendix A



http://docs.openstack.org/folsom/openstack-ops/content/flavors.html
http://www.hastexo.com/resources/docs/installing-openstack-essex-20121-ubuntu-1204precise-pangolin
http://docs.openstack.org/essex/openstackcompute/starter/content/Introduction_to_OpenStack_and_its_components-d1e59.html
126
Appendix B. OpenStack with VMware ESXi
Configuration
1. Downloading VMware ESXi
Download VMware ESXi (vSphere 5.5) from:
https://my.vmware.com/web/vmware/evalcenter?p=vsphere-55
2. Installing VMware ESXi
After burning VMware ESXi software into a CD, install it on top of your hardware.
3. Download vSphere Client
To manage your VMware ESXi host:

Install vSphere Client in another machine with Windows OS.

After opening the software, login to VMware ESXi machine with your username and
password.

After login, you will get access to VMware ESXi machine resources. In our case,
VMware ESXi machine has an IP address of 10.50.1.166 (Figure 1)
Figure 1: vSphere Client interface: access to VMware ESXi 10.50.1.166
127
4. Create “Openstack VM”
 Create a virtual machine on top of VMware ESXi using vSphere Client. The VM will
be used to host OpenStack.

Create the VM with Ubuntu Precise LTS 12.04 64bits Guest.
5. Download VMware vSphere Web Services SDK

Download appropriate SDK from: http://www.vmware.com/support/developer/vcsdk/

Copy the SDK to /openstack/vmware file.

Make sure that the WSDL is available by checking if this path is existing
/openstack /vmware/SDK/wsdl/vim25/vimService.wsdl

/openstack /vmware/SDK/wsdl/vim25/vimService.wsdl: this path will be specified
in nova.conf.
6. Configure OpenStack on “VMware ESXi”

You need to follow the same steps provided in OpenStack –KVM documentation.

The main difference here is the nova.conf configuration.
7. Nova.conf Configuration
In this case, you need to specify the compute_driver, host_ip (VMware ESXi machine),
host_username , host_password and sdl_location (for SDK) as follow
[vmware]
host_password = 12357890
host_username = root
host_ip = 10.50.1.166
compute_driver = vmwareapi.VMwareESXDriver
sdl_location=file:///openstack /vmware/SDK/wsdl/vim25/vimService.wsdl
8. Dashboard access
Access OpenStack resources from the Horizon using the IP address of “Openstack VM”.
9. Make sure that you OpenStack is installed wth VMware ESXi
This is done from Horizon interface
Example:
128
Figure 2: OpenStack with VMware ESXi hypervisor
10. Manage OpenStack with VMware ESXi

After configuring OpenStack, you can now download images and create instances.
Each time you create an instance, it will be displayed in vSphere Client interface as
depicted in Figure 1.

Concerning images, you need to add images with vmdk extension. You can find them
in the following website (you can download them from the free images section):
http://stacklet.com
129
Figure 3: access to VMs (OpenStack instances) through vSphere Client interface
References



http://docs.openstack.org/trunk/config-reference/content/vmware.html
https://my.vmware.com/web/vmware/evalcenter?p=vsphere-55
https://www.vmware.com/support/developer/vc-sdk/
130
Appendix C: Hadoop Configuration
Prerequisites for Installing Hadoop
1. Adding a dedicated Hadoop system user (all machines)
Create a Hadoop user account (hduser) for running Hadoop using the following commands:
ouidad@host1:~$ sudo addgroup hadoop
ouidad@host1:~$ sudo adduser --ingroup hadoop hduser
2. Configuring SSH
2.1. To manage cluster’ nodes, Hadoop requires SSH access. In this case, you need to
generate an SSH key for the hduser user.
ouidad@host1:~$ su hduser
Password:
hduser@host1:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
44:f5:7b:85:32:f7:69:c7:d7:fc:75:38:63:32:be:d7 hduser@host1
The key's randomart image is:
+--[ RSA 2048]----+
|
...
|
|
. . .|
|
. + o .|
|
. = *o|
|
S + *oX|
|
. =.o*|
|
. ..|
|
.. E|
|
.. |
+-----------------+
131
2.2. In order to allow Hadoop interacts directly with its nodes, you need to create an RSA key
pair with an empty password. This is done by enable SSH access to your local machine
with this newly created key.
hduser@host1:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
3. Install JAVA
3.1.Download jdk-6u45-linux-i586.bin (for 32 bits architecture) from:
http://www.oracle.com/technetwork/java/javasebusiness/downloads/java-archive-downloads-javase6419409.html
3.2.JDK Installation
chmod +x jdk-6u45-linux-i586.bin
sudo ./jdk-6u45-linux-i586.bin
3.3.Make sure that JDK is installed
ouidad@host1:~$ java -version
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) Client VM (build 19.1-b02, mixed mode, sharing)
3.4. Move JDK folder from its current location to /home/hduser path
ouidad@host1:~$ sudo cp /Downloads/jdk1.6.0_45 /home/hduser -r
3.5. Change the JDK ownership
ouidad@host1:~$ sudo chown -R hduser:hadoop /home/hduser/jdk1.6.0_45/
132
Installing Hadoop
1. Download Hadoop version 1.2.1 (hadoop-1.2.1.tar.gz) from
http://www.apache.org/dyn/closer.cgi/hadoop/core
2. Extract the downloaded version
ouidad@host1:~/Downloads$ tar -zxvf hadoop-1.2.1.tar.gz
3. Move the extracted folder (hadoop-1.2.1) from Downloads folder to /home/hduser
ouidad@host1:~/Downloads$ sudo cp hadoop-1.2.1 /home/hduser/ -r
4. Change the ownership
ouidad@host1:~/Downloads$ sudo chown -R hduser:hadoop /home/hduser/hadoop-1.2.1
5. Bashrc file configuration (All machines)
You need first to login to the hduser account, then you need to run the following command:
hduser@host1:~$ sudo gedit ~/.bashrc
at the end of the file, add the following line:
export JAVA_HOME=~/jdk1.6.0_45
export PATH =$JAVA_HOME/bin:$PATH
6. Hdfs folder creation (All machines)
You need first to login to the hduser account, then create the following folder:
hduser@host1:~$ sudo mkdir -p /home/hduser/hdfs/temp
hduser@host1:~$sudo chown hduser:hadoop /home/hduser/hdfs/temp
hduser@host1:~$sudo chmod 777 /home/hduser/hdfs/temp/
133
7. Hadoop Files Configuration (Slave machines)
Move to /hadoop-1.2.1/conf folder to change the following files
7.1. hadoop-env.sh File
hduser@host1:~/hadoop-1.2.1/conf$ sudo gedit hadoop-env.sh
Replace the following two lines:
# The java implementation to use. Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
by (uncomment the second line):
# The java implementation to use. Required.
export JAVA_HOME=~/jdk1.6.0_45
Then, add at the end of the file:
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
7.2. core-site.xml File
hduser@host1:~/hadoop-1.2.1/conf$ sudo gedit core-site.xml
Add the following lines between the <configuration> tags:
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hduser/hdfs/temp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
<property>
<name>dfs.name.dir</name>
</property>
134
7.3 .mapred-site.xml File
hduser@host1:~/hadoop-1.2.1/conf$ sudo gedit mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>master:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
7.4. hdfs-site.xml File
hduser@host1:~/hadoop-1.2.1/conf$ sudo gedit hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>3 </value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
Note: Number 3 illustrates the total number of block replication. If you have a cluster of 3-10
nodes, set the replication factor to 3
8. Hadoop Files Configuration (Master)
8.1. core-site.xml File
hduser@master:~/hadoop-1.2.1/conf$ sudo gedit core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
<property> <name>hadoop.tmp.dir</name>
<description>A base for other temporary directories.</description> </property>
135
<property>
<name>dfs.name.dir</name> <value>/home/hduser/hdfs/temp</value> </property>
8.2 .mapred-site.xml File
hduser@master:~/hadoop-1.2.1/conf$ sudo gedit mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>master: 54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task. </description> </property>
8.3. hdfs-site.xml File
hduser@master:~/hadoop-1.2.1/conf$ sudo gedit hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>3
</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
8.4. slaves File
hduser@master:~/hadoop-1.2.1/conf$ sudo gedit slaves
Comment the localhost, and add the name of your slaves (you can set your master node as
master and slave at the same by adding the master hostname to slaves file.
master
host1
host2
.
8.4. masters File
hduser@master:~/hadoop-1.2.1/conf$ sudo gedit masters
Comment the localhost, and add the name of your master node.
136
master
Connecting Nodes
1. IP address configuration (All machines)
1.1. Find out the IP address of each machine
hduser@host1:~$ ifconfig
eth0
Link encap:Ethernet HWaddr 00:23:ae:b0:89:ae
inet addr:10.50.0.170 Bcast:10.50.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:198693 errors:0 dropped:0 overruns:0 frame:0
TX packets:9134 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:30871002 (30.8 MB) TX bytes:1334436 (1.3 MB)
Interrupt:21 Memory:fe6e0000-fe700000
lo
Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:58 errors:0 dropped:0 overruns:0 frame:0
TX packets:58 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:9306 (9.3 KB) TX bytes:9306 (9.3 KB)
1.2. Find out the host name of each machine
hduser@host1:~$ sudo gedit /etc/hostname
1.1. Open hosts file (for each machine)
hduser@host1:~$ sudo gedit /etc/hosts
Replace the content of the file by the IP Addresses of all machines, including in the cluster.
10.50.0.197 master
10.50.0.94 slave
….…
2. Connect the master hduser with the hduser on slaves
Example: For machine with hostname host1
137
hduser@master:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@host1
Example: For machine with hostname host2
hduser@master:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@host2
3. Test the connection between each slave and master machine
hduser@master:~$ ssh host1
Welcome to Ubuntu 12.04.2 LTS (GNU/Linux 3.5.0-23-generic i686)
* Documentation: https://help.ubuntu.com/
System information as of Sun Jun 30 19:44:28 WEST 2013
System load: 0.08
Processes:
159
Usage of /: 77.7% of 228.23GB Users logged in: 2
Memory usage: 35%
IP address for eth0: 10.50.0.170
Swap usage: 0%
=> There is 1 zombie process.
Graph this data and manage this system at https://landscape.canonical.com/
97 packages can be updated.
66 updates are security updates.
Last login: Sun Jun 30 18:39:15 2013 from ip6-localhost
If the connection is set up, you need then to cancel it to continue your installation
hduser@host5:~$ exit
logout
Connection to host5 closed.
138
Formatting the HDFS & Starting Multi-node Cluster
1. Format the HDFS filesystem via the NameNode
hduser@master:~/hadoop-1.2.1$ bin/hadoop namenode -format
Here is the output:
13/06/30 20:00:42 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = master/10.50.0.197
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.2.1
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1440782; compiled by
'hortonfo' on Thu Jan 31 02:03:24 UTC 2013
************************************************************/
13/06/30 20:00:42 INFO util.GSet: VM type
= 32-bit
13/06/30 20:00:42 INFO util.GSet: 2% max memory = 19.33375 MB
13/06/30 20:00:42 INFO util.GSet: capacity
= 2^22 = 4194304 entries
13/06/30 20:00:42 INFO util.GSet: recommended=4194304, actual=4194304
13/06/30 20:00:42 INFO namenode.FSNamesystem: fsOwner=hduser
13/06/30 20:00:42 INFO namenode.FSNamesystem: supergroup=supergroup
13/06/30 20:00:42 INFO namenode.FSNamesystem: isPermissionEnabled=true
13/06/30 20:00:42 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
13/06/30 20:00:42 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s),
accessTokenLifetime=0 min(s)
13/06/30 20:00:42 INFO namenode.NameNode: Caching file names occuring more than 10 times
13/06/30 20:00:42 INFO common.Storage: Image file of size 112 saved in 0 seconds.
13/06/30 20:00:42 INFO namenode.FSEditLog: closing edit log: position=4,
editlog=/home/hduser/hdfs/temp/dfs/name/current/edits
13/06/30 20:00:42 INFO namenode.FSEditLog: close success: truncate to 4,
editlog=/home/hduser/hdfs/temp/dfs/name/current/edits
13/06/30 20:00:43 INFO common.Storage: Storage directory /home/hduser/hdfs/temp/dfs/name has been successfully
formatted.
13/06/30 20:00:43 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at master/10.50.0.197
************************************************************/
2. Start the multi-node cluster
hduser@master:~/hadoop-1.2.1$ bin/start-all.sh
Start both DFS and Hadoop Map/Reduce daemons:
hduser@master:~/hadoop-1.2.1$ bin/start-dfs.sh
hduser@master:~/hadoop-1.2.1$ bin/start-mapred.sh
139
4. On master machine, check if the following java processes are running :
hduser@master:~$ jps
5721 SecondaryNameNode
6738 DataNode
5243 NameNode
6047 TaskTracker
8423 Jps
5805 JobTracker
4. On slave machines, check if the following java processes are running:
hduser@master:~$ jps
1902 DataNode
4002 Jps
2108 TaskTracker
If you get the following oputput:
hduser@host1:~/hadoop-1.2.1/conf$ jps
The program 'jps' can be found in the following packages:
* openjdk-6-jdk
* openjdk-7-jdk
Ask your administrator to install one of them
Then install one of the suggested packages:
hduser@host1:~/hadoop-1.2.1/conf$ sudo apt-get install openjdk-7-jdk
Note: if you didn’t get the same services, follow the suggestion provided for exception 2.
140
Hadoop Troubleshooting
1. Formatting the Namenode Exception: “Cannot lock storage…”
13/06/30 19:57:35 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = master/10.50.0.197
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.2.1
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1440782;
compiled by 'hortonfo' on Thu Jan 31 02:03:24 UTC 2013
************************************************************/
…
….
13/06/30 19:57:38 ERROR namenode.NameNode: java.io.IOException: Cannot lock storage
/home/hduser/hdfs/temp/dfs/name. The directory is already locked.
at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:599)
at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:1327)
at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:1345)
at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:1207)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1398)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1419)
13/06/30 19:57:38 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at master/10.50.0.197
************************************************************/
 Solution
Step 1: Stop all processes
hduser@master:~/hadoop-1.2.1$ bin/stop-all.sh
Step 2 : move to /hdfs/temp folder and run the following command
hduser@master:~/hdfs/temp$ sudo rm -rf *
Step 3 : Restart your work by formatting the namenode
141
2. Formatting the Namenode Exception: “Cannot create directory
/home/hduser/hdfs…”
 Solution
In this case, make sure that you have set the following permission when creating the
/hdfs/temp folder
3. Exception in log file: hadoop-hduser-datanode-host1.log or when Hadoop DataNode
doesn’t show up in slave nodes
hduser@host1:~/hadoop-1.2.1/logs$ sudo gedit hadoop-hduser-datanode-host1.log
2013-06-30 19:01:09,078 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException:
Incompatible namespaceIDs in /home/hduser/hdfs/temp/dfs/data: namenode namespaceID = 1345454277;
datanode namespaceID = 1875045188
at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:232)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:147)
at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:399)
at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:309)
at org.apache.hadoop. hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1651)
at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1590)
at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1608)
at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1734)
at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1751)
 Solution 1
1. From master machine, open VERSION file under /hdfs/temp/dfs/name/current folder:
hduser@master:~/hdfs/temp/dfs/name/current$ sudo gedit VERSION
Here is the content of VERSION file:
#Sun Jun 30 20:00:43 WEST 2013
namespaceID=1289101159
cTime=0
storageType=NAME_NODE
layoutVersion=-32
Check the id of the namespace variable ( in this case it is 1289101159); remember the id as
you will need it in the next step
2. From all slaves machines where you found the above exception, open the VERSION file
under /hdfs/temp/dfs/data/current folder:
hduser@host1:~/hdfs/tmp/dfs/data/current$ sudo gedit VERSION
142
Here is the content of VERSION file:
#Fri Jun 14 09:22:08 WET 2013
storageID=DS-1900366223-127.0.1.1-50010-1371201728420
cTime=0
storageType=DATA_NODE
layoutVersion=-32
Replace the namespaceID variable with the value you found in the VERSION file of the
master.
The content of file VERSION under /hdfs/temp/dfs/data/current folder is:
#Fri Jun 14 09:22:08 WET 2013
storageID=DS-1900366223-127.0.1.1-50010-1371201728420
cTime=0
storageType=DATA_NODE
layoutVersion=-32
 Solution 2
1. Stop the whole cluster
2. Delete the data directory on the problematic DataNode: the directory is specified
by dfs.data.dir in conf/hdfs-site.xml; if you followed this tutorial, the relevant directory
is /hdfs/temp /dfs/data.
3. Reformat the NameNode.
4. Restart the cluster.
4. Safe mode exception when running MapReduce examples
org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.SafeModeException:
Cannot delete /benchmarks/TestDFSIO. Name node is in safe mode.
The reported blocks is only 3601 but the threshold is 0.9990 and the total blocks 3748. Safe mode will be turned
off automatically.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:2111)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:2088)
at org.apache.hadoop.hdfs.server.namenode.NameNode.delete(NameNode.java:832)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 Solution
hduser@master:~/hadoop-1.2.1$ bin/hadoop dfsadmin -safemode leave
Safe mode is OFF
hduser@master:~/hadoop-1.2.1$ bin/hadoop jar hadoop-*test*.jar TestDFSIO -clean
TestDFSIO.0.0.4
143
References for Appendix C
[1]http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-nodecluster/
[2]http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-nodecluster/#solution-2-manually-update-the-namespaceid-of-problematic-datanodes
144
Appendix D: TeraSort and TestDFSIO Execution
1. TeraSort
1.1.Generate the TeraSort input data using TeraGen
TeraGen generates random data that can be conveniently used as input data for a subsequent
TeraSort run. The command to run TeraGen in order to generate 100 MB of input data is:
bin/hadoop jar hadoop-*examples*.jar teragen 1000000 /home/hduser/terasort-input
1000000 specifies the number of rows of input data to generate, each of which having a size
of 100 bytes.
1.2.Run the actual TeraSort benchmark using TeraSort
The syntax to run the TeraSort benchmark is as follows:
bin/hadoop jar hadoop-*examples*.jar terasort /home/hduser/terasort-input /home/hduser/terasort-output
1.3.Validate the sorted output data of TeraSort using TeraValidate
The syntax to run the TeraValidate is as follow:
bin/hadoop jar hadoop-*examples*.jar teravalidate /home/hduser/terasort-input /home/hduser/terasort-output
1. Check TeraSort Analysis
To check the average time to generate 100 MB, you need to run the following command:
bin/hadoop job -history /home/hduser/terasort-input
To check the average time to sort 100 MB, you need to run the following command:
bin/hadoop job -history /home/hduser/terasort-output
2. Clean up your temporary files
When re-running TeraSort Benchmark, you need to clean up all generated files in the first
TeraSort test.
bin/hadoop dfs -rmr /home/hduser/terasort-input
bin/hadoop dfs -rmr /home/hduser/terasort-output
145
2. TestDFSIO
2.1. Write data using TestDFSIO-Write
To generate 1000MB dataset, you need to specify an input with 10 files, and each file with
10MB. To allow this operation, the following command needs to be executed:
hadoop jar hadoop-*test*.jar TestDFSIO -write -nrFiles 10 -fileSize 10
A sample output of TestDFSIO-write operation provides information about the throughput,
average I/O rate, I/O rate standard deviation and test execution time.
13/11/07 15:37:27 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write
13/11/07 15:37:27 INFO fs.TestDFSIO:Date & time: Thu Nov 07 15:37:27 UTC 2013
13/11/07 15:37:27 INFO fs.TestDFSIO:
Number of files: 10
13/11/07 15:37:27 INFO fs.TestDFSIO: Total MBytes processed: 100
Throughput mb/sec: 5.680527152919791
13/11/07 15:37:27 INFO fs.TestDFSIO: Average IO rate mb/sec: 9.899490356445312
IO rate std deviation: 7.567628183406918
Test exec time sec: 17.568
2.2.Read data using TestDFSIO-Read
After getting the results of TestDFSIO-write command, the next step is to run TestDFSIOread operation. In this case, to read the previous generated data, the following command needs
to be executed.
hadoop jar hadoop-*test*.jar TestDFSIO -read -nrFiles 10 -fileSize 10
A sample output of write operation provides information about the throughput, average I/O
rate, I/O rate standard deviation and test execution time.
13/11/07 15:38:11 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read
13/11/07 15:38:11 INFO fs.TestDFSIO: Date & time: Thu Nov 07 15:38:11 UTC 2013
Number of files: 10
13/11/07 15:38:11 INFO fs.TestDFSIO: Total MBytes processed: 100
Throughput mb/sec: 70.57163020465772
13/11/07 15:38:11 INFO fs.TestDFSIO: Average IO rate mb/sec: 73.69004821777344
IO rate std deviation: 16.249892929638822
Test exec time sec: 15.51
2.3.Clean your cluster
The last step is to clean up the generated data using the following command:
bin/hadoop jar hadoop-*test*.jar TestDFSIO -clean
146
Appendix E: Data Gathering for TeraSort
1. Hadoop Physical Cluster
Number of
Machines
3
4
5
Dataset
Size
100 MB
100 MB
100 MB
1 GB
1 GB
1 GB
10 GB
10 GB
10 GB
30 GB
30 GB
30 GB
100 MB
100 MB
100 MB
1 GB
1 GB
1 GB
10 GB
10 GB
10 GB
30 GB
30 GB
30 GB
100 MB
100 MB
100 MB
1 GB
1 GB
1 GB
10 GB
10 GB
10 GB
30 GB
30 GB
30 GB
100 MB
100 MB
100 MB
Map
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Test 1
6
10
5
13
83
99
31
1065
1511
26
2971
9522
5
10
4
14
81
99
19
951
1680
21
2860
5908
5
10
5
14
79
84
21
937
1729
19
2446
5437
6
10
4
Test 2
7
10
5
14
81
77
22
921
1841
25
2312
7544
7
10
6
16
79
87
21
921
1714
22
2912
6412
6
10
5
14
87
93
19
900
1611
23
2710
7118
6
10
4
Test 3
6
11
4
16
85
93
19
930
1679
28
3081
8434
7
10
4
15
80
82
19
881
1421
20
3120
6109
6
10
5
13
80
83
20
857
1360
22
2650
6821
5
11
5
Mean
6.33
10.33
4.67
14.33
83.00
89.67
24.00
972.00
1677.00
26.33
2788.00
8500.00
6.33
10.00
4.67
15.00
80.00
89.33
19.67
917.67
1605.00
21.00
2964.00
6143.00
5.67
10.00
5.00
13.67
82.00
86.67
20.00
898.00
1566.67
21.33
2602.00
6458.67
5.67
10.33
4.33
147
7
8
1 GB
1 GB
1 GB
10 GB
10 GB
10 GB
Map
Shuffling
Reduce
Map
Shuffling
Reduce
16
86
91
18
885
1147
18
89
75
18
929
1515
14
83
73
16
906
1097
16.00
86.00
79.67
17.33
906.67
1253.00
30 GB
30 GB
30 GB
100 MB
100 MB
100 MB
1 GB
1 GB
1 GB
10 GB
10 GB
10 GB
30 GB
30 GB
30 GB
100 MB
100 MB
100 MB
1 GB
1 GB
1 GB
10 GB
10 GB
10 GB
30 GB
30 GB
30 GB
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
20
2731
6419
6
10
5
16
83
85
23
985
1681
37
2983
6514
5
10
5
15
92
80
20
925
1043
27
2812
5319
19
2694
6210
5
10
5
18
81
83
27
910
1591
23
2796
5891
5
10
4
11
91
76
25
1020
1679
24
2777
6317
20
2725
5877
6
10
4
12
87
80
25
979
1514
40
2882
5338
5
10
5
10
88
75
29
893
2092
30
2834
5395
19.67
2716.67
6168.67
5.67
10.00
4.67
15.33
83.67
82.67
25.00
958.00
1595.33
33.33
2887.00
5914.33
5.00
10.00
4.67
12.00
90.33
77.00
24.67
946.00
1604.67
27.00
2807.67
5677.00
148
2. Hadoop Virtualized Cluster- KVM
Number of
KVM VMs
3
4
5
6
Dataset
Size
100 MB
100 MB
100 MB
1 GB
1 GB
1 GB
10 GB
10 GB
10 GB
30 GB
30 GB
30 GB
100 MB
100 MB
100 MB
1 GB
1 GB
1 GB
10 GB
10 GB
10 GB
100 GB
100 GB
100 GB
100 MB
100 MB
100 MB
1 GB
1 GB
1 GB
10 GB
10 GB
10 GB
100 GB
100 GB
100 GB
100 MB
100 MB
100 MB
1 GB
1 GB
1 GB
10 GB
Map
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Test 1
Test 2
Test 3
Mean
4
7
3
12
37
41
24
781
336
24
2150
1559
5
6
3
12
28
38
28
657
438
25
1952
1616
5
6
3
61
113
51
33
746
445
37
3446
1413
5
6
3
224
511
56
45
6
7
3
14
37
41
20
737
345
24
2220
1542
5
7
3
15
34
40
29
672
442
28
2046
1517
5
7
3
64
109
41
29
632
477
66
3332
1597
5
6
4
343
464
48
37
5
7
3
12
38
40
23
718
392
23
2172
1539
5
7
3
16
38
40
23
657
419
25
1887
1605
5
6
3
85
139
42
32
877
358
51
2816
1788
4
6
4
266
492
63
42
5
7
3
12.67
37.33
40.67
22.33
745.33
357.67
23.67
2180.67
1546.67
5.00
6.67
3.00
14.33
33.33
39.33
26.67
662.00
433.00
26.00
1961.67
1579.33
5.00
6.33
3.00
70.00
120.33
44.67
31.33
751.67
426.67
51.33
3198.00
1599.33
4.67
6.00
3.67
277.67
489.00
55.67
41.33
149
7
8
10 GB
10 GB
100 GB
100 GB
100 GB
100 MB
100 MB
100 MB
1 GB
1 GB
1 GB
10 GB
10 GB
10 GB
100 GB
100 GB
100 GB
100 MB
100 MB
100 MB
1 GB
1 GB
1 GB
10 GB
10 GB
10 GB
100 GB
100 GB
100 GB
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
1652
404
140
7402
1717
5
6
4
124
1083
102
61
1024
985
185
12112
1987
5
6
4
162
1201
545.4
104
2489.52
1211.55
201
11087
3088
1387
412
180
10197
1565
5
6
3
245
958
121
63
1984
1101
163
10197
1851
5
6
3
193
1320
244.42
121
2440.32
1354.23
195
14587
3145
1745
532
50
5710
1206
5
6
3
365
1344
81
58
2062
1024
154
12024
2106
5
6
3
167
1259
163.62
97
2536.26
2283.52
168
13214
2906
1594.67
449.33
123.33
7769.67
1496.00
5.00
6.00
3.33
244.67
1128.33
101.33
60.67
1690.00
1036.67
167.33
11444.33
1981.33
5.00
6.00
3.33
174.00
1260.00
317.81
107.33
2488.70
1616.43
188
12962.667
3046.3333
150
3. Hadoop Virtualized Cluster- VMware ESXi
Number of
VMware VMs
3
4
5
6
Dataset Size
100 MB
100 MB
100 MB
1 GB
1 GB
1 GB
10 GB
10 GB
10 GB
30 GB
30 GB
30 GB
100 MB
100 MB
100 MB
1 GB
1 GB
1 GB
10 GB
10 GB
10 GB
30 GB
30 GB
30 GB
100 MB
100 MB
100 MB
1 GB
1 GB
1 GB
10 GB
10 GB
10 GB
30 GB
30 GB
30 GB
100 MB
100 MB
100 MB
1 GB
1 GB
1 GB
10 GB
Map
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Test 1
5
8
4
18
42
40
24
660
492
44
4108
2278
5
7
4
19
38
42
25
672
486
35
2657
1985
7
8
4
19
35
39
31
553
418
39
2540
2310
5
7
5
18
28
32
59
Test 2
5
7
4
16
49
38
22
636
483
44
3952
2315
5
7
4
15
39
41
24
691
425
51
3257
1852
5
7
3
21
30
35
26
514
432
36
2412
2245
6
7
4
18
29
29
42
Test 3
5
7
4
16
41
39
23
645
493
43
3891
2101
5
8
4
15
42
40
25
682
411
43
3214
1865
5
7
3
18
32
37
28
503
421
45
2286
2101
5
6
4
17
27
34
41
Mean
5
7
4
17
44
39
23
647
489
44
3984
2231
5
7
4
16
40
41
24.66667
682
440.6667
43
3042.667
1900.667
6
7
3
19
32
37
28
523
424
40
2413
2219
5
7
4
18
28
32
47
151
7
8
10 GB
10 GB
30 GB
30 GB
30 GB
100 MB
100 MB
100 MB
1 GB
1 GB
1 GB
10 GB
10 GB
10 GB
30 GB
30 GB
30 GB
100 MB
100 MB
100 MB
1 GB
1 GB
1 GB
10 GB
10 GB
10 GB
30 GB
30 GB
30 GB
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
Map
Shuffling
Reduce
536
369
30
2412
2098
10
12
4
24
35
26
52
536
298
84
3210
1743
17
15
4
81
92
36
128
1340
509
144
4481
1753
552
385
32
2254
1671
10
11
4
29
32
34
56
520
290
76
2687
1523
16
16
4
79
93
36
102
1102
562
137
4251
1578
529
336
28
2114
1658
8
8
4
26
39
25
52
511
302
87
2968
1621
11
14
4
81
82
37
127
1021
554
142
4012
1697
539
363
30
2260
1809
9
10
4
26
35
28
53
522
297
82
2955
1629
15
15
4
80
89
36
119
1154
542
141
4248
1676
152
Appendix F: Data Gathering for TestDFSIO
1. Hadoop Physical Cluster
Number of Nodes = 3
Dataset
Size
Operatio
n
Write
100 MB
Read
Write
1 GB
Read
Write
10 GB
Read
Write
100 GB
Read
Criteria
Test1
Test2
Test3
Mean
Throughput (mb/sec)
Average IO rate (mb/sec)
IO rate standard deviation
Execution time (sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
2.867
2.903
0.363
17.786
7.645
19.509
26.167
14.72
2.507
2.889
1.2632
77.129
6.037
10.231
10.784
43.468
2.503
2.671
0.796
674.535
7.956
11.289
6.421
241.896
3.544
3.546
0.089
3315.61
4.746
4.791
0.478
2387.659
2.861
2.957
0.505
16.717
6.309
11.442
14.595
16.721
2.713
2.866
0.765
74.47
7.297
31.235
39.149
35.947
2.589
2.761
0.817
641.232
7.799
12.452
12.916
296.722
3.275
3.284
0.165
3343.122
5.109
5.238
0.852
2467.634
2.421
2.517
0.486
18.8
6.558
31.255
40.655
14.705
2.204
2.498
0.929
83.658
5.068
8.779
9.712
42.779
3.288
3.318
0.317
363.144
5.458
5.786
1.485
257.708
3.275
3.282
0.148
3338.37
5.603
12.875
18.333
2734.46
2.72
2.79
0.45
17.77
6.84
20.74
27.14
15.38
2.47
2.75
0.99
78.42
6.13
16.75
19.88
40.73
2.79
2.92
0.64
559.64
7.07
9.84
6.94
265.44
3.36
3.37
0.13
3332.37
5.15
7.63
6.55
2529.92
153
Number of Nodes = 4
Dataset
Size
Operation
Write
100 MB
Read
Write
1 GB
Read
Write
10 GB
Read
Write
100 GB
Read
Criteria
Test1
Test2
Test3
Mean
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
2.65
2.661
0.173
17.665
6.405
19.631
28.056
15.35
2.556
2.669
0.582
59.677
12.133
27.419
34.001
33.087
3.713
3.735
0.283
315.636
5.045
5.205
1.779
258.009
3.533
3.538
0.136
3557.813
7.009
7.6966
5.129
2098.422
3.303
3.543
0.796
17.039
9.038
31.433
37.351
13.684
2.79
2.954
0.747
58.02
6.031
7.751
4.998
40.004
3.325
3.341
0.236
347.593
5.738
11.437
15.779
261.24
3.354
3.356
0.085
3507.078
6.716
12.229
9.796
2700.035
3.639
3.932
1.212
15.674
5.827
19.547
29.466
14.676
2.786
2.885
0.536
61.536
8.264
25.182
35.168
32.861
3.201
3.22
0.248
367.294
5.006
5.138
0.884
276.283
3.366
3.37
0.111
3184.76
4.349
10.179
12.546
3046.01
3.20
3.38
0.73
16.79
7.09
23.54
31.62
14.57
2.71
2.84
0.62
59.74
8.81
20.12
24.72
35.32
3.41
3.43
0.26
343.51
5.26
7.26
6.15
265.18
3.42
3.42
0.11
3416.5
6.02
10.03
9.16
2614.8
154
Number of Nodes = 5
Data
Size
Operation
Write
100 MB
Read
Write
1 GB
Read
Write
10 GB
Read
Write
100 GB
Read
Criteria
Test1
Test2
Test3
Mean
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
2.597
2.623
0.28
16.672
8.019
32.097
40.452
14.584
2.477
2.572
0.533
59.372
7.659
11.029
7.049
36.214
3.309
3.329
0.264
346.239
6.309
9.178
9.178
263.224
3.103
3.115
0.191
2.791
2.841
0.406
15.708
7.213
35.821
47.846
14.501
2.896
3.032
0.64
56.319
5.617
8.651
8.984
35.04
3.337
3.367
0.335
340.622
5.741
13.109
23.064
256.89
3.081
3.092
0.183
3552.118
3478.991
Throughput (mb/sec)
4.737
6.078
5.198
5.292
2462.739
2243.086
2.597
2.791
3.804
3.941
0.772
16.679
10.68
46.053
40.287
14.579
2.676
2.757
0.533
54.271
8.738
25.868
41.954
30.18
3.382
3.415
0.331
361.257
4.771
4.839
0.609
254.85
3.343
3.349
0.1386
3177.00
1
4.478
4.512
2558.05
1
3.804
3.06
3.14
0.49
16.35
8.64
37.99
42.86
14.55
2.68
2.79
0.57
56.65
7.34
15.18
19.33
33.81
3.34
3.37
0.31
349.37
5.61
9.04
10.95
258.32
3.18
3.19
0.17
3402.7
0
4.80
5.29
2421.2
9
3.06
155
Number of Nodes = 6
Dataset
Size
Operation
Write
100 MB
Read
Write
1 GB
Read
Write
10 GB
Read
Criteria
Test1
Test2
Test3
Mean
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
3.603
3.726
0.714
16.679
7.017
37.162
49.103
14.165
3.089
3.369
1.13
55.472
7.809
8.239
1.988
33.23
3.366
3.386
0.267
347.497
5.681
10.302
13.222
269.214
3.343
3.352
0.178
4.19
4.329
0.865
15.703
6.681
24.101
36.636
13.942
3.155
3.239
0.595
51.491
7.593
20.751
34.499
34.396
3.133
3.139
0.14
353.804
6.327
14.173
18.079
270.797
3.252
3.26
0.173
3254.674
3329.312
5.435
7.827
8.045
5.169
5.465
3.505
2369.118
2481.531
3.603
3.726
0.714
16.679
7.017
37.162
49.103
14.165
4.19
4.329
0.865
15.703
6.681
24.101
36.636
13.942
3.536
3.949
1.337
15.792
8.877
33.456
41.63
14.833
3.0178
3.088
0.522
51.749
5.651
6.391
2.169
32.392
3.782
3.796
0.229
297.105
14.756
27.573
22.233
176.225
3.268
3.275
6.127
3313.77
3
6.126
11.738
13.987
2168.30
4
3.536
3.949
1.337
15.792
8.877
33.456
41.63
14.833
3.78
4.00
0.97
16.06
7.53
31.57
42.46
14.31
3.09
3.23
0.75
52.90
7.02
11.79
12.89
33.34
3.43
3.44
0.21
332.80
8.92
17.35
17.84
238.75
3.29
3.30
2.16
3299.2
5
5.58
8.34
8.51
2339.6
5
3.78
4.00
0.97
16.06
7.53
31.57
42.46
14.31
Throughput (mb/sec)
100 GB
300 GB
Write
Read
Throughput (mb/sec)
Throughput (mb/sec)
156
Number of Nodes = 7
Data
Size
Operation
Write
100 MB
Read
Write
1 GB
Read
Write
10 GB
Read
100 GB
Write
Criteria
Test1
Test2
Test3
Mean
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
3.475
3.928
1.605
15.793
9.034
29.731
35.058
14.071
3.771
3.814
0.402
44.285
6.069
13.227
19.929
42.883
3.377
3.395
0.248
342.034
5.909
8.364
6.168
273.925
2.698
2.77
0.508
3.028
3.263
0.905
15.679
6.669
14.642
25.436
14.688
3.837
3.887
0.441
41.408
6.664
19.04
38.007
37.004
3.548
3.568
0.28
313.38
7.832
18.227
22.808
238.699
3.49
3.493
0.083
3.475
3.928
1.605
15.793
9.034
29.731
35.058
14.071
3.203
3.509
1.118
52.404
6.644
7.797
3.689
32.181
3.636
3.646
0.194
311.647
7.661
14.755
17.955
242.805
3.609
3.611
0.075
2987.432
2972.533
2849.33
3.9676
6.569
4.499
9.992
4.804
6.072
3.33
3.71
1.37
15.76
8.25
24.70
31.85
14.28
3.60
3.74
0.65
46.03
6.46
13.35
20.54
37.36
3.52
3.54
0.24
322.35
7.13
13.78
15.64
251.81
3.27
3.29
0.22
2936.4
3
4.42
7.54
8.425
14.613
3.837
8.96
1846.735
1952.279
3.475
3.928
1.605
15.793
9.034
29.731
35.058
14.071
3.028
3.263
0.905
15.679
6.669
14.642
25.436
14.688
2653.41
4
3.475
3.928
1.605
15.793
9.034
29.731
35.058
14.071
2150.8
1
3.33
3.71
1.37
15.76
8.25
24.70
31.85
14.28
Throughput (mb/sec)
Read
300 GB
Write
Read
Throughput (mb/sec)
Throughput (mb/sec)
157
Number of Nodes = 8
Dataset
Size
Operation
Write
100 MB
Read
Write
1 GB
Read
Write
10 GB
Read
Write
100 GB
Read
300 GB
Write
Read
Criteria
Test1
Test2
Test3
Mean
Throughput (mb/sec)
3.229
3.485
1.161
15.854
2.197
2.235
0.324
16.137
1.828
2.198
1.507
16.931
5.521
18.754
40.071
14.75
3.977
4.067
0.614
38.578
6.079
14.054
24.771
40.959
3.718
3.733
0.244
305.018
7.692
14.645
12.945
292.329
3.208
3.216
0.154
3386.666
4.959
4.941
1.292
2508.348
2.641
2.657
0.190
7882.599
4.202
5.188
4.432
5386.197
5.721
18.857
40.038
15.846
3.701
3.826
0.758
46.594
6.579
36.377
67.854
40.594
3.441
3.466
0.306
332.807
8.208
16.074
16.939
230.154
3.568
3.571
0.086
2908.484
4.648
4.694
0.495
2343.769
2.629
2.644
0.178
7862.87
4.111
5.044
3.660
5796.385
5.361
16.446
32.486
15.278
4.010
4.061
0.473
43.285
5.672
14.545
26.687
42.45
3.432
3.460
0.320
337.531
6.057
13.983
19.665
289.889
3.592
3.595
0.095
2918.7
5.303
6.544
4.823
2358.37
2.638
2.654
0.196
8000.19
4.307
8.211
12.171
5551.46
3.51
3.78
1.2115
15.764
5
7.767
30.20
41.78
14.72
3.82
3.91
0.63
43.11
6.49
20.03
35.06
39.94
3.52
3.54
0.23
323.16
6.43
11.94
12.82
286.00
3.54
3.55
0.13
2988.4
5.23
7.66
6.73
2471.1
2.63
2.65
0.19
7917.8
4.28
6.04
6.15
5546.3
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
158
2. Hadoop Virtualized Cluster- KVM
Number of KVM VMs = 3
Data
Size
Operation
100 MB
Write
Read
1 GB
Write
Read
10 GB
Write
Read
100 GB
Write
Read
Criteria
Test1
Test2
Test3
Mean
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
6.804
11.989
9.399
15.439
101.833
102.428
7.777
13.479
7.764
9.173
3.808
40.681
22.912
30.131
16.609
20.441
7.409
7.837
1.917
283.681
15.179
15.23
0.899
9.417
20.359
15.769
15.405
96.618
102.200
22.154
14.881
7.231
7.397
1.115
40.126
19.046
19.969
4.422
20.518
7.429
7.68
1.43
283.554
16.526
17.456
5.7934
6.152
15.808
16.049
13.405
104.275
105.770
12.986
13.399
8.201
11.249
6.539
38.515
25.187
40.464
34.112
19.44
7.323
7.616
1.55
288.894
16.663
17.96
5.753
7.39
7.71
1.63
14.75
16.12
16.88
4.15
13.92
7.73
9.27
3.82
39.77
22.38
30.19
18.38
20.13
7.39
7.71
1.63
285.38
16.12
16.88
4.15
148.455
133.574
128.987
137.01
Throughput (mb/sec)
6.704
7.621
7.621
6.883
7.557
7.247
1.147
2929.379
1.554
2666.541
1.512
2812.221
7.32
7.23
1.40
2802.71
Throughput (mb/sec)
15.959
15.79
15.79
15.85
16.413
15.831
15.831
0.717
0.818
0.724
16.03
0.75
1316.845
1486.787
1438.554
1414.06
159
Data
Size
Operation
100 MB
Write
Read
1 GB
Write
Read
10 GB
Write
Criteria
4.721
12.213
5.25
6.71
5.865
13.164
16.688
4.41
14.37
15.692
15.473
15.18
Throughput (mb/sec)
95.419
77.042
96.061
11.37
100.145
84.183
97.716
11.97
21.111
23.277
12.857
2.92
14.387
14.364
15.534
14.76
Throughput (mb/sec)
5.825
5.557
5.677
7.556
7.236
7.601
5.199
40.198
38.079
40.489
Throughput (mb/sec)
26.314
33.061
23.697
45.562
52.421
31.314
42.962
41.111
18.684
19.474
15.461
19.475
Throughput (mb/sec)
5.817
5.188
5.182
7.263
6.567
6.457
4.535
4.212
3.955
270.133
296.114
301.458
Throughput (mb/sec)
14.008
11.722
11.517
12.759
18.052
4.447
3.54
16.12
118.331
144.184
130.603
Throughput (mb/sec)
5.149
6.361
3.833
2778.663
11.655
12.193
2.891
1369.266
5.339
5.259
6.886
6.886
4.625
4.78
2780.868
2824.785
11.181
11.269
12.002
11.724
Throughput (mb/sec)
Read
Mean
5.13
12.451
Write
Test3
4.842
8.131
100 GB
Test2
Throughput (mb/sec)
Read
Test1
15.293
4.868
3.319
1318.89
5.323
2.549
1520.755
5.69
7.46
5.13
39.59
27.69
43.10
34.25
18.14
5.40
6.76
4.23
289.24
12.42
15.37
8.04
131.04
5.25
6.71
4.41
2794.77
11.37
11.97
2.92
1402.97
160
Data
Size
Operation
Write
100 MB
Read
1 GB
Write
Read
10 GB
Write
Read
Criteria
Test1
Test2
Test3
Mean
Throughput (mb/sec)
5.796
4.807
2.949
4.52
6.55
5.447
3.682
5.23
2.342
2.171
2.446
14.444
14.696
14.398
Throughput (mb/sec)
42.481
54.171
54.083
52.311
65.455
63.057
21.799
23.039
20.053
14.39
14.466
14.534
Throughput (mb/sec)
3.962
2.168
2.552
4.422
2.215
2.626
1.699
0.375
0.527
42.716
37.287
37.65
Throughput (mb/sec)
4.883
7.708
5.135
6.698
9.251
5.884
4.412
4.452
2.42
18.364
17.669
18.061
Throughput (mb/sec)
3.369
3.495
3.421
3.374
3.497
3.294
0.123
0.081
0.057
262.581
287.531
291.531
Throughput (mb/sec)
8.792
7.17
8.27
8.558
7.3
7.211
1.058
0.906
0.906
128.409
125.356
133.347
Throughput (mb/sec)
5.149
6.361
4.811
6.847
5.2
6.121
5.677
5.75
2.32
14.51
50.25
60.27
21.63
14.46
2.89
3.09
0.87
39.22
5.91
7.28
3.76
18.03
3.43
3.39
0.09
280.55
8.08
7.69
0.96
129.04
5.73
6.05
5.27
2824.512
2784.82
11.724
100 GB
Write
Read
2679.211
Throughput (mb/sec)
11.655
5.255
2850.74
4
11.181
12.193
12.002
2.891
1475.121
3.319
1214.15
11.269
2.549
1420.575
11.97
2.92
1369.95
161
Data
Size
Operation
Write
100 MB
Read
Write
1 GB
Read
Write
10 GB
Write
100 GB
Read
Criteria
Test1
Test2
Test3
Mean
Throughput (mb/sec)
5.054
4.318
3.31
4.23
6.463
4.769
3.984
5.07
2.986
2.163
2.446
2.53
Throughput (mb/sec)
14.474
62.035
69.145
22.529
14.468
14.44
56.085
65.138
23.831
14.441
14.42
24.337
62.606
38.026
15.151
14.44
47.49
65.63
28.13
14.69
Throughput (mb/sec)
3.089
3.155
2.982
3.08
3.369
3.239
3.262
3.29
1.13
0.595
1.115
0.95
55.472
51.491
57.861
54.94
Throughput (mb/sec)
7.809
7.593
9.488
8.30
8.239
20.751
29.577
19.52
1.988
34.499
36.679
24.39
34.23
34.396
31.08
33.24
Throughput (mb/sec)
1.138
0.393
0.862
0.80
1.326
0.393
0.875
0.86
0.105
0.015
0.112
0.08
310.523
372.186
359.615
347.44
Throughput (mb/sec)
0.881
1.437
1.568
1.30
3.091
1.639
1.721
2.15
5.442
0.666
0.645
2.25
144.278
115.98
155.58
138.61
Throughput (mb/sec)
2.597
2.898
2.581
2.69
2.516
2.606
2.625
2.58
0.155
0.157
0.21
0.17
Throughput (mb/sec)
4130.984 4322.184
4.365
4.125
4.744
4.994
1.235
1.352
4179.124
4.335
3.951
1.228
4210.76
4.28
4.56
1.27
3115.787 3411.599
3954.8
3494.06
162
Data
Size
Operation
Criteria
Test1
Test2
Test3
Mean
2.81
2.419
2.604
2.61
Throughput (mb/sec)
3.285
1.731
16.788
36.311
2.562
0.719
17.535
39.541
2.68
0.477
19.524
40.404
2.84
0.98
17.95
38.75
42.668
52.131
52.211
49.00
16.107
22.127
24.376
20.87
15.563
15.498
15.573
15.54
Throughput (mb/sec)
2.027
2.263
2.234
2.17
2.072
2.342
2.273
2.23
0.357
0.484
2.273
1.04
Write
69.088
66.069
70.888
68.68
Read
Throughput (mb/sec)
9.77
20.969
26.933
24.537
28.656
13.725
11.26
25.211
23.079
15.19
24.95
21.25
32.326
38.813
35.21
35.45
Throughput (mb/sec)
3.052
3.505
3.727
3.43
3.073
3.519
3.239
3.28
0.267
0.226
0.216
0.24
390.563
7.934
18.501
6.414
167.311
2.214
2.421
0.195
8303.277
3.218
4.421
1.521
318.818
8.232
8.33
0.9
153.892
2.632
2.412
0.157
8990.14
3.625
5.114
1.235
301.551
10.567
11.837
4.159
163.909
2.325
2.514
0.21
8776.16
5.024
2.125
1.095
336.98
8.91
12.89
3.82
161.70
2.39
2.45
0.19
8689.86
3.96
3.89
1.28
7820.6253
7573.74
9175.136
8189.84
Throughput (mb/sec)
100 MB
Write
Read
1 GB
10 GB
Write
Read
Throughput (mb/sec)
Throughput (mb/sec)
100 GB
Write
Throughput (mb/sec)
Read
163
Data Size
Operation
Write
Criteria
Test1
Test2
Test3
Mean
Throughput (mb/sec)
7.45965
7.00618
2.206
26.14782
49.132
34.014
13.245
24.96277
2.002
2.105
0.311
93.2688
9.02
20.123
20.923
38.7912
3.052
3.009
0.213
515.5432
7.934
18.501
5.621
830.37
2.58245
3.8086
2.365
43.34817
46.789
43.115
13.154
31.83195
2.365
2.211
0.12
83.90763
11.417
19.296
18.665
22.5756
3.505
3.157
0.215
420.8398
8.232
8.33
6.211
249.305
3.80997
4.14562
2.323
29.21525
52.311
66.94
14.774
31.4689
2.004
2.106
2.185
150.2826
11.352
20.011
23.001
77.462
3.727
3.562
0.2
729.7534
10.567
11.837
1.529
368.7953
4.62
4.99
2.30
32.90
49.41
48.02
13.72
29.42
2.12
2.14
0.87
109.15
10.60
19.81
20.86
46.28
3.43
3.24
0.21
555.38
8.91
12.89
4.45
482.82
100 MB
Throughput (mb/sec)
Read
Throughput (mb/sec)
Write
1 GB
Throughput (mb/sec)
Read
Throughput (mb/sec)
Write
10 GB
Throughput (mb/sec)
Read
Throughput (mb/sec)
Write
100 GB
Throughput (mb/sec)
Read
164
4. Hadoop Virtualized Cluster- VMware ESXi
Number of VMware ESXi VMs = 3
Dataset
Size
Operation
100 MB
Write
Read
1 GB
Write
Read
10 GB
Write
Read
Criteria
Test1
Test2
Test3
Mean
Throughput (mb/sec)
1.534
4.382
5.396
3.77
5.854
6.476
8.613
6.98
5.995
4.094
4.688
4.93
32.586
39.459
33.961
35.34
Throughput (mb/sec)
15.813
15.489
11.664
14.32
36.691
40.070
35.419
37.39
17.121
18.054
17.575
17.58
27.836
29.189
31.256
29.43
Throughput (mb/sec)
2.796
2.843
2.267
2.64
3.25
3.036
2.63
2.97
1.284
0.661
0.925
0.96
98.707
99.748
105.382
101.28
Throughput (mb/sec)
14.873
16.918
15.707
15.83
17.528
18.735
17.787
18.02
5.826
5.519
5.377
5.57
45.231
45.825
44.245
45.10
Throughput (mb/sec)
16.154
17.254
16.259
16.56
29.400
29.484
28.354
29.08
0.002
0.003
0.03
0.01
477.380
467.431
467.431
470.75
Throughput (mb/sec)
17.214
16.213
16.254
16.56
90.557
87.254
90.264
89.36
0.0219
0.0211
0.003
0.02
138.808
8.215
153.864
8.255
162.121
6.923
151.60
7.80
6.874
6.254
7.257
6.80
0.952
0.961
1.021
0.98
4630.131
4766.423
4621.21
4672.59
Throughput (mb/sec)
12.214
12.214
12.214
12.21
15.24
15.24
15.24
15.24
2.721
2.745
2.847
2.77
1621.001
1569.541
1642.21
1610.92
Throughput (mb/sec)
100 GB
Write
Read
165
Dataset
Size
Operation
Write
100 MB
Read
Write
1 GB
Read
Write
10 GB
Read
Write
100 GB
Read
Criteria
Test1
Test2
Test3
Mean
Throughput (mb/sec)
Throughput (mb/sec)
3.621
6.094
3.863
32.953
22.207
35.199
11.124
27.896
3.256
6.509
4.442
38.188
13.139
33.706
17.761
29.868
4
5.994
4.69
33.315
11.465
30.311
22.259
29.815
3.63
6.20
4.33
34.82
15.60
33.07
17.05
29.19
Throughput (mb/sec)
Throughput (mb/sec)
2.652
2.808
0.718
91.877
18.332
24.537
17.041
43.546
3.716
4.021
1.157
90.642
19.887
37.693
40.033
43.877
4.593
5.233
2.367
82.769
12.121
21.528
20.724
41.49
3.65
4.02
1.41
88.43
16.78
27.92
25.93
42.97
Throughput (mb/sec)
Throughput (mb/sec)
16.211
24.756
0.004
474.891
13.254
22.644
0.001
16.001
29.481
0.002
457.717
12.354
21.14
0.014
15.251
25.328
0.002
415.126
16.321
23.214
0.003
15.82
26.52
0.00
449.24
13.98
22.33
0.01
151.35
120.893
139.212
Throughput (mb/sec)
Throughput (mb/sec)
4.215
6.214
0.617
4384.964
12.214
16.241
2.125
1573.197
4.101
5.214
1.002
4514.001
13.12
15.214
2.155
1203.144
4.259
6.254
0.658
3913.98
12.542
15.24
2.314
1503.98
137.15
4.19
5.89
0.76
4270.98
12.63
15.57
2.20
1426.78
166
Dataset
Size
Operation
Write
100 MB
Read
Write
1 GB
Read
Write
10 GB
Read
Write
100 GB
Read
Cretiria
Test1
Test2
Test3
Mean
Throughput (mb/sec)
6.787
5.783
5.797
6.12
7.487
6.311
6.085
6.63
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
2.357
21.832
32.553
33.458
9.214
20.689
1.926
2.031
0.526
76.754
18.093
23.024
11.61
1.683
19.816
30.599
33.203
8.962
17.884
2.032
2.133
0.569
79.927
9.005
9.845
3.008
1.291
22.687
31.699
19.672
9.374
18.865
2.497
2.648
0.793
68.664
14.98
17.178
7.886
1.78
21.45
31.62
28.78
9.18
19.15
2.15
2.27
0.63
75.12
14.03
16.68
7.50
26.606
34.991
32.247
31.28
Throughput (mb/sec)
Throughput (mb/sec)
3.065
3.079
0.217
421.213
10.624
10.701
0.912
3.131
3.143
0.203
417.535
10.144
10.24
1.023
3.03
3.048
0.238
427.938
10.566
10.709
1.236
3.08
3.09
0.22
422.23
149.57
10.50
4.21
124.088
137.512
131.46
131.02
Throughput (mb/sec)
Throughput (mb/sec)
3.202
3.298
0.617
3584.964
11.951
12.24
2.372
3.147
3.182
0.335
3607.653
11.709
11.877
2.201
3.144
3.249
0.778
3595.375
12.024
2.255
1.667
3.16
3.24
0.58
3596.00
11.89
8.79
2.08
1163.197
1085.013
1054.679
1100.96
167
Dataset
Size
Operation
Write
100 MB
Read
1 GB
Write
Read
10 GB
Write
Read
100 GB
Write
Read
Criteria
Test1
Test2
Test3
Mean
Throughput (mb/sec)
2.528
4.331
1.161
2.67
2.767
4.794
1.924
3.16
Throughput (mb/sec)
0.803
30.602
18.91
23.593
10.874
20.554
1.448
23.713
24.085
29.105
12.232
18.235
1.256
30.648
20.934
30.76
13.055
18.076
1.17
28.32
21.31
27.82
12.05
18.96
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
3.035
1.663
1.578
4.024
1.683
1.655
2.173
0.185
0.389
58.777
86.469
75.434
3.201
7.975
8.673
5.086
7.729
9.918
4.142
1.629
3.371
32.292
44.313
44.413
3.132
3.012
2.693
3.163
3.053
2.718
0.329
0.366
0.261
375.74
408.223 462.919
8.422
9.66
9.21
8.489
9.26
9.32
0.774
2.301
1.254
122.793
125.499
133.985
27.459
26.086
26.888
27.459
26.086
26.888
7.555
0.005
0.002
3669.984 3881.374 3752.234
92.256
98.351
96.926
92.256
98.351
96.926
0.009
0.0156
0.017
1106.825 1042.597 1049.995
2.09
2.45
0.92
73.56
6.62
7.58
3.05
40.34
2.95
2.98
0.32
415.63
9.10
9.02
1.44
127.43
26.81
26.81
2.52
3767.86
95.84
95.84
0.01
1066.47
168
Dataset
Size
100 MB
Operation
Write
Read
1 GB
Write
Read
10 GB
Write
Read
100 GB
Write
Read
Criteria
Test1
Test2
Test3
Mean
Throughput (mb/sec)
Throughput (mb/sec)
16.815
16.835
0.021
23.757
112.524
103.235
0.019
18.002
16.87
15.22
0.003
21.82
137.741
135.542
0.027
16.993
22.75
22.18
0.005
20.727
124.069
131.096
0.015
16.189
18.81
18.08
0.01
22.10
124.78
88.89
6.01
18.81
Throughput (mb/sec)
Throughput (mb/sec)
21.989
21.989
0.003
66.387
66.366
66.366
0.011
44.061
22.135
22.187
0.004
69.104
72.849
62.325
0.007
31.754
18.241
18.215
1.526
78.357
76.694
79.241
0.012
42.002
20.79
20.80
0.51
71.28
71.97
69.31
0.01
39.27
22.587
25.638
0.005
417.404
80.214
92.541
0.004
132.544
23.574
27.036
0.004
3727.394
65.963
84.254
0.014
980.919
23.25
25.08
0.00
410.29
86.34
96.22
0.005
128.75
22.70
27.23
0.004
3733.40
55.58
91.38
0.02
1016.58
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
25.951
21.215
23.465
26.124
0.005
0.003
412.5
400.975
92.125
86.671
98.851
97.256
0.006
0.004
121.16
132.544
19.274
25.261
26.332
28.315
0.002
0.005
3826.909 3645.909
45.215
55.547
95.214
94.686
0.019
0.023
1074.606 994.225
169
Dataset
Size
100 MB
Operation
Write
Read
1 GB
Write
Read
10 GB
Write
Read
100 GB
Write
Read
Cretiria
Test1
Test2
Test3
Mean
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
Throughput (mb/sec)
6.352
8.359
0.001
42.097
69.215
93.721
0.018
27.748
7.652
13.067
0.003
94.415
36.678
28.352
0.003
55.273
18.124
26.557
0.004
400.564
89.268
69.361
0.008
130.975
18.214
27.006
0.001
3737.64
93.514
68.325
0.143
1090.37
6.214
16.072
0.035
22.322
82.254
146.511
0.02
27.962
6.521
16.873
4.231
94.711
60.214
99.265
0.021
30.137
19.254
24.477
0.004
438.626
119.348
80.541
0.007
101.048
19.421
24.451
0.002
4138.379
90.291
78.245
0.012
1130.456
5.214
9.325
0.001
34.282
68.325
95.328
0.014
26.957
6.241
17.89
0.002
79.465
62.124
78.019
0.006
62.601
18.625
24.955
0.004
432.917
78.019
59.013
0.007
171.601
19.566
25.324
0.002
3981.254
91.157
65.247
0.102
1105.645
5.93
11.25
0.01
32.90
73.26
111.85
0.02
27.56
6.80
15.94
1.41
89.53
53.01
68.55
0.01
49.34
18.67
25.33
0.00
424.04
95.55
69.64
0.0073
134.54
19.07
25.59
0.00
3952.43
91.65
70.61
0.09
1108.82
170

a HowTo - Al Akhawayn University

Transcription

Similar documents

Svensk-Norska Handelskammaren

Mukesh Gulati Foundation for MSME Clusters New Delhi

January 2015 - ZATIS TECHNOLOGY GROUP

A Word from the Principal … New Baby for TKGS Alumni: The Youth

AppRiver - Franchise services Inc.

Big Data - Department of Computer Science and Engineering, CUHK

Single AC servo-drive turret punch presses

[email protected]

Creative Cloud: for a community storm