Nesting Virtual Machines in Virtualization Test Frameworks

Transcription

Nesting Virtual Machines in Virtualization Test Frameworks
Nesting Virtual Machines in
Virtualization Test Frameworks
Dissertation submitted on May 2010 to the
Department of Mathematics and Computer Science
of the Faculty of Sciences, University of Antwerp,
in partial fulfillment of the requirements
for the degree of Master of Science.
Supervisor: prof. Dr. Jan Broeckhove
Co-supervisor: Dr. Kurt Vanmechelen
Mentors: Sam Verboven & Ruben Van den Bossche
Olivier Berghmans
Research Group Computational
Modelling and Programming
Contents
List of Figures
iv
List of Tables
vi
Nederlandstalige samenvatting
vii
Preface
viii
Abstract
x
1 Introduction
1.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
2
2 Virtualization
2.1 Applications . . . . . . . . . . . . . . . . .
2.2 Taxonomy . . . . . . . . . . . . . . . . . .
2.2.1 Process virtual machines . . . . . .
2.2.2 System virtual machines . . . . . .
2.3 x86 architecture . . . . . . . . . . . . . . .
2.3.1 Formal requirements . . . . . . . . .
2.3.2 The x86 protection level architecture
2.3.3 The x86 architecture problem . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Evolution of virtualization for the x86 architecture
3.1 Dynamic binary translation . . . . . . . . . . . .
3.1.1 System calls . . . . . . . . . . . . . . . .
3.1.2 I/O virtualization . . . . . . . . . . . . .
3.1.3 Memory management . . . . . . . . . . .
3.2 Paravirtualization . . . . . . . . . . . . . . . . .
i
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
4
5
6
7
9
9
11
11
.
.
.
.
.
13
13
15
15
16
17
.
.
.
.
.
.
.
.
.
.
.
.
18
18
19
19
22
23
24
24
24
25
25
26
4 Nested virtualization
4.1 Dynamic binary translation . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Paravirtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Hardware supported virtualization . . . . . . . . . . . . . . . . . . . . .
28
30
32
33
5 Nested virtualization in Practice
5.1 Software solutions . . . . . . . . . . . . . . . . . . .
5.1.1 Dynamic binary translation . . . . . . . . . .
5.1.2 Paravirtualization . . . . . . . . . . . . . . .
5.1.3 Overview software solutions . . . . . . . . . .
5.2 First generation hardware support . . . . . . . . . .
5.2.1 Dynamic binary translation . . . . . . . . . .
5.2.2 Paravirtualization . . . . . . . . . . . . . . .
5.2.3 Hardware supported virtualization . . . . . .
5.2.4 Overview first generation hardware support .
5.3 Second generation hardware support . . . . . . . . .
5.3.1 Dynamic binary translation . . . . . . . . . .
5.3.2 Paravirtualization . . . . . . . . . . . . . . .
5.3.3 Hardware supported virtualization . . . . . .
5.3.4 Overview second generation hardware support
5.4 Nested hardware support . . . . . . . . . . . . . . .
5.4.1 KVM . . . . . . . . . . . . . . . . . . . . . .
5.4.2 Xen . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
34
36
36
38
40
40
42
43
44
45
46
47
48
48
49
50
51
52
6 Performance results
6.1 Processor performance
6.2 Memory performance .
6.3 I/O performance . . . .
6.3.1 Network I/O . .
6.3.2 Disk I/O . . . .
6.4 Conclusion . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
56
57
58
61
61
62
64
3.3
3.4
3.5
3.6
3.2.1 System calls . . . . . . . . . . . . . . . . .
3.2.2 I/O virtualization . . . . . . . . . . . . . .
3.2.3 Memory management . . . . . . . . . . . .
First generation hardware support . . . . . . . . .
Second generation hardware support . . . . . . . .
Current and future hardware support . . . . . . . .
Virtualization software . . . . . . . . . . . . . . . .
3.6.1 VirtualBox . . . . . . . . . . . . . . . . . .
3.6.2 VMware . . . . . . . . . . . . . . . . . . .
3.6.3 Xen . . . . . . . . . . . . . . . . . . . . .
3.6.4 KVM . . . . . . . . . . . . . . . . . . . . .
3.6.5 Comparison between virtualization software
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
ii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7 Conclusions
7.1 Nested virtualization and performance results . . . . . . . . . . . . . .
7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
66
67
Appendices
72
Appendix A Virtualization software
A.1 VirtualBox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
73
Appendix B Details of the nested virtualization in
B.1 Dynamic binary translation . . . . . . . . . .
B.1.1 VirtualBox . . . . . . . . . . . . . . .
B.1.2 VMware Workstation . . . . . . . . .
B.2 Paravirtualization . . . . . . . . . . . . . . .
B.3 First generation hardware support . . . . . .
B.3.1 Dynamic binary translation . . . . . .
B.3.2 Paravirtualization . . . . . . . . . . .
B.4 Second generation hardware support . . . . .
B.4.1 Dynamic binary translation . . . . . .
B.4.2 Paravirtualization . . . . . . . . . . .
B.5 KVM’s nested SVM support . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
76
76
76
78
79
80
80
82
83
84
84
84
tests
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
86
86
88
88
Appendix C Details of the performance
C.1 sysbench . . . . . . . . . . . . . .
C.2 iperf . . . . . . . . . . . . . . . .
C.3 iozone . . . . . . . . . . . . . . .
iii
practice
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
List of Figures
2.1
2.2
2.3
Implementation layers in a computer system. . . . . . . . . . . . . .
Taxonomy of virtual machines. . . . . . . . . . . . . . . . . . . . . .
The x86 protection levels. . . . . . . . . . . . . . . . . . . . . . . . .
5
8
11
3.1
3.2
3.3
Memory management in x86 virtualization using shadow tables. . . .
Execution flow using virtualization based on Intel VT-x. . . . . . . .
Latency reductions by CPU implementation [30]. . . . . . . . . . . .
17
20
21
4.1
4.2
Layers in a nested virtualization setup with hosted hypervisors. . . .
Memory architecture in a nested situation. . . . . . . . . . . . . . . .
29
31
5.1
5.2
5.3
5.4
Layers for nested paravirtualization in dynamic binary translation. .
Layers for nested Xen paravirtualization. . . . . . . . . . . . . . . . .
Layers for nested dynamic binary translation in paravirtualization. .
Layers for nested dynamic binary translation in a hypervisor based
on hardware support. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Layers for nested paravirtualization in a hypervisor based on hardware support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Nested virtualization architecture based on hardware support. . . . .
Execution flow in nested virtualization based on hardware support. .
37
39
39
5.5
5.6
5.7
6.1
6.2
6.3
6.4
6.5
6.6
CPU performance for native with four cores and L1 guest with one
core. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CPU performance for native, L1 and L2 guest with four cores. . . .
CPU performance for L1 and L2 guests with one core. . . . . . . . .
Memory performance for L1 and L2 guests. . . . . . . . . . . . . . .
Threads performance for native, L1 guests and L2 guests with sysbench benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Network performance for native, L1 guests and L2 guests. . . . . . .
iv
42
44
53
54
58
59
60
61
62
63
6.7
6.8
File I/O performance for native, L1 guests and L2 guests with sysbench benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
File I/O performance for native, L1 guests and L2 guests with iozone
benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v
64
65
List of Tables
3.1
Comparison between a selection of the most popular hypervisors. . .
5.1
Index table containing directions in which subsections information
can be found about a certain nested setup. . . . . . . . . . . . . . .
The nesting setups with dynamic binary translation as the L1 hypervisor technique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The nesting setups with paravirtualization as the L1 hypervisor technique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Overview of the nesting setups with a software solution as the L1
hypervisor technique. . . . . . . . . . . . . . . . . . . . . . . . . . . .
The nesting setups with first generation hardware support as the L1
hypervisor technique and DBT as the L2 hypervisor technique. . . .
The nesting setups with first generation hardware support as the L1
hypervisor technique and PV as the L2 hypervisor technique. . . . .
The nesting setups with first generation hardware support as the L1
and L2 hypervisor technique. . . . . . . . . . . . . . . . . . . . . . .
Overview of the nesting setups with first generation hardware support
as the L1 hypervisor technique. . . . . . . . . . . . . . . . . . . . . .
The nesting setups with second generation hardware support as the
L1 hypervisor technique and DBT as the L2 hypervisor technique. .
The nesting setups with second generation hardware support as the
L1 hypervisor technique and PV as the L2 hypervisor technique. . .
The nesting setups with first generation hardware support as the L1
and L2 hypervisor technique. . . . . . . . . . . . . . . . . . . . . . .
Overview of the nesting setups with second generation hardware support as the L1 hypervisor technique. . . . . . . . . . . . . . . . . . .
Overview of all nesting setups. . . . . . . . . . . . . . . . . . . . . .
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13
vi
27
35
38
40
41
43
44
45
46
48
49
49
50
55
Nederlandstalige samenvatting
Virtualisatie is uitgegroeid tot een wijdverspreide technologie die gebruikt wordt
om computing resources te abstraheren, te combineren of op te delen. Verzoeken
voor deze resources zijn op deze manier minimaal afhankelijk van de onderliggende
fysieke laag. De x86 architectuur is niet speciaal ontworpen voor virtualisatie en
bevat een aantal niet-virtualiseerbare instructies. Verschillende software-oplossingen
en hardware-ondersteuning hebben hier voor een oplossing gezorgd. Het groeiend
aantal toepassingen zorgt ervoor dat gebruikers steeds meer wensen virtualisatie
te hanteren. Onder andere de noodzaak voor volledige fysieke opstellingen voor
onderzoeksdoeleinden kan vermeden worden door het gebruik van virtualisatie. Om
componenten, die mogelijk zelf virtualisatie gebruiken, te kunnen virtualiseren, moet
het mogelijk zijn om virtuele machines in elkaar te nesten. Er was slechts weinig
informatie over geneste virtualisatie beschikbaar en dit proefschrift gaat dieper in
op wat mogelijk is met de huidige technieken.
We testen het nesten van hypervisors gebaseerd op de verschillende virtualisatie
technieken. De technieken die gebruikt werden zijn dynamic binary translation, paravirtualization en hardware-ondersteuning. Voor de hardware-ondersteuning werd
een onderscheid gemaakt tussen eerste generatie en tweede generatie hardware-ondersteuning. Succesvolle geneste opstellingen maken gebruik van software-oplossingen voor de tweede hypervisor en hardware-ondersteuning voor de eerste hypervisor.
Slechts één werkende geneste oplossing gebruikt voor beide een software-oplossing.
Benchmarks werden uitgevoerd om te kijken of de prestaties van werkende geneste opstellingen performant zijn. De prestaties van de processor, het geheugen en
I/O werden getest en vergeleken met de verschillende niveaus van virtualisatie.
We ontdekten dat geneste virtualisatie werkt voor bepaalde opstellingen, vooral met een software-oplossing bovenop een hypervisor met hardware-ondersteuning.
Opstellingen met hardware-ondersteuning voor de bovenste hypervisor zijn nog niet
mogelijk. Geneste hardware-ondersteuning zal binnenkort beschikbaar worden, maar
voorlopig is de enige optie het gebruik van een software-oplossing voor de bovenste
hypervisor. Uit de resultaten van de benchmarks bleek dat de prestaties van geneste
opstellingen veelbelovend zijn.
vii
Preface
In this section I will give some insight on the creation of this thesis. It was submitted
in partial fulfillment of the requirements for a Master’s degree of Computer Science.
I have always been fascinated by virtualization and during the presentation of open
thesis subjects I stumbled upon the subject of nested virtualization. Right from
the start I found the subject very interesting so I made an appointment for more
information and I eventually got it!
I had already used some virtualization software but I did not know much about
the underlying techniques. During the first semester I followed a course on virtualization, which helped me to learn the fundamentals. It took time to become familiar
with the installation and use of the different virtualization packages. At first, it
took a long time to test one nested setup and it seemed that all I was doing was
installing operating systems in virtual machines. Predefined images can save a lot
of work but I had to find this out the hard way! But even with these predefined
images, a nested setup can take a long time to test and re-test since there are so
many possible configurations.
After the first series of tests, I was quite disappointed about the obtained results.
Due to some setbacks in December and January, I also fell behind on schedule leading
to a hard second semester. It was hard combining this thesis with other courses and
with extracurricular responsibilities during this second semester. I am pleased that
I got back on track and finished the thesis on time! This would not have been
possible without the help from the people around me. I want to thank my girlfriend
Anneleen Wislez for supporting me, not only during this year but during the last
few years. She also helped me with creating the figures for this thesis and reading
the text.
viii
Further, I would like to show appreciation to my mentors Sam Verboven and
Ruben Van den Bossche for always pointing me in the right direction and for the
help during this thesis. Additionally, I also want to thank my supervisor Prof. Dr.
Jan Broeckhove and co-supervisor Dr. Kurt Vanmechelen giving me the opportunity
to make this thesis.
A special thank you goes out to all my fellow students and especially to Kristof
Overdulve for the interesting conversations and the laughter during the past years.
And last but not least I want to thank my parents and sister for supporting me
throughout my education; my dad for offering support by buying his new computer
earlier and borrowing it so I could do a second series of tests on a new processor
and my mom for the excellent care and interest in what I was doing.
ix
Abstract
Virtualization has become a widespread technology that is used to abstract, combine
or divide computing resources to allow resource requests to be described and fulfilled
with minimal dependence on the underlying physical delivery. The x86 architecture
was not designed with virtualization in mind and contains certain non-virtualizable
instructions. This has resulted in the emergence of several software solutions and has
led to the introduction of hardware support. The expanding range of applications
ensures that users increasingly want to use virtualization. Among other things,
the need for entire physical setups for research purposes can be avoided by using
virtualization. For components that already use virtualization, executing a virtual
machine inside a virtual machine is necessary, this is called nested virtualization.
There has been little related work on nested virtualization and this thesis elaborates
on what is possible with current techniques.
We tested the nesting of hypervisors based on the different virtualization techniques. The techniques that were used are dynamic binary translation, paravirtualization and hardware support. For hardware support, a distinction was made
between first generation and second generation hardware support. Successful nested
setups use a software solution for the inner hypervisor and hardware support for the
bottom layer hypervisor. Only one working nested setup uses software solutions for
both hypervisors.
Performance benchmarks were conducted to find out if the performance of working nested setups is reasonable. The performance of the processor, the memory and
I/O was tested and compared with the different levels of virtualization.
We found that nested virtualization on the x86 architecture works for certain
setups, especially with a software solution on top of a hardware supported hypervisor. Setups with hardware support for the inner hypervisor are not yet possible.
The nested hardware support will be coming soon but until then, the only option is
the use of a software solution for the inner hypervisor. Results of the performance
benchmarks showed that performance of the nested setups is promising.
x
CHAPTER
1
Introduction
Within the research surrounding grid and cluster computing there are many developments at different levels that make use of virtualization. Virtualization can be
used for all, or a selection of the components in grid or cluster middleware. Grids or
clusters are also using virtualization to run separate applications in a sandbox environment. Both developments bring advantages concerning security, fault tolerance,
legacy support, isolation, resource control, consolidation, etc.
Complete test setups are not available or desirable for many development and
research purposes. If certain performance limitations do not pose a problem, virtualization of all components in a system can avoid the need for physical grid or cluster
setups. This thesis focusses on the latter, the consolidation of several physical cluster machines by virtualizing them on a single physical machine. The virtualization
of cluster machines that use virtualization themselves leads to a combination of the
above mentioned levels.
1.1 Goals
The goal of this thesis is to find out whether different levels of virtualization are possible with current virtualization techniques. The research question is whether nested
virtualization works on the x86 architecture. In cases where nested virtualization
works we want to find out what the performance degradation is when compared
to a single level of virtualization or to a native solution. For cases where nested
virtualization does not work we search for the reasons of the failure and what needs
to be changed in order for it to work. The experiments are conducted with some of
the most popular virtualization software to find an answer to the posed question.
1.2. OUTLINE
2
1.2 Outline
The outline of this thesis is as follows. Chapter 2 contains an introduction to virtualization, a brief history of virtualization is given followed by a few definitions
and a taxonomy of virtualization in general. The chapter ends with the formal requirements needed for virtualization on a computer architecture and how the x86
architecture compares to these requirements.
Chapter 3 describes the evolution of virtualization for the x86 architecture. Virtualization software first used software techniques, at a later stage processor vendors
provided hardware support for virtualization. The last section of the chapter provides an overview of a selection of the most popular virtualization software.
Chapter 4 provides a theoretical view for the requirements of nested virtualization on the x86 architecture. For each technique described in chapter 3, a detailed
explanation of the theoretical requirements gives more insight in whether nested
virtualization can work for the given technique.
Chapter 5 investigates the actual nesting of virtual machines using some of the
most popular virtualization software solutions. The different virtualization techniques are combined to get an overview of which nested setup works best. Chapter 6 presents performance results of the working nested setups in chapter 5. System
benchmarks are executed on each setup and the results are compared.
Chapter 7 summarizes the results in this thesis and gives directions for future
work.
CHAPTER
2
Virtualization
In recent years virtualization has become a widespread technology that is used to
abstract, combine or divide computing resources to allow resource requests to be described and fulfilled with minimal dependence on the underlying physical delivery.
The first tracks of virtualization can be traced back to the 1960’s [1, 2] in research
projects that provided concurrent, interactive access to mainframes. Each virtual
machine (VM) gave the user the illusion of working directly on a physical machine.
By partitioning the system into virtual machines, multiple users could concurrently
use the system each within their own operating system. The projects provided an
elegant way to enable time- and resource-sharing on expensive mainframes. Users
could execute, develop, and test applications within their own virtual machine without interfering with other users. In that time, virtualization was used to reduce the
cost of acquiring new hardware and to improve the productivity by letting more
users work simultaneously.
In the late 1970’s and early 1980’s virtualization became unpopular because of
the introduction of cheaper hardware and multiprocessing operating systems. The
popular x86 architecture lacked the power to run multiple operating systems at the
same time. But since this hardware was so cheap, a dedicated machine was used for
each separate application. The use of these dedicated machines led to a decrease in
the use of virtualization.
The ideas of virtualization became popular again in the late 1990’s with the
emergence of a wide variety of operating systems and hardware configurations. Virtualization was used for executing a series of applications, targeted for different
hardware or operating systems, on a given machine. Instead of buying dedicated
machines and operating systems for each application, the use of virtualization on
one machine offers the ability to create virtual machines that are able to run these
applications.
Virtualization concepts can be used in many areas of computer science. Large
variations in the abstraction level and underlying architecture lead to many defini-
2.1. APPLICATIONS
4
tions of virtualization. In “A survey on virtualization technologies”, S. Nanda and
T. Chiueh define virtualization by the following relaxed definition [1]:
Definition 2.1 Virtualization is a technology that combines or divides computing
resources to present one or many operating environments using methodologies like
hardware and software partitioning or aggregation, partial or complete machine simulation, emulation, time-sharing, and many others.
The definition mentions the aggregation of resources but in this context the focus
lies on the partitioning of resources. Throughout the rest of this thesis, virtualization
provides infrastructure used to abstract lower-level, physical resources and to create
multiple independent and isolated virtual machines.
2.1 Applications
The expanding range of computer applications and their varied requirements for
hardware and operating systems increases the need for users to start using virtualization. Most people will have already used virtualization without realizing it
because there are many applications where virtualization can be used in some form.
This section elaborates on some practical applications where virtualization can be
used. S. Nanda and T. Chiueh enumerate some of these applications in “A survey
on virtualization technologies” but the list is not complete and one can easily think
of other applications [1].
A first practical application that benefits from using virtualization is server consolidation [3]. It allows system administrators to consolidate workloads of multiple
under-utilized machines to a few powerful machines. This saves hardware, management, administration of the infrastructure, space, cooling and power. A second
application that also involves consolidation is application consolidation. A legacy
application might require faster and newer hardware but might also require a legacy
operating system. The need for such legacy applications could be served well by
virtualizing the newer hardware.
Virtual machines can be used for providing secure, isolated environments to
run foreign or less-trusted applications. This form of sandboxing can help build
secure computing platforms. Besides sandboxing, virtualization can also be used
for debugging purposes. It can help debug complicated software such as operating
systems or device drivers by letting the user execute them on an emulated PC
with full software controls. Moreover, virtualization can help produce arbitrary test
scenarios that are hard to produce in reality and thus eases the testing of software.
Virtualization provides the ability to capture the entire state of a running virtual
machine, which creates new management possibilities. Saving the state of a virtual
machine, also called a snapshot, offers the user the capability to roll back to the saved
state when, for example, a crash occurs in the virtual machine. The saved state can
also be used to package an application together with its required operating system,
this is often called an “appliance”. This eases the installation of that application on
2.2. TAXONOMY
5
a new server, lowering the entry barrier for its use. Another advantage of snapshots
is that the user can copy the saved state to other physical servers and use the new
instance of the virtual machine without having to install it from scratch. This is
useful for migrating virtual machines from one physical server to other physical
servers when needed.
Another practical application is the use of virtualization within distributed network computing systems [4]. Such a system must deal with the complexity of decoupling local administration policies and configuration characteristics of distributed
resources from the quality of service expected from end users. Virtualization can
simplify or eliminate this complex decoupling because it offers functionality like
consolidation of physical resources, security and isolation, flexibility and ease of
management.
It is not difficult to see that the practical applications given in this section are just
a few examples of the many possible uses for virtualization. The number of possible
advantages that virtualization can provide continues to rise, making it more and
more popular.
2.2 Taxonomy
Virtual machines can be divided into two main categories, namely process virtual
machines and system virtual machines. In order to describe the differences, this
section starts with an overview of the different implementation layers in a computer
system, followed by the characteristics of process virtual machines. Finally, the
characteristics of system virtual machines are explained. Most information in this
section is deduced from the book “Virtual machines: Versatile platforms for systems
and processes” by J. E. Smith and R. Nair [5].
Figure 2.1: Implementation layers in a computer system.
2.2. TAXONOMY
6
The complexity in computer systems is tackled by the division into levels of
abstraction separated by well-defined interfaces. Implementation details at lower
levels are ignored or simplified by introducing levels of abstraction. In both hardware and software in a computer system, the levels of abstraction correspond to
implementation layers. A typical architecture of a computer system consist of several implementation layers. Figure 2.1 shows the key implementation layers in a
typical computer system. At the base of the computer system we have the hardware
layer consisting of all the different components of a modern computer. Just above
the hardware layer, we find the operating system layer which exploits the hardware
resources to provide a set of services to system users [6]. The libraries layer allows
application calls to invoke various services available on the system, including those
provided by the operating system. At the top, the application layer consists of the
applications running on the computer system.
Figure 2.1 also shows the three interfaces between the implementation layers –
the instruction set architecture (ISA), the application binary interface (ABI), and the
application programming interface (API) – which are especially important for virtual
machine construction [7]. The division between hardware and software is marked
by the instruction set architecture. The ISA consists of two interfaces, the user ISA
and the system ISA. The user ISA includes the aspects visible to the libraries and
application layers. The system ISA is a superset of the user ISA which also includes
those aspects visible to supervisor software, such as the operating system.
The application binary interface provides a program or library access to the
hardware resources and services available in the system. This interface consists of
the user ISA and a system call interface which allows application programs to interact
with the shared hardware resources indirectly. The ABI allows the operating system
to perform operations on behalf of a user program.
The application programming interface allows a program to invoke various services available on the system and is usually defined with respect to a high-level
language (HHL). An API enables application written to the API to be ported easily
to other systems that support the same API. The interface consists of the user ISA
and of HHL library calls.
Using the three interfaces, virtual machines can be divided into two main categories: process virtual machines and system virtual machines. A process VM runs a
single program, supporting only an individual process. It provides a user application
with a virtual ABI or API environment. The process virtual machine is created when
the corresponding process is created and terminates when the process terminates.
The system virtual machines provide a complete system environment in which many
processes can coexist. System VMs do this by virtualizing the ISA layer.
2.2.1 Process virtual machines
Process virtual machines virtualize the ABI or API and can run only a single user
program. Each virtual machine thus supports a single process, possibly consisting
of multiple threads. The most common process VM is an operating system. It
supports multiple user processes to run simultaneously by time-sharing the limited
hardware resources. The operating system provides a replicated process VM for
2.2. TAXONOMY
7
each executing program so that each program thinks that it has its own machine.
Program binaries that are compiled for a different instruction set are also supported by process VMs. There are two approaches for emulating the instruction set.
Interpretation is a simple but slow approach; an interpreter fetches, decodes and
emulates each individual instruction. A more efficient approach is dynamic binary
translation, which is explained in section 3.1.
The emulation between different instruction sets provides cross-platform compatibility only on case-by-case basis and requires considerable programming effort.
Designing a process-level VM together with an HLL application development environment is an easier way to achieve full cross-platform portability. The HHL
virtual machine does not correspond to any real platform, but is designed for ease
of portability. The Java programming language is a widely used example of a HHL
VM.
2.2.2 System virtual machines
System virtual machines provide a complete system environment by virtualizing the
ISA layer. They allow a physical hardware system to be shared among multiple,
isolated guest operating system environments simultaneously. The layer that provides the hardware virtualization is called the virtual machine monitor (VMM) or
hypervisor. It manages the hardware resources so that multiple guest operating
system environments and their user programs can execute simultaneously. Subdivision is centered on the supported ISAs of the guest operating systems, whether
virtualization or emulation is used. Virtualization can be further subdivided based
on the location where the hypervisor is executed: native or hosted. The following
two paragraphs clarify the subdivision according to the supported ISAs.
Emulation: Guest operating systems with a different ISA from the host ISA
can be supported through emulation. The hypervisor must emulate both the application and operating system code by translating each instruction to the ISA of the
physical machine. The translation is applied to each instruction so that the hypervisor can easily manage all hardware resources. Using emulation for guest operating
systems with the same ISA as the host ISA, performance will be severely lower than
using virtualization.
Virtualization: When the ISA of the guest operating system is the same as
the host ISA, virtualization can be used to improve performance. It treats nonprivileged instructions and privileged instructions differently. A privileged instruction is an instruction that traps when executed in user mode instead of in kernel
mode and will be discussed in more detail in section 2.3. Non-privileged instructions
are executed directly on the hardware without intervention of the hypervisor. Privileged instructions are caught by the hypervisor and translated in order to guarantee
correct results. When guest operating systems primarily execute non-privileged instructions, the performance is comparable to near native speed.
Thus, when the ISA of the guest and the host are the same, the best performing
technique is virtualization. It improves performance in terms of execution speed by
running non-privileged instructions directly on the hardware. If the ISA of the guest
and the host are different, emulation is the only way to execute the guest operating
2.2. TAXONOMY
8
system. The subdivision of virtualization based on the location of the hypervisor is
clarified in the next two paragraphs.
Native, bare-metal hypervisor: A native, bare-metal hypervisor, also referred to as a Type 1 hypervisor, is the first layer of software installed on a clean
system. The hypervisor runs in the most privileged mode, while all the guests run
in a less privileged mode. It runs directly on the hardware and executes the intercepted instructions directly on the hardware. According to J. E. Smith and R.
Nair, a bare-metal hypervisor is more efficient than a hosted hypervisor in many
respects since it has direct access to hardware resources, enabling greater scalability, robustness and performance [5]. There are some variations of this architecture
where a privileged guest operating system handles the intercepted instructions. The
disadvantage of a native, bare-metal hypervisor is that a user must clear the existing
operating systems in order to install the hypervisor.
Hosted hypervisor: An alternative to a native, bare-metal hypervisor is
the hosted or Type 2 hypervisor. It runs on top of a standard operating system
and supports the broadest range of hardware configurations [3]. The installation
of the hypervisor is similar to the installation of an application within the host
operating system. The hypervisor relies on the host OS for device support and
physical resource management. Privileged instructions cannot be executed directly
on the hardware but are modified by the hypervisor and passed down to the host
OS.
The implementation specifics of Type 1 and Type 2 hypervisors can be separated
into several categories: dynamic binary translation, paravirtualization and hardware
assisted virtualization. These approaches are discussed in more detail in chapter 3,
which elaborates on virtualization within system virtual machines. An overview of
the taxonomy of virtual machines is shown in figure 2.2.
Figure 2.2: Taxonomy of virtual machines.
2.3. X86 ARCHITECTURE
9
2.3 x86 architecture
The taxonomy given in the previous section provides an overview of different virtual
machines and different implementation approaches. This section gives detailed information about the requirements associated with virtualization and the problems that
occur when virtualization technologies are implemented on the x86 architecture.
2.3.1 Formal requirements
In order to provide insight into the problems and solutions for virtualization on top
of the x86 architecture, the formal requirements for a virtualizable architecture are
given first. These requirements describe what is needed in order to use virtualization on a computer architecture. In “Formal requirements for virtualizable third
generation architectures”, G. J. Popek and R. P. Goldberg defined a set of formal
requirements for a virtualizable computer architecture [8]. They divided the ISA
instruction into several groups. The first group contains the privileged instructions:
Definition 2.2 Privileged instructions are all the ISA instruction that only work in
kernel mode and trap when executed in user mode instead of in kernel mode.
Another important group of instructions that will have a big influence on the virtualizability of a particular machine are the sensitive instructions. Before defining
sensitive instructions, the notions of behaviour sensitive and control sensitive are
explained.
Definition 2.3 An instruction is behaviour sensitive if the effect of its execution
depends on the state of the hardware, i.e. upon its location in real memory, or on
the mode.
Definition 2.4 An instruction is control sensitive if it changes the state of the
hardware upon execution, i.e. it attempts to change the amount of resources available
or affects the processor mode without going through the memory trap sequence.
With these notions, instructions can be separated into sensitive instructions and
innocuous instructions.
Definition 2.5 Sensitive instructions is the group of instructions that are either
control sensitive or behaviour sensitive.
Definition 2.6 Innocuous instructions is the group of instruction that are not sensitive instructions.
According to Popek and Goldberg, there are three properties of interest when any
arbitrary program is executed while the control program (the virtual machine monitor) is resident: efficiency, resource control, and equivalence.
The efficiency property: All innocuous instructions are executed by the hardware directly, with no intervention at all on the part of the control program.
2.3. X86 ARCHITECTURE
10
The hypervisor should not intervene for instructions that do no harm. These instructions do not change the state of the hardware and should be executed by the
hardware directly in order to preserve performance. The more instructions are
executed directly, the better the performance of the virtualization will be. This
property highlights the contrast between emulation - where every single instruction
is analyzed - and virtualization.
The resource control property: It must be impossible for that arbitrary program to affect the system resources, i.e. memory, available to it; the allocator of the
control program is to be invoked upon any attempt.
The hypervisor is in full control of the hardware resources. A virtual machine should
not be able to access the hardware resources directly. It should go through the hypervisor to ensure correct results and isolation from other virtual machines.
The equivalence property: Any program K executing with a control program
resident, with two possible exceptions, performs in a manner indistinguishable from
the case when the control program did not exist and K had whatever freedom of access to privileged instructions that the programmer had intended.
A program running on top of a hypervisor should perform the identical behaviour
as in the case where the program would run on the hardware directly. As mentioned, there are two exceptions: timing and resource availability problems. The
hypervisor will occasionally intervene and instruction sequences may take longer to
execute. This can lead to incorrect results in the assumptions about the length of
the program. The second exception, the resource availability problem, might occur
when the hypervisor does not satisfy a particular request for space. The program
may then be unable to function in the same way as if the space were made available.
The problem could easily occur, since the virtual machine monitor itself and other
possible virtual machines take space as well. A virtual machine environment can
be seen as a “smaller” version of the actual hardware: logically the same, but with
lesser quantity of certain resources.
Given the categories of instructions and the properties, they define the hypervisor
and a virtualizable architecture as:
Definition 2.7 We say that a virtual machine monitor, or hypervisor, is any control program that satisfies the three properties of efficiency, resource control and
equivalence. Then functionally, the environment which any program sees when running with a virtual machine present is called a virtual machine. It is composed of
the original real machine and the virtual machine monitor.
Definition 2.8 For any conventional third generation computer, a virtual machine
monitor may be constructed, i.e. it is a virtualizable architecture, if the set of sensitive instructions for that computer is a subset of the set of privileged instructions.
2.3. X86 ARCHITECTURE
11
2.3.2 The x86 protection level architecture
The x86 architecture recognizes four privilege levels, numbered from 0 to 3 [9].
Figure 2.3 shows how the privilege levels can be interpreted as rings of protection.
The center ring, ring 0, is reserved for the most privileged code and is used for
the kernel of an operating system. When the processor is running in kernel mode,
the code is executing in ring 0. Rings 1 and 2 are less privileged and are used
for operating system services. These two are rarely used but some techniques in
virtualization will run the guests inside ring 1. The most outer ring is used for
applications and has the least privileges. The code of applications running in users
mode will execute in ring 3.
Figure 2.3: The x86 protection levels.
These rings are used to prevent a program operating in a lower ring from accessing more privileged system routines. A call gate is used to allow an outer ring to
access an inner ring’s resource in a predefined manner.
2.3.3 The x86 architecture problem
A computer architecture can support virtualization if it meets the formal requirements described in subsection 2.3 . The x86 architecture, however, does not meet the
requirements posed above. The x86 instruction set architecture contains sensitive
instructions that are non-privileged, called non-virtualizable instructions. In other
words, these instruction will not trap when executed in user mode and they depend
on or change the hardware state. This is not desirable because the hypervisor cannot simulate the effect of the instruction. The current hardware state could belong
to another virtual machine, producing an incorrect result for the current virtual
machine.
The non-virtualizable instructions make virtualization on the x86 architecture
more difficult. Virtualization techniques will need to deal with these instructions.
Applications will only run at near native speed when they contain a minimum
2.3. X86 ARCHITECTURE
12
amount of non-virtualizable instructions. Approaches that overcome the limitations
of the x86 architecture are discussed in the next chapter.
CHAPTER
3
Evolution of virtualization for the x86 architecture
Developers of virtualization software did not wait until processor vendors solved the
x86 architecture problem. They introduced software solutions like binary translation and, when virtualization became more popular, paravirtualization. Processor
vendors then introduced hardware support to solve the design problem of the x86
architecture and at a later stage to improve the performance. The next generation
hardware support was introduced to improve performance concerning the memory
management. This chapter gives an overview of the evolution towards hardware
supported virtualization on x86 architectures. Dynamic binary translation, a software solution that tries to circumvent the design problem of the x86 architecture,
is explained in the first section. The second section explains paravirtualization, a
software solution which tries to improve the binary translation concept. It has some
advantages and disadvantages over dynamic binary translation. The third section
gives details on the first generation hardware support and its advantages and disadvantages over software solutions. In many cases the software solutions outperform
the hardware support. The next generation hardware support tries to further close
the performance gap by eliminating major sources of virtualization overhead. The
second generation hardware support focusses on memory management and is discussed in the fourth section. The last section gives an overview of VirtualBox, KVM
and Xen, which are virtualization products and VMware, a company providing multiple virtualization products.
3.1 Dynamic binary translation
In full virtualization, the guest OS is not aware that it is running inside a virtual machine and requires no modifications [10]. Dynamic binary translation is a
technique that implements full virtualization. It requires no hardware assisted or
3.1. DYNAMIC BINARY TRANSLATION
14
operating system assisted support while other techniques, like paravirtualization,
need modifications to either the hardware or the operating system.
Dynamic binary translation is a technique which works by translating code from
one instruction set to another. The word “dynamic” indicates that the translation
is done on the fly and is interleaved with execution of the generated code [11]. The
word “binary” indicates that the input is binary code and not source code. To
improve performance, the translation is mostly done on blocks of code instead of
single instructions [12]. A block of code is defined by a sequence of instructions
that end with a jump or branch instruction. A translation cache is used to avoid
retranslating code blocks multiple times.
In x86 virtualization, dynamic binary translation is not used to translate between different instruction set architectures. Instead, the translation is done from
x86 instructions to x86 instructions. This makes the translation a lot lighter than
previous binary translation technologies [13]. Since it is a translation between the
same ISA, a copy of the original instructions often suffices. In other words, generally
no translation is needed and the code can be executed as is. In particular, whenever the guest OS is executing code in user mode, no translation will be carried out
and the instructions are executed directly, which is comparable in performance to
execution of the code natively. Code that the guest OS wants to execute in kernel
mode will be translated on the fly and is saved in the translation cache.
Even when the guest OS is running kernel code, most times no translation is
needed and the code is copied as is. Only in some cases will the hypervisor need to
translate instructions of the kernel code to guarantee the integrity of the guest. The
kernel of the guest is executed in ring 1 instead of ring 0 when using software virtualization. As explained in section 2.3, the x86 instruction set architecture contains
sensitive instructions that are non-privileged. If the kernel of the guest operating
system wants to execute privileged instructions or one of these non-virtualizable
instructions, the dynamic binary translation will translate the instructions into a
safe equivalent. The safe equivalent will not harm other guests or the hypervisor.
For example, if access to the physical hardware is needed, the performed translation assures that the code will use the virtual hardware instead. In these cases, the
translation ensures that the safe code is also less costly than the code with privileged
instructions. The code with privileged instructions would trap when running in ring
1 and the hypervisor should handle these traps. The dynamic binary translation
thus avoids the traps by replacing the privileged instruction so that there are less
interrupts and the safe code will be less costly.
The translation of code into safer equivalents is less costly than letting the privileged instructions trap, but the translation itself should also be taken into account.
Luckily, the translation overhead is rather low and will decrease over time since
translated pieces of code are cached in order to avoid retranslation in case of loops
in the code. Yet, dynamic binary translation has a few cases it cannot fully solve:
system calls, I/O, memory management and complex code. The latter is the set
of code that, for example, does self-modification or has indirect control flows. This
code is complex to execute, even on an operating system that runs natively. The
other cases are now described in more detail in the next subsections.
3.1. DYNAMIC BINARY TRANSLATION
15
3.1.1 System calls
A system call is a mechanism used by processes to access the services provided by the
operating system. This involves a transition to the kernel where the required function
is then performed [6, 14]. The kernel of an operating system is also a process, but it
differs from other processes in that it has privileged access to processor instructions.
The kernel will not execute directly but only when it receives an interrupt from the
processor or a system call from another process also running in the operating system.
There are many different techniques for implementing system calls. One way is to
use a software interrupt and trap, but for x86 a faster technique was chosen [13, 15].
Intel and AMD have come up with the instructions SYSCALL/SYSENTER and
SYSRET/SYSEXIT for a process to do a system call. These instructions transfer
control to the kernel without the overhead of an interrupt.
In software virtualization the kernel of the guest will run inside ring 1 instead
of ring 0. This implies that the hypervisor should intercept a SYSENTER (or
SYSCALL), translate the code and hand over control to the kernel of the guest.
This kernel then executes the translated code and execute a SYSEXIT (or SYSRET) to return control back to the process that requested the service of the kernel.
Because the kernel of the guest is running inside ring 1, it does not have the privilege to perform the SYSEXIT. This will cause an interrupt at the processor and the
hypervisor has to emulate the effect of this instruction.
System calls will cause a significant amount of overhead when using software
virtualization. In a virtual machine, a system call costs about 10 times the cycles
needed for a system call on a native machine. In “A comparison of software and
hardware techniques for x86 virtualization”, the authors measured that a system
call on a 3.8 GHz Pentium 4 takes 242 cycles [11]. On the same machine, a system
call in a virtual machine, virtualized with dynamic binary translation and the kernel
running in ring 1, takes 2308 cycles. In an environment where virtualization is used
there will most likely be more than one virtual machine on a physical machine.
In this case, the overhead of the system calls can become a significant part of the
virtualization overhead. As we will see later, hardware support for virtualization
offers a solution for this.
3.1.2 I/O virtualization
When creating a virtual machine, not only the processor needs to be virtualized
but also all the essential hardware like memory and storage. Each I/O device type
has its own characteristics and needs to be controlled in its own special way [5].
There are often a large number of devices for an I/O device type and this number
continues to rise. The strategy consists of constructing a virtual I/O device and
then virtualizing the I/O activity that is directed at the device. Every access to
this virtual hardware must be translated to the real hardware. The hypervisor
must intercept all I/O operations issued by the guest operating system and it must
emulate these instructions using software that understands the semantics of the
specific I/O port accessed [16]. The I/O devices are emulated because of the ease of
migration and multiplexing advantages [17]. Migration is easy because the virtual
3.1. DYNAMIC BINARY TRANSLATION
16
device exists in memory and can easily be transferred. The hypervisor can present
a virtual device to each guest while performing the multiplexing.
Emulation has the disadvantage of poor performance. The hypervisor must
perform a significant amount of work to present the illusion of a virtual device.
The great number of physical devices make the emulation of the I/O devices in the
hypervisor complex. The hypervisor needs drivers for every physical device in order
to be usable on different physical systems. A hosted hypervisor has the advantage
that it can reuse the device drivers provided by the host operating system. Another
problem is that the virtual I/O device is often a device model which does not match
the full power of the underlying physical devices [18]. This means that optimizations
implemented by specific devices can be lost in the process of emulation.
3.1.3 Memory management
In an operating system, every application has the illusion that it is working with a
piece of contiguous memory. Whereas in reality, the memory used by applications
can be dispersed across the physical memory. The application is working with virtual
addresses that are translated to physical addresses. The operating system manages
a set of tables to do the translation of the virtual memory to the physical addresses.
The x86 architecture provides support for paging in the hardware. Paging is the
process that translates virtual addresses of a process to a system physical address.
The hardware that translates the virtual addresses to physical addresses is called
the memory management unit or MMU.
The page table walker performs address translation using the page tables and
uses a hardware page table pointer, the CR3 register, to start the page walk [19].
It will traverse several page table entries which point to the next level of the walk.
The memory hierarchy will be traversed many times when the page walker performs
address translation. To keep this overhead within limits, a translation look-aside
buffer (TLB) is used. The most recent translation will be saved in this buffer. The
processor will first check the TLB to see whether the translation is located in the
cache. When the translation is found in the buffer this translation is used, otherwise
a page walk is performed and this result is saved in the TLB. The operating system
and the processor must cooperate in order to assure that the TLB stays consistent.
Inside a virtual machine the guest operating system manages its own page tables.
The task of the hypervisor is to virtualize the memory but also virtualize the virtual
memory so that the guest operating system can use virtual memory [20]. This
introduces an extra level of translation which maps physical addresses of the guest
to real physical addresses of the system. The hypervisor must manage the address
translation on the processor using software techniques. It derives a shadow version
of the page table from the guest page table, which holds the translations of the
virtual guest addresses to the real physical addresses. This shadow page table will
be used by the processor when the guest is active and the hypervisor manages this
shadow table to keep it synchronized with the guest page table. The guest does
not have access to these shadow page tables and can only see his guest page tables
which runs on an emulated MMU. It has the illusion that it can translate the virtual
addresses to real physical ones. In the background, the hypervisor will deal with the
3.2. PARAVIRTUALIZATION
17
real translation using the shadow page tables.
Figure 3.1: Memory management in x86 virtualization using shadow tables.
Figure 3.1 shows the translations needed for translating a virtual guest address
into a real physical address. Without the shadow page tables, the virtual guest
memory (orange area) will be translated into physical guest memory (blue area)
and the latter is translated into real physical memory (white area). The shadow
page tables avoid the double translation by immediately translating the virtual guest
memory (orange) into real physical memory (white) as shown by the red arrow.
In software, several techniques can be used to keep the shadow page tables and
guest page tables consistent. These techniques use the page fault exception mechanism of the processor. It throws an exception when a page fault occurred and allows
the hypervisor to update the current shadow page table. This introduces extra page
faults due to the shadow paging. The shadow page tables introduce an overhead
because of the extra page faults and the extra work in keeping the shadow tables
up to date. The shadow page tables also consume additional memory. Maintaining
shadow page tables for SMP guests also introduces a certain overhead. Each processor in the guest can use the same guest page table instance. The hypervisor could
maintain shadow page tables instances that can be used at each processor, which
results in memory overhead. Another possibility is to share the shadow page table
between the virtual processors leading to synchronization overheads.
3.2 Paravirtualization
Paravirtualization is in many ways comparable to dynamic binary translation. It is
also a software technique designed to enable virtualization on the x86 architecture.
As explained in “Denali: Lightweight Virtual Machines for Distributed and Networked Applications,” and used in Denali [21], paravirtualization exposes a virtual
architecture to the guest that is slightly different than the physical architecture.
3.2. PARAVIRTUALIZATION
18
Dynamic binary translation translates “critical” code into safe code on the fly. Paravirtualization does the same thing but requires changes in the source code of the
operating system in advance. The operating systems built for the x86 architecture
are by default not compatible with the paravirtualized architecture. This is a major
disadvantage for existing operating systems because extra effort is needed in order
to run these operating systems inside a paravirtualized guest. In the case of Denali, which provides light weight virtual machines, it allowed them to co-design the
virtual architecture with the operating system.
The advantages of a successful paravirtualization is a simpler hypervisor implementation and an improvement in the performance degradation compared to the
physical system. Better performance is achieved because many unnecessary traps
by the hypervisor are eliminated. The hypervisor provides hypercall interfaces for
critical kernel operations such as memory management, interrupt handling and time
keeping [10]. The guest operating system is adapted so that it is aware of the virtualization. The kernel is modified to replace non-virtualizable instructions with
hypercalls that communicate directly with the hypervisor. The binary translation
overhead is completely eliminated since the modifications are done in the operating system at design time. The implementation of the hypervisor is much simpler
because it does not contain the binary translator.
3.2.1 System calls
The overhead of system calls can be improved a bit. The dynamic binary translation technique intercepts each SYSENTER/SYSCALL instruction and translates
the instruction to hand over the control to the kernel of the guest operating system.
Afterwards, the guest operating system’s kernel executes a SYSEXIT/SYSRET instruction to return to the application. This instruction is again intercepted and
translated by the dynamic binary translation. The paravirtualization technique allows guest operating systems to install a handler for system calls, permitting direct
calls from an application into its guest OS and avoiding indirection through the
hypervisor on every call [22]. This handler is validated before installation and is
accessed directly by the processor without indirection via ring 0.
3.2.2 I/O virtualization
Paravirtualization software mostly uses a different approach for I/O virtualization
compared to the emulation used with dynamic binary translation. The guest operating system utilizes a paravirtualized driver that operates on a simplified abstract
device model exported by the hypervisor [23]. The real device driver can reside in
the hypervisor, but often resides in a separate device driver domain which has privileged access to the device hardware. The latter one is attractive since the hypervisor
does not need to provide the device drivers but the drivers of a legacy operating system can be used. Separating the address space of the device drivers from guest and
hypervisor code also prevents buggy device drivers from causing system crashes.
The paravirtualized drivers remove the need to emulate devices. They free up
processor time and resources which would otherwise be needed to emulate hardware.
Since there is no emulation of the device hardware, the overhead is significantly re-
3.3. FIRST GENERATION HARDWARE SUPPORT
19
duced. In Xen, well-known for its use of paravirtualization, the real device drivers
reside in a privileged guest known as domain 0. A description of Xen can be found
in subsection 3.6.3. However, Xen is not the only hypervisor that uses paravirtualization for I/O. VMware has a paravirtualized I/O device driver, vmxnet, that
shares data structures with the hypervisor [10]. “A Performance Comparison of
Hypervisors” states that by using the paravirtualized vmxnet network driver they
can now run network I/O intensive datacenter applications with very acceptable
network performance [24].
3.2.3 Memory management
Paravirtual interfaces can be used by both the hypervisor and guest to reduce hypervisor complexity and overhead in virtualizing x86 paging [19]. When using a
paravirtualized memory management unit, the guest operating system page tables
are registered directly with the MMU [22]. To reduce the overhead and complexity
associated with the use of shadow page tables, the guest operating system has readonly access to the page tables. A page table update is passed to Xen via a hypercall
and validated before being applied. Guest operating systems can locally queue page
table updates and apply the entire batch with a single hypercall. This minimizes
the number of hypercalls needed for the memory management.
3.3 First generation hardware support
In the meantime, processor vendors noticed that virtualization was becoming increasingly popular and they created a solution that solves the virtualization problem on the x86 architecture by introducing hardware assisted support. Hardware
support for processor virtualization enables simple, robust and reliable hypervisor
software [25]. It eliminates the need for the hypervisor to listen, trap and execute
certain instructions for the guest OS [26]. Both Intel and AMD provide these hardware extensions in the form of Intel VT-x and AMD SVM respectively [11, 27, 28].
The first generation hardware support introduces a data structure for virtualization, together with specific instructions and a new execution flow. In AMD SVM,
the data structure is called the virtual machine control block (VMCB). The VMCB
combines control state with the guest’s processor state. Each guest has its own
VMCB with its own control state and processor state. The VMCB contains a list of
which instructions or events in the guest to intercept, various control bits and the
guest’s processor state. The various control bits specify the execution environment
of the guest or indicate special actions to be taken before running guest code. The
VMCB is accessed by reading and writing to its physical address. The execution
environment of the guest is referred to as guest mode. The execution environment of
the hypervisor is called host mode. The new VMRUN instruction transfers control
from host to guest mode. The instruction saves the current processor state and loads
the corresponding guest state from the VMCB. The processor now runs the guest
code until an intercept event occurs. This results in a #VMEXIT at which point
3.3. FIRST GENERATION HARDWARE SUPPORT
20
the processor writes the current guest state back to the VMCB and resumes host
execution at the instruction following the VMRUN. The processor is then executing
the hypervisor again. The hypervisor can retrieve information from the VMCB to
handle the exit. When the effect of the exiting operation is emulated, the hypervisor
can execute VMRUN again to return to guest mode.
Although Intel has implemented their own version of hardware support, it has
many similarities with the implementation of AMD although the terminology is
somewhat different. Intel uses a virtual machine control structure (VMCS) instead
of a VMCB. A VMCS can be manipulated by the new instructions VMCLEAR,
VMPTRLD, VMREAD and VMWRITE which clears, loads, reads from, and writes
to a VMCS respectively. The hypervisor runs in “VMX root operation“ and the
guest in ”VMX non-root operation“ instead of host and guest mode. Software
enters the VMX operation by executing the VMXON instruction. From then on, the
hypervisor can use a VMEntry to transfer control to one of its guest. There are two
instructions available for triggering a VMEntry: VMLAUNCH and VMRESUME.
As with AMD SVM, the hypervisor regains control using VMExits. Eventually, the
hypervisor can leave the VMX operation with the instruction VMXOFF.
Figure 3.2: Execution flow using virtualization based on Intel VT-x.
The execution flow of a guest, virtualized by hardware support, can be seen
in figure 3.2. The VMXON instruction starts and the VMXOFF stops the VMX
operation. The guest is started using a VMEntry which loads the VMCS of the
guest into the hardware. The hypervisor regains control using a VMExit when a
guest tries to execute a privileged instruction. After intervention of the hypervisor,
a VMEntry transfers control back to the guest. In the end, the guest can shut down
and control is handed back to the hypervisor with a VMExit.
The basic idea behind the first generation hardware support is to fix the problem
that the x86 architecture cannot be virtualized. The VMExit forces a transition
from guest to hypervisor, which is based on the trap all exceptions and privileged
instructions philosophy. Nevertheless, each transition between the hypervisor and a
3.3. FIRST GENERATION HARDWARE SUPPORT
21
virtual machine requires a fixed amount of processor cycles. When the hypervisor has
to handle a complex operation, the overhead is relatively low. However, for a simple
operation the overhead of switching from guest to hypervisor and back is relatively
high. Creating processes, context switches, small page table updates are all simple
operations that will have a large overhead. In these cases, software solutions like
binary translation and paravirtualization perform better than hardware supported
virtualization.
The overhead can be improved by reducing the number of processor cycles required for a transition between guest and hypervisor. The exact number of extra
processor cycles depends on the processor architecture. For Intel, the format and layout of the VMCS in memory is not architecturally defined, allowing implementationspecific optimizations to improve performance in VMX non-root operation and to
reduce the latency of a VMEntry and VMExit [29]. Intel and AMD are improving
these latencies in their next processors, as you can see for Intel in figure 3.3.
Figure 3.3: Latency reductions by CPU implementation [30].
System calls are an example of complex operations having a low transition overhead. System calls do not automatically transfer control from the guest to the
hypervisor in hardware supported virtualization. A hypervisor intervention is only
needed when the system call contains critical instructions. The overhead when a system call requires intervention is relatively low since a system call is rather complex
and already requires a lot of processor cycles.
First generation hardware support does not include support for I/O virtualization and memory management unit virtualization. Hypervisors that use the first
generation hardware extensions will need to use a software technique for virtualizing the I/O devices and the MMU. For the MMU, this can be done using shadow
tables or paravirtualization of the MMU.
3.4. SECOND GENERATION HARDWARE SUPPORT
22
3.4 Second generation hardware support
First generation hardware support has made the x86 architecture virtualizable, but
only in some cases an improvement in performance can be measured [11]. Maintaining the shadow tables can be an intensive task, as was pointed out in subsection 3.1.3. The next step of the processor vendors was to provide hardware MMU
support. This second generation hardware support adds memory management support so the hypervisor does not have to maintain the integrity of the shadow page
table mappings [17].
The shadow page tables remove the need to translate the virtual memory of the
process to the guest OS physical memory and then translate the latter into the real
physical memory, as can be seen in figure 3.1. It provides the ability to immediately
translate the virtual memory of the guest process into real physical memory. On the
other hand, the hypervisor must do the bookkeeping to keep the shadow page table
up to date when an update occurs to the guest OS page table. In existing software
solutions like binary translation, this bookkeeping introduces overhead which was
even worse for first generation hardware support. The hypervisor must maintain
the shadow page tables and every time a guest tries to translate a memory address,
the hypervisor must intervene. In software solutions this intervention is an extra
page fault, but in the first generation hardware support this will result in a VMExit
and VMEntry roundtrip. As shown in figure 3.3, the latencies of such a roundtrip
are improving but the second generation hardware support removes the need for the
roundtrip.
Intel and AMD introduced their own hardware MMU support. Like the first
generation hardware support, this results in two different implementation but with
similar characteristics. Intel proposed the extended page tables (EPT) and AMD
proposed their nested page tables (NPT). In Intel’s EPT, the page tables translate
from virtual memory to guest physical addresses while a separate set of page tables,
the extended page tables, translate from guest physical addresses to the real physical
addresses [29]. The guest can modify its page tables without hypervisor intervention.
The new extended page tables remove the VMExits associated with page table
virtualization.
AMD’s nested paging also use additional page tables, the nested page tables
(nPT), to translate guest physical addresses to real physical addresses [19]. The
guest page tables (gPT) map the virtual memory addresses to guest physical addresses. The gPT are set up by the guest and the nPT by the hypervisor. When
nested paging is enabled and a guest attempts to reference memory using a virtual
address, the page walker performs a two dimensional walk using the gPT and nPT
to translate the guest virtual address to the real physical address. Like Intel’s EPT,
nested paging removes the overheads associated with software shadow paging.
Another feature introduced by both Intel and AMD in the second generation
hardware support is tagged TLBs. Intel uses Virtual-Processor Identifiers (VPIDs)
that allow a hypervisor to assign a different identifier to each virtual processor. The
zero VPID is reserved for the hypervisor itself. The processor then uses the VPIDs
3.5. CURRENT AND FUTURE HARDWARE SUPPORT
23
to tag translations in the TLB. AMD calls these identifiers the Address Space IDs
(ASIDs). During a TLB lookup, the VPID or ASID value of the active guest is
matched against the ID tag in the TLB entry. In this way, TLB entries belonging
to different guests and to the hypervisor can coexist without causing incorrect address translations. The tagged TLBs eliminate the need for TLB flushes on every
VMEntry and VMExit, furthermore it eliminates the impact of those flushes on
performance. The tagged TLBs are an improvement compared to the other virtualization techniques. These techniques need to flush the TLB every time a guest
switches to the hypervisor or back. The drawback of the extended page tables or
nested paging is that a TLB miss has a larger performance hit for guests because it
introduces an additional level of address translation. This is rectified by making the
TLBs much larger than before. Previous techniques like shadow page tables immediately translate the virtual guest address to the real physical address eliminating
the additional level of address translation.
The second generation hardware support is completely focussed on the improvement of the memory management. It eliminates the need for the hypervisor to
maintain the shadow tables and eliminates the TLB flushes. The EPT and NPT
help to improve performance for memory intensive workloads.
3.5 Current and future hardware support
Intel and AMD are still working on support for virtualization. They are improving
the latencies of the VMEntry and VMExit instructions, but are also working on new
hardware techniques for supporting virtualization on the x86 architecture. The first
generation hardware support for virtualization was based primarily on the processor
and the second generation focusses on the memory management unit. The final
component required next to CPU and memory virtualization is device and I/O
virtualization [10]. Recent techniques are Intel VT-d and AMD IOMMU.
There are three general techniques for I/O virtualization. The first technique is
emulation and is described in subsection 3.1.2. The second technique, explained in
subsection 3.2.2, is paravirtualization. The last technique is direct I/O. The device is
not virtualized but assigned directly to a guest virtual machine. The guest’s device
drivers are used for the dedicated device.
In order to improve the performance for I/O virtualization, Intel and AMD are
looking at allowing virtual machines to talk to the device hardware directly. With
Intel VT-d and AMD IOMMU, hardware support is introduced to support assigning
I/O devices to virtual machines. In such cases, the ability to multiplex the I/O device
is lost. Depending on the I/O device, this does not need to be an issue. For example,
network card interfaces can easily be added to the hardware in order to provide a
NIC for each virtual machine.
3.6. VIRTUALIZATION SOFTWARE
24
3.6 Virtualization software
There are many different virtualization implementations. This section gives an
overview of some well-known virtualization software. Each implementation can be
placed in the categories explained throughout the previous sections.
3.6.1 VirtualBox
VirtualBox is a hosted hypervisor that performs full virtualization. It started as proprietary software but currently comes under a Personal Use and Evaluation License
(PUEL). The software is free of charge for personal and educational use. VirtualBox was initially created by Innotek and was released as an Open Source Edition
on January 2007. The company was later purchased by Sun Microsystems, which
in turn was recently purchased by Oracle Corporation. VirtualBox software runs on
Windows, Linux, Mac OS X and Solaris hosts. In depth information can be found
on the wiki of their site [31], more specifically in the technical documentation [32].
Appendix A.1 presents an overview of VirtualBox, which is largely based on the
technical documentation. A short summary is given in the following paragraph.
VirtualBox started as a pure software solution for virtualization. The hypervisor
used dynamic binary translation to fix the problem of virtualization in the x86
architecture. With the arrival of hardware support for virtualization, VirtualBox
now also supports Intel VT-x and AMD SVM. The host operating system runs each
VirtualBox virtual machine as an application, i.e. just another process in the host
operating system. A ring 0 driver needs to be loaded in the host OS for VirtualBox
to work. It only performs a few tasks: allocating physical memory for the virtual
machine, saving and restoring CPU registers and descriptor tables, switching from
host ring 3 to guest context and enabling or disabling hardware support. The guest
operating system is manipulated to execute its ring 0 code in ring 1. This could
result in poor performance since there is a possibility of generating a large amount
of additional instruction faults. To address these performance issues, VirtualBox has
come up with a Patch Manager (PATM) and Code Scanning and Analysis Manager
(CSAM). The PATM will scan code recursively and replace problematic instructions
with a jump to hypervisor memory where a more suitable implementation is placed.
Every time a fault occurs, the CSAM will analyze the fault’s cause and determine if
it is possible to patch the offending code to prevent it from causing more expensive
faults.
3.6.2 VMware
VMware [33] provides several virtualization products. The company was founded in
1998 and they released their first product, VMware Workstation, in May 1999. In
2001, they also entered the server market with VMware GSX Server and VMware
ESX Server. Currently, VMware provides a variety of products for datacenter and
desktop solutions together with management products. VMware software runs on
Windows and Linux, and since the introduction of VMware Fusion it also runs
on Mac OS X. Like VirtualBox, VMware started with a software only solution
3.6. VIRTUALIZATION SOFTWARE
25
for their hypervisors. In contrast with VirtualBox, VMware does not release the
source code of their products. VMware now supports both full virtualization with
binary translation and hardware assisted virtualization, and has a paravirtualized
I/O device driver, vmxnet, that shares data structures with the hypervisor [10].
VMware Server is a free product based on the VMware virtualization technology.
It is a hosted hypervisor that can be installed in Windows or Linux hosts. A webbased user interface provides a simple way to manage virtual machines. Another free
datacenter product is VMware ESXi. It provides the same functionality but uses a
native, bare-metal architecture for its hypervisor. VMware ESXi needs a dedicated
server but has better performance. VMware makes these products available at no
cost in order to help companies of all sizes experience the benefits of virtualization.
The desktop product is VMware Player. It is free for personal non-commercial
use and allows users to create and run virtual machines on a Windows or Linux
host. It is a hosted hypervisor since this is common practice for desktop products.
If users need developer-centric features, they can upgrade to VMware Workstation.
3.6.3 Xen
Xen [34] is an open source example of virtualization software that uses paravirtualization. It is a native, bare-metal hypervisor for the x86 architecture and was
initially created by the University of Cambridge Computer Laboratory in 2003 [22].
Xen is designed to allow multiple commodity operating systems to share conventional hardware. In 2007, Citrix Systems acquired the source of Xen and intended
to freely license to all vendors and projects that implement the Xen hypervisor.
Since 2010, the Xen community maintains and develops Xen. The Xen hypervisor
is licensed under the GNU General Public License.
After installation of the Xen hypervisor, the user can boot into Xen. When the
hypervisor is started, it automatically boots a guest, domain 0, that has special
management privileges and direct access to the physical hardware [35].
I/O devices are not emulated but Xen exposes a set of clean and simple device
abstractions. There are two possibilities to run device drivers. In the first one,
domain 0 is responsible for running the device drivers for the hardware. It will
run a BackendDriver which queues requests from other domains and relays them to
the real hardware driver. Each domain communicates with domain 0 through the
FrontendDriver to access the devices. To the applications and the kernel, this driver
looks like a normal device. The other possibility is that a driver domain has been
given the responsibility for a particular piece of hardware. It runs the hardware
driver and the backend driver for that device class. When the hardware driver fails,
only this domain is affected and all other domains will survive.
Apart from running paravirtualized guests, Xen supports Intel VT-x and AMD
SVM since version 3.0.0 and 3.0.2 respectively. This allows users to run unmodified
guest operating system in Xen.
3.6.4 KVM
KVM [36], short for Kernel-based Virtual Machine, is a virtualization product that
uses hardware support exclusively. Instead of creating major portions of an operating
3.6. VIRTUALIZATION SOFTWARE
26
system kernel, as other hypervisors have done, the KVM developers turned the
standard Linux kernel into a hypervisor. By developing KVM as a loadable module,
the virtualized environment can benefit from all the ongoing work on the Linux
kernel itself and reduce redundancy [37]. KVM uses a driver (”/dev/kvm“) that
communicates with the kernel and acts as an interface for userspace virtual machines.
The initial version of KVM was released in November 2006 and it was first included
in the Linux kernel 2.6.20 on February 2007.
The recommended way of installing KVM is through the packaging system of a
Linux distribution. The latest version of the KVM kernel modules and supporting
userspace can be found on their website. You can find the kernel modules in the
kvm-kmod-kernel version releases and the userspace components are found in qemukvm-version. The latter is the stable branch of KVM based on QEMU [38] with
the KVM extras on top. QEMU is a machine emulator and can run an unmodified
target operating system and all its applications in a virtual machine. The kvmversion releases are development releases but they are outdated.
Every virtual machine is a Linux process, scheduled by the standard Linux scheduler [39]. A normal Linux process has two modes of execution: kernel and user
mode. KVM adds a third mode of execution, guest mode. Processes that are run
from within the virtual machine run in guest mode. Hardware virtualization is used
to virtualize the processor, memory management is handled by the host kernel and
I/O is handled in user space through QEMU.
In this text, KVM is considered as a hosted hypervisor but there are some discussions1 that KVM is more a native, bare-metal hypervisor. One side argues that
KVM turns Linux into a native, bare-metal hypervisor because Linux becomes the
hypervisor and is running directly on top of the hardware. The other side argues
that KVM runs on top of Linux and should be considered as hosted hypervisor.
Regardless of what type of hypervisor KVM actually is, this text will consider KVM
to be a hosted hypervisor.
3.6.5 Comparison between virtualization software
A high-level comparison is given in table 3.1. All virtualization products in the
table, except Xen, are installed within a host operating system. Xen is installed
directly on the hardware. Most products provide two techniques for virtualization
on x86 architectures. Hardware support for virtualization on x86 architectures is
supported by all virtualization software in the table.
1
http://virtualizationreview.com/Blogs/Mental-Ward/2009/02/
KVM-BareMetal-Hypervisor.aspx
3.6. VIRTUALIZATION SOFTWARE
Hypervisor type
Dynamic binary
translation
27
VirtualBox
VMware
Workstation
XEN
KVM
Hosted
Hosted
Native,
bare-metal
Hosted
X
X
Paravirtualization
Hardware support
X
X
X
X
X
Table 3.1: Comparison between a selection of the most popular hypervisors.
CHAPTER
4
Nested virtualization
The focus of this thesis lies with nested virtualization on x86 architectures. Nested
virtualization is executing a virtual machine inside a virtual machine. In case of
multiple nesting levels, one can also talk about recursive virtual machines. In 1973
and 1975 initial research was published about properties of recursive virtual machine
architectures [40, 41]. These works refer to virtualization that was used in mainframes so that users could work simultaneously on a single mainframe. Multiple use
cases come in mind for using nested virtualization.
• A possible use case for nested x86 virtualization is the development of test
setups for research purposes. Research in cluster1 and grid2 computing requires
extensive test setups, which might not be available. The latest developments
in the research of grid and cluster computing make use of virtualization at
different levels. Virtualization can be used for all, or certain, components of
a grid or cluster. It can also be used to run applications within the grid or
cluster in a sandbox environment. If certain performance limitations are not
an issue, virtualizing all components of such a system can eliminate the need
to acquire the entire test setup. Because these virtualized components, e.g.
Eucalyptus3 or OpenNebula4 , might use virtualization for running applications
in a sandbox environment, two levels of virtualization are used. Nesting the
physical machines of a cluster or grid as virtual machines on one physical
machine can offer security, fault tolerance, legacy support, isolation, resource
control, consolidation, etc.
1
A cluster is a group of interconnected computers working together as a single, integrated computer resource [42, 43].
2
There is no strict definition of a grid. In [44], Bote-Lorenzo et al. listed a number of attempts to
create a definition. Ian Foster created a three point checklist that combine the common properties
of a grid. [45]
3
http://www.eucalyptus.com
4
http://www.opennebula.org
29
• A second possible use case is the creation of a test framework for hypervisors. As virtualization allows testing and debugging an operating system by
deploying the OS in a virtual machine, nested virtualization allows testing and
debugging a hypervisor inside a virtual machine. It eliminates the need for a
separate physical machine where a developer can test and debug a hypervisor.
• Another possible use case is the use of virtual machines inside a server rented
from the cloud5 . Such a server is virtualized on its own so that the cloud vendor
can make optimal use of its resources. For example, Amazon EC26 offers
virtual private servers which are virtual machines using the Xen hypervisor.
Hence, if a user wants to use virtualization software inside this server, nested
x86 virtualization is needed in order to make that setup work.
As explained in chapter 2, virtualization on the x86 architecture is not straightforward. This has resulted in the emergence of several techniques that are given
in chapter 3. These different techniques produce many different combinations to
nest virtual machines. A nested setup can consist of the same technique for both
hypervisors, but it can also consist of a different technique for either the first level
hypervisor or the nested hypervisor. Hence, if we divide the techniques in three major groups: dynamic binary translation, paravirtualization and hardware support,
there are nine possible combinations for nesting a virtual machine inside another
virtual machine. In the following sections, the theoretical possibilities and requirements for each of these combinations are given. The results of nested virtualization
on x86 architectures are given in chapter 5.
Figure 4.1: Layers in a nested virtualization setup with hosted hypervisors.
To prevent confusion about which hypervisor or guest is meant, some terms are
introduced. In a nested virtualization setup, there are two levels of virtualization, see
5
6
Two widely accepted definitions of the term cloud can be found in [46] and [47].
http://aws.amazon.com/ec2/
4.1. DYNAMIC BINARY TRANSLATION
30
figure 4.1. The first level, referred to as L1, is the layer of virtualization that is used
in a non-nested setup. Thus, this level is the virtualization layer that is closest to the
hardware. The terms L1 or bottom layer indicate the first level of virtualization, e.g.
the L1 hypervisor is the hypervisor that is used in the first level of virtualization.
The second level, referred to as L2, is the new layer of virtualization, introduced by
the nested virtualization. Hence, the terms L2, nested or inner indicate the second
level of virtualization, e.g. the L2 hypervisor is the hypervisor that will be installed
inside the L1 guest.
4.1 Dynamic binary translation
This section focusses on L1 hypervisors that use dynamic binary translation for
nested virtualization on x86 architectures. This can be in the host operating system or directly on the hardware. The hypervisor can be VirtualBox (see subsection 3.6.1), a VMware product (see subsection 3.6.2) or any other hypervisor using
dynamic binary translation. The nested hypervisor can be any hypervisor, resulting
in three major combinations. Each combination uses a nested hypervisor that allows
virtualization through a different technique. The nested hypervisor will be installed
in a guest virtualized by the L1 hypervisor. The first combination is again a hypervisor using dynamic binary translation. In the second combination a hypervisor
using paravirtualization is installed in the guest. The last combination is a nested
hypervisor that uses hardware support.
It should be theoretically possible to nest virtual machines using dynamic binary
translation as L1. When using dynamic binary translation, no modifications are
needed to the hardware or to the operating system, as pointed out in section 3.1.
Code running in ring 0 will actually run in ring 1, but the guest is not aware of this.
Dynamic binary translation: The first combination nests a L2 hypervisor
inside a guest virtualized by a L1 hypervisor where both hypervisors are based on
dynamic binary translation. The L2 hypervisor will be running in guest ring 0.
Since the hypervisor will not be aware that its code is actually running in ring 1, it
should be possible to run a hypervisor in this guest.
The nested hypervisor will have to take care of the memory management in
the L2 guest. It will have to maintain the shadow page tables for its guests, see
subsection 3.1.3. The hypervisor uses these shadow page tables to translate the
L2 virtual memory addresses to, what it thinks to be, real memory equivalents.
But actually these translated addresses are in the virtual memory range of the L1
guest and can be converted to real memory addresses by the shadow page tables
maintained by the L1 hypervisor. The memory architecture in a nested setup is
illustrated in figure 4.2. For a L1 guest, there are two levels of address translation
as shown in figure 3.1. A nested guest has three levels of address translation resulting
in the need for shadow tables in the L2 hypervisor.
Paravirtualization: The second combination uses paravirtualization as technique for the L2 hypervisor. This situation is the same as the situation with dynamic
4.1. DYNAMIC BINARY TRANSLATION
31
Figure 4.2: Memory architecture in a nested situation.
binary translation for the L2 hypervisor. The hypervisor using paravirtualization
will be running in guest ring 0 and is not aware that it is actually running in ring
1. This should make it possible to nest a L2 hypervisor based on paravirtualization
within a guest virtualized by a L1 hypervisor using dynamic binary translation.
Hardware supported virtualization:
The virtualized processor that is
available to the L1 guest is based on the x86 architecture in order to allow current operating system to work in the virtualized environment. However, are the
extensions (see section 3.3 and 3.4) for virtualization on x86 architectures also included? In order to use a L2 hypervisor based on hardware support within the L1
guest, the L1 hypervisor should virtualize or emulate the virtualization extensions
of the processor. A virtualization product that is based on hardware supported
virtualization needs these extra extensions. If the extensions are not available, the
hypervisor cannot be installed or activated. If the L1 hypervisor provides these extensions, chances are that it requires a physical processor with the same extensions.
It might be possible for hypervisors based on dynamic binary translation to provide
the extensions without having a processor that supports the hardware virtualization.
However, all current processors have these extensions. Therefore it is very unlikely
that developers will incorporate functionality that provides the hardware support to
the guest without a processor with hardware support for x86 virtualization.
Memory management in the L2 guest based on hardware support is not possible
because the second generation hardware support only provides two levels of address
translation. The L1 hypervisor should provide the EPT or NPT functionality to the
guest together with the first generation hardware support, but it will have to use a
software technique for the implementation of the MMU.
4.2. PARAVIRTUALIZATION
32
4.2 Paravirtualization
The situation for nested virtualization is quite different when using paravirtualization as the bottom layer hypervisor. The most popular example of a hypervisor
based on paravirtualization is Xen (see subsection 3.6.3). There are again three
combinations. A nested hypervisor can be the same as the bottom layer hypervisor,
based on paravirtualization. The second combination is the case where a dynamic
binary translation based hypervisor is used as the nested hypervisor. In the last combination a hypervisor based on hardware support is nested in the paravirtualized
guest. The main difference is that the L1 guest is aware of the virtualization.
Dynamic binary translation and paravirtualization: The paravirtualized guest is aware of the virtualization and should use the hypercalls provided by
the hypervisor. The guest’s operating system should be modified to use these hypercalls, thus all code in the guest that runs in kernel mode needs these modifications
in order to work in the paravirtualized guest. This has major consequences for a
nested virtualization setup. A nested hypervisor can only work in a paravirtualized
environment if it is modified to work with these hypercalls. A native, bare-metal
hypervisor should be adapted so that all ring 0 code is changed. For a hosted hypervisor this indicates that the module, that is loaded into the kernel of the host
operating system, is modified to work in the paravirtualized environment. Hence,
companies that develop virtualization products need to actively make their hypervisors compatible for running inside a paravirtualized guest.
Memory management of the L2 guests is done by the nested hypervisor. The
pages tables of the L1 guests are directly registered with the MMU, so the nested
hypervisor can use the hypercalls to register its page tables with the MMU. A nested
hypervisor based on paravirtualization might allow a L2 guest to register its page
tables directly with the MMU, while a nested hypervisor based on dynamic binary
translation will maintain shadow tables.
Hardware supported virtualization: Hardware support for x86 virtualization is also for paravirtualization an exceptional case. The L1 hypervisor should
provide the extensions for the hardware support to the guests, probably by means
of hypercalls. Modified hypervisors based on hardware support can then use the
hardware extensions. Second generation hardware support can also only be used if
it is provided by the L1 hypervisor, together with first generation hardware support.
In conclusion, nested virtualization with paravirtualization as a bottom layer
needs modifications to the nested hypervisor, whereas nested virtualization with
dynamic binary translation as bottom layer did not need these changes. On the
other hand, the guests know that they are virtualized which might influence the
performance of the L2 guests in a positive way. The nested virtualization will not
work unless support is actively introduced. There is a low likelihood that virtualization software developers are willing to incorporate these modifications in their
hypervisors since the cost of the implementation does not exceed the benefits.
4.3. HARDWARE SUPPORTED VIRTUALIZATION
33
4.3 Hardware supported virtualization
The last setup is to use a hypervisor based on hardware support for x86 virtualization as the bottom layer. For this configuration a processor is required that has the
extensions for hardware support. KVM (see subsection 3.6.4) is a popular example
of such a hypervisor but the latest versions of VMware, VirtualBox and Xen can
also use hardware support. As with the previous configurations, there are three
combinations. The combination using a hypervisor that is based on the same technique as the L1 hypervisor. A combination where a hypervisor based on dynamic
binary translation is used and the last combination where a paravirtualization based
hypervisor is the nested hypervisor.
Dynamic binary translation and paravirtualization: These combinations are similar to the combinations where a hypervisor based on dynamic binary
translation is used as bottom layer. A guest or its operating system does not need
modifications, hence it should in theory be possible to nest virtual machines in a
setup where the bottom layer hypervisor is based on hardware support. The nested
hypervisor thinks its code is running in ring 0, but actually it is running in the guest
mode of the processor, which is a result of a VMRUN or VMEntry instruction.
The memory management depends on whether the processor supports the second generation hardware support. If the processor does not support this, the L1
hypervisor uses a software technique for virtualizing the MMU. In this case, memory
management will be the same as with dynamic binary translation where both L1
and L2 hypervisor maintain shadow tables for virtualizing the MMU. Whereas, if
the processor does support the hardware MMU, then the L1 hypervisor does not
need to maintain these shadow tables which can improve the performance.
Hardware supported virtualization: As for the other configurations, hardware support for nested hypervisors is a special case. The virtualized processor that
is provided to the L1 guest is based on the x86 processor but needs to contain the
hardware extensions for virtualization if the nested hypervisor uses hardware support. If the L1 hypervisor does not provide these hardware extensions to its guests,
only the combination with a nested hypervisor that uses dynamic binary translation or paravirtualization can work. KVM and Xen are doing research and work to
provide hardware extensions for virtualization on the x86 architecture to the guests.
More details are given in section 5.4.
The hardware support for EPT or NPT (see section 3.4) in the guest, which
can also be referred to as nested EPT or nested NPT, deserves special attention
according to Avi Kivity [48]. Avi Kivity is a lead developer and maintainer of KVM
and posted some interesting information about nested virtualization on his blog.
Nested EPT or nested NPT can be critical for obtaining reasonable performance.
The guest hypervisor needs to trap and service context switches and writes to guest
tables. A trap in the guest hypervisor is multiplied by quite a large factor into KVM
traps. Since the hardware only supports two levels of address translation, nested
EPT or NPT should be implemented in software.
CHAPTER
5
Nested virtualization in Practice
The previous chapter gave some insight in the theoretical requirements of nested x86
virtualization. The division into three categories resulted in nine combinations. This
chapter presents how nested x86 virtualization behaves in practice. Each of the nine
combinations is tested and performance tests are executed on working combinations.
The results of these tests are discussed in the following chapter. The combinations
that fail to run are analyzed in order to find the reason for the failure.
A selection of the currently popular virtualization products are tested. These
products are VirtualBox, VMware Workstation, Xen and KVM as discussed in section 3.6. Table 3.1 shows a summary of these hypervisors and the supported virtualization techniques. There are seven different hypervisors if we consider that the
products with multiple techniques consist of different hypervisors. For each hypervisor we can nest the other seven hypervisors. Thus, nesting these hypervisors result
in 49 different setups, which will be described in the following sections. Details
of the tests are given in appendix B. It lists the used configuration for each setup
together with version information of the hypervisors and the result of the setup.
The subsection in which each nested setup can be found is summarized in table 5.1. The columns of the table represent the L1 hypervisors and the rows represent
the L2 hypervisors, i.e. the hypervisor represented by the row is nested inside the
hypervisor represented by the column. For example, information about the nested
setup where VirtualBox based on dynamic binary translation is nested inside Xen
using paravirtualization, can be found in subsection 5.1.2. The table cells for setups
with a L1 hypervisor based on hardware support are split in two cells, the upper
cell represents the nested setup tested on a processor with first generation hardware
support. The bottom cell represents the setup tested on a processor with second
generation hardware support.
KVM
Xen
VMware
VirtualBox
HV
5.1.1
5.1.1
5.1.1
PV
HV
5.1.1
5.1.1
DBT
HV
5.1.1
5.1.1
DBT
5.1.1
5.3.3
5.2.3
5.2.3
5.3.3
5.3.3
5.3.3
5.2.3
5.2.3
5.1.1
5.3.2
5.3.2
5.2.2
5.2.2
5.1.1
5.3.3
5.3.3
5.2.3
5.2.3
5.1.1
5.3.1
5.3.1
5.2.1
5.2.1
5.1.1
5.3.3
5.3.3
5.2.3
5.1.1
5.2.3
5.2.1
HV
5.3.1
5.1.1
DBT
VMware
5.3.1
5.2.1
HV
VirtualBox
HV
DBT
Subsections
5.1.2
5.1.2
5.1.2
5.1.2
5.1.2
5.1.2
5.1.2
PV
5.3.3
5.2.3
5.3.3
5.2.3
5.3.2
5.2.2
5.3.3
5.2.3
5.3.1
5.2.1
5.3.3
5.2.3
5.3.1
5.2.1
HV
XEN
5.3.3
5.2.3
5.3.3
5.2.3
5.3.2
5.2.2
5.3.3
5.2.3
5.3.1
5.2.1
5.3.3
5.2.3
5.3.1
5.2.1
HV
KVM
2nd gen.
1st gen.
2nd gen.
1st gen.
2nd gen.
1st gen.
2nd gen.
1st gen.
2nd gen.
1st gen.
2nd gen.
1st gen.
2nd gen.
1st gen.
Gen. HV
35
Table 5.1: Index table containing directions in which subsections information can
be found about a certain nested setup.
5.1. SOFTWARE SOLUTIONS
36
5.1 Software solutions
5.1.1 Dynamic binary translation
In this subsection, we will give the results of actually nesting the virtual machines inside a L1 hypervisor based on dynamic binary translation, as discussed in section 4.1.
The nested hypervisors should not need modifications. Only the nested hypervisors
based on hardware support for virtualization need a virtual processor that contains
the hardware extensions. The L1 hypervisors are VirtualBox and VMware Workstation using dynamic binary translation for virtualizing guests. Since we test two L1
hypervisors, this subsection describes 14 setups. These setups are described in the
following paragraphs categorized by their technique for the L2 hypervisor. The first
paragraphs elaborate on the setups that use dynamic binary translation on top of
dynamic binary translation. The next paragraph presents the setups that use paravirtualization as the L2 hypervisor, followed by a paragraph that presents the setups
that use hardware support as the L2 hypervisor. The last paragraph concludes this
subsection with an overview.
Dynamic binary translation: Each setup that used dynamic binary translation as both the L1 and L2 hypervisor resulted in failure. The setups either hung
or crashed when starting the inner guest. In the two setups where VMware Workstation was nested inside VMware Workstation and VirtualBox was nested inside
VirtualBox, the L2 guest became unresponsive when started. After a few hours, the
nested guests were still trying to start, so these setups could be marked as failures.
In both setups the L1 and L2 hypervisors were the same, the developers know what
instructions and functionality is used by the nested hypervisor and may have foreseen this situation. However, the double layer of dynamic binary translation seems
to be inoperative or too slow for a working nested setup with the same hypervisor
for both L1 and L2 hypervisors.
The other two setups, where VMware Workstation is nested in VirtualBox and
VirtualBox is nested in VMware Workstation, resulted in a crash. In the former
setup the L1 VirtualBox guest crashed which indicates that the L2 guest tried to use
functionality that is not fully supported by VirtualBox. This can be functionality
that was left out in order to improve performance or a simple bug. In the other
setup, with VMware Workstation as the L1 hypervisor and VirtualBox as the L2,
the VirtualBox guest crashed but the VMware Workstation guest stayed operational.
The L2 guest noticed that some conditions are not met and crashes with an assertion
failure. In both setups, it seems that the L2 guest does not see a fully virtualized
environment and one of the guests, in particularly VirtualBox, reports a crash. More
information about the reported crash is given in section B.1. A possible reason that
in both cases VirtualBox reports the crash is that VirtualBox is open source and
can allow more information to be viewed by its users.
Paravirtualization: Of the two setups that use paravirtualization on top of
dynamic binary translation, one worked and the other crashed. Figure 5.1 shows the
layers of these setups, where the L1 guest and the L2 hypervisor are represented by
the same layer. The setup with VMware Workstation as the L1 hypervisor allowed
5.1. SOFTWARE SOLUTIONS
37
Figure 5.1: Layers for nested paravirtualization in dynamic binary translation.
a Xen guest to boot successfully. In the other setup, using VirtualBox, the L1 guest
crashed and reported a message similar to the setup with VMware Workstation
inside VirtualBox (see section B.1). The result, one setup that works and one
that does not, gives some insight in the implementation of VMware Workstation
and VirtualBox. The latter contains one or more bugs which make the L1 guest
crash when a nested hypervisor starts a guest. The functionality could be left out
deliberately because such a situation might not be very common. Leaving out these
exceptional situations allows developers to focus on more important functionality for
allowing virtualization. On the other hand, VMware Workstation does provide the
functionality and could be considered more mature for nested virtualization using
dynamic binary translation as the L1 hypervisor.
Hardware supported virtualization: VirtualBox and VMware Workstation do not provide the x86 virtualization processor extensions to their guests. This
means that there is no hardware support available in the guests, neither for the
processor, nor the memory management. Since four of the hypervisors are based
on hardware support, there are eight setups that contain such a hypervisor. The
lack of hardware support causes the failure of these eight setups. Implementing the
hardware support in the L1 hypervisor using software, without underlying support
from the processor, could result in bad performance. However, if performance is not
an issue, such a setup could be useful to simulate a processor with hardware support
on an incompatible processor.
Only one out of 14 setups worked with dynamic binary translation as the L1
hypervisor. The successful combination is the Xen hypervisor using paravirtualization within the VMware Workstation hypervisor. Other setups hung or crashed
and VirtualBox reported the most crashes. VirtualBox seems to contain some bugs
that VMware Workstation does not have, resulting in crashes in the guest being
virtualized by VirtualBox. Hardware support for virtualization is not present in
the L1 guest using VMware Workstation or VirtualBox, which eliminates the eight
setups with a nested hypervisor that needs the hardware extensions. Table 5.2 gives
a summary of the setups described in this subsection. The columns represent the
5.1. SOFTWARE SOLUTIONS
VirtualBox
VMware
Xen
KVM
38
VirtualBox
VMware
DBT
DBT
DBT
×
×
HV
×
×
DBT
×
×
HV
×
×
PV
×
X
HV
×
×
HV
×
×
Table 5.2: The nesting setups with dynamic binary translation as the L1 hypervisor
technique. DBT stands for dynamic binary translation, PV for paravirtualization
and HV for hardware virtualization.
L1 hypervisors and the rows represent the L2 hypervisors.
5.1.2 Paravirtualization
Previous subsection described the setups that use dynamic binary translation as the
L1 hypervisor. The following paragraphs elaborate on the use of a L1 hypervisor
based on paravirtualization. In section 4.2, we concluded that nested virtualization
with paravirtualization as a bottom layer needs modifications to the nested hypervisor. The L1 hypervisor used for the tests is Xen. In all the nested setups, the
L2 hypervisor should be modified to use the paravirtual interfaces offered by Xen
instead of executing ring 0 code. We discuss the problems for each hypervisor technique in the following paragraphs, together with what the setup would look like if the
nested virtualization works. The last paragraph summarizes the setups described in
this subsection.
Paravirtualization: The paravirtualized guest does not allow the start of a
Xen hypervisor within the guest. The kernel loaded in the paravirtualized guest is
a kernel adapted for paravirtualization. The Xen hypervisor is not adapted to use
the provided interface and hence the paravirtualized guest removes the other kernels
from the bootloader. The complete setup, see figure 5.2, consists of Xen as the L1
hypervisor which automatically starts domain 0. This domain 0 is a L1 privileged
guest. Another domain would run the nested hypervisor, which in turn would run
its automatically started domain 0 and a nested virtual machine.
Dynamic binary translation: The hypervisor of VMware Workstation and
VirtualBox based on dynamic binary translation could not be loaded in the paravirtualized guest. The reason is that the ring 0 code is not adapted for the paravirtualization. In practice this expresses itself as the inability to compile the driver or
module that needs to be loaded. It should be compiled against the kernel headers
5.1. SOFTWARE SOLUTIONS
39
Figure 5.2: Layers for nested Xen paravirtualization.
but fails to compile since it does not recognize the version of the adapted kernel and
its headers. The setup for dynamic binary translation as technique for the nested
hypervisor (see figure 5.3) differs from the previous setup (figure 5.2) in that the
L2 hypervisor is on top of a guest operating system. Xen is a native, bare-metal
hypervisor which runs directly on the hardware, i.e. in this case the virtual hardware. VMware Workstation and VirtualBox are hosted hypervisors and do need an
operating system between the hypervisor and the virtual hardware.
Figure 5.3: Layers for nested dynamic binary translation in paravirtualization.
Hardware supported virtualization: The other four setups, where a nested
hypervisor based on hardware support is used, have the same problem. None of the
hypervisors are modified to run in a paravirtualized environment. In addition, the
virtualization extensions are not provided in the paravirtualized guest. Even if the
hypervisors were adapted for the paravirtualization, they would still need these
extensions. These setups look like figure 5.2 or figure 5.3, depending on whether the
nested hypervisor is hosted or native, bare-metal.
None of the seven setups with paravirtualization as bottom layer worked. The
results of the setups are shown in table 5.3. The column with the header “Xen”
represents the L1 hypervisor. The main problem is the adaptation of the hypervisors.
5.2. FIRST GENERATION HARDWARE SUPPORT
40
XEN
PV
VirtualBox
VMware
Xen
KVM
DBT
×
HV
×
DBT
×
HV
×
PV
×
HV
×
HV
×
Table 5.3: The nesting setups with paravirtualization as the L1 hypervisor technique.
Unless these hypervisors are modified, paravirtualization is not a good choice as L1
hypervisor technique. It will always depend on the adaptation of the hypervisor and
one could only use that hypervisor. When using paravirtualization, the best one
could do is hope that developers adapt their hypervisors or modify the hypervisor
oneself.
5.1.3 Overview software solutions
Previous subsections explain the results of nested virtualization with software solutions for the bottom layer hypervisor. This subsection gives an overview of all the
possible setups described in the previous subsections. All these setups are gathered
in table 5.4. The columns of the table represent the setups belonging to the same
L1 hypervisor. The rows in the table indicate a different nested hypervisor, i.e. the
hypervisor represented by the row is nested inside the hypervisor represented by the
column.
Nested x86 virtualization using a L1 hypervisor based on a software solution
is not successful. Out of the 21 setups that were tested, only one setup allows to
successfully boot a L2 guest: nesting Xen inside VMware Workstation. Note that
12 setups are unsuccessful simply because hardware support for x86 virtualization
is not available in the L1 guest.
5.2 First generation hardware support
The setups with a bottom layer hypervisor based on hardware support are described
in this section. The theoretical possibilities and requirements needed for these setups are discussed in section 4.3. The conclusion was that it should be possible to
nest virtual machines without modifying the guest operating systems, given that
the physical processor provides the hardware extensions for x86 virtualization. In
5.2. FIRST GENERATION HARDWARE SUPPORT
VirtualBox
VMware
XEN
DBT
DBT
PV
5.1.1
5.1.1
5.1.2
DBT
×
×
×
HV
×
×
×
DBT
×
×
×
HV
×
×
×
PV
×
X
×
HV
×
×
×
HV
×
×
×
Subsection
VirtualBox
VMware
Xen
KVM
41
Table 5.4: Overview of the nesting setups with a software solution as the L1 hypervisor technique.
chapter 3, the hardware support for x86 virtualization was divided into the first generation and second generation hardware support. The second generation hardware
support adds a hardware supported memory management unit so that the hypervisor does not need to maintain shadow tables. The original research was done
on a processor1 that did not have second generation hardware support. Detailed
information about the hypervisor versions is listed in section B.3. To make a comparison between first generation and second generation hardware support for x86
virtualization, the setups were also tested on a newer processor2 that does provide
the hardware supported MMU. The results of the tests on the newer processor are
given in section 5.3.
The tested L1 hypervisors using the hardware extensions for virtualization are
VirtualBox, VMware Workstation, Xen and KVM. We nested the seven hypervisors
(see table 3.1) within these four hypervisors, resulting in 28 setups. In the first
subsection the nested hypervisor is based on dynamic binary translation. The second
subsection described the setups with Xen paravirtualization as the L2 hypervisor.
The last subsection handles the setups with a nested hypervisor based on hardware
support for x86 virtualization.
1
Setups with a L1 hypervisor based on first generation hardware support for x86 virtualization
R
were tested on an Intel
CoreTM 2 Quad Q9550 processor.
2
Setups with a L1 hypervisor based on second generation hardware support for x86 virtualization
R
were tested on an Intel
CoreTM i7-860 processor.
5.2. FIRST GENERATION HARDWARE SUPPORT
42
5.2.1 Dynamic binary translation
Using dynamic binary translation as the nested hypervisor technique, there are
eight setups. Three of these setups are able to successfully boot and run a nested
virtual machine. The layout of these setups can be seen in figure 5.4 where the L1
hypervisor is based on hardware support and the L2 hypervisor is based on dynamic
binary translation. When Xen is used as the L1 hypervisor, the host OS layer can
be left out and a domain 0 is started next to VM1, which still uses hardware support
for its virtualization.
Figure 5.4: Layers for nested dynamic binary translation in a hypervisor based on
hardware support.
VirtualBox: When VirtualBox based on hardware support is used as the bottom layer hypervisor, none of the setups worked. Nesting VirtualBox inside VirtualBox resulted in the L2 guest becoming unresponsive. The same result happened
when VirtualBox was nested in VirtualBox but used dynamic binary translation for
both levels. When trying to nest a VMware Workstation guest inside VirtualBox,
the configuration of that setup is very unstable so that each minor change resulted
in a setup that refuses to start the L2 guest. There was one working configuration
which we listed in section B.3.
VMware Workstation: If the L1 hypervisor in figure 5.4 is VMware Workstation, the setups were successful in nesting virtual machines. Both VirtualBox
and VMware Workstation as nested hypervisors based on dynamic binary translation were able to start the L2 guest which booted and ran correctly.
Xen: VMware Workstation3 checks whether there is an underlying hypervisor
running. It noticed that Xen was running and refused to start a nested guest. This
prevents a L2 VMware guest from starting within a Xen guest. In the other setup,
where VirtualBox is used as inner hypervisor, the L2 again became unresponsive
3
In version VMware Workstation 6.5.3 build-185404 and newer
5.2. FIRST GENERATION HARDWARE SUPPORT
43
after starting. There is no crash, error message or warning which might indicate
that the L2 guest booted at a very slow pace.
KVM: The third and last working setup for nesting a hypervisor based on
dynamic binary translation within one based on hardware support is nesting VMware
Workstation inside KVM. In newer versions of VMware Workstation4 , a check for
an underlying hypervisor noticed that KVM was running and refused to boot a
nested guest. The setup with VirtualBox as the nested hypervisor crashed while
booting. The L2 guest showed an error indicating a kernel panic because it could
not synchronize. The guest became unresponsive after displaying the message.
VirtualBox
VMware
XEN
KVM
HV
HV
HV
HV
VirtualBox
DBT
×
X
×
×
VMware
DBT
∼
X
×
X
Table 5.5: The nesting setups with first generation hardware support as the L1
hypervisor technique and DBT as the L2 hypervisor technique.
Table 5.5 gives a summary of the eight setups discussed in this subsection. VMware Workstation is the best option since it allows nesting other hypervisors based
on dynamic binary translation, but it will also most likely work when used as nested
hypervisor based on dynamic binary translation. In comparison to nesting inside
a software solution, VirtualBox is able to nest within VMware Workstation when
using hardware support for the L1 hypervisor. VirtualBox is still not able to nest
within KVM, Xen and within itself, while VMware Workstation is able to nest within
KVM and itself. It is regretful that VMware Workstation checks for an underlying
hypervisor, other than VMware itself, to prevent the use of VMware Workstation
within other hypervisors.
5.2.2 Paravirtualization
In this subsection, we discuss the setups that nest a paravirtualized guest inside
a guest virtualized using hardware support. Figure 5.5 shows the layers in these
setups. The main differences with the setups in the previous subsection are that
the L1 guest and the L2 hypervisor are represented by the same layer and that Xen
automatically starts domain 0.
There are just four setups tested in this subsection since only Xen is nested within
the four hypervisors based on hardware support. All four setups could successfully
nest a paravirtualized guest inside the L1 guest. However, the setup where Xen is
nested inside VirtualBox was not very stable. Sometimes during the start-up of the
privileged domain several segmentation faults occurred. Domain 0 was able to boot
and run successfully but the creation of another paravirtualized guest was sometimes
4
In version VMware Workstation 7.0.1 build-227600 and newer
5.2. FIRST GENERATION HARDWARE SUPPORT
44
Figure 5.5: Layers for nested paravirtualization in a hypervisor based on hardware
support.
impossible. Xen reported that the guest is created, however, it did not show up in
the list of virtual machines indicating that the guest crashed immediately.
Xen
PV
VirtualBox
VMware
XEN
KVM
HV
HV
HV
HV
∼
X
X
X
Table 5.6: The nesting setups with first generation hardware support as the L1
hypervisor technique and PV as the L2 hypervisor technique.
An overview of the four setups is shown in table 5.6. It is clear that using
paravirtualization as technique for the nested hypervisor can be recommended. The
only setup that does not completely work is the one with VirtualBox. Since the
other three setups work and since previous conclusions were also not in favor of
VirtualBox, VirtualBox is probably the reason for the instability.
5.2.3 Hardware supported virtualization
The remaining setups, which attempt to nest a hypervisor based on hardware supported virtualization, are discussed in this subsection. Nesting the four hypervisors
based on hardware support within each other results in 16 setups. The layout of the
setups is equal to figure 5.4 and figure 5.5, depending on which hypervisor is used.
None of the hypervisors provide the x86 virtualization processor extensions to their
guests indicating that none of the setups will work.
Developers of both KVM and Xen are working on support for nested hardware
support. Detailed information can be found in section 5.4. KVM has already released
initial patches for nested hardware support on AMD processors and is working on
patches for the nested support on Intel processors. Xen is also researching the ability
5.2. FIRST GENERATION HARDWARE SUPPORT
45
to nest the hardware support so that nested virtual machines can use the hardware
extensions.
VirtualBox
VMware
XEN
KVM
HV
HV
HV
HV
VirtualBox
HV
×
×
×
×
VMware
HV
×
×
×
×
Xen
HV
×
×
×
×
KVM
HV
×
×
×
×
Table 5.7: The nesting setups with first generation hardware support as the L1 and
L2 hypervisor technique.
The results of this subsection are summarized in table 5.7. It is regretful that
currently none of the setups work because the L1 hypervisors do not yet provide the
hardware support for virtualization to the guests. Nonetheless, it is hopeful that
KVM and Xen are doing research and work in this area. Their work can motivate
managers and developers of other hypervisors to provide these hardware extensions
to their guests as well.
We would like to note that VMware and VirtualBox guests with a 64 bit operating system need hardware support to execute. If we would use a 64 bit operating
system for the nested guest, the result would be the same as the results in this
section since there is currently no nested hardware support.
5.2.4 Overview first generation hardware support
In this subsection, we summarize the results using the setups that are described in
the previous subsections. All the setups were tested on a processor that had first
generation hardware support for virtualization on x86 architectures. The results of
all the setups are gathered in table 5.8. The columns indicate the L1 hypervisor and
the rows indicate the L2 hypervisor, i.e. the hypervisor represented by the row is
nested inside the hypervisor represented by the column.
Nested x86 virtualization using a L1 hypervisor based on hardware support is
more successful than using a L1 hypervisor based on software solutions (see section 5.1.3). For nesting dynamic binary translation, results suggest that VMware
Workstation was the best option and that VirtualBox works although it showed
some instabilities. Nesting paravirtualization is the most suitable solution when using a L1 hypervisor based on hardware support on a processor that only supports
first generation hardware support. Nested hardware support is not present yet but
KVM and Xen are working on it. The future will tell whether they will be successful
or not. The number of working setups increased when using hardware support in
the L1 hypervisor so the future looks promising for nested hardware support.
For now, VMware Workstation is the most suitable choice for the L1 hypervisor,
5.3. SECOND GENERATION HARDWARE SUPPORT
VirtualBox
VMware
Xen
KVM
46
VirtualBox
VMware
XEN
KVM
HV
HV
HV
HV
DBT
×
X
×
×
HV
×
×
×
×
DBT
∼
X
×
X
HV
×
×
×
×
PV
∼
X
X
X
HV
×
×
×
×
HV
×
×
×
×
Table 5.8: Overview of the nesting setups with first generation hardware support as
the L1 hypervisor technique.
directly followed by KVM, since it can nest three different hypervisors. The advisable
choice for the L2 hypervisor is a paravirtualized guest using Xen since it allows
nesting in all the hypervisors. VirtualBox as the L1 hypervisor has two unstable
setups which makes it rather unsuitable for nested virtualization.
5.3 Second generation hardware support
In section 4.3 we concluded that it should be possible to nest virtual machines
without modifying the guest operating system, given that the hardware extensions
for virtualization on x86 architectures are provided by the physical processor. In
section 5.2, the setups with a L1 hypervisor based on hardware support were tested
on a processor that only provided first generation hardware support. In this section,
we test the same setups but on a newer processor5 that provides second generation
hardware support. The comparison of the results presented in both sections can give
an insight in the influence of the hardware supported MMU for nested virtualization.
Section B.4 lists detailed information about the hypervisor versions.
The second generation hardware support offers a hardware supported MMU.
The hardware supported MMU provides the Extended Page Tables for Intel and the
Nested Page Tables for AMD (see section 3.4). The memory management in nested
virtualization needs three levels of address translation, as can be seen in figure 4.2,
while the hardware supported MMU only offers two levels of address translation.
5
Setups with a L1 hypervisor based on second generation hardware support for x86 virtualization
R
were tested on an Intel
CoreTM i7-860 processor
5.3. SECOND GENERATION HARDWARE SUPPORT
47
This problem is solved by reusing the existing code for the shadow tables. The
L2 hypervisor will maintain shadow tables that translate the nested virtual guest
address to the physical guest address. These shadow tables are used together with
the EPT or NPT to translate the nested virtual guest address into a real physical
address. So in a nested setup, the L2 guest maintains its own page tables that
translate nested virtual guest addresses into nested physical guest addresses. The L2
hypervisor maintains shadow tables for these page tables that immediately translate
nested virtual guest addresses into physical guest addresses. The L1 hypervisor
maintains the EPT or NPT that translate the physical guest addresses into real
physical addresses.
The setups in this section remain unchanged; VirtualBox, VMware Workstation,
Xen and KVM are used as L1 hypervisor, supporting the hardware extensions for
virtualization on x86 architectures. The first subsection elaborates on the results of
nesting a L2 hypervisor based on dynamic binary translation. The second subsection
discusses the results of a nested hypervisor based on paravirtualization. The results
of nesting a hypervisor based on hardware support within the L1 hypervisor are
explained in the third subsection. Subsection 5.3.4 gives an overview of the results
in this section and compares these results with the results obtained in section 5.2.
5.3.1 Dynamic binary translation
Eight setups were tested using a L2 hypervisor based on dynamic binary translation.
Compared to the three working setups in subsection 5.2.1, there are six working
setups when using a L1 hypervisor based on the second generation hardware support
for virtualization on x86 architectures. The layout of the setups in this subsection
is shown in figure 5.4.
VirtualBox: Using VirtualBox based on hardware support as the bottom
layer hypervisor results in a different outcome for one of the setups. If the L2
hypervisor is VMware Workstation, the result was very unstable comparable to subsection 5.2.1. The other setup, which uses VirtualBox as the nested hypervisor, was
able to boot and run a L2 guest successfully. In the tests without second generation hardware support, this setup became unresponsive. The use of the hardware
supported MMU affects the outcome of the test for this setup.
VMware Workstation: Nothing has changed for these results. Both setups,
with VMware Workstation as L1 hypervisor, were still successful in running a L2
guest.
Xen: The setup with VirtualBox as the L2 hypervisor also has a different
outcome with Xen based on hardware support as the L1 hypervisor. A L2 guest was
able to boot and run successfully, while in the test with first generation hardware
support, the setup became unresponsive. The setup with VMware Workstation as
the L2 hypervisor still does not work because the hypervisor checks for an underlying L1 hypervisor. The Xen hypervisor was detected and VMware Workstation6
reported that the user should disable the other hypervisor.
KVM: Nesting VirtualBox or VMware Workstation within KVM now worked
6
In version VMware Workstation 6.5.3 build-185404 and newer
5.3. SECOND GENERATION HARDWARE SUPPORT
48
for both setups. The setup with VMware Workstation as the inner hypervisor already worked without second generation hardware support. The newer versions of
VMware Workstation7 checked whether there is an underlying hypervisor and noticed that KVM was running. The new check prevented the setup from working.
The setup using VirtualBox as L2 hypervisor, which showed a kernel panic in the
previous test, now booted and ran successfully.
VirtualBox
VMware
XEN
KVM
HV
HV
HV
HV
VirtualBox
DBT
X
X
X
X
VMware
DBT
∼
X
×
X
Table 5.9: The nesting setups with second generation hardware support as the L1
hypervisor technique and DBT as the L2 hypervisor technique.
The new results are gathered in table 5.9. In subsection 5.2.1, VMware Workstation was the recommended option to use as both L1 and L2 hypervisor. The
conclusion is different with the results in this subsection. VMware Workstation now
shares the most suitable choice for the bottom layer hypervisor with KVM. The
most suitable choice for the L2 hypervisor is no longer VMware Workstation but
VirtualBox since it could be nested in all setups. The check for an underlying hypervisor in VMware Workstation prevented it from being nested in certain setups. The
setup that nests VMware inside VirtualBox is very unstable, preventing VirtualBox
from being the advisable choice for the L1 hypervisor.
5.3.2 Paravirtualization
In this subsection, we replace dynamic binary translation as the L2 hypervisor with
paravirtualization. The layout of the setups is shown in figure 5.5. There were
three setups that completely worked in subsection 5.2.2. The fourth setup was very
unstable because segmentation faults could occur during the start-up of domain 0.
Using second generation hardware support these segmentation faults disappeared
and the fourth setup successfully passed the test.
There is little difference with the previous results on a processor with first generation hardware support. The new results are collected in table 5.10. Xen using
paravirtualization remains a perfect choice for nesting inside a virtual machine.
5.3.3 Hardware supported virtualization
None of the setups in subsection 5.2.3 worked. The results are the same for this
subsection since the problem is not the hardware supported MMU. The layout of the
setups with the L1 and L2 hypervisors based on hardware support for virtualization
on x86 architectures is similar to figure 5.4 and figure 5.5, depending on which
7
In version VMware Workstation 7.0.1 build-227600 and newer
5.3. SECOND GENERATION HARDWARE SUPPORT
Xen
PV
49
VirtualBox
VMware
XEN
KVM
HV
HV
HV
HV
X
X
X
X
Table 5.10: The nesting setups with second generation hardware support as the L1
hypervisor technique and PV as the L2 hypervisor technique.
hypervisor is used. The problem is that there is no nested hardware support. There
is work in progress on this subject by KVM and Xen, see section 5.4.
VirtualBox
VMware
XEN
KVM
HV
HV
HV
HV
VirtualBox
HV
×
×
×
×
VMware
HV
×
×
×
×
Xen
HV
×
×
×
×
KVM
HV
×
×
×
×
Table 5.11: The nesting setups with first generation hardware support as the L1
and L2 hypervisor technique.
For completeness, the results are shown in table 5.11 but the results are the
same as the results of the tests on a processor without second generation hardware
support.
5.3.4 Overview second generation hardware support
The intermediate results of the previous subsections are gathered in this subsection.
The setups were tested on a processor that provides second generation hardware
support for virtualization. Table 5.12 shows a summary of the results obtained in
the previous subsections. The columns indicate the L1 hypervisor and the rows
represent the L2 hypervisor, i.e. the hypervisor indicated by the row is nested inside
the hypervisor represented by the column.
Nested x86 virtualization with a L1 hypervisor based on hardware support is
even more successful when the processor provides second generation hardware support. Both dynamic binary translation and paravirtualization are capable of being
nested inside a hypervisor based on hardware support. The two setups that did
not work, had problems with configuration instability and the check for an underlying hypervisor in VMware Workstation. KVM and VMware Workstation are the
advisable choice for the L1 hypervisor since all dynamic binary translation and paravirtualization setups worked for these hypervisors. VirtualBox using dynamic binary
5.4. NESTED HARDWARE SUPPORT
VirtualBox
VMware
Xen
KVM
50
VirtualBox
VMware
XEN
KVM
HV
HV
HV
HV
DBT
X
X
X
X
HV
×
×
×
×
DBT
∼
X
×
X
HV
×
×
×
×
PV
X
X
X
X
HV
×
×
×
×
HV
×
×
×
×
Table 5.12: Overview of the nesting setups with second generation hardware support
as the L1 hypervisor technique.
translation and Xen using paravirtualization are the most suitable choice for the L2
hypervisor.
Many setups that were unresponsive in section 5.2 became responsive when using
a hardware supported MMU. The use of EPT or NPT improves the performance for
the memory management and releases the L1 hypervisor from maintaining shadow
tables. The maintenance of the shadow tables is based on software and can contain
bugs. It must also be implemented in a performance oriented way since it is a crucial
part. After some research8 , it was clear that hypervisors normally take shortcuts
in order to improve the performance of the memory management. Thus, the main
issue is the shadow tables, which optimize the MMU virtualization but not exactly
follow architecture equivalence for performance reasons. Two levels of shadow page
tables seemed to be the cause of unresponsiveness in several setups. Replacing the
shadow tables in the L1 hypervisor by the use of EPT or NPT removes the inaccurate
virtualization of the memory management unit. The second generation hardware
support inserts an accurate hardware MMU with two levels of address translation
in the L1 hypervisor allowing L2 hypervisors and L2 guests to run successfully.
5.4 Nested hardware support
Nested hardware support is the support of hardware extensions for virtualization
on x86 architectures within a guest. The goal of nested hardware support is mainly
supporting nested virtualization for L2 hypervisors based on that hardware support.
8
http://www.mail-archive.com/[email protected]/msg29779.html
5.4. NESTED HARDWARE SUPPORT
51
In section 4.3, we concluded that in order to nest a hypervisor based on hardware
support, the virtualized processor should provide the hardware extensions. In subsection 5.2.3 and subsection 5.3.3, we noticed that none of the hypervisors provide a
virtualized processor with hardware extensions, resulting in none of the setups being
able to nest a hypervisor. Recently, KVM and Xen started research in this domain
in order to develop nested hardware support. In the following subsection, the work
in progress of both KVM and Xen is presented.
5.4.1 KVM
Nested hardware support was not supported by default in KVM. The virtualized processor provided to the guest is similar to the host processor, but lacks the hardware
extensions for virtualization. These extensions are needed in order to use KVM or
any other hypervisor based on hardware support. The introduction of nested hardware support should allow these hypervisors to be nested inside a virtual machine.
The first announcement of nested hardware support was made on September 2008
in a blog post of Avi Kivity [48]. He writes about an e-mail of Alexander Graf and
Joerg Roedel presenting a patch for nested SVM support [49], i.e. nested hardware
support for AMD processors with SVM support, and about the relative simplicity
of this patch. More information on AMD SVM itself can be found in section 3.3.
Alexander Graf and Joerg Roedel are both developers working on new features for
KVM. The patch was eventually included in development version kvm-82 and allows
the guest on an AMD processor, with hardware extensions for virtualization, to run
a nested hypervisor based on hardware support. The implementation of the patch
stayed relatively simple by exploiting the design of the SVM instruction set.
A year later on September 2009, Avi Kivity announced that support for nested
VMX, i.e. nested hardware support for Intel processors with Intel VT-x extensions, is coming. The bad news is that it will take longer to implement this feature
since nested VMX is more complex than nested SVM. In section 3.3, we explained
that Intel VT-x and AMD SVM are very similar but the terminology is somewhat
different. Besides the similarities, there are some fundamental differences in their
implementation that make VMX support more complex.
A first difference is the manipulation of the data structure used by the hypervisor
to communicate with the processor. For Intel VT-x, this data structure is called the
VMCS, the equivalent in AMD SVM is called VMCB. Intel uses two instructions,
VMREAD and VMWRITE, to manipulate the VMCS, while AMD allows manipulation of the VMCB by reading and writing in a memory region. The drawback of the
two extra instructions is that KVM must trap and emulate the special instructions.
For SVM, KVM could just allow the guest to read and write to the memory region
of the VMCB without intervention.
A second difference is the number of fields used in the data structure. Intel uses
a lot more fields to allow hypervisor-processor intercommunication. AMD SVM has
91 fields in the VMCS, while Intel VT-x has no less than 144 fields. KVM needs to
virtualize all these fields and make sure that the guest, running a hypervisor, can
use those fields in a correct way.
Besides the differences in the implementation of Intel VT-x and AMD SVM,
5.4. NESTED HARDWARE SUPPORT
52
another reason for the longer development time for the nested VMX support is
that the patch will immediately support nested EPT. This means that not only
the hypervisor in the host can use Extended Page Tables, see section 3.4, but the
hypervisor in the guest also benefits from EPT support. As already pointed out in
section 4.3, nested EPT or nested NPT could be critical for obtaining reasonable
performance. With the VMX support, a KVM guest must support the 32 bit and
64 bit page tables format and the EPT format.
In practice
The nested hardware support was tested on an AMD processor9 since the nested
SVM patch was already released. The installation is the same as a regular install
but in order to use the patch one must set a flag when loading the modules. We can
do this using the following commands:
modprobe kvm
modprobe kvm−amd n e s t e d=1
“nested=1” indicates that we want to use the nested SVM. The tested setup was
KVM as both L1 and L2 hypervisor. After installing and booting the L1 guest, KVM
was installed inside the guest in exactly the same way as a normal installation of
KVM. The nested hypervisor’s modules do not need to be loaded with “nested=1”.
In subsection 5.2.3 and subsection 5.3.3, we could not install KVM within the guest.
Installing KVM within the guest is a promising step towards nested virtualization
with KVM, or any other hypervisor based on hardware support, as a nested hypervisor. When starting the L2 guest for installation of an operating system or for
booting an existing operating system, some “handle exit” messages occurred. On
KVM’s mailing list, Joerg Roedel replied10 on March 2010 that the messages result
from a difference between a real hardware SVM and the emulated SVM from KVM.
A patch should fix this issue, as it needs more testing the current setup was not
able to boot. Nonetheless, developers are constantly improving the nested SVM by
means of new patches and tests so it is just a matter of time before the current setup
will work.
5.4.2 Xen
Xen is also working on nested virtualization with an emphasis on virtualization based
on hardware support. On November 2009, during the Xen Summit in Asia, Qing He
presented his work on nested virtualization [50]. Qing He has been working on Xen
since 2006 and is a software engineer from the Intel Open Source Technology Center.
His work focusses on hardware support based virtualization and more specifically
on Intel’s VT-x hardware support. The current progress is a proof of concept for a
simple scenario with a single processor and one nested guest. The nested guest is
able to boot to an early stage successfully with KVM as the L2 hypervisor. Before
releasing the current version, it still needs some stabilization and refinement.
The main target is the virtualization of VMX in order to present a virtualized
VMX to the guest. This means that everything of the hardware support must be
9
10
The nested hardware support was tested on a Quad-Core AMD OpteronTM 2350 processor.
http://www.mail-archive.com/[email protected]/msg31096.html
5.4. NESTED HARDWARE SUPPORT
53
available in the guest. The guest should be able to use the data structures and the
instructions to manipulate the VMCS. The guest should also be able to control the
execution flow of the VMX with VMEntry and VMExit instructions.
Figure 5.6: Nested virtualization architecture based on hardware support.
The data structures are shown in figure 5.6. The L1 guest has a VMCS that is
loaded into the hardware when this guest is running. The VMCS is maintained by
the L1 hypervisor. If the L2 guest wants to execute, it needs to have a corresponding
VMCS. That corresponding VMCS is maintained by the L2 hypervisor running in
the L1 guest and is called the virtual VMCS, or vVMCS. The L2 hypervisor sees
the virtual VMCS as the controlling VMCS of the L2 guest but it is called virtual
because the L1 hypervisor maintains a corresponding shadow VMCS, or sVMCS.
This shadow VMCS is not a complete duplicate of the virtual VMCS but contains
translations, similar to the shadow tables (see subsection 3.1.3). It is the shadow
VMCS that is loaded to the hardware when the L2 guest is running. Thus, each
nested guest has a virtual VMCS in the L2 hypervisor and a corresponding shadow
VMCS in the L1 hypervisor. The general idea is to treat the L2 guests as a guest
of the L1 hypervisor using the shadow VMCS.
Figure 5.7 shows the execution flow in a nested virtualization scenario based
on hardware support. On the left side of the figure, the L1 guest is running and
wants to start a nested guest. The guest does this by executing a VMEntry with the
instruction VMLAUNCH or VMRESUME. The virtual VMEntry can not directly
switch to the L2 guest because it is not supported by the hardware. The L1 guest
is already using the VMX guest mode and can only trigger a VMExit. The VMExit
results in a transition to the L1 hypervisor which will intercept the VMEntry call
and tries to switch to the shadow VMCS indicated by the VMEntry. This results in
the transition to the L2 guest and the L2 can run from then on.
Similar to a virtual VMEntry, the virtual VMExit will transition to the L1 hypervisor. The L1 hypervisor does not know whether the VMExit is a virtual VMExit
5.4. NESTED HARDWARE SUPPORT
54
Figure 5.7: Execution flow in nested virtualization based on hardware support.
or whether the VMExit happened due to the L2 guest executing a privileged instruction. When the L2 guest tries to run a privileged instruction, the L1 hypervisor can
fix this without having to forward the VMExit to the L2 hypervisor. An algorithm
in the L1 hypervisor determines whether this is a virtual VMExit and should be forwarded to the L2 hypervisor, or it is another type of VMExit that can be handled
by the L1 hypervisor. For a virtual VMExit, the L1 hypervisor forwards to the L2
hypervisor and the shadow VMCS of the L2 guest is unloaded. The L1 hypervisor
switches the controlling VMCS to the VMCS of the L1 guest. In the figure, there
are 3 VMExits which result in a transition to the L1 hypervisor. The first and the
last VMExit is forwarded by the L1 hypervisor to the L2 hypervisor and the second
VMExit is handled by the L1 hypervisor itself.
There is no special handling in place for the memory management. The nested
EPT, as described in the previous subsection, is also very helpful in this case because
it significantly reduces the number of virtual VMExits. Nested EPT support is still
work in progress.
Table 5.13: Overview of all nesting setups.
KVM
Xen
VMware
VirtualBox
HV
×
×
×
PV
HV
×
×
DBT
HV
×
×
DBT
×
×
×
×
×
×
×
×
×
×
X
X
X
∼
X
×
×
×
×
×
X
∼
X
∼
×
×
×
×
×
×
X
HV
X
×
DBT
VMware
X
×
HV
VirtualBox
HV
DBT
Subsections
×
×
×
×
×
×
×
PV
×
×
×
×
X
X
×
×
×
×
×
×
X
×
HV
XEN
×
×
×
×
X
X
×
×
X
X
×
×
X
×
HV
KVM
2nd gen.
1st gen.
2nd gen.
1st gen.
2nd gen.
1st gen.
2nd gen.
1st gen.
2nd gen.
1st gen.
2nd gen.
1st gen.
2nd gen.
1st gen.
Gen. HV
5.4. NESTED HARDWARE SUPPORT
55
CHAPTER
6
Performance results
This chapter elaborates on the performance of the working setups for nested virtualization on x86 architectures. Chapter 5 showed that there was one working setup
for nested virtualization when using dynamic binary translation as the L1 hypervisor technique. There were also ten working setups when using a L1 hypervisor
based on hardware support with a processor that contains the second generation
hardware extensions for virtualization on x86 architectures. The performance in a
normal virtual machine is compared to the performance in a nested virtual machine
in order to get an idea about the performance degradation between virtualization
and nested virtualization.
The performed tests measure the processor, memory and I/O performance.
These are the three most important components of a computer system. The evolution of hardware support for virtualization on x86 architecture also shows that the
processor, the memory management unit and I/O are important components, see
chapter 3. The first generation hardware support focusses on the processor, second
generation hardware support concentrates on a hardware supported MMU and the
newer generation provides support for directed I/O. The benchmarks used for the
tests are sysbench1 , iperf2 and iozone3 . sysbench was used for the processor,
memory and file I/O performance. iperf was used for network performance and
iozone was used for a second benchmark for file I/O.
The rest of this chapter is organized using these three components. The first
section elaborates on the performance of the processor in nested virtualization. The
next section evaluates the memory performance of the nested virtual machines and
the third section shows the performance of I/O in a nested setup. The last section
gives an overall conclusion on the performance of nested virtualization.
1
http://sysbench.sourceforge.net/
http://iperf.sourceforge.net/
3
http://www.iozone.org/
2
6.1. PROCESSOR PERFORMANCE
57
Whenever a test ran directly on the host operating system, without any virtualization, the test is labeled with the word “native”. If the label is a name of a single
virtualization product, the test ran inside a L1 guest with the indicated hypervisor
as L1 hypervisor. The “DBT” suffix indicates that the L1 hypervisor uses the dynamic binary translation technique. All “HV” tests use the hardware support of the
processor4 for virtualization. A label of the form L1hypervisor -L2hypervisor shows
the result of a performance test executed in a L2 guest using the given L2 hypervisor
and L1 hypervisor. For example, “KVM (HV) - VirtualBox (DBT)” indicates the
setup where KVM is used as L1 hypervisor and VirtualBox is used as L2 hypervisor
based on dynamic binary translation. All nested setups use the hardware support
of the processor in the L1 hypervisor, except for “VMware (DBT) - Xen (PV)”.
The latter uses VMware as the L1 hypervisor based on dynamic binary translation
and uses Xen as L2 hypervisor based on paravirtualization. The L2 hypervisor is
never based on hardware support as can be seen in chapter 5. Thus, VirtualBox and
VMware are always based on dynamic binary translation and Xen is always based
on paravirtualization, when used as L2 hypervisor.
6.1 Processor performance
The experiment used to measure the performance of the processor consists of a
sysbench test which calculates prime numbers. It calculates the prime numbers
until a set maximum and does this a given amount of times. The number of threads
that will calculate the prime numbers can also be modified prior to running the
test. In the executed tests, the maximum number for the primes was 150000 and
all prime numbers until 150000 were calculated 10000 times spread over 10 threads.
The measured unit of the test was the duration in seconds.
Figure 6.1 shows the first results of the performance test for the processor. The
left bar is the result on the computer system without virtualization and the other
bars are the results of the tests in L1 guests. The figure shows a serious gap between
the native performance and the performance in a virtual machine. The reason for
this big gap in performance is the use of only one core inside the virtual machine
while the host operating system can use four cores. The tests were executed in
virtual machines with only one core so that the comparison between the different
virtualization software would be fair.
In order to get an indication of the real performance degradation, the same test
was executed in a VMware guest that can use four cores and in a “VMware (HV)
- VMware (DBT)” nested guest that can use four cores. The results of these tests
are given in figure 6.2. The figure shows that the performance degradation between
a virtual machine and a nested virtual machine is less than the performance degradation between a native platform and a virtual machine. By adding an extra level
R
4
All performance tests were executed on an Intel
CoreTM i7-860 processor that provides second
generation hardware support for x86 virtualization.
6.2. MEMORY PERFORMANCE
58
400
CPU
350
Duration (in seconds)
300
250
200
150
100
50
0
KV
M
V)
(H
V)
V)
V)
(H
(H
x
Bo
e
ar
(H
w
n
Xe
VM
al
rtu
Vi
e
tiv
na
Figure 6.1: CPU performance for native with four cores and L1 guest with one core
(lower is better).
of virtualization, one expects a certain overhead, but this shows that the performance degradation for the extra level is promising. The performance overhead is
linear and does not increase exponentially, which is promising because the latencies
of VMEntry and VMExit instructions (see section 3.3) do not have to be improved
dramatically in order to get acceptable performance in the nested guest.
The results of the tests on virtual machines and nested virtual machines are
shown in figure 6.3. The performance between L1 guests with “HV” is about the
same since the L1 hypervisors use hardware support for virtualization. The L1 guest
that is virtualized using dynamic binary translation, “VMware (DBT)”, was able to
perform equally well. The results of the L2 guests vary heavily between the different
setups and are higher than the results of the L1 guests. However, the performance
degradation is not problematic, except for one outlier which uses dynamic binary
translation for the L1 hypervisor. With a duration of 496.83 seconds, the “VMware
(DBT) - Xen (PV)” setup performs much worse than other nested setups.
6.2 Memory performance
In this section, the performance degradation of the memory management unit is
evaluated. In section 5.3 we explained that the hardware supported L1 hypervisors
use the hardware supported MMU of the processor and the L2 hypervisors use a
software technique for maintaining the page tables of their guests. In the “VMware
6.2. MEMORY PERFORMANCE
59
CPU
160
140
Duration (in seconds)
120
100
80
60
40
20
0
e
ar
w
VM
e
ar
w
VM
e
iv
t
na
V)
(H
V)
(H
-V
e
ar
w
M
)
BT
(D
Figure 6.2: CPU performance for native, L1 and L2 guest with four cores (lower is
better).
(DBT) - Xen (PV)” setup, the L1 hypervisor maintains shadow tables and the L2
hypervisor provides paravirtual interfaces to its guests.
The performed memory tests evaluate the read and write throughput. The tests
read or write data with a total size of 2 Gb from or to the memory in block sizes of 256
bytes. The tests were done in twofold, one that reads or writes in a sequential order
and one that reads or writes in a random order. Figure 6.4 presents the results of the
memory tests for the native platform, L1 guests and L2 guests. Several observations
for nested virtualization can be made from the results.
A first observation is that the duration of the tests increases greatly when using
virtualization. The L1 guests needed approximately 10 seconds to read or write 2 Gb,
while the test on the native platform took about 1.5 seconds. Most L2 guests took
more than 128 seconds to pass the test. For nested virtualization the performance
degradation of the memory is more significant than the performance degradation of
the processor.
A second observation shows that nesting Xen allows to avoid the performance
degradation for the memory, except for the setup with dynamic binary translation
as L1 hypervisor. While other nested setups took more than 128 seconds, the nested
setups with Xen as L2 hypervisor took 10 seconds which is the same as in the L1
6.2. MEMORY PERFORMANCE
60
410
CPU
400
390
Duration (in seconds)
380
370
360
350
340
330
320
310
300
)
BT
(D
V)
)
(P
BT
en
D
(
)
-X
e
ar
BT
V)
w
(D
(H
M
x
M
- V lBo
KV
a
V)
u
(H
irt
M
)
-V
)
V
KV
V)
(P
BT
(H
(D
en
M
x
- X lBo
KV
V)
V)
ua
(H
(P
irt
n
n
-V
Xe
Xe
V)
)V)
(H
)
n
(P
BT
BT
(D
Xe
en
D
e
(
)
-X
e
ar
w
ar
BT
V)
w
(D
(H
VM
M
e
ox
-V
ar
B
)
l
w
a
V
(H
rtu
VM
)
e
V)
Vi
ar
(P
BT
w
n
V)
(D
(H
Xe
VM
ox
e
lB
ar
V)
ua
w
(H
irt
VM ox
-V
B
al
V)
rtu x (H
Vi
Bo
al
rtu )
V
(H
Vi
M
KV
e
V)
V)
(H
(H
x
V)
(H
ar
w
n
Xe
VM
e
Bo
ar
w
al
rtu
VM
Vi
Figure 6.3: CPU performance for L1 and L2 guests with one core (lower is better).
guests. Paravirtualization appears to be a promising technique for the L2 hypervisor
to minimize the performance overhead of the memory. The reason why “Xen (PV)”
does not minimize the performance overhead of the memory when compared to
native is unclear.
The figure also shows that the “VMware (DBT)” setup performs poorly compared to the other L1 setups and that “VMware (DBT) - Xen (PV)” did not take
advantage of the paravirtualization in the L2 hypervisor. The nested setup “VMware (DBT) - Xen (PV)” is not the worst of all nested setups, but the duration
still increases despite the use of paravirtualization as the L2 hypervisor technique.
Therefore, a L1 hypervisor that uses second generation hardware support for virtualization on x86 architectures performs better for memory management.
The thread test stresses many divergent paths through the hypervisor, such
as system calls, context switching, creation of address spaces and injection of page
faults [11]. Figure 6.5 summarizes the results of a thread test. The test created 1000
threads and 4 mutexes. Each thread locks a mutex, yields the CPU and unlocks
the mutex afterwards. These actions are performed in a loop so concurrency is
placed on each mutex. The results of the test are equal to the results of the memory
performance test. This indicates that the thread test depends heavily on memory
management.
6.3. I/O PERFORMANCE
61
1024
Memory Read RND
Memory Read SEQ
Memory Write RND
Memory Write SEQ
512
256
Duration (in seconds)
128
64
32
16
8
4
2
1
e
tiv
V)
T)
(P
B
en
(D
)
- X are
BT
V)
w
(D
M ox
(H
M ) - V lB
KV HV rtua
i
(
- V V)
M
T)
)
KV HV n (P (DB
(
e
x
M
Bo
)
-X
l
KV V) tua (PV
(H Vir
en
n
- X V)
Xe V)
T)
T)
(P
(H
B
B
n
n
D
D
e
)
Xe re ( - X re (
BT
a
)
a
w
w HV
(D
M
x
(
o
V
VM
e
ar V) - ualB
w
)
t
)
(H Vir PV
BT
VM
e
(
(D
)ar
w HV Xen ox
lB
VM re (
V) tua
a
w
(H Vir
VM Box ) al (HV
rtu
Vi Box
al
rtu )
Vi HV
(
M
KV V)
(P
n
Xe V)
T)
(H
B
n
D
Xe re (
)
a
V
w
H
VM re (
V)
a
w
(H
VM Box
al
rtu
Vi
na
Figure 6.4: Memory performance for L1 and L2 guests (lower is better).
6.3 I/O performance
We evaluate the I/O performance in this section. There are many I/O devices so
we selected two major devices. The first test measured the network throughput and
the second test measured the reads and writes from and to the hard disk. The first
subsection elaborates on the results of the network test and the second subsection
presents the results of the disk I/O.
6.3.1 Network I/O
The network throughput was tested using the iperf benchmark which measures
the TCP throughput during 10 seconds. Figure 6.6 shows that there is little or no
performance degradation between native and L1 guests. The bottleneck in these
tests was the 100 Mbit/s network card and not the virtualization. The results can
be different for a network card with a higher throughput. The performance overhead
for L2 guests heavily depends on which setup is used. The lowest performance was
measured for VirtualBox on top of Xen.
The nine nested setups can clearly be divided in groups of three. The nested
setups in the first group perform rather poorly for network I/O with a throughput
6.3. I/O PERFORMANCE
62
32768
Threads
16384
8192
4096
Duration (in seconds)
2048
1024
512
256
128
64
32
16
8
4
2
1
V)
V)
(H
(H
x
V)
T)
(P
B
en
(D
e
T)
-X
ar DB
V)
w
(H VM ox (
M
lB
KV
V)
ua
(H Virt
)
M
V)
BT
KV
(P
V)
(D
(H
en
x
M
- X lBo
)
KV V)
ua (PV
irt
(H
n
n
- V Xe
Xe V)
)
)
PV
T)
(H
(
B
n
BT
n
D
D
(
Xe e (
Xe
e
T)
ar
ar DB
w
V)
w
(
M
(H
x
VM
e
-V
Bo
l
ar
w
V) tua )
H
)
r
VM e (
Vi
PV
BT
(
ar
(D
n
w
V)
H
Xe
ox
VM e (
lB
)ar HV rtua
w
i
(
VM ox ) - V
B
V
al
H
rtu x (
Vi
Bo
al
rtu )
Vi
V
(H
M
)
KV
V
(H BT)
n
D
Xe e (
ar
w
e
Bo
ar
w
VM
VM
e
tiv
al
rtu
Vi
na
Figure 6.5: Threads performance for native, L1 guests and L2 guests with sysbench
benchmark (lower is better).
of less than 50 Mbit/s. The second group achieved reasonable performance. They
are the nested setups with a throughput between 50 Mbit/s and 80 Mbit/s. The
last group has a good performance with a network throughput of more than 80
Mbit/s nearing native performance. This group has little performance degradation
compared to L1 guests and native, taken into account that the network card is the
bottleneck.
6.3.2 Disk I/O
We measured the disk I/O performance using two tests. The first test is a file I/O
test of sysbench. In the preparation stage, it creates a specified number of files with
a specified total size. During the test, each thread performs specified I/O operations
on this set of files.
In figure 6.7, we can observe that the setups with virtualization perform much
better than in the native case. These results are unusual since we would expect
that the virtualization layer adds a certain performance overhead. The test in the
L1 VMware Workstation guest took 0.5 seconds while the same test on the native
platform took 37.7 seconds. This suggests that some optimization provides a speedup for the disk performance. The optimization is not a feature of the hard disk
6.3. I/O PERFORMANCE
63
100
iperf
90
Network throughput (in Mbit/s)
80
70
60
50
40
30
20
10
0
M
KV
M
KV
V)
(H
V)
(H
n
Xe
)
)
BT
(D
)
BT
)
BT
(D
)
BT
)
BT
(D
(D
(D
BT
(D
V)
(P
e
ar
w
VM
V)
(P
ox
lB
ox
lB
ua
rt
Vi
n
Xe
ua
ox
lB
ox
V)
(P
lB
e
ar
w
M
-V
ua
n
Xe
ua
rt
Vi
irt
-V
rt
Vi
V)
(H
V)
(H
M
KV
n
Xe
V)
(H
V)
V)
(H
V)
(H
V)
(H
V)
(H
e
ar
w
e
ar
w
ox
ox
lB
ua
lB
V)
(H
ua
n
Xe
VM
VM
rt
Vi
rt
Vi
M
KV
V)
(H
V)
(H
(H
ox
e
ar
w
lB
e
ua
n
Xe
VM
rt
Vi
tiv
na
Figure 6.6: Network performance for native, L1 guests and L2 guests (higher is
better).
because it is not possible for a virtual machine to read from a disk faster than a
native machine. The documentation of the iozone benchmark suggests that the
processor cache and buffer cache are helping out for smaller files. It advises to run
their benchmark with a maximum file size that is larger than the available memory
on the computer system. The results of these tests are shown in figure 6.8. In the
figure we can clearly see that optimizations are obtained by the use of the caches.
The theoretical speed of the hard disk is marked on the graphs. The throughput of
the I/O exceeds this theoretical speed, indicating that the measured values are not
the real I/O performance.
The real I/O performance can be found when using larger files. In figure 6.8(a),
the measured performance for the L1 VMware Workstation guest is lower than in
the native test for larger files. The performance in the L2 VMware Workstation
guest is higher than in the L1 guest, but the test stopped after 2 Gb files since
the hard disk of the nested guest was not large enough. The iozone tests showed
that the sysbench tests were inaccurate due to caching and suggest that real I/O
performance can be measured with writing and reading large files. In order to obtain
good performance results, these tests should be conducted for larger files than the
tests we ran.
6.4. CONCLUSION
64
100
File I/O
90
80
Duration (in seconds)
70
60
50
40
30
20
10
0
e
V)
(D
V)
(H
(H
x
V)
)
(P
BT
en
D
(
)
-X
e
ar
BT
V)
w
(D
(H
M
x
M
- V lBo
KV
a
V)
u
(H
irt
M
)
-V
)
V
KV
V)
(P
BT
(H
(D
en
M
x
- X lBo
KV
V)
)
ua
(H
irt
)
PV
n
(
T
-V
n
B
Xe
V)
(D
Xe
)
(H
re
a
n
BT
V)
w
(D
(H
Xe
x
e
VM
ar
Bo
w
al
V)
(H
rtu
VM
)
i
)
e
PV
-V
ar
(
BT
)
w
n
V
(D
(H
Xe
VM
ox
e
lB
ar
V)
ua
w
H
(
irt
VM ox
-V
B
al
V)
rtu x (H
Vi
Bo
al
rtu )
Vi
V
(H
M
KV
V)
)
(H
BT
ar
w
n
Xe
e
Bo
ar
w
VM
VM
e
tiv
al
rtu
Vi
na
Figure 6.7: File I/O performance for native, L1 guests and L2 guests with sysbench
benchmark (lower is better).
6.4 Conclusion
Performance overhead for nested virtualization is linear for CPU performance and
exponential for memory performance, except for the memory performance of nested
setups with paravirtualization as L2 hypervisor technique. Paravirtualization minimizes the performance degradation. For CPU performance in nested virtualization,
the setup that uses dynamic binary translation as the L1 hypervisor was the only
outlier, the other setups performed adequate. For memory performance, the setups
that use paravirtualization as the L2 hypervisor performed as well as the L1 guests.
The results for the I/O performance was split into network and disk performance.
The network performance could be divided into three groups. The first group had
near native performance, the second group performed acceptable and the last group
performed rather poorly. The results for disk performance were not accurate enough
since real disk I/O was difficult to measure due to caching. More testing is required
for disk I/O performance to reach an accurate conclusion.
6.4. CONCLUSION
65
4194304
write native
write L1
write L2
2097152
Speed (in KBytes/sec)
1048576
524288
SATA2 speed (3.0 Gbit/s)
262144
131072
65536
32768
16384
6
1
72
77 8
16 60
88 4
83 30
94 2
41 15
97 6
20 57
48
10 88
42
52 44
21
26 72
10
13 6
53
65 8
76
32 4
38
16
92
81
96
40
48
20
24
10
2
51
6
25
8
12
64
File size (in KBytes)
(a) write test
16777216
read native
read L1
read L2
8388608
Speed (in KBytes/sec)
4194304
2097152
1048576
524288
SATA2 speed (3.0 Gbit/s)
262144
131072
65536
16
72
77 8
16 60
88 4
83 30
94 2
41 15
97 6
20 57
48
10 88
42
52 44
21
26 72
10
13 6
53
65 8
76
32 4
38
16
92
96
81
40
48
24
20
2
10
6
51
8
25
12
64
File size (in KBytes)
(b) read test
Figure 6.8: File I/O performance for native, L1 guests and L2 guests with iozone
benchmark.
CHAPTER
7
Conclusions
This chapter concludes the work of this thesis. In the first section, we elaborate on
the results and conclusions of the previous chapters. The last section proposes some
future work for nested virtualization.
7.1 Nested virtualization and performance results
Nested virtualization on the x86 architecture can be a useful tool for the development
of test setups for research purposes, creating test frameworks for hypervisors, etc. In
chapter 5, we investigated which techniques are the most suitable choice for the L1
and L2 hypervisors. The most suitable L1 hypervisor technique is a hardware based
virtualization solution. When comparing the results of the setups that use software
solutions for the L1 hypervisor with the results of the setups that use hardware
support as technique, we saw that the latter resulted in more working setups. The
hardware support of the processor is preferably the second generation hardware
support for virtualization. The use of EPT or NPT improves the performance for
the memory management and releases the L1 hypervisor from maintaining shadow
tables. These shadow tables form a problem for certain nested setups when used in
the L1 hypervisor. Hypervisors take shortcuts in order to improve the performance
of the software based memory management. This can lead to failures in nesting
virtual machines. The second generation hardware support avoids these problems
by providing a hardware supported MMU and appears to be the most advisable
choice for the L1 hypervisor. Table 5.13 shows an overview of the results of all
nested setups.
The best technique for the L2 hypervisor is paravirtualization. The only working
setup in section 5.1 used paravirtualization for the L2 hypervisor and except for
one, all nested setups with paravirtualization as the L2 hypervisor worked for a L1
7.2. FUTURE WORK
67
hypervisor based on hardware support. Dynamic binary translation also performed
well on top of hardware support when the processor provided the hardware supported
MMU. Without the use of EPT or NPT, dynamic binary translation as the L2
hypervisor results in two levels of shadow tables which does not work very well.
The performance results in chapter 6 support the decision that paravirtualization
is the most suitable choice for the L2 hypervisor. The processor performance is
comparable to other nested setups and the memory performance is comparable to a
single layer of virtualization.
Nested hardware support is the great absentee in the whole nesting story. None
of the hypervisors provided the hardware extensions for virtualization to their guests.
This prevented the installation of a L2 hypervisor based on hardware support within
these guests. KVM and Xen are working on nested hardware support. KVM already
released nested SVM support but the implementation is still in its infancy. The
development of nested VMX support takes more time because of the differences
between AMD SVM and Intel VT-x. Xen is focussing on nested hardware support
for VMX and a proof of concept has been made that can successfully boot a nested
virtual machine to an early stage.
For the performance results, we observed that the processor performance degradation was linear for nested virtualization with hardware support for the L1 hypervisor. The memory performance decreased greatly for nested virtualization. The
only exception is the use of paravirtualization for the L2 hypervisor. In these experiments, no memory overhead was introduced in the nested setups. The other
nested setups suffered from a significant memory overhead and had memory access
times that were on average 128 times slower when compared to native. The I/O
performance results were not accurate and need more work to gain an accurate view
of the I/O performance degradation for nested virtualization.
7.2 Future work
One area of future work is testing the nested hardware support when KVM and
Xen release their updated versions. The release of nested hardware support allows
to test other nested setups than tested in this thesis and might provide new results.
An extra task could be to check whether other virtualization software vendors have
started to develop nested hardware support.
Throughout this thesis, we focussed on software solutions and hardware support
as techniques for a hypervisor. The first generation and second generation hardware
support were compared to see what impact the hardware supported MMU had.
Lately, hardware vendors are working on directed I/O for virtualization and more
specifically, Intel is working on its Intel VT-d and AMD on its IOMMU. Another
area of feature work would be to investigate whether directed I/O can be useful
for nested virtualization and what the performance impact of these new generation
hardware support is.
Bibliography
[1] S. Nanda and T. Chiueh, “A survey on virtualization technologies,” tech. rep.,
Stony Brook University, 2005.
[2] VMware,
“Virtualization
History.”
http://www.vmware.com/
virtualization/history.html. Last accessed on May 19, 2010.
[3] VMware, “Vmware: Virtualization overview.” http://www.vmware.com/pdf/
virtualization.pdf. Last accessed on May 19, 2010.
[4] S. Adabala, V. Chadha, P. Chawla, R. Figueiredo, J. Fortes, I. Krsul, A. Matsunaga, M. Tsugawa, J. Zhang, M. Zhao, L. Zhu, and X. Zhu, “From virtualized
resources to virtual computing grids: the In-VIGO system,” Future Generation
Computer Systems, vol. 21, no. 6, pp. 896–909, 2005.
[5] J. E. Smith and R. Nair, Virtual Machines: Versatile Platforms for Systems
and Processes. The Morgan Kaufmann Series in Computer Architecture and
Design, San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2005.
[6] W. Stallings, Operating Systems: Internals and Design Principles. Prentice
Hall, 5th ed., 2004.
[7] J. E. Smith and R. Nair, “The architecture of virtual machines,” Computer,
vol. 38, pp. 32–38, May 2005.
[8] G. J. Popek and R. P. Goldberg, “Formal requirements for virtualizable third
generation architectures,” Commun. ACM, vol. 17, no. 7, pp. 412–421, 1974.
R
[9] Intel Corporation, Intel
64 and IA-32 Architectures Software Developer’s Manual, Dec. 2009. http://www.intel.com/products/processor/manuals/. Last
accessed on May 19, 2010.
[10] VMware,
“Understanding
Full
Virtualization,
Paravirtualization,
and
Hardware
Assist.”
http://www.vmware.com/files/pdf/VMware_
paravirtualization.pdf, Sept. 2007. Last accessed on May 19, 2010.
BIBLIOGRAPHY
69
[11] K. Adams and O. Agesen, “A comparison of software and hardware techniques
for x86 virtualization,” in ASPLOS-XII: Proceedings of the 12th international
conference on Architectural support for programming languages and operating
systems, (New York, NY, USA), pp. 2–13, ACM, Oct. 2006.
[12] E. Witchel and M. Rosenblum, “Embra: fast and flexible machine simulation,”
in SIGMETRICS ’96: Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, (New
York, NY, USA), pp. 68–79, ACM, 1996.
[13] J. D. Gelas, “Hardware Virtualization: the Nuts and Bolts.” http://it.
anandtech.com/printarticle.aspx?i=3263, Mar. 2008. Last accessed on
May 19, 2010.
[14] A. Vasudevan, R. Yerraballi, and A. Chawla, “A high performance KernelLess Operating System architecture,” in ACSC ’05: Proceedings of the Twentyeighth Australasian conference on Computer Science, (Darlinghurst, Australia,
Australia), pp. 287–296, Australian Computer Society, Inc., 2005.
[15] K. Onoue, Y. Oyama, and A. Yonezawa, “Control of system calls from outside
of virtual machines,” in SAC ’08: Proceedings of the 2008 ACM symposium on
Applied computing, (New York, NY, USA), pp. 2116–1221, ACM, 2008.
[16] J. Sugerman, G. Venkitachalam, and B.-H. Lim, “Virtualizing i/o devices on
vmware workstation’s hosted virtual machine monitor,” in Proceedings of the
General Track: 2002 USENIX Annual Technical Conference, (Berkeley, CA,
USA), pp. 1–14, USENIX Association, 2001.
[17] J. Fisher-Ogden, “Hardware Support for Efficient Virtualization.”
[18] Y. Dong, J. Dai, Z. Huang, H. Guan, K. Tian, and Y. Jiang, “Towards highquality i/o virtualization,” in SYSTOR ’09: Proceedings of SYSTOR 2009:
The Israeli Experimental Systems Conference, (New York, NY, USA), pp. 1–8,
ACM, 2009.
[19] Advanced Micro Devices, Inc., “AMD-V Nested Paging,” tech. rep., Advanced Micro Devices, Inc., July 2008. http://developer.amd.com/assets/
NPT-WP-1%201-final-TM.pdf. Last accessed on May 19, 2010.
[20] VMware, “Software and Hardware Techniques for x86 Virtualization.” http://
www.vmware.com/files/pdf/software_hardware_tech_x86_virt.pdf. Last
accessed on May 19, 2010.
[21] A. Whitaker, M. Shaw, and S. D. Gribble, “Denali: Lightweight Virtual Machines for Distributed and Networked Applications,” in In Proceedings of the
USENIX Annual Technical Conference, 2002.
[22] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer,
I. Pratt, and A. Warfield, “Xen and the art of virtualization,” SIGOPS Oper.
Syst. Rev., vol. 37, no. 5, pp. 164–177, 2003.
BIBLIOGRAPHY
70
[23] J. R. Santos, Y. Turner, G. Janakiraman, and I. Pratt, “Bridging the gap
between software and hardware techniques for i/o virtualization,” in ATC’08:
USENIX 2008 Annual Technical Conference on Annual Technical Conference,
(Berkeley, CA, USA), pp. 29–42, USENIX Association, 2008.
[24] VMware, “A Performance Comparison of Hypervisors.” http://www.vmware.
com/pdf/hypervisor_performance.pdf, Feb. 2007. Last accessed on May 19,
2010.
R
Virtualization Technology for Directed I/O,”
[25] Intel Corporation, “Intel
tech. rep., Intel Corporation, Sept. 2008. http://download.intel.com/
technology/computing/vptech/Intel(r)_VT_for_Direct_IO.pdf. Last accessed on May 19, 2010.
[26] Intel Corporation, “A superior hardware platform for server virtualization,” tech. rep., Intel Corporation, 2009. http://download.intel.com/
business/resources/briefs/xeon5500/xeon_5500_virtualization.pdf.
Last accessed on May 19, 2010.
R
[27] Intel Corporation, “Intel
Virtualization Technology Specification for the
R
IA-32 Intel Architecture,” tech. rep., Intel Corporation, Apr. 2005.
http://dforeman.cs.binghamton.edu/~foreman/552pages/Readings/
intel05virtualization.pdf. Last accessed on May 19, 2010.
[28] Advanced Micro Devices, Inc., AMD64 Virtualization Codenamed “Pacifica” Technology: Secure Virtual Machine Architecture Reference Manual, May 2005. http://www.mimuw.edu.pl/~vincent/lecture6/sources/
amd-pacifica-specification.pdf. Last accessed on May 19, 2010.
[29] G. Neiger, A. Santoni, F. Leung, D. Rodgers, and R. Uhlig,
R
“Intel
Virtualization Technology: Hardware Support for Efficient ProR
cessor Virtualization,” Intel
Virtualization Technology, vol. 10, pp. 167–178,
Aug. 2006.
R
[30] G. Gerzon, “Intel
Virtualization Technology: Processor Virtualization ExR
tensions and Intel Trusted execution Technology.” http://software.intel.
com/file/1024. Last accessed on May 19, 2010.
[31] VirtualBox, “Virtualbox.” http://www.virtualbox.org.
May 19, 2010.
Last accessed on
[32] VirtualBox, “Virtualbox architecture.” http://www.virtualbox.org/wiki/
VirtualBox_architecture. Last accessed on May 19, 2010.
[33] VMware, “Vmware.” http://www.vmware.com. Last accessed on May 19, 2010.
[34] Xen, “Xen.” http://www.xen.org/. Last accessed on May 19, 2010.
[35] Xen, “Dom0 - xen wiki.” http://wiki.xensource.com/xenwiki/Dom0. Last
accessed on May 19, 2010.
BIBLIOGRAPHY
71
[36] KVM, “Kvm.” http://www.linux-kvm.org. Last accessed on May 19, 2010.
[37] A. Shah, “Kernel-based virtualization with kvm,” Linux Magazine, vol. 86,
pp. 37–39, 2008.
[38] F. Bellard, “Qemu, a fast and portable dynamic translator,” in ATEC ’05:
Proceedings of the annual conference on USENIX Annual Technical Conference,
(Berkeley, CA, USA), pp. 41–41, USENIX Association, 2005.
[39] I. Habib, “Virtualization with kvm,” Linux J., vol. 2008, no. 166, p. 8, 2008.
[40] H. C. Lauer and D. Wyeth, “A recursive virtual machine architecture,” in Proceedings of the workshop on virtual computer systems, (New York, NY, USA),
pp. 113–116, ACM, 1973.
[41] G. Belpaire and N.-T. Hsu, “Formal properties of recursive virtual machine
architectures.,” in SOSP ’75: Proceedings of the fifth ACM symposium on Operating systems principles, (New York, NY, USA), pp. 89–96, ACM, 1975.
[42] M. Baker and R. Buyya, “Cluster computing at a glance.” Cluster Computing,
Chapter 1, 1999.
[43] M. Baker, “Cluster computing white paper,” Tech. Rep. Version 2.0, IEEE
Computer Scociety Task Force on Cluster Computing (TFCC), December 2000.
[44] M. L. Bote-Lorenzo, Y. A. Dimitriadis, and E. Gomez-Sanchez, “Grid characteristics and uses: a grid definition,” in Proceedings of the First European
Across Grids Conference (AG’03), vol. 2970 of Lecture Notes in Computer Science, (Heidelberg), pp. 291–298, Springer-Verlag, 2004.
[45] I. Foster, “What is the grid? a three point checklist,” 2002.
[46] I. Foster, Y. Zhao, I. Raicu, and S. Lu, “Cloud computing and grid computing
360-degree compared,” in Proceedings of the Grid Computing Environments
Workshop, 2008. GCE ’08, pp. 1–10, Nov. 2008.
[47] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, “Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering
computing as the 5th utility,” Future Gener. Comput. Syst., vol. 25, no. 6,
pp. 599–616, 2009.
[48] A. Kivity, “Avi kivity’s blog.” http://avikivity.blogspot.com/. Last accessed on May 19, 2010.
[49] A. Graf and J. Roedel, “Add support for nested svm (kernel).” http://thread.
gmane.org/gmane.comp.emulators.kvm.devel/21119. Last accessed on May
19, 2010.
[50] Xen, “Xen summit asia at intel 2009.” http://www.xen.org/xensummit/
xensummit_fall_2009.html. Last accessed on May 19, 2010.
Appendices
APPENDIX
A
Virtualization software
A.1 VirtualBox
VirtualBox is a hypervisor that performs full virtualization. It started as proprietary
software but currently comes under a Personal Use and Evaluation License (PUEL).
The software is free of charge for personal and educational use. VirtualBox was initially created by Innotek. This company was later purchased by Sun Microsystems,
which in turn was recently purchased by Oracle Corporation.
This section contains information that is almost completely extracted from the
website of VirtualBox. You can see [32] for more information. It is presented here
to give extra information about the internals of VirtualBox.
The host operating system runs each VirtualBox virtual machine as an application. VirtualBox takes over control over a large part of the computer, executing a
complete OS with its own guest processes, drivers, and devices inside this virtual
machine process. The host OS does not notice much of this, only that an extra
process is started. So a virtual machine is just another process in the host operating
system. Thus, this implementation is an example of an hosted hypervisor. Initially,
VirtualBox used dynamic binary translation as implementation approach for their
hypervisor. However, with the release of hardware support for virtualization, it also
provides Intel VT-x and AMD SVM support.
Upon starting VirtualBox, one extra process gets started; the VirtualBox “service” process VBoxSVC. This service is running in the background to keep track of
all the processes involved, i.e. it keeps track of which virtual machines are running
and what state they are in. It is automatically started by the first GUI process.
The guts of the VirtualBox implementation are hidden in a shared library,
A.1. VIRTUALBOX
74
VBoxVMM.dll, or VBoxVMM.so in linux. This library contains all complicated
and messy things that make a x86 architecture. This can be considered as a static
“backend”, or black box. Around this backend, many frontends can be written without having to mess with the gory details of x86 virtualization. VirtualBox already
comes with several frontends: the Qt GUI, a command-line utility VBoxManage, a
“plain” GUI based on SDL and remote interfaces.
The host operating system does not need much tweaking to support virtualization. A ring 0 driver is loaded in the host operating system for VirtualBox to work.
This driver does not mess the scheduling or process management of the host operating system. The entire guest OS, including its own hundreds of processes, is only
scheduled when the host OS gives the VM process a timeslice. The ring 0 driver only
performs a few specific tasks: allocating physical memory for the VM, saving and
restoring CPU registers and descriptor tables when a host interrupt occurs while
a guest’s ring-3 code is executing (e.g. when the host OS wants to reschedule),
switching from host ring 3 to guest context, and enabling or disabling VT-x etc.
support.
When running virtual machine, the computer can be in one of several states,
from the processor’s point of view:
1. The CPU can be executing host ring 3 code (e.g. from other host processes),
or host ring 0 code, just as it would be if VirtualBox was not running.
2. The CPU can be emulating guest code (within the ring 3 host VM process).
Basically, VirtualBox tries to run as much guest code natively as possible. But
it can (slowly) emulate guest code as a fallback when it is not sure what the
guest system is doing, or when the performance penalty of emulation is not
too high. The VirtualBox emulator is based on QEMU and typically steps in
when:
• guest code disables interrupts and VirtualBox cannot figure out when
they will be switched back on (in these situations, VirtualBox actually
analyzes the guest code using its own disassembler);
• for execution of certain single instructions; this typically happens when
a nasty guest instruction such as LIDT has caused a trap and needs to
be emulated;
• for any real mode code (e.g. BIOS code, a DOS guest, or any operating
system startup).
3. The CPU can be running guest ring 3 code natively (within the ring 3 host
VM process). With VirtualBox, we call this ”raw ring 3”. This is, of course,
the most efficient way to run the guest, and hopefully we don’t leave this mode
too often. The more we do, the slower the VM is compared to a native OS,
because all context switches are very expensive.
4. The CPU can be running guest ring 0 code natively. Here is where things
get tricky: the guest only thinks it’s running ring 0 code, but VirtualBox has
A.1. VIRTUALBOX
75
fooled the guest OS to instead enter ring 1 (which is normally unused with
x86 operating systems).
The guest operating system is thus manipulated to actually execute its ring 0
code in ring 1. This causes a lot of additional instruction faults, as ring 1 is not
allowed to execute any privileged instructions. With each of these faults, the hypervisor must step in and emulate the code to achieve the desired behavior. While this
normally works perfectly well, the resulting performance would be very poor since
CPU faults tend to be very expensive and there will be thousands and thousands of
them per second. To make things worse, running ring 0 code in ring 1 causes some
nasty occasional compatibility problems. Because of design flaws in the x86 architecture that were never addressed, some system instructions that should cause faults
when called in ring 1 unfortunately do not. Instead, they just behave differently. It
is therefore imperative that these instructions be found and replaced.
To address these two issues, VirtualBox has come up with a set of unique techniques that they call ”Patch Manager” (PATM) and ”Code Scanning and Analysis
Manager” (CSAM). Before executing ring 0 code, the code is scanned recursively
to discover problematic instructions. In-place patching is then performed, i.e. replacing the instruction with a jump to hypervisor memory where an integrated code
generator has placed a more suitable implementation. In reality, this is a very complex task as there are lots of odd situations to be discovered and handled correctly.
So, with its current complexity, one could argue that PATM is an advanced in-situ
recompiler.
In addition, every time a fault occurs, the fault’s cause is analyzed to determine
if it is possible to patch the offending code to prevent it from causing more expensive
faults in the future. This turns out to work very well, and it can reduce the faults
caused by their virtualization to a rate that performs much better than a typical
recompiler, or even VT-x technology, for that matter.
APPENDIX
B
Details of the nested virtualization in practice
This chapter gives some detailed information about error messages and warnings
that occurred during the tests in chapter 5. For each setup, information about the
operating system and the hypervisor version is given. The setups are combined
into sections so that each section contains the setups that have the same bottom
layer hypervisor technique. Note that the setups with a nested hypervisor based
on hardware support for virtualization on x86 architectures are left out because the
nested hypervisor could not be installed.
B.1 Dynamic binary translation
B.1.1 VirtualBox
VirtualBox within VirtualBox
Host operating system: Ubuntu 9.10 64bit
L1 hypervisor: VirtualBox 3.1.6 r59338
L1 guest: Ubuntu 9.04 32bit
L2 hypervisor: VirtualBox 3.1.6 r59338
L2 guest: Ubuntu 9.04 32bit
The L1 guest booted and ran correctly using dynamic binary translation. The
image of the L2 guest was copied to the L1 guest. The L2 guest tried to start but
did not show any sign of activity. The guest showed the following output:
Boot from ( hd0 , 0 ) e x t 3
S t a r t i n g up . . .
<HDD−ID>
This screen remained for several hours without aborting or continuing.
B.1. DYNAMIC BINARY TRANSLATION
77
VMware Workstation within VirtualBox
Host operating system: Ubuntu 9.10 64bit
L1 hypervisor: VirtualBox 3.1.6 r59338
L1 guest: Ubuntu 9.04 32bit
L2 hypervisor: VMware Workstation 6.5.3 build-185404
L2 guest: Ubuntu 9.04 32bit
The L1 guest booted and ran correctly using dynamic binary translation. The
image of the L2 guest was copied to the L1 guest. The L2 guest attempted to start
but the L1 guest crashed and showed following error:
A c r i t i c a l e r r o r has o c c u r r e d w h i l e r u n n i n g t h e v i r t u a l machine and t h e
machine e x e c u t i o n has been s t o p p e d .
For h e l p , p l e a s e s e e t h e Community s e c t i o n on h t t p : //www. v i r t u a l b o x . o r g o r
your s u p p o r t c o n t r a c t . P l e a s e p r o v i d e t h e c o n t e n t s o f t h e l o g f i l e VBox .
l o g and t h e image f i l e VBox . png , which you can f i n d i n t h e /home/ o l i v i e r
/ . V i r t u a l B o x / Machines / vbox−vmware/ Logs d i r e c t o r y , a s w e l l a s a
d e s c r i p t i o n o f what you were d o i n g when t h i s e r r o r happened . Note t h a t
you can a l s o a c c e s s t h e above f i l e s by s e l e c t i n g Show Log from t h e
Machine menu o f t h e main V i r t u a l B o x window .
P r e s s OK i f you want t o power o f f t h e machine o r p r e s s I g n o r e i f you want t o
l e a v e i t as i s f o r debugging . Please note that debugging r e q u i r e s
s p e c i a l knowledge and t o o l s , s o i t i s recommended t o p r e s s OK now .
The log of the L1 guest showed:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
PATM: patmR3RefreshPatch : s u c c e e d e d t o r e f r e s h patch a t c 0 1 5 2 6 1 0
PATM: patmR3RefreshPatch : s u c c e e d e d t o r e f r e s h patch a t c 0 1 3 f 2 5 0
PATM: patmR3RefreshPatch : s u c c e e d e d t o r e f r e s h patch a t c 0 5 0 2 6 5 0
PATM: D i s a b l i n g IDT e f patch h a n d l e r c 0 1 0 5 2 f 0
PATM: patmR3RefreshPatch : s u c c e e d e d t o r e f r e s h patch a t c 0 1 0 5 2 f 0
PIIX3 ATA: C t l #0: RESET, DevSel=0 AIOIf=0 CmdIf0=0x20 (−1 u s e c ago ) CmdIf1=0
x00 (−1 u s e c ago )
PIIX3 ATA: C t l #0: f i n i s h e d p r o c e s s i n g RESET
PATM: patmR3RefreshPatch : s u c c e e d e d t o r e f r e s h patch a t c015b7a0
PIIX3 ATA: C t l #1: RESET, DevSel=0 AIOIf=0 CmdIf0=0x00 (−1 u s e c ago ) CmdIf1=0
x00 (−1 u s e c ago )
PIIX3 ATA: C t l #1: f i n i s h e d p r o c e s s i n g RESET
PATM: D i s a b l e b l o c k a t c 0 7 4 5 9 6 3 − w r i t e c07459b5−c07459b9
PATM: D i s a b l e b l o c k a t c0745bab − w r i t e c 0 7 4 5 c 0 f −c 0 7 4 5 c 1 3
PATM: D i s a b l e b l o c k a t c0746d22 − w r i t e c 0 7 4 6 d 8 f −c0746d93
PATM: D i s a b l e b l o c k a t c 0 7 6 3 a 9 0 − w r i t e c0763aed−c 0 7 6 3 a f 1
PATM: patmR3RefreshPatch : s u c c e e d e d t o r e f r e s h patch a t c 0 1 8 4 5 9 0
PCNet#0: I n i t : s s 3 2 =1 GCRDRA=0x36596000 [ 3 2 ] GCTDRA=0x36597000 [ 1 6 ]
f a t a l e r r o r i n r e c o m p i l e r cpu : t r i p l e f a u l t
Xen within VirtualBox
Host operating system: Ubuntu 9.10 64bit
L1 hypervisor: VirtualBox 3.1.6 r59338
L2 hypervisor: xen-3.0-x86 32p
Domain 0 (L2): openSUSE 11.3 build 0475
L2 guest: openSUSE 11.3 build 0475
The L1 guest booted and crashed almost immediately. The L1 guest showed the
following error:
B.1. DYNAMIC BINARY TRANSLATION
78
A c r i t i c a l e r r o r has o c c u r r e d w h i l e r u n n i n g t h e v i r t u a l machine and t h e
machine e x e c u t i o n has been s t o p p e d .
For h e l p , p l e a s e s e e t h e Community s e c t i o n on h t t p : //www. v i r t u a l b o x . o r g o r
your s u p p o r t c o n t r a c t . P l e a s e p r o v i d e t h e c o n t e n t s o f t h e l o g f i l e VBox .
l o g and t h e image f i l e VBox . png , which you can f i n d i n t h e /home/ o l i v i e r
/ . V i r t u a l B o x / Machines / vbox−vmware/ Logs d i r e c t o r y , a s w e l l a s a
d e s c r i p t i o n o f what you were d o i n g when t h i s e r r o r happened . Note t h a t
you can a l s o a c c e s s t h e above f i l e s by s e l e c t i n g Show Log from t h e
Machine menu o f t h e main V i r t u a l B o x window .
P r e s s OK i f you want t o power o f f t h e machine o r p r e s s I g n o r e i f you want t o
l e a v e i t as i s f o r debugging . Please note that debugging r e q u i r e s
s p e c i a l knowledge and t o o l s , s o i t i s recommended t o p r e s s OK now .
The log output was similar to the “VirtualBox within VMware Workstation”
output (see subsection B.1.2).
B.1.2 VMware Workstation
VirtualBox within VMware Workstation
Host operating system: Ubuntu 9.10 64bit
L1 hypervisor: VMware Workstation 7.0.1 build-227600
L1 guest: Ubuntu 9.04 32bit
L2 hypervisor: VirtualBox 3.1.6 r59338
L2 guest: Ubuntu 9.04 32bit
The L1 guest booted and ran correctly using dynamic binary translation. The
image of the L2 guest was copied to the L1 guest. When starting the L2 guest, the
bootloader showed
Boot from ( hd0 , 0 ) e x t 3
S t a r t i n g up . . .
<HDD−ID>
and afterwards, VirtualBox aborted the start and showed following message:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
PATM: patmR3RefreshPatch : s u c c e e d e d
PATM: patmR3RefreshPatch : s u c c e e d e d
PATM: F a i l e d t o r e f r e s h d i r t y patch
PATM: patmR3RefreshPatch : s u c c e e d e d
PATM: patmR3RefreshPatch : s u c c e e d e d
PATM: patmR3RefreshPatch : s u c c e e d e d
PATM: patmR3RefreshPatch : s u c c e e d e d
PATM: patmR3RefreshPatch : s u c c e e d e d
PATM: patmR3RefreshPatch : s u c c e e d e d
PATM: patmR3RefreshPatch : s u c c e e d e d
PATM: patmR3RefreshPatch : s u c c e e d e d
PATM: patmR3RefreshPatch : s u c c e e d e d
PATM: patmR3RefreshPatch : s u c c e e d e d
PATM: patmR3RefreshPatch : s u c c e e d e d
PATM: patmR3RefreshPatch : s u c c e e d e d
PATM: patmR3RefreshPatch : s u c c e e d e d
PATM: F a i l e d t o r e f r e s h d i r t y patch
PATM: patmR3RefreshPatch : s u c c e e d e d
PIT : mode=0 count=0x10000 ( 6 5 5 3 6 ) −
t o r e f r e s h patch a t c 0 5 0 0 6 7 0
t o r e f r e s h patch a t c 0 5 0 0 5 6 0
at c013f370 . Disabling i t .
t o r e f r e s h patch a t c 0 1 5 4 b c 0
t o r e f r e s h patch a t c0109d00
t o r e f r e s h patch a t c0122d50
t o r e f r e s h patch a t c05006d0
t o r e f r e s h patch a t c 0 1 0 9 c 8 0
t o r e f r e s h patch a t c 0 1 8 0 a c 0
t o r e f r e s h patch a t c 0 1 5 a 0 4 0
t o r e f r e s h patch a t c 0 1 3 f 9 8 0
t o r e f r e s h patch a t c 0 1 5 4 8 8 0
t o r e f r e s h patch a t c 0 1 5 4 a 0 0
t o r e f r e s h patch a t c 0 1 3 2 3 c 0
t o r e f r e s h patch a t c 0 1 5 1 6 0 0
t o r e f r e s h patch a t c 0 1 7 e d 0 0
at c013f370 . Disabling i t .
t o r e f r e s h patch a t c01051d0
1 8 . 2 0 Hz ( ch =0)
20
21
22
23
! ! Assertion Failed ! !
E x p r e s s i o n : pOrgInstrGC
Location
: /home/ vbox / vbox − 3 . 1 . 6 / s r c /VBox/VMM/PATM/VMMAll/PATMAll . cpp ( 1 5 9 )
v o i d PATMRawLeave(VM∗ , CPUMCTXCORE∗ , i n t )
B.2. PARAVIRTUALIZATION
79
VMware Workstation within VMware Workstation
Host operating system: Ubuntu 9.10 64bit
L1 hypervisor: VMware Workstation 7.0.1 build-227600
L1 guest: Ubuntu 9.04 32bit
L2 hypervisor: VMware Workstation 6.5.3 build-185404
L2 guest: Ubuntu 9.04 32bit
The L1 guest booted and ran correctly using dynamic binary translation. The
image of the L2 guest was copied to the L1 guest. The L2 guest tried to start but
did not show any activity. The screen stayed black and nothing happened, even
after several hours.
Xen within VMware Workstation
Host operating system: Ubuntu 9.10 64bit
L1 hypervisor: VMware Workstation 7.0.1 build-227600
L2 hypervisor: xen-3.0-x86 32p
Domain 0 (L2): openSUSE 11.3 build 0475
L2 guest: openSUSE 11.3 build 0475
The L1 guest booted and ran the Xen hypervisor correctly using dynamic binary
translation. The image of the L2 guest was copied to the L1 guest. The paravirtualized L2 guest booted rather slow but a user could log in and use the nested
guest.
B.2 Paravirtualization
None of the setups with paravirtualization as bottom layer work. The configuration
of the setups are given but, as explained in section 5.1.2, the problem is that nested
hypervisors need modifications.
VirtualBox within Xen
L1 hypervisor: xen-3.0-x86 32p
Domain 0 (L1): openSUSE 11.3 build 0475
L1 guest: Ubuntu 9.04 32bit
L2 hypervisor: VirtualBox 3.1.6 r59338
VMware Workstation within Xen
L1 hypervisor: xen-3.0-x86 32p
Domain 0 (L1): openSUSE 11.3 build 0475
L1 guest: Ubuntu 9.04 32bit
L2 hypervisor: VMware Workstation 6.5.3 build-185404
Xen within Xen
L1 hypervisor: xen-3.0-x86 32p
Domain 0 (L1): openSUSE 11.3 build 0475
L2 hypervisor: xen-3.0-x86 32p
B.3. FIRST GENERATION HARDWARE SUPPORT
80
B.3 First generation hardware support
In this section, the results are given of the setups with a L1 hypervisor based on
hardware support for x86 virtualization. All tests were conducted on a processor that
R
CoreTM 2 Quad Q9550.
only has first generation hardware support, namely Intel
The following subsections combine the setups that use the same nested hypervisor
technique.
B.3.1 Dynamic binary translation
VirtualBox within VirtualBox
Host operating system: Ubuntu 9.10 64bitor Ubuntu 9.04 32bit
L1 hypervisor: VirtualBox 3.1.6 r59338
L1 guest: Ubuntu 9.04 32bit
L2 hypervisor: VirtualBox 3.1.6 r59338
L2 guest: Ubuntu 9.04 32bit
The L2 guest hung in this setup. It did not crash, did not show information in
the log during several hours, and was probably just very slow.
VMware Workstation within VirtualBox
Host operating system: Ubuntu 9.10 64bit or Ubuntu 9.04 32bit
L1 hypervisor: VirtualBox 3.1.6 r59338
L1 guest: Ubuntu 9.04 32bit
L2 hypervisor: VMware Workstation 6.5.3 build-185404
L2 guest: Ubuntu 9.04 32bit
This setup was rather unstable. One configuration allowed booting the nested
guest. This working configuration used the graphical user interface for the L1 and
L2 guest, while other setups used a text based environment. The setup was tested
on a Ubuntu 9.10 64bit and Ubuntu 9.04 32bit operating system, with a graphical user interface and with a text based environment. The test on the Ubuntu
9.04 32bit operating system with the graphical user interface was the only one that
worked.
VirtualBox within VMware Workstation
Host operating system: Ubuntu 9.10 64bit
L1 hypervisor: VMware Workstation 6.5.3 build-185404
L1 guest: Ubuntu 9.04 32bit
L2 hypervisor: VirtualBox 3.1.6 r59338
L2 guest: Ubuntu 9.04 32bit
Both L1 and L2 guest were able to boot and run correctly. However, in order to
get the L2 guest started, one must append the line
m o n i t o r c o n t r o l . r e s t r i c t b a c k d o o r = ”TRUE”
B.3. FIRST GENERATION HARDWARE SUPPORT
81
to the configuration file (.vmx) of the L1 guest. This allowed to run a hypervisor
within the L1 guest. Upon starting the L2 guest, the L1 guest displayed the following
warning:
The v i r t u a l machine ’ s o p e r a t i n g system has attempted t o e n a b l e p r o m i s c u o u s
mode on a d a p t e r E t h e r n e t 0 . This i s not a l l o w e d f o r s e c u r i t y r e a s o n s .
P l e a s e go t o t h e Web page h t t p : / / vmware . com/ i n f o ? i d =161 f o r h e l p e n a b l i n g
t h e p r o m i s c u o u s mode i n t h e v i r t u a l machine .
One can get around this message by starting VirtualBox as root instead of a
normal user.
VMware Workstation within VMware Workstation
Host operating system: Ubuntu 9.10 64bit
L1 hypervisor: VMware Workstation 6.5.3 build-185404
L1 guest: Ubuntu 9.04 32bit
L2 hypervisor: VMware Workstation 6.5.3 build-185404
L2 guest: Ubuntu 9.04 32bit
Both L1 and L2 guest were able to boot and run correctly. However, in order to
get the L2 guest started, one must append the line
m o n i t o r c o n t r o l . r e s t r i c t b a c k d o o r = ”TRUE”
to the configuration file (.vmx) of the L1 guest. This allowed running a hypervisor
within the L1 guest. Upon starting the L2 guest, the L1 guest displayed the following
warning:
The v i r t u a l machine ’ s o p e r a t i n g system has attempted t o e n a b l e p r o m i s c u o u s
mode on a d a p t e r E t h e r n e t 0 . This i s not a l l o w e d f o r s e c u r i t y r e a s o n s .
P l e a s e go t o t h e Web page h t t p : / / vmware . com/ i n f o ? i d =161 f o r h e l p e n a b l i n g
t h e p r o m i s c u o u s mode i n t h e v i r t u a l machine .
One can get around this message by starting the L2 VMware Workstation as
root instead of a normal user.
VirtualBox within Xen
L1 hypervisor: xen-3.0-x86 32p
Domain 0 (L1): openSUSE 11.3 build 0475
L1 guest: Ubuntu 9.04 32bit
L2 hypervisor: VirtualBox 3.1.6 r59338
L2 guest: Ubuntu 9.04 32bit
The L2 hung while booting inside the Xen guest. The L2 guest was probably
just very slow since it stayed unresponsive for several hours.
VMware Workstation within Xen
L1 hypervisor: xen-3.0-x86 32p
Domain 0 (L1): openSUSE 11.3 build 0475
L1 guest: Ubuntu 9.04 32bit
L2 hypervisor: VMware Workstation 7.0.1 build-227600
L2 guest: Ubuntu 9.04 32bit
B.3. FIRST GENERATION HARDWARE SUPPORT
82
The L2 guest did not start because VMware Workstation checks whether there
is an underlying hypervisor. VMware Workstation noticed that there was a Xen
hypervisor running and displayed the following message:
You a r e r u n n i n g VMware W o r k s t a t i o n v i a an i n c o m p a t i b l e h y p e r v i s o r . You may
not power on a v i r t u a l machine u n t i l t h i s h y p e r v i s o r i s d i s a b l e d .
VirtualBox within KVM
Host operating system: Ubuntu 9.10 64bit
L1 hypervisor: kvm-kmod 2.6.32-9
L1 guest: Ubuntu 9.04 32bit
L2 hypervisor: VirtualBox 3.1.6 r59338
L2 guest: Ubuntu 9.04 32bit
The L1 guest booted and ran correctly. Upon starting the L2 guest, the following
error was shown:
Boot from ( hd0 , 0 ) e x t 3
S t a r t i n g up . . .
<HDD−ID>
...
[ 3 . 9 8 2 8 8 4 ] K e r n e l p a n i c − not s y n c i n g : Attempted t o k i l l
init !
VMware Workstation within KVM
Host operating system: Ubuntu 9.10 64bit
L1 hypervisor: kvm-kmod 2.6.32-9
L1 guest: Ubuntu 9.04 32bit
L2 hypervisor: VMware Workstation 6.5.3 build-185404
L2 guest: Ubuntu 9.04 32bit
The setup worked for newer versions of KVM. In older versions and in the developer version (kvm-88), the L2 guest hung during start-up. With the newer version,
the L2 guest booted and ran successfully. Note that newer versions of VMware
Workstation check if there is an underlying hypervisor and in those cases it will
refuse to boot a virtual machine. In the newest version, VMware Workstation 7.0.1
build-227600, this setup no longer worked due to the check for an underlying hypervisor.
B.3.2 Paravirtualization
All four setups could successfully nest a paravirtualized guest inside the L1 guest.
However, the setup where Xen is nested inside VirtualBox was not very stable.
Xen within VirtualBox
Host operating system: Ubuntu 9.10 64bitor Ubuntu 9.04 32bit
L1 hypervisor: VirtualBox 3.1.6 r59338
L2 hypervisor: xen-3.0-x86 32p
Domain 0 (L2): openSUSE 11.3 build 0475
L2 guest: openSUSE 11.3 build 0475
B.4. SECOND GENERATION HARDWARE SUPPORT
83
Sometimes during the start-up of domain 0 several segmentation faults occurred.
Domain 0 was able to boot and run successfully but the creation of another paravirtualized guest was sometimes impossible. Xen reported that the guest was created,
however it did not show up in the list of virtual machines indicating that the guest
crashed immediately. The setup worked most of the times on host operating system
Ubuntu 9.04 32bit. On the Ubuntu 9.10 64bit operating system, there were always
segmentation faults.
Xen within VMware Workstation
Host operating system: Ubuntu 9.10 64bit
L1 hypervisor: VMware Workstation 7.0.1 build-227600
L2 hypervisor: xen-3.0-x86 32p
Domain 0 (L2): openSUSE 11.3 build 0475
L2 guest: openSUSE 11.3 build 0475
Both L1 and L2 guest were able to boot and run correctly.
Xen within Xen
L1 hypervisor: xen-3.0-x86 32p
Domain 0 (L1): openSUSE 11.3 build 0475
L2 hypervisor: xen-3.0-x86 32p
Domain 0 (L2): openSUSE 11.3 build 0475
L2 guest: openSUSE 11.3 build 0475
Both L1 and L2 guest were able to boot and run correctly.
Xen within KVM
Host operating system: Ubuntu 9.10 64bitor Ubuntu 9.04 32bit
L1 hypervisor: VirtualBox 3.1.6 r59338
L2 hypervisor: xen-3.0-x86 32p
Domain 0 (L2): openSUSE 11.3 build 0475
L2 guest: openSUSE 11.3 build 0475
Both L1 and L2 guest were able to boot and run correctly.
B.4 Second generation hardware support
This section summarizes the results of the setups with a L1 hypervisor based on
hardware support for x86 virtualization. The tests were conducted on a newer
R
processor with second generation hardware support, namely Intel
CoreTM i7-860.
Only the setups are given where the outcome differed from section B.3. These setups
are the setups that worked using the second generation hardware support and did
not work using the first generation. All the other setups had the same output.
B.5. KVM’S NESTED SVM SUPPORT
84
B.4.1 Dynamic binary translation
VirtualBox within VirtualBox
Host operating system: Ubuntu 9.10 64bitor Ubuntu 9.04 32bit
L1 hypervisor: VirtualBox 3.1.6 r59338
L1 guest: Ubuntu 9.04 32bit
L2 hypervisor: VirtualBox 3.1.6 r59338
L2 guest: Ubuntu 9.04 32bit
Both L1 and L2 guest were able to boot and run correctly. In the test result of
section B.3, the L2 guest hung.
VirtualBox within Xen
L1 hypervisor: xen-3.0-x86 32p
Domain 0 (L1): openSUSE 11.3 build 0475
L1 guest: Ubuntu 9.04 32bit
L2 hypervisor: VirtualBox 3.1.6 r59338
L2 guest: Ubuntu 9.04 32bit
Both L1 and L2 guest were able to boot and run correctly. With only the use of
first generation hardware support (section B.3), the L2 guest hung.
VirtualBox within KVM
Host operating system: Ubuntu 9.10 64bit
L1 hypervisor: kvm-kmod 2.6.32-9
L1 guest: Ubuntu 9.04 32bit
L2 hypervisor: VirtualBox 3.1.6 r59338
L2 guest: Ubuntu 9.04 32bit
Both L1 and L2 guest were able to boot and run correctly. The L2 guest showed
a kernel panic message when only using first generation hardware support, as shown
in section B.3.
B.4.2 Paravirtualization
Xen within VirtualBox
Host operating system: Ubuntu 9.10 64bitor Ubuntu 9.04 32bit
L1 hypervisor: VirtualBox 3.1.6 r59338
L2 hypervisor: xen-3.0-x86 32p
Domain 0 (L2): openSUSE 11.3 build 0475
L2 guest: openSUSE 11.3 build 0475
Both L1 and L2 guest were able to boot and run correctly. In the test result of
section B.3, domain 0 displayed some segmentation faults.
B.5 KVM’s nested SVM support
Host operating system: Ubuntu 9.04 server 64bit
B.5. KVM’S NESTED SVM SUPPORT
85
L1 hypervisor: kvm-kmod 2.6.33
L2 hypervisor: Ubuntu 9.04
Domain 0 (L2): kvm-kmod 2.6.33
L2 guest: Ubuntu 9.04
After installing the L1 hypervisor, it must be loaded with the arguments “nested=1”.
The L1 guest booted and ran perfectly. The installation of the L2 hypervisor within
the L1 guest was successful. No special actions were required for installing the L2
hypervisor. When booting the L2 hypervisor, the L1 guest showed the following
messages:
[ 1 6 . 7 1 2 0 4 7 ] h a n d l e e x i t : u n e x p e c t e d e x i t i n i i n f o 0 x80000008 e x i t c o d e 0 x60
[ 3 1 . 4 3 2 0 3 2 ] h a n d l e e x i t : u n e x p e c t e d e x i t i n i i n f o 0 x80000008 e x i t c o d e 0 x60
[ 3 4 . 4 6 8 0 5 8 ] h a n d l e e x i t : u n e x p e c t e d e x i t i n i i n f o 0 x80000008 e x i t c o d e 0 x60
Patches fix these messages but were not yet released because they need more
testing1 .
1
http://www.mail-archive.com/[email protected]/msg31096.html
APPENDIX
C
Details of the performance tests
This chapter gives some detailed information about the performance tests that were
executed in chapter 6. The benchmarks used for the tests are sysbench, iperf
and iozone. Each section lists the tests that were executed for the corresponding
benchmark.
C.1 sysbench
1
# !/ bin / bash
2
#
#
#
#
#
3
4
5
6
The
sysbench
@author
tests
Olivier
Berghmans
7
8
9
10
11
12
i f [ $# - ne 1 ]; then
echo " Usage : $0 < prefix > "
echo " with < prefix > the prefix for the output files "
exit
fi
13
14
15
PREFIX = $1
OUTPUT = " ### "
16
17
18
19
20
date
echo $ { OUTPUT } " sysbench CPU Test "
sysbench -- num - threads =10 -- max - requests =10000 -- t e s t = cpu -cpu - max - prime =150000 run > $ { PREFIX } _cpu . txt
date
C.1. SYSBENCH
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
87
echo $ { OUTPUT } " sysbench Memory Test Write "
sysbench -- num - threads =10 -- t e s t = memory -- memory - block - size
=256 -- memory - total - size =2 G -- memory - scope = l o c a l -- memory hugetlb = off -- memory - oper = write -- memory - access - mode = seq
run > $ { PREFIX } _mem_write_seq . txt
sysbench -- num - threads =10 -- t e s t = memory -- memory - block - size
=256 -- memory - total - size =2 G -- memory - scope = l o c a l -- memory hugetlb = off -- memory - oper = write -- memory - access - mode = rnd
run > $ { PREFIX } _mem_write_rnd . txt
date
echo $ { OUTPUT } " sysbench Memory Test Read "
sysbench -- num - threads =10 -- t e s t = memory -- memory - block - size
=256 -- memory - total - size =2 G -- memory - scope = l o c a l -- memory hugetlb = off -- memory - oper = read -- memory - access - mode = seq
run > $ { PREFIX } _mem_read_seq . txt
sysbench -- num - threads =10 -- t e s t = memory -- memory - block - size
=256 -- memory - total - size =2 G -- memory - scope = l o c a l -- memory hugetlb = off -- memory - oper = read -- memory - access - mode = rnd
run > $ { PREFIX } _mem_read_rnd . txt
date
echo $ { OUTPUT } " sysbench Thread Test "
sysbench -- num - threads =1000 -- max - requests =10000 -- t e s t =
threads -- thread - yields =10000 -- thread - locks =4 run > $ {
PREFIX } _threads . txt
date
echo $ { OUTPUT } " sysbench Mutex Test "
sysbench -- num - threads =5000 -- max - requests =10000 -- t e s t = mutex
-- mutex - num =4096 -- mutex - locks =500000 -- mutex - loops =10000
run > $ { PREFIX } _mutex . txt
date
echo $ { OUTPUT } " sysbench File io Test "
sysbench -- num - threads =16 -- t e s t = fileio -- file - num =256 -- file
- block - size =16 K -- file - total - size =512 M -- file - t e s t - mode =
rndrw -- file - io - mode = sync prepare
sysbench -- num - threads =16 -- t e s t = fileio -- file - num =256 -- file
- block - size =16 K -- file - total - size =512 M -- file - t e s t - mode =
rndrw -- file - io - mode = sync run > $ { PREFIX } _fileio . txt
sysbench -- num - threads =16 -- t e s t = fileio -- file - num =256 -- file
- block - size =16 K -- file - total - size =512 M -- file - t e s t - mode =
rndrw -- file - io - mode = sync cleanup
date
echo $ { OUTPUT } " sysbench MySQL Test "
sysbench -- num - threads =4 -- max - requests =10000 -- t e s t = oltp -oltp - table - size =1000000 -- mysql - table - engine = innodb -mysql - user = root -- mysql - password = root -- oltp - t e s t - mode =
complex prepare
sysbench -- num - threads =4 -- max - requests =10000 -- t e s t = oltp -oltp - table - size =1000000 -- mysql - table - engine = innodb -mysql - user = root -- mysql - password = root -- oltp - t e s t - mode =
complex run > $ { PREFIX } _oltp . txt
C.2. IPERF
43
44
45
46
47
88
sysbench -- num - threads =4 -- max - requests =10000 -- t e s t = oltp -oltp - table - size =1000000 -- mysql - table - engine = innodb -mysql - user = root -- mysql - password = root -- oltp - t e s t - mode =
complex cleanup
date
echo $ { OUTPUT } " packing Results "
tar czf $ { PREFIX }. tgz $ { PREFIX }*. txt
rm -f $ { PREFIX }*. txt
C.2 iperf
This benchmarks consists of a server and a client. The server runs on a separate
computer in the network with the command:
iperf -s
The test machines connect with the server by running the command:
iperf -c < hostname >
C.3 iozone
The iozone benchmark tests the performance of writing and reading a file. The
commands used for running the benchmark native, in a L1 guest and in a nested
guest are the following:
native$ iozone -a -g 16 G -i 0 -i 1
L1guest$ iozone -a -g 4 G -i 0 -i 1
L2guest$ iozone -a -g 2 G -i 0 -i 1
The “-g” option specifies the maximum file size used in the tests. The test on
the native platform uses 16 Gb since the physical memory of the computer system
was 8 Gb. The physical memory of the L1 guest was 2 Gb and the physical memory
of the L2 guest was 512 Mb.