Nesting Virtual Machines in Virtualization Test Frameworks
Transcription
Nesting Virtual Machines in Virtualization Test Frameworks
Nesting Virtual Machines in Virtualization Test Frameworks Dissertation submitted on May 2010 to the Department of Mathematics and Computer Science of the Faculty of Sciences, University of Antwerp, in partial fulfillment of the requirements for the degree of Master of Science. Supervisor: prof. Dr. Jan Broeckhove Co-supervisor: Dr. Kurt Vanmechelen Mentors: Sam Verboven & Ruben Van den Bossche Olivier Berghmans Research Group Computational Modelling and Programming Contents List of Figures iv List of Tables vi Nederlandstalige samenvatting vii Preface viii Abstract x 1 Introduction 1.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 2 Virtualization 2.1 Applications . . . . . . . . . . . . . . . . . 2.2 Taxonomy . . . . . . . . . . . . . . . . . . 2.2.1 Process virtual machines . . . . . . 2.2.2 System virtual machines . . . . . . 2.3 x86 architecture . . . . . . . . . . . . . . . 2.3.1 Formal requirements . . . . . . . . . 2.3.2 The x86 protection level architecture 2.3.3 The x86 architecture problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Evolution of virtualization for the x86 architecture 3.1 Dynamic binary translation . . . . . . . . . . . . 3.1.1 System calls . . . . . . . . . . . . . . . . 3.1.2 I/O virtualization . . . . . . . . . . . . . 3.1.3 Memory management . . . . . . . . . . . 3.2 Paravirtualization . . . . . . . . . . . . . . . . . i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4 5 6 7 9 9 11 11 . . . . . 13 13 15 15 16 17 . . . . . . . . . . . . 18 18 19 19 22 23 24 24 24 25 25 26 4 Nested virtualization 4.1 Dynamic binary translation . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Paravirtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Hardware supported virtualization . . . . . . . . . . . . . . . . . . . . . 28 30 32 33 5 Nested virtualization in Practice 5.1 Software solutions . . . . . . . . . . . . . . . . . . . 5.1.1 Dynamic binary translation . . . . . . . . . . 5.1.2 Paravirtualization . . . . . . . . . . . . . . . 5.1.3 Overview software solutions . . . . . . . . . . 5.2 First generation hardware support . . . . . . . . . . 5.2.1 Dynamic binary translation . . . . . . . . . . 5.2.2 Paravirtualization . . . . . . . . . . . . . . . 5.2.3 Hardware supported virtualization . . . . . . 5.2.4 Overview first generation hardware support . 5.3 Second generation hardware support . . . . . . . . . 5.3.1 Dynamic binary translation . . . . . . . . . . 5.3.2 Paravirtualization . . . . . . . . . . . . . . . 5.3.3 Hardware supported virtualization . . . . . . 5.3.4 Overview second generation hardware support 5.4 Nested hardware support . . . . . . . . . . . . . . . 5.4.1 KVM . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Xen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 36 36 38 40 40 42 43 44 45 46 47 48 48 49 50 51 52 6 Performance results 6.1 Processor performance 6.2 Memory performance . 6.3 I/O performance . . . . 6.3.1 Network I/O . . 6.3.2 Disk I/O . . . . 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 57 58 61 61 62 64 3.3 3.4 3.5 3.6 3.2.1 System calls . . . . . . . . . . . . . . . . . 3.2.2 I/O virtualization . . . . . . . . . . . . . . 3.2.3 Memory management . . . . . . . . . . . . First generation hardware support . . . . . . . . . Second generation hardware support . . . . . . . . Current and future hardware support . . . . . . . . Virtualization software . . . . . . . . . . . . . . . . 3.6.1 VirtualBox . . . . . . . . . . . . . . . . . . 3.6.2 VMware . . . . . . . . . . . . . . . . . . . 3.6.3 Xen . . . . . . . . . . . . . . . . . . . . . 3.6.4 KVM . . . . . . . . . . . . . . . . . . . . . 3.6.5 Comparison between virtualization software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Conclusions 7.1 Nested virtualization and performance results . . . . . . . . . . . . . . 7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 66 67 Appendices 72 Appendix A Virtualization software A.1 VirtualBox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 73 Appendix B Details of the nested virtualization in B.1 Dynamic binary translation . . . . . . . . . . B.1.1 VirtualBox . . . . . . . . . . . . . . . B.1.2 VMware Workstation . . . . . . . . . B.2 Paravirtualization . . . . . . . . . . . . . . . B.3 First generation hardware support . . . . . . B.3.1 Dynamic binary translation . . . . . . B.3.2 Paravirtualization . . . . . . . . . . . B.4 Second generation hardware support . . . . . B.4.1 Dynamic binary translation . . . . . . B.4.2 Paravirtualization . . . . . . . . . . . B.5 KVM’s nested SVM support . . . . . . . . . . . . . . . . . . . . 76 76 76 78 79 80 80 82 83 84 84 84 tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 86 88 88 Appendix C Details of the performance C.1 sysbench . . . . . . . . . . . . . . C.2 iperf . . . . . . . . . . . . . . . . C.3 iozone . . . . . . . . . . . . . . . iii practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Figures 2.1 2.2 2.3 Implementation layers in a computer system. . . . . . . . . . . . . . Taxonomy of virtual machines. . . . . . . . . . . . . . . . . . . . . . The x86 protection levels. . . . . . . . . . . . . . . . . . . . . . . . . 5 8 11 3.1 3.2 3.3 Memory management in x86 virtualization using shadow tables. . . . Execution flow using virtualization based on Intel VT-x. . . . . . . . Latency reductions by CPU implementation [30]. . . . . . . . . . . . 17 20 21 4.1 4.2 Layers in a nested virtualization setup with hosted hypervisors. . . . Memory architecture in a nested situation. . . . . . . . . . . . . . . . 29 31 5.1 5.2 5.3 5.4 Layers for nested paravirtualization in dynamic binary translation. . Layers for nested Xen paravirtualization. . . . . . . . . . . . . . . . . Layers for nested dynamic binary translation in paravirtualization. . Layers for nested dynamic binary translation in a hypervisor based on hardware support. . . . . . . . . . . . . . . . . . . . . . . . . . . . Layers for nested paravirtualization in a hypervisor based on hardware support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nested virtualization architecture based on hardware support. . . . . Execution flow in nested virtualization based on hardware support. . 37 39 39 5.5 5.6 5.7 6.1 6.2 6.3 6.4 6.5 6.6 CPU performance for native with four cores and L1 guest with one core. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CPU performance for native, L1 and L2 guest with four cores. . . . CPU performance for L1 and L2 guests with one core. . . . . . . . . Memory performance for L1 and L2 guests. . . . . . . . . . . . . . . Threads performance for native, L1 guests and L2 guests with sysbench benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Network performance for native, L1 guests and L2 guests. . . . . . . iv 42 44 53 54 58 59 60 61 62 63 6.7 6.8 File I/O performance for native, L1 guests and L2 guests with sysbench benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . File I/O performance for native, L1 guests and L2 guests with iozone benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 64 65 List of Tables 3.1 Comparison between a selection of the most popular hypervisors. . . 5.1 Index table containing directions in which subsections information can be found about a certain nested setup. . . . . . . . . . . . . . . The nesting setups with dynamic binary translation as the L1 hypervisor technique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The nesting setups with paravirtualization as the L1 hypervisor technique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overview of the nesting setups with a software solution as the L1 hypervisor technique. . . . . . . . . . . . . . . . . . . . . . . . . . . . The nesting setups with first generation hardware support as the L1 hypervisor technique and DBT as the L2 hypervisor technique. . . . The nesting setups with first generation hardware support as the L1 hypervisor technique and PV as the L2 hypervisor technique. . . . . The nesting setups with first generation hardware support as the L1 and L2 hypervisor technique. . . . . . . . . . . . . . . . . . . . . . . Overview of the nesting setups with first generation hardware support as the L1 hypervisor technique. . . . . . . . . . . . . . . . . . . . . . The nesting setups with second generation hardware support as the L1 hypervisor technique and DBT as the L2 hypervisor technique. . The nesting setups with second generation hardware support as the L1 hypervisor technique and PV as the L2 hypervisor technique. . . The nesting setups with first generation hardware support as the L1 and L2 hypervisor technique. . . . . . . . . . . . . . . . . . . . . . . Overview of the nesting setups with second generation hardware support as the L1 hypervisor technique. . . . . . . . . . . . . . . . . . . Overview of all nesting setups. . . . . . . . . . . . . . . . . . . . . . 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 vi 27 35 38 40 41 43 44 45 46 48 49 49 50 55 Nederlandstalige samenvatting Virtualisatie is uitgegroeid tot een wijdverspreide technologie die gebruikt wordt om computing resources te abstraheren, te combineren of op te delen. Verzoeken voor deze resources zijn op deze manier minimaal afhankelijk van de onderliggende fysieke laag. De x86 architectuur is niet speciaal ontworpen voor virtualisatie en bevat een aantal niet-virtualiseerbare instructies. Verschillende software-oplossingen en hardware-ondersteuning hebben hier voor een oplossing gezorgd. Het groeiend aantal toepassingen zorgt ervoor dat gebruikers steeds meer wensen virtualisatie te hanteren. Onder andere de noodzaak voor volledige fysieke opstellingen voor onderzoeksdoeleinden kan vermeden worden door het gebruik van virtualisatie. Om componenten, die mogelijk zelf virtualisatie gebruiken, te kunnen virtualiseren, moet het mogelijk zijn om virtuele machines in elkaar te nesten. Er was slechts weinig informatie over geneste virtualisatie beschikbaar en dit proefschrift gaat dieper in op wat mogelijk is met de huidige technieken. We testen het nesten van hypervisors gebaseerd op de verschillende virtualisatie technieken. De technieken die gebruikt werden zijn dynamic binary translation, paravirtualization en hardware-ondersteuning. Voor de hardware-ondersteuning werd een onderscheid gemaakt tussen eerste generatie en tweede generatie hardware-ondersteuning. Succesvolle geneste opstellingen maken gebruik van software-oplossingen voor de tweede hypervisor en hardware-ondersteuning voor de eerste hypervisor. Slechts één werkende geneste oplossing gebruikt voor beide een software-oplossing. Benchmarks werden uitgevoerd om te kijken of de prestaties van werkende geneste opstellingen performant zijn. De prestaties van de processor, het geheugen en I/O werden getest en vergeleken met de verschillende niveaus van virtualisatie. We ontdekten dat geneste virtualisatie werkt voor bepaalde opstellingen, vooral met een software-oplossing bovenop een hypervisor met hardware-ondersteuning. Opstellingen met hardware-ondersteuning voor de bovenste hypervisor zijn nog niet mogelijk. Geneste hardware-ondersteuning zal binnenkort beschikbaar worden, maar voorlopig is de enige optie het gebruik van een software-oplossing voor de bovenste hypervisor. Uit de resultaten van de benchmarks bleek dat de prestaties van geneste opstellingen veelbelovend zijn. vii Preface In this section I will give some insight on the creation of this thesis. It was submitted in partial fulfillment of the requirements for a Master’s degree of Computer Science. I have always been fascinated by virtualization and during the presentation of open thesis subjects I stumbled upon the subject of nested virtualization. Right from the start I found the subject very interesting so I made an appointment for more information and I eventually got it! I had already used some virtualization software but I did not know much about the underlying techniques. During the first semester I followed a course on virtualization, which helped me to learn the fundamentals. It took time to become familiar with the installation and use of the different virtualization packages. At first, it took a long time to test one nested setup and it seemed that all I was doing was installing operating systems in virtual machines. Predefined images can save a lot of work but I had to find this out the hard way! But even with these predefined images, a nested setup can take a long time to test and re-test since there are so many possible configurations. After the first series of tests, I was quite disappointed about the obtained results. Due to some setbacks in December and January, I also fell behind on schedule leading to a hard second semester. It was hard combining this thesis with other courses and with extracurricular responsibilities during this second semester. I am pleased that I got back on track and finished the thesis on time! This would not have been possible without the help from the people around me. I want to thank my girlfriend Anneleen Wislez for supporting me, not only during this year but during the last few years. She also helped me with creating the figures for this thesis and reading the text. viii Further, I would like to show appreciation to my mentors Sam Verboven and Ruben Van den Bossche for always pointing me in the right direction and for the help during this thesis. Additionally, I also want to thank my supervisor Prof. Dr. Jan Broeckhove and co-supervisor Dr. Kurt Vanmechelen giving me the opportunity to make this thesis. A special thank you goes out to all my fellow students and especially to Kristof Overdulve for the interesting conversations and the laughter during the past years. And last but not least I want to thank my parents and sister for supporting me throughout my education; my dad for offering support by buying his new computer earlier and borrowing it so I could do a second series of tests on a new processor and my mom for the excellent care and interest in what I was doing. ix Abstract Virtualization has become a widespread technology that is used to abstract, combine or divide computing resources to allow resource requests to be described and fulfilled with minimal dependence on the underlying physical delivery. The x86 architecture was not designed with virtualization in mind and contains certain non-virtualizable instructions. This has resulted in the emergence of several software solutions and has led to the introduction of hardware support. The expanding range of applications ensures that users increasingly want to use virtualization. Among other things, the need for entire physical setups for research purposes can be avoided by using virtualization. For components that already use virtualization, executing a virtual machine inside a virtual machine is necessary, this is called nested virtualization. There has been little related work on nested virtualization and this thesis elaborates on what is possible with current techniques. We tested the nesting of hypervisors based on the different virtualization techniques. The techniques that were used are dynamic binary translation, paravirtualization and hardware support. For hardware support, a distinction was made between first generation and second generation hardware support. Successful nested setups use a software solution for the inner hypervisor and hardware support for the bottom layer hypervisor. Only one working nested setup uses software solutions for both hypervisors. Performance benchmarks were conducted to find out if the performance of working nested setups is reasonable. The performance of the processor, the memory and I/O was tested and compared with the different levels of virtualization. We found that nested virtualization on the x86 architecture works for certain setups, especially with a software solution on top of a hardware supported hypervisor. Setups with hardware support for the inner hypervisor are not yet possible. The nested hardware support will be coming soon but until then, the only option is the use of a software solution for the inner hypervisor. Results of the performance benchmarks showed that performance of the nested setups is promising. x CHAPTER 1 Introduction Within the research surrounding grid and cluster computing there are many developments at different levels that make use of virtualization. Virtualization can be used for all, or a selection of the components in grid or cluster middleware. Grids or clusters are also using virtualization to run separate applications in a sandbox environment. Both developments bring advantages concerning security, fault tolerance, legacy support, isolation, resource control, consolidation, etc. Complete test setups are not available or desirable for many development and research purposes. If certain performance limitations do not pose a problem, virtualization of all components in a system can avoid the need for physical grid or cluster setups. This thesis focusses on the latter, the consolidation of several physical cluster machines by virtualizing them on a single physical machine. The virtualization of cluster machines that use virtualization themselves leads to a combination of the above mentioned levels. 1.1 Goals The goal of this thesis is to find out whether different levels of virtualization are possible with current virtualization techniques. The research question is whether nested virtualization works on the x86 architecture. In cases where nested virtualization works we want to find out what the performance degradation is when compared to a single level of virtualization or to a native solution. For cases where nested virtualization does not work we search for the reasons of the failure and what needs to be changed in order for it to work. The experiments are conducted with some of the most popular virtualization software to find an answer to the posed question. 1.2. OUTLINE 2 1.2 Outline The outline of this thesis is as follows. Chapter 2 contains an introduction to virtualization, a brief history of virtualization is given followed by a few definitions and a taxonomy of virtualization in general. The chapter ends with the formal requirements needed for virtualization on a computer architecture and how the x86 architecture compares to these requirements. Chapter 3 describes the evolution of virtualization for the x86 architecture. Virtualization software first used software techniques, at a later stage processor vendors provided hardware support for virtualization. The last section of the chapter provides an overview of a selection of the most popular virtualization software. Chapter 4 provides a theoretical view for the requirements of nested virtualization on the x86 architecture. For each technique described in chapter 3, a detailed explanation of the theoretical requirements gives more insight in whether nested virtualization can work for the given technique. Chapter 5 investigates the actual nesting of virtual machines using some of the most popular virtualization software solutions. The different virtualization techniques are combined to get an overview of which nested setup works best. Chapter 6 presents performance results of the working nested setups in chapter 5. System benchmarks are executed on each setup and the results are compared. Chapter 7 summarizes the results in this thesis and gives directions for future work. CHAPTER 2 Virtualization In recent years virtualization has become a widespread technology that is used to abstract, combine or divide computing resources to allow resource requests to be described and fulfilled with minimal dependence on the underlying physical delivery. The first tracks of virtualization can be traced back to the 1960’s [1, 2] in research projects that provided concurrent, interactive access to mainframes. Each virtual machine (VM) gave the user the illusion of working directly on a physical machine. By partitioning the system into virtual machines, multiple users could concurrently use the system each within their own operating system. The projects provided an elegant way to enable time- and resource-sharing on expensive mainframes. Users could execute, develop, and test applications within their own virtual machine without interfering with other users. In that time, virtualization was used to reduce the cost of acquiring new hardware and to improve the productivity by letting more users work simultaneously. In the late 1970’s and early 1980’s virtualization became unpopular because of the introduction of cheaper hardware and multiprocessing operating systems. The popular x86 architecture lacked the power to run multiple operating systems at the same time. But since this hardware was so cheap, a dedicated machine was used for each separate application. The use of these dedicated machines led to a decrease in the use of virtualization. The ideas of virtualization became popular again in the late 1990’s with the emergence of a wide variety of operating systems and hardware configurations. Virtualization was used for executing a series of applications, targeted for different hardware or operating systems, on a given machine. Instead of buying dedicated machines and operating systems for each application, the use of virtualization on one machine offers the ability to create virtual machines that are able to run these applications. Virtualization concepts can be used in many areas of computer science. Large variations in the abstraction level and underlying architecture lead to many defini- 2.1. APPLICATIONS 4 tions of virtualization. In “A survey on virtualization technologies”, S. Nanda and T. Chiueh define virtualization by the following relaxed definition [1]: Definition 2.1 Virtualization is a technology that combines or divides computing resources to present one or many operating environments using methodologies like hardware and software partitioning or aggregation, partial or complete machine simulation, emulation, time-sharing, and many others. The definition mentions the aggregation of resources but in this context the focus lies on the partitioning of resources. Throughout the rest of this thesis, virtualization provides infrastructure used to abstract lower-level, physical resources and to create multiple independent and isolated virtual machines. 2.1 Applications The expanding range of computer applications and their varied requirements for hardware and operating systems increases the need for users to start using virtualization. Most people will have already used virtualization without realizing it because there are many applications where virtualization can be used in some form. This section elaborates on some practical applications where virtualization can be used. S. Nanda and T. Chiueh enumerate some of these applications in “A survey on virtualization technologies” but the list is not complete and one can easily think of other applications [1]. A first practical application that benefits from using virtualization is server consolidation [3]. It allows system administrators to consolidate workloads of multiple under-utilized machines to a few powerful machines. This saves hardware, management, administration of the infrastructure, space, cooling and power. A second application that also involves consolidation is application consolidation. A legacy application might require faster and newer hardware but might also require a legacy operating system. The need for such legacy applications could be served well by virtualizing the newer hardware. Virtual machines can be used for providing secure, isolated environments to run foreign or less-trusted applications. This form of sandboxing can help build secure computing platforms. Besides sandboxing, virtualization can also be used for debugging purposes. It can help debug complicated software such as operating systems or device drivers by letting the user execute them on an emulated PC with full software controls. Moreover, virtualization can help produce arbitrary test scenarios that are hard to produce in reality and thus eases the testing of software. Virtualization provides the ability to capture the entire state of a running virtual machine, which creates new management possibilities. Saving the state of a virtual machine, also called a snapshot, offers the user the capability to roll back to the saved state when, for example, a crash occurs in the virtual machine. The saved state can also be used to package an application together with its required operating system, this is often called an “appliance”. This eases the installation of that application on 2.2. TAXONOMY 5 a new server, lowering the entry barrier for its use. Another advantage of snapshots is that the user can copy the saved state to other physical servers and use the new instance of the virtual machine without having to install it from scratch. This is useful for migrating virtual machines from one physical server to other physical servers when needed. Another practical application is the use of virtualization within distributed network computing systems [4]. Such a system must deal with the complexity of decoupling local administration policies and configuration characteristics of distributed resources from the quality of service expected from end users. Virtualization can simplify or eliminate this complex decoupling because it offers functionality like consolidation of physical resources, security and isolation, flexibility and ease of management. It is not difficult to see that the practical applications given in this section are just a few examples of the many possible uses for virtualization. The number of possible advantages that virtualization can provide continues to rise, making it more and more popular. 2.2 Taxonomy Virtual machines can be divided into two main categories, namely process virtual machines and system virtual machines. In order to describe the differences, this section starts with an overview of the different implementation layers in a computer system, followed by the characteristics of process virtual machines. Finally, the characteristics of system virtual machines are explained. Most information in this section is deduced from the book “Virtual machines: Versatile platforms for systems and processes” by J. E. Smith and R. Nair [5]. Figure 2.1: Implementation layers in a computer system. 2.2. TAXONOMY 6 The complexity in computer systems is tackled by the division into levels of abstraction separated by well-defined interfaces. Implementation details at lower levels are ignored or simplified by introducing levels of abstraction. In both hardware and software in a computer system, the levels of abstraction correspond to implementation layers. A typical architecture of a computer system consist of several implementation layers. Figure 2.1 shows the key implementation layers in a typical computer system. At the base of the computer system we have the hardware layer consisting of all the different components of a modern computer. Just above the hardware layer, we find the operating system layer which exploits the hardware resources to provide a set of services to system users [6]. The libraries layer allows application calls to invoke various services available on the system, including those provided by the operating system. At the top, the application layer consists of the applications running on the computer system. Figure 2.1 also shows the three interfaces between the implementation layers – the instruction set architecture (ISA), the application binary interface (ABI), and the application programming interface (API) – which are especially important for virtual machine construction [7]. The division between hardware and software is marked by the instruction set architecture. The ISA consists of two interfaces, the user ISA and the system ISA. The user ISA includes the aspects visible to the libraries and application layers. The system ISA is a superset of the user ISA which also includes those aspects visible to supervisor software, such as the operating system. The application binary interface provides a program or library access to the hardware resources and services available in the system. This interface consists of the user ISA and a system call interface which allows application programs to interact with the shared hardware resources indirectly. The ABI allows the operating system to perform operations on behalf of a user program. The application programming interface allows a program to invoke various services available on the system and is usually defined with respect to a high-level language (HHL). An API enables application written to the API to be ported easily to other systems that support the same API. The interface consists of the user ISA and of HHL library calls. Using the three interfaces, virtual machines can be divided into two main categories: process virtual machines and system virtual machines. A process VM runs a single program, supporting only an individual process. It provides a user application with a virtual ABI or API environment. The process virtual machine is created when the corresponding process is created and terminates when the process terminates. The system virtual machines provide a complete system environment in which many processes can coexist. System VMs do this by virtualizing the ISA layer. 2.2.1 Process virtual machines Process virtual machines virtualize the ABI or API and can run only a single user program. Each virtual machine thus supports a single process, possibly consisting of multiple threads. The most common process VM is an operating system. It supports multiple user processes to run simultaneously by time-sharing the limited hardware resources. The operating system provides a replicated process VM for 2.2. TAXONOMY 7 each executing program so that each program thinks that it has its own machine. Program binaries that are compiled for a different instruction set are also supported by process VMs. There are two approaches for emulating the instruction set. Interpretation is a simple but slow approach; an interpreter fetches, decodes and emulates each individual instruction. A more efficient approach is dynamic binary translation, which is explained in section 3.1. The emulation between different instruction sets provides cross-platform compatibility only on case-by-case basis and requires considerable programming effort. Designing a process-level VM together with an HLL application development environment is an easier way to achieve full cross-platform portability. The HHL virtual machine does not correspond to any real platform, but is designed for ease of portability. The Java programming language is a widely used example of a HHL VM. 2.2.2 System virtual machines System virtual machines provide a complete system environment by virtualizing the ISA layer. They allow a physical hardware system to be shared among multiple, isolated guest operating system environments simultaneously. The layer that provides the hardware virtualization is called the virtual machine monitor (VMM) or hypervisor. It manages the hardware resources so that multiple guest operating system environments and their user programs can execute simultaneously. Subdivision is centered on the supported ISAs of the guest operating systems, whether virtualization or emulation is used. Virtualization can be further subdivided based on the location where the hypervisor is executed: native or hosted. The following two paragraphs clarify the subdivision according to the supported ISAs. Emulation: Guest operating systems with a different ISA from the host ISA can be supported through emulation. The hypervisor must emulate both the application and operating system code by translating each instruction to the ISA of the physical machine. The translation is applied to each instruction so that the hypervisor can easily manage all hardware resources. Using emulation for guest operating systems with the same ISA as the host ISA, performance will be severely lower than using virtualization. Virtualization: When the ISA of the guest operating system is the same as the host ISA, virtualization can be used to improve performance. It treats nonprivileged instructions and privileged instructions differently. A privileged instruction is an instruction that traps when executed in user mode instead of in kernel mode and will be discussed in more detail in section 2.3. Non-privileged instructions are executed directly on the hardware without intervention of the hypervisor. Privileged instructions are caught by the hypervisor and translated in order to guarantee correct results. When guest operating systems primarily execute non-privileged instructions, the performance is comparable to near native speed. Thus, when the ISA of the guest and the host are the same, the best performing technique is virtualization. It improves performance in terms of execution speed by running non-privileged instructions directly on the hardware. If the ISA of the guest and the host are different, emulation is the only way to execute the guest operating 2.2. TAXONOMY 8 system. The subdivision of virtualization based on the location of the hypervisor is clarified in the next two paragraphs. Native, bare-metal hypervisor: A native, bare-metal hypervisor, also referred to as a Type 1 hypervisor, is the first layer of software installed on a clean system. The hypervisor runs in the most privileged mode, while all the guests run in a less privileged mode. It runs directly on the hardware and executes the intercepted instructions directly on the hardware. According to J. E. Smith and R. Nair, a bare-metal hypervisor is more efficient than a hosted hypervisor in many respects since it has direct access to hardware resources, enabling greater scalability, robustness and performance [5]. There are some variations of this architecture where a privileged guest operating system handles the intercepted instructions. The disadvantage of a native, bare-metal hypervisor is that a user must clear the existing operating systems in order to install the hypervisor. Hosted hypervisor: An alternative to a native, bare-metal hypervisor is the hosted or Type 2 hypervisor. It runs on top of a standard operating system and supports the broadest range of hardware configurations [3]. The installation of the hypervisor is similar to the installation of an application within the host operating system. The hypervisor relies on the host OS for device support and physical resource management. Privileged instructions cannot be executed directly on the hardware but are modified by the hypervisor and passed down to the host OS. The implementation specifics of Type 1 and Type 2 hypervisors can be separated into several categories: dynamic binary translation, paravirtualization and hardware assisted virtualization. These approaches are discussed in more detail in chapter 3, which elaborates on virtualization within system virtual machines. An overview of the taxonomy of virtual machines is shown in figure 2.2. Figure 2.2: Taxonomy of virtual machines. 2.3. X86 ARCHITECTURE 9 2.3 x86 architecture The taxonomy given in the previous section provides an overview of different virtual machines and different implementation approaches. This section gives detailed information about the requirements associated with virtualization and the problems that occur when virtualization technologies are implemented on the x86 architecture. 2.3.1 Formal requirements In order to provide insight into the problems and solutions for virtualization on top of the x86 architecture, the formal requirements for a virtualizable architecture are given first. These requirements describe what is needed in order to use virtualization on a computer architecture. In “Formal requirements for virtualizable third generation architectures”, G. J. Popek and R. P. Goldberg defined a set of formal requirements for a virtualizable computer architecture [8]. They divided the ISA instruction into several groups. The first group contains the privileged instructions: Definition 2.2 Privileged instructions are all the ISA instruction that only work in kernel mode and trap when executed in user mode instead of in kernel mode. Another important group of instructions that will have a big influence on the virtualizability of a particular machine are the sensitive instructions. Before defining sensitive instructions, the notions of behaviour sensitive and control sensitive are explained. Definition 2.3 An instruction is behaviour sensitive if the effect of its execution depends on the state of the hardware, i.e. upon its location in real memory, or on the mode. Definition 2.4 An instruction is control sensitive if it changes the state of the hardware upon execution, i.e. it attempts to change the amount of resources available or affects the processor mode without going through the memory trap sequence. With these notions, instructions can be separated into sensitive instructions and innocuous instructions. Definition 2.5 Sensitive instructions is the group of instructions that are either control sensitive or behaviour sensitive. Definition 2.6 Innocuous instructions is the group of instruction that are not sensitive instructions. According to Popek and Goldberg, there are three properties of interest when any arbitrary program is executed while the control program (the virtual machine monitor) is resident: efficiency, resource control, and equivalence. The efficiency property: All innocuous instructions are executed by the hardware directly, with no intervention at all on the part of the control program. 2.3. X86 ARCHITECTURE 10 The hypervisor should not intervene for instructions that do no harm. These instructions do not change the state of the hardware and should be executed by the hardware directly in order to preserve performance. The more instructions are executed directly, the better the performance of the virtualization will be. This property highlights the contrast between emulation - where every single instruction is analyzed - and virtualization. The resource control property: It must be impossible for that arbitrary program to affect the system resources, i.e. memory, available to it; the allocator of the control program is to be invoked upon any attempt. The hypervisor is in full control of the hardware resources. A virtual machine should not be able to access the hardware resources directly. It should go through the hypervisor to ensure correct results and isolation from other virtual machines. The equivalence property: Any program K executing with a control program resident, with two possible exceptions, performs in a manner indistinguishable from the case when the control program did not exist and K had whatever freedom of access to privileged instructions that the programmer had intended. A program running on top of a hypervisor should perform the identical behaviour as in the case where the program would run on the hardware directly. As mentioned, there are two exceptions: timing and resource availability problems. The hypervisor will occasionally intervene and instruction sequences may take longer to execute. This can lead to incorrect results in the assumptions about the length of the program. The second exception, the resource availability problem, might occur when the hypervisor does not satisfy a particular request for space. The program may then be unable to function in the same way as if the space were made available. The problem could easily occur, since the virtual machine monitor itself and other possible virtual machines take space as well. A virtual machine environment can be seen as a “smaller” version of the actual hardware: logically the same, but with lesser quantity of certain resources. Given the categories of instructions and the properties, they define the hypervisor and a virtualizable architecture as: Definition 2.7 We say that a virtual machine monitor, or hypervisor, is any control program that satisfies the three properties of efficiency, resource control and equivalence. Then functionally, the environment which any program sees when running with a virtual machine present is called a virtual machine. It is composed of the original real machine and the virtual machine monitor. Definition 2.8 For any conventional third generation computer, a virtual machine monitor may be constructed, i.e. it is a virtualizable architecture, if the set of sensitive instructions for that computer is a subset of the set of privileged instructions. 2.3. X86 ARCHITECTURE 11 2.3.2 The x86 protection level architecture The x86 architecture recognizes four privilege levels, numbered from 0 to 3 [9]. Figure 2.3 shows how the privilege levels can be interpreted as rings of protection. The center ring, ring 0, is reserved for the most privileged code and is used for the kernel of an operating system. When the processor is running in kernel mode, the code is executing in ring 0. Rings 1 and 2 are less privileged and are used for operating system services. These two are rarely used but some techniques in virtualization will run the guests inside ring 1. The most outer ring is used for applications and has the least privileges. The code of applications running in users mode will execute in ring 3. Figure 2.3: The x86 protection levels. These rings are used to prevent a program operating in a lower ring from accessing more privileged system routines. A call gate is used to allow an outer ring to access an inner ring’s resource in a predefined manner. 2.3.3 The x86 architecture problem A computer architecture can support virtualization if it meets the formal requirements described in subsection 2.3 . The x86 architecture, however, does not meet the requirements posed above. The x86 instruction set architecture contains sensitive instructions that are non-privileged, called non-virtualizable instructions. In other words, these instruction will not trap when executed in user mode and they depend on or change the hardware state. This is not desirable because the hypervisor cannot simulate the effect of the instruction. The current hardware state could belong to another virtual machine, producing an incorrect result for the current virtual machine. The non-virtualizable instructions make virtualization on the x86 architecture more difficult. Virtualization techniques will need to deal with these instructions. Applications will only run at near native speed when they contain a minimum 2.3. X86 ARCHITECTURE 12 amount of non-virtualizable instructions. Approaches that overcome the limitations of the x86 architecture are discussed in the next chapter. CHAPTER 3 Evolution of virtualization for the x86 architecture Developers of virtualization software did not wait until processor vendors solved the x86 architecture problem. They introduced software solutions like binary translation and, when virtualization became more popular, paravirtualization. Processor vendors then introduced hardware support to solve the design problem of the x86 architecture and at a later stage to improve the performance. The next generation hardware support was introduced to improve performance concerning the memory management. This chapter gives an overview of the evolution towards hardware supported virtualization on x86 architectures. Dynamic binary translation, a software solution that tries to circumvent the design problem of the x86 architecture, is explained in the first section. The second section explains paravirtualization, a software solution which tries to improve the binary translation concept. It has some advantages and disadvantages over dynamic binary translation. The third section gives details on the first generation hardware support and its advantages and disadvantages over software solutions. In many cases the software solutions outperform the hardware support. The next generation hardware support tries to further close the performance gap by eliminating major sources of virtualization overhead. The second generation hardware support focusses on memory management and is discussed in the fourth section. The last section gives an overview of VirtualBox, KVM and Xen, which are virtualization products and VMware, a company providing multiple virtualization products. 3.1 Dynamic binary translation In full virtualization, the guest OS is not aware that it is running inside a virtual machine and requires no modifications [10]. Dynamic binary translation is a technique that implements full virtualization. It requires no hardware assisted or 3.1. DYNAMIC BINARY TRANSLATION 14 operating system assisted support while other techniques, like paravirtualization, need modifications to either the hardware or the operating system. Dynamic binary translation is a technique which works by translating code from one instruction set to another. The word “dynamic” indicates that the translation is done on the fly and is interleaved with execution of the generated code [11]. The word “binary” indicates that the input is binary code and not source code. To improve performance, the translation is mostly done on blocks of code instead of single instructions [12]. A block of code is defined by a sequence of instructions that end with a jump or branch instruction. A translation cache is used to avoid retranslating code blocks multiple times. In x86 virtualization, dynamic binary translation is not used to translate between different instruction set architectures. Instead, the translation is done from x86 instructions to x86 instructions. This makes the translation a lot lighter than previous binary translation technologies [13]. Since it is a translation between the same ISA, a copy of the original instructions often suffices. In other words, generally no translation is needed and the code can be executed as is. In particular, whenever the guest OS is executing code in user mode, no translation will be carried out and the instructions are executed directly, which is comparable in performance to execution of the code natively. Code that the guest OS wants to execute in kernel mode will be translated on the fly and is saved in the translation cache. Even when the guest OS is running kernel code, most times no translation is needed and the code is copied as is. Only in some cases will the hypervisor need to translate instructions of the kernel code to guarantee the integrity of the guest. The kernel of the guest is executed in ring 1 instead of ring 0 when using software virtualization. As explained in section 2.3, the x86 instruction set architecture contains sensitive instructions that are non-privileged. If the kernel of the guest operating system wants to execute privileged instructions or one of these non-virtualizable instructions, the dynamic binary translation will translate the instructions into a safe equivalent. The safe equivalent will not harm other guests or the hypervisor. For example, if access to the physical hardware is needed, the performed translation assures that the code will use the virtual hardware instead. In these cases, the translation ensures that the safe code is also less costly than the code with privileged instructions. The code with privileged instructions would trap when running in ring 1 and the hypervisor should handle these traps. The dynamic binary translation thus avoids the traps by replacing the privileged instruction so that there are less interrupts and the safe code will be less costly. The translation of code into safer equivalents is less costly than letting the privileged instructions trap, but the translation itself should also be taken into account. Luckily, the translation overhead is rather low and will decrease over time since translated pieces of code are cached in order to avoid retranslation in case of loops in the code. Yet, dynamic binary translation has a few cases it cannot fully solve: system calls, I/O, memory management and complex code. The latter is the set of code that, for example, does self-modification or has indirect control flows. This code is complex to execute, even on an operating system that runs natively. The other cases are now described in more detail in the next subsections. 3.1. DYNAMIC BINARY TRANSLATION 15 3.1.1 System calls A system call is a mechanism used by processes to access the services provided by the operating system. This involves a transition to the kernel where the required function is then performed [6, 14]. The kernel of an operating system is also a process, but it differs from other processes in that it has privileged access to processor instructions. The kernel will not execute directly but only when it receives an interrupt from the processor or a system call from another process also running in the operating system. There are many different techniques for implementing system calls. One way is to use a software interrupt and trap, but for x86 a faster technique was chosen [13, 15]. Intel and AMD have come up with the instructions SYSCALL/SYSENTER and SYSRET/SYSEXIT for a process to do a system call. These instructions transfer control to the kernel without the overhead of an interrupt. In software virtualization the kernel of the guest will run inside ring 1 instead of ring 0. This implies that the hypervisor should intercept a SYSENTER (or SYSCALL), translate the code and hand over control to the kernel of the guest. This kernel then executes the translated code and execute a SYSEXIT (or SYSRET) to return control back to the process that requested the service of the kernel. Because the kernel of the guest is running inside ring 1, it does not have the privilege to perform the SYSEXIT. This will cause an interrupt at the processor and the hypervisor has to emulate the effect of this instruction. System calls will cause a significant amount of overhead when using software virtualization. In a virtual machine, a system call costs about 10 times the cycles needed for a system call on a native machine. In “A comparison of software and hardware techniques for x86 virtualization”, the authors measured that a system call on a 3.8 GHz Pentium 4 takes 242 cycles [11]. On the same machine, a system call in a virtual machine, virtualized with dynamic binary translation and the kernel running in ring 1, takes 2308 cycles. In an environment where virtualization is used there will most likely be more than one virtual machine on a physical machine. In this case, the overhead of the system calls can become a significant part of the virtualization overhead. As we will see later, hardware support for virtualization offers a solution for this. 3.1.2 I/O virtualization When creating a virtual machine, not only the processor needs to be virtualized but also all the essential hardware like memory and storage. Each I/O device type has its own characteristics and needs to be controlled in its own special way [5]. There are often a large number of devices for an I/O device type and this number continues to rise. The strategy consists of constructing a virtual I/O device and then virtualizing the I/O activity that is directed at the device. Every access to this virtual hardware must be translated to the real hardware. The hypervisor must intercept all I/O operations issued by the guest operating system and it must emulate these instructions using software that understands the semantics of the specific I/O port accessed [16]. The I/O devices are emulated because of the ease of migration and multiplexing advantages [17]. Migration is easy because the virtual 3.1. DYNAMIC BINARY TRANSLATION 16 device exists in memory and can easily be transferred. The hypervisor can present a virtual device to each guest while performing the multiplexing. Emulation has the disadvantage of poor performance. The hypervisor must perform a significant amount of work to present the illusion of a virtual device. The great number of physical devices make the emulation of the I/O devices in the hypervisor complex. The hypervisor needs drivers for every physical device in order to be usable on different physical systems. A hosted hypervisor has the advantage that it can reuse the device drivers provided by the host operating system. Another problem is that the virtual I/O device is often a device model which does not match the full power of the underlying physical devices [18]. This means that optimizations implemented by specific devices can be lost in the process of emulation. 3.1.3 Memory management In an operating system, every application has the illusion that it is working with a piece of contiguous memory. Whereas in reality, the memory used by applications can be dispersed across the physical memory. The application is working with virtual addresses that are translated to physical addresses. The operating system manages a set of tables to do the translation of the virtual memory to the physical addresses. The x86 architecture provides support for paging in the hardware. Paging is the process that translates virtual addresses of a process to a system physical address. The hardware that translates the virtual addresses to physical addresses is called the memory management unit or MMU. The page table walker performs address translation using the page tables and uses a hardware page table pointer, the CR3 register, to start the page walk [19]. It will traverse several page table entries which point to the next level of the walk. The memory hierarchy will be traversed many times when the page walker performs address translation. To keep this overhead within limits, a translation look-aside buffer (TLB) is used. The most recent translation will be saved in this buffer. The processor will first check the TLB to see whether the translation is located in the cache. When the translation is found in the buffer this translation is used, otherwise a page walk is performed and this result is saved in the TLB. The operating system and the processor must cooperate in order to assure that the TLB stays consistent. Inside a virtual machine the guest operating system manages its own page tables. The task of the hypervisor is to virtualize the memory but also virtualize the virtual memory so that the guest operating system can use virtual memory [20]. This introduces an extra level of translation which maps physical addresses of the guest to real physical addresses of the system. The hypervisor must manage the address translation on the processor using software techniques. It derives a shadow version of the page table from the guest page table, which holds the translations of the virtual guest addresses to the real physical addresses. This shadow page table will be used by the processor when the guest is active and the hypervisor manages this shadow table to keep it synchronized with the guest page table. The guest does not have access to these shadow page tables and can only see his guest page tables which runs on an emulated MMU. It has the illusion that it can translate the virtual addresses to real physical ones. In the background, the hypervisor will deal with the 3.2. PARAVIRTUALIZATION 17 real translation using the shadow page tables. Figure 3.1: Memory management in x86 virtualization using shadow tables. Figure 3.1 shows the translations needed for translating a virtual guest address into a real physical address. Without the shadow page tables, the virtual guest memory (orange area) will be translated into physical guest memory (blue area) and the latter is translated into real physical memory (white area). The shadow page tables avoid the double translation by immediately translating the virtual guest memory (orange) into real physical memory (white) as shown by the red arrow. In software, several techniques can be used to keep the shadow page tables and guest page tables consistent. These techniques use the page fault exception mechanism of the processor. It throws an exception when a page fault occurred and allows the hypervisor to update the current shadow page table. This introduces extra page faults due to the shadow paging. The shadow page tables introduce an overhead because of the extra page faults and the extra work in keeping the shadow tables up to date. The shadow page tables also consume additional memory. Maintaining shadow page tables for SMP guests also introduces a certain overhead. Each processor in the guest can use the same guest page table instance. The hypervisor could maintain shadow page tables instances that can be used at each processor, which results in memory overhead. Another possibility is to share the shadow page table between the virtual processors leading to synchronization overheads. 3.2 Paravirtualization Paravirtualization is in many ways comparable to dynamic binary translation. It is also a software technique designed to enable virtualization on the x86 architecture. As explained in “Denali: Lightweight Virtual Machines for Distributed and Networked Applications,” and used in Denali [21], paravirtualization exposes a virtual architecture to the guest that is slightly different than the physical architecture. 3.2. PARAVIRTUALIZATION 18 Dynamic binary translation translates “critical” code into safe code on the fly. Paravirtualization does the same thing but requires changes in the source code of the operating system in advance. The operating systems built for the x86 architecture are by default not compatible with the paravirtualized architecture. This is a major disadvantage for existing operating systems because extra effort is needed in order to run these operating systems inside a paravirtualized guest. In the case of Denali, which provides light weight virtual machines, it allowed them to co-design the virtual architecture with the operating system. The advantages of a successful paravirtualization is a simpler hypervisor implementation and an improvement in the performance degradation compared to the physical system. Better performance is achieved because many unnecessary traps by the hypervisor are eliminated. The hypervisor provides hypercall interfaces for critical kernel operations such as memory management, interrupt handling and time keeping [10]. The guest operating system is adapted so that it is aware of the virtualization. The kernel is modified to replace non-virtualizable instructions with hypercalls that communicate directly with the hypervisor. The binary translation overhead is completely eliminated since the modifications are done in the operating system at design time. The implementation of the hypervisor is much simpler because it does not contain the binary translator. 3.2.1 System calls The overhead of system calls can be improved a bit. The dynamic binary translation technique intercepts each SYSENTER/SYSCALL instruction and translates the instruction to hand over the control to the kernel of the guest operating system. Afterwards, the guest operating system’s kernel executes a SYSEXIT/SYSRET instruction to return to the application. This instruction is again intercepted and translated by the dynamic binary translation. The paravirtualization technique allows guest operating systems to install a handler for system calls, permitting direct calls from an application into its guest OS and avoiding indirection through the hypervisor on every call [22]. This handler is validated before installation and is accessed directly by the processor without indirection via ring 0. 3.2.2 I/O virtualization Paravirtualization software mostly uses a different approach for I/O virtualization compared to the emulation used with dynamic binary translation. The guest operating system utilizes a paravirtualized driver that operates on a simplified abstract device model exported by the hypervisor [23]. The real device driver can reside in the hypervisor, but often resides in a separate device driver domain which has privileged access to the device hardware. The latter one is attractive since the hypervisor does not need to provide the device drivers but the drivers of a legacy operating system can be used. Separating the address space of the device drivers from guest and hypervisor code also prevents buggy device drivers from causing system crashes. The paravirtualized drivers remove the need to emulate devices. They free up processor time and resources which would otherwise be needed to emulate hardware. Since there is no emulation of the device hardware, the overhead is significantly re- 3.3. FIRST GENERATION HARDWARE SUPPORT 19 duced. In Xen, well-known for its use of paravirtualization, the real device drivers reside in a privileged guest known as domain 0. A description of Xen can be found in subsection 3.6.3. However, Xen is not the only hypervisor that uses paravirtualization for I/O. VMware has a paravirtualized I/O device driver, vmxnet, that shares data structures with the hypervisor [10]. “A Performance Comparison of Hypervisors” states that by using the paravirtualized vmxnet network driver they can now run network I/O intensive datacenter applications with very acceptable network performance [24]. 3.2.3 Memory management Paravirtual interfaces can be used by both the hypervisor and guest to reduce hypervisor complexity and overhead in virtualizing x86 paging [19]. When using a paravirtualized memory management unit, the guest operating system page tables are registered directly with the MMU [22]. To reduce the overhead and complexity associated with the use of shadow page tables, the guest operating system has readonly access to the page tables. A page table update is passed to Xen via a hypercall and validated before being applied. Guest operating systems can locally queue page table updates and apply the entire batch with a single hypercall. This minimizes the number of hypercalls needed for the memory management. 3.3 First generation hardware support In the meantime, processor vendors noticed that virtualization was becoming increasingly popular and they created a solution that solves the virtualization problem on the x86 architecture by introducing hardware assisted support. Hardware support for processor virtualization enables simple, robust and reliable hypervisor software [25]. It eliminates the need for the hypervisor to listen, trap and execute certain instructions for the guest OS [26]. Both Intel and AMD provide these hardware extensions in the form of Intel VT-x and AMD SVM respectively [11, 27, 28]. The first generation hardware support introduces a data structure for virtualization, together with specific instructions and a new execution flow. In AMD SVM, the data structure is called the virtual machine control block (VMCB). The VMCB combines control state with the guest’s processor state. Each guest has its own VMCB with its own control state and processor state. The VMCB contains a list of which instructions or events in the guest to intercept, various control bits and the guest’s processor state. The various control bits specify the execution environment of the guest or indicate special actions to be taken before running guest code. The VMCB is accessed by reading and writing to its physical address. The execution environment of the guest is referred to as guest mode. The execution environment of the hypervisor is called host mode. The new VMRUN instruction transfers control from host to guest mode. The instruction saves the current processor state and loads the corresponding guest state from the VMCB. The processor now runs the guest code until an intercept event occurs. This results in a #VMEXIT at which point 3.3. FIRST GENERATION HARDWARE SUPPORT 20 the processor writes the current guest state back to the VMCB and resumes host execution at the instruction following the VMRUN. The processor is then executing the hypervisor again. The hypervisor can retrieve information from the VMCB to handle the exit. When the effect of the exiting operation is emulated, the hypervisor can execute VMRUN again to return to guest mode. Although Intel has implemented their own version of hardware support, it has many similarities with the implementation of AMD although the terminology is somewhat different. Intel uses a virtual machine control structure (VMCS) instead of a VMCB. A VMCS can be manipulated by the new instructions VMCLEAR, VMPTRLD, VMREAD and VMWRITE which clears, loads, reads from, and writes to a VMCS respectively. The hypervisor runs in “VMX root operation“ and the guest in ”VMX non-root operation“ instead of host and guest mode. Software enters the VMX operation by executing the VMXON instruction. From then on, the hypervisor can use a VMEntry to transfer control to one of its guest. There are two instructions available for triggering a VMEntry: VMLAUNCH and VMRESUME. As with AMD SVM, the hypervisor regains control using VMExits. Eventually, the hypervisor can leave the VMX operation with the instruction VMXOFF. Figure 3.2: Execution flow using virtualization based on Intel VT-x. The execution flow of a guest, virtualized by hardware support, can be seen in figure 3.2. The VMXON instruction starts and the VMXOFF stops the VMX operation. The guest is started using a VMEntry which loads the VMCS of the guest into the hardware. The hypervisor regains control using a VMExit when a guest tries to execute a privileged instruction. After intervention of the hypervisor, a VMEntry transfers control back to the guest. In the end, the guest can shut down and control is handed back to the hypervisor with a VMExit. The basic idea behind the first generation hardware support is to fix the problem that the x86 architecture cannot be virtualized. The VMExit forces a transition from guest to hypervisor, which is based on the trap all exceptions and privileged instructions philosophy. Nevertheless, each transition between the hypervisor and a 3.3. FIRST GENERATION HARDWARE SUPPORT 21 virtual machine requires a fixed amount of processor cycles. When the hypervisor has to handle a complex operation, the overhead is relatively low. However, for a simple operation the overhead of switching from guest to hypervisor and back is relatively high. Creating processes, context switches, small page table updates are all simple operations that will have a large overhead. In these cases, software solutions like binary translation and paravirtualization perform better than hardware supported virtualization. The overhead can be improved by reducing the number of processor cycles required for a transition between guest and hypervisor. The exact number of extra processor cycles depends on the processor architecture. For Intel, the format and layout of the VMCS in memory is not architecturally defined, allowing implementationspecific optimizations to improve performance in VMX non-root operation and to reduce the latency of a VMEntry and VMExit [29]. Intel and AMD are improving these latencies in their next processors, as you can see for Intel in figure 3.3. Figure 3.3: Latency reductions by CPU implementation [30]. System calls are an example of complex operations having a low transition overhead. System calls do not automatically transfer control from the guest to the hypervisor in hardware supported virtualization. A hypervisor intervention is only needed when the system call contains critical instructions. The overhead when a system call requires intervention is relatively low since a system call is rather complex and already requires a lot of processor cycles. First generation hardware support does not include support for I/O virtualization and memory management unit virtualization. Hypervisors that use the first generation hardware extensions will need to use a software technique for virtualizing the I/O devices and the MMU. For the MMU, this can be done using shadow tables or paravirtualization of the MMU. 3.4. SECOND GENERATION HARDWARE SUPPORT 22 3.4 Second generation hardware support First generation hardware support has made the x86 architecture virtualizable, but only in some cases an improvement in performance can be measured [11]. Maintaining the shadow tables can be an intensive task, as was pointed out in subsection 3.1.3. The next step of the processor vendors was to provide hardware MMU support. This second generation hardware support adds memory management support so the hypervisor does not have to maintain the integrity of the shadow page table mappings [17]. The shadow page tables remove the need to translate the virtual memory of the process to the guest OS physical memory and then translate the latter into the real physical memory, as can be seen in figure 3.1. It provides the ability to immediately translate the virtual memory of the guest process into real physical memory. On the other hand, the hypervisor must do the bookkeeping to keep the shadow page table up to date when an update occurs to the guest OS page table. In existing software solutions like binary translation, this bookkeeping introduces overhead which was even worse for first generation hardware support. The hypervisor must maintain the shadow page tables and every time a guest tries to translate a memory address, the hypervisor must intervene. In software solutions this intervention is an extra page fault, but in the first generation hardware support this will result in a VMExit and VMEntry roundtrip. As shown in figure 3.3, the latencies of such a roundtrip are improving but the second generation hardware support removes the need for the roundtrip. Intel and AMD introduced their own hardware MMU support. Like the first generation hardware support, this results in two different implementation but with similar characteristics. Intel proposed the extended page tables (EPT) and AMD proposed their nested page tables (NPT). In Intel’s EPT, the page tables translate from virtual memory to guest physical addresses while a separate set of page tables, the extended page tables, translate from guest physical addresses to the real physical addresses [29]. The guest can modify its page tables without hypervisor intervention. The new extended page tables remove the VMExits associated with page table virtualization. AMD’s nested paging also use additional page tables, the nested page tables (nPT), to translate guest physical addresses to real physical addresses [19]. The guest page tables (gPT) map the virtual memory addresses to guest physical addresses. The gPT are set up by the guest and the nPT by the hypervisor. When nested paging is enabled and a guest attempts to reference memory using a virtual address, the page walker performs a two dimensional walk using the gPT and nPT to translate the guest virtual address to the real physical address. Like Intel’s EPT, nested paging removes the overheads associated with software shadow paging. Another feature introduced by both Intel and AMD in the second generation hardware support is tagged TLBs. Intel uses Virtual-Processor Identifiers (VPIDs) that allow a hypervisor to assign a different identifier to each virtual processor. The zero VPID is reserved for the hypervisor itself. The processor then uses the VPIDs 3.5. CURRENT AND FUTURE HARDWARE SUPPORT 23 to tag translations in the TLB. AMD calls these identifiers the Address Space IDs (ASIDs). During a TLB lookup, the VPID or ASID value of the active guest is matched against the ID tag in the TLB entry. In this way, TLB entries belonging to different guests and to the hypervisor can coexist without causing incorrect address translations. The tagged TLBs eliminate the need for TLB flushes on every VMEntry and VMExit, furthermore it eliminates the impact of those flushes on performance. The tagged TLBs are an improvement compared to the other virtualization techniques. These techniques need to flush the TLB every time a guest switches to the hypervisor or back. The drawback of the extended page tables or nested paging is that a TLB miss has a larger performance hit for guests because it introduces an additional level of address translation. This is rectified by making the TLBs much larger than before. Previous techniques like shadow page tables immediately translate the virtual guest address to the real physical address eliminating the additional level of address translation. The second generation hardware support is completely focussed on the improvement of the memory management. It eliminates the need for the hypervisor to maintain the shadow tables and eliminates the TLB flushes. The EPT and NPT help to improve performance for memory intensive workloads. 3.5 Current and future hardware support Intel and AMD are still working on support for virtualization. They are improving the latencies of the VMEntry and VMExit instructions, but are also working on new hardware techniques for supporting virtualization on the x86 architecture. The first generation hardware support for virtualization was based primarily on the processor and the second generation focusses on the memory management unit. The final component required next to CPU and memory virtualization is device and I/O virtualization [10]. Recent techniques are Intel VT-d and AMD IOMMU. There are three general techniques for I/O virtualization. The first technique is emulation and is described in subsection 3.1.2. The second technique, explained in subsection 3.2.2, is paravirtualization. The last technique is direct I/O. The device is not virtualized but assigned directly to a guest virtual machine. The guest’s device drivers are used for the dedicated device. In order to improve the performance for I/O virtualization, Intel and AMD are looking at allowing virtual machines to talk to the device hardware directly. With Intel VT-d and AMD IOMMU, hardware support is introduced to support assigning I/O devices to virtual machines. In such cases, the ability to multiplex the I/O device is lost. Depending on the I/O device, this does not need to be an issue. For example, network card interfaces can easily be added to the hardware in order to provide a NIC for each virtual machine. 3.6. VIRTUALIZATION SOFTWARE 24 3.6 Virtualization software There are many different virtualization implementations. This section gives an overview of some well-known virtualization software. Each implementation can be placed in the categories explained throughout the previous sections. 3.6.1 VirtualBox VirtualBox is a hosted hypervisor that performs full virtualization. It started as proprietary software but currently comes under a Personal Use and Evaluation License (PUEL). The software is free of charge for personal and educational use. VirtualBox was initially created by Innotek and was released as an Open Source Edition on January 2007. The company was later purchased by Sun Microsystems, which in turn was recently purchased by Oracle Corporation. VirtualBox software runs on Windows, Linux, Mac OS X and Solaris hosts. In depth information can be found on the wiki of their site [31], more specifically in the technical documentation [32]. Appendix A.1 presents an overview of VirtualBox, which is largely based on the technical documentation. A short summary is given in the following paragraph. VirtualBox started as a pure software solution for virtualization. The hypervisor used dynamic binary translation to fix the problem of virtualization in the x86 architecture. With the arrival of hardware support for virtualization, VirtualBox now also supports Intel VT-x and AMD SVM. The host operating system runs each VirtualBox virtual machine as an application, i.e. just another process in the host operating system. A ring 0 driver needs to be loaded in the host OS for VirtualBox to work. It only performs a few tasks: allocating physical memory for the virtual machine, saving and restoring CPU registers and descriptor tables, switching from host ring 3 to guest context and enabling or disabling hardware support. The guest operating system is manipulated to execute its ring 0 code in ring 1. This could result in poor performance since there is a possibility of generating a large amount of additional instruction faults. To address these performance issues, VirtualBox has come up with a Patch Manager (PATM) and Code Scanning and Analysis Manager (CSAM). The PATM will scan code recursively and replace problematic instructions with a jump to hypervisor memory where a more suitable implementation is placed. Every time a fault occurs, the CSAM will analyze the fault’s cause and determine if it is possible to patch the offending code to prevent it from causing more expensive faults. 3.6.2 VMware VMware [33] provides several virtualization products. The company was founded in 1998 and they released their first product, VMware Workstation, in May 1999. In 2001, they also entered the server market with VMware GSX Server and VMware ESX Server. Currently, VMware provides a variety of products for datacenter and desktop solutions together with management products. VMware software runs on Windows and Linux, and since the introduction of VMware Fusion it also runs on Mac OS X. Like VirtualBox, VMware started with a software only solution 3.6. VIRTUALIZATION SOFTWARE 25 for their hypervisors. In contrast with VirtualBox, VMware does not release the source code of their products. VMware now supports both full virtualization with binary translation and hardware assisted virtualization, and has a paravirtualized I/O device driver, vmxnet, that shares data structures with the hypervisor [10]. VMware Server is a free product based on the VMware virtualization technology. It is a hosted hypervisor that can be installed in Windows or Linux hosts. A webbased user interface provides a simple way to manage virtual machines. Another free datacenter product is VMware ESXi. It provides the same functionality but uses a native, bare-metal architecture for its hypervisor. VMware ESXi needs a dedicated server but has better performance. VMware makes these products available at no cost in order to help companies of all sizes experience the benefits of virtualization. The desktop product is VMware Player. It is free for personal non-commercial use and allows users to create and run virtual machines on a Windows or Linux host. It is a hosted hypervisor since this is common practice for desktop products. If users need developer-centric features, they can upgrade to VMware Workstation. 3.6.3 Xen Xen [34] is an open source example of virtualization software that uses paravirtualization. It is a native, bare-metal hypervisor for the x86 architecture and was initially created by the University of Cambridge Computer Laboratory in 2003 [22]. Xen is designed to allow multiple commodity operating systems to share conventional hardware. In 2007, Citrix Systems acquired the source of Xen and intended to freely license to all vendors and projects that implement the Xen hypervisor. Since 2010, the Xen community maintains and develops Xen. The Xen hypervisor is licensed under the GNU General Public License. After installation of the Xen hypervisor, the user can boot into Xen. When the hypervisor is started, it automatically boots a guest, domain 0, that has special management privileges and direct access to the physical hardware [35]. I/O devices are not emulated but Xen exposes a set of clean and simple device abstractions. There are two possibilities to run device drivers. In the first one, domain 0 is responsible for running the device drivers for the hardware. It will run a BackendDriver which queues requests from other domains and relays them to the real hardware driver. Each domain communicates with domain 0 through the FrontendDriver to access the devices. To the applications and the kernel, this driver looks like a normal device. The other possibility is that a driver domain has been given the responsibility for a particular piece of hardware. It runs the hardware driver and the backend driver for that device class. When the hardware driver fails, only this domain is affected and all other domains will survive. Apart from running paravirtualized guests, Xen supports Intel VT-x and AMD SVM since version 3.0.0 and 3.0.2 respectively. This allows users to run unmodified guest operating system in Xen. 3.6.4 KVM KVM [36], short for Kernel-based Virtual Machine, is a virtualization product that uses hardware support exclusively. Instead of creating major portions of an operating 3.6. VIRTUALIZATION SOFTWARE 26 system kernel, as other hypervisors have done, the KVM developers turned the standard Linux kernel into a hypervisor. By developing KVM as a loadable module, the virtualized environment can benefit from all the ongoing work on the Linux kernel itself and reduce redundancy [37]. KVM uses a driver (”/dev/kvm“) that communicates with the kernel and acts as an interface for userspace virtual machines. The initial version of KVM was released in November 2006 and it was first included in the Linux kernel 2.6.20 on February 2007. The recommended way of installing KVM is through the packaging system of a Linux distribution. The latest version of the KVM kernel modules and supporting userspace can be found on their website. You can find the kernel modules in the kvm-kmod-kernel version releases and the userspace components are found in qemukvm-version. The latter is the stable branch of KVM based on QEMU [38] with the KVM extras on top. QEMU is a machine emulator and can run an unmodified target operating system and all its applications in a virtual machine. The kvmversion releases are development releases but they are outdated. Every virtual machine is a Linux process, scheduled by the standard Linux scheduler [39]. A normal Linux process has two modes of execution: kernel and user mode. KVM adds a third mode of execution, guest mode. Processes that are run from within the virtual machine run in guest mode. Hardware virtualization is used to virtualize the processor, memory management is handled by the host kernel and I/O is handled in user space through QEMU. In this text, KVM is considered as a hosted hypervisor but there are some discussions1 that KVM is more a native, bare-metal hypervisor. One side argues that KVM turns Linux into a native, bare-metal hypervisor because Linux becomes the hypervisor and is running directly on top of the hardware. The other side argues that KVM runs on top of Linux and should be considered as hosted hypervisor. Regardless of what type of hypervisor KVM actually is, this text will consider KVM to be a hosted hypervisor. 3.6.5 Comparison between virtualization software A high-level comparison is given in table 3.1. All virtualization products in the table, except Xen, are installed within a host operating system. Xen is installed directly on the hardware. Most products provide two techniques for virtualization on x86 architectures. Hardware support for virtualization on x86 architectures is supported by all virtualization software in the table. 1 http://virtualizationreview.com/Blogs/Mental-Ward/2009/02/ KVM-BareMetal-Hypervisor.aspx 3.6. VIRTUALIZATION SOFTWARE Hypervisor type Dynamic binary translation 27 VirtualBox VMware Workstation XEN KVM Hosted Hosted Native, bare-metal Hosted X X Paravirtualization Hardware support X X X X X Table 3.1: Comparison between a selection of the most popular hypervisors. CHAPTER 4 Nested virtualization The focus of this thesis lies with nested virtualization on x86 architectures. Nested virtualization is executing a virtual machine inside a virtual machine. In case of multiple nesting levels, one can also talk about recursive virtual machines. In 1973 and 1975 initial research was published about properties of recursive virtual machine architectures [40, 41]. These works refer to virtualization that was used in mainframes so that users could work simultaneously on a single mainframe. Multiple use cases come in mind for using nested virtualization. • A possible use case for nested x86 virtualization is the development of test setups for research purposes. Research in cluster1 and grid2 computing requires extensive test setups, which might not be available. The latest developments in the research of grid and cluster computing make use of virtualization at different levels. Virtualization can be used for all, or certain, components of a grid or cluster. It can also be used to run applications within the grid or cluster in a sandbox environment. If certain performance limitations are not an issue, virtualizing all components of such a system can eliminate the need to acquire the entire test setup. Because these virtualized components, e.g. Eucalyptus3 or OpenNebula4 , might use virtualization for running applications in a sandbox environment, two levels of virtualization are used. Nesting the physical machines of a cluster or grid as virtual machines on one physical machine can offer security, fault tolerance, legacy support, isolation, resource control, consolidation, etc. 1 A cluster is a group of interconnected computers working together as a single, integrated computer resource [42, 43]. 2 There is no strict definition of a grid. In [44], Bote-Lorenzo et al. listed a number of attempts to create a definition. Ian Foster created a three point checklist that combine the common properties of a grid. [45] 3 http://www.eucalyptus.com 4 http://www.opennebula.org 29 • A second possible use case is the creation of a test framework for hypervisors. As virtualization allows testing and debugging an operating system by deploying the OS in a virtual machine, nested virtualization allows testing and debugging a hypervisor inside a virtual machine. It eliminates the need for a separate physical machine where a developer can test and debug a hypervisor. • Another possible use case is the use of virtual machines inside a server rented from the cloud5 . Such a server is virtualized on its own so that the cloud vendor can make optimal use of its resources. For example, Amazon EC26 offers virtual private servers which are virtual machines using the Xen hypervisor. Hence, if a user wants to use virtualization software inside this server, nested x86 virtualization is needed in order to make that setup work. As explained in chapter 2, virtualization on the x86 architecture is not straightforward. This has resulted in the emergence of several techniques that are given in chapter 3. These different techniques produce many different combinations to nest virtual machines. A nested setup can consist of the same technique for both hypervisors, but it can also consist of a different technique for either the first level hypervisor or the nested hypervisor. Hence, if we divide the techniques in three major groups: dynamic binary translation, paravirtualization and hardware support, there are nine possible combinations for nesting a virtual machine inside another virtual machine. In the following sections, the theoretical possibilities and requirements for each of these combinations are given. The results of nested virtualization on x86 architectures are given in chapter 5. Figure 4.1: Layers in a nested virtualization setup with hosted hypervisors. To prevent confusion about which hypervisor or guest is meant, some terms are introduced. In a nested virtualization setup, there are two levels of virtualization, see 5 6 Two widely accepted definitions of the term cloud can be found in [46] and [47]. http://aws.amazon.com/ec2/ 4.1. DYNAMIC BINARY TRANSLATION 30 figure 4.1. The first level, referred to as L1, is the layer of virtualization that is used in a non-nested setup. Thus, this level is the virtualization layer that is closest to the hardware. The terms L1 or bottom layer indicate the first level of virtualization, e.g. the L1 hypervisor is the hypervisor that is used in the first level of virtualization. The second level, referred to as L2, is the new layer of virtualization, introduced by the nested virtualization. Hence, the terms L2, nested or inner indicate the second level of virtualization, e.g. the L2 hypervisor is the hypervisor that will be installed inside the L1 guest. 4.1 Dynamic binary translation This section focusses on L1 hypervisors that use dynamic binary translation for nested virtualization on x86 architectures. This can be in the host operating system or directly on the hardware. The hypervisor can be VirtualBox (see subsection 3.6.1), a VMware product (see subsection 3.6.2) or any other hypervisor using dynamic binary translation. The nested hypervisor can be any hypervisor, resulting in three major combinations. Each combination uses a nested hypervisor that allows virtualization through a different technique. The nested hypervisor will be installed in a guest virtualized by the L1 hypervisor. The first combination is again a hypervisor using dynamic binary translation. In the second combination a hypervisor using paravirtualization is installed in the guest. The last combination is a nested hypervisor that uses hardware support. It should be theoretically possible to nest virtual machines using dynamic binary translation as L1. When using dynamic binary translation, no modifications are needed to the hardware or to the operating system, as pointed out in section 3.1. Code running in ring 0 will actually run in ring 1, but the guest is not aware of this. Dynamic binary translation: The first combination nests a L2 hypervisor inside a guest virtualized by a L1 hypervisor where both hypervisors are based on dynamic binary translation. The L2 hypervisor will be running in guest ring 0. Since the hypervisor will not be aware that its code is actually running in ring 1, it should be possible to run a hypervisor in this guest. The nested hypervisor will have to take care of the memory management in the L2 guest. It will have to maintain the shadow page tables for its guests, see subsection 3.1.3. The hypervisor uses these shadow page tables to translate the L2 virtual memory addresses to, what it thinks to be, real memory equivalents. But actually these translated addresses are in the virtual memory range of the L1 guest and can be converted to real memory addresses by the shadow page tables maintained by the L1 hypervisor. The memory architecture in a nested setup is illustrated in figure 4.2. For a L1 guest, there are two levels of address translation as shown in figure 3.1. A nested guest has three levels of address translation resulting in the need for shadow tables in the L2 hypervisor. Paravirtualization: The second combination uses paravirtualization as technique for the L2 hypervisor. This situation is the same as the situation with dynamic 4.1. DYNAMIC BINARY TRANSLATION 31 Figure 4.2: Memory architecture in a nested situation. binary translation for the L2 hypervisor. The hypervisor using paravirtualization will be running in guest ring 0 and is not aware that it is actually running in ring 1. This should make it possible to nest a L2 hypervisor based on paravirtualization within a guest virtualized by a L1 hypervisor using dynamic binary translation. Hardware supported virtualization: The virtualized processor that is available to the L1 guest is based on the x86 architecture in order to allow current operating system to work in the virtualized environment. However, are the extensions (see section 3.3 and 3.4) for virtualization on x86 architectures also included? In order to use a L2 hypervisor based on hardware support within the L1 guest, the L1 hypervisor should virtualize or emulate the virtualization extensions of the processor. A virtualization product that is based on hardware supported virtualization needs these extra extensions. If the extensions are not available, the hypervisor cannot be installed or activated. If the L1 hypervisor provides these extensions, chances are that it requires a physical processor with the same extensions. It might be possible for hypervisors based on dynamic binary translation to provide the extensions without having a processor that supports the hardware virtualization. However, all current processors have these extensions. Therefore it is very unlikely that developers will incorporate functionality that provides the hardware support to the guest without a processor with hardware support for x86 virtualization. Memory management in the L2 guest based on hardware support is not possible because the second generation hardware support only provides two levels of address translation. The L1 hypervisor should provide the EPT or NPT functionality to the guest together with the first generation hardware support, but it will have to use a software technique for the implementation of the MMU. 4.2. PARAVIRTUALIZATION 32 4.2 Paravirtualization The situation for nested virtualization is quite different when using paravirtualization as the bottom layer hypervisor. The most popular example of a hypervisor based on paravirtualization is Xen (see subsection 3.6.3). There are again three combinations. A nested hypervisor can be the same as the bottom layer hypervisor, based on paravirtualization. The second combination is the case where a dynamic binary translation based hypervisor is used as the nested hypervisor. In the last combination a hypervisor based on hardware support is nested in the paravirtualized guest. The main difference is that the L1 guest is aware of the virtualization. Dynamic binary translation and paravirtualization: The paravirtualized guest is aware of the virtualization and should use the hypercalls provided by the hypervisor. The guest’s operating system should be modified to use these hypercalls, thus all code in the guest that runs in kernel mode needs these modifications in order to work in the paravirtualized guest. This has major consequences for a nested virtualization setup. A nested hypervisor can only work in a paravirtualized environment if it is modified to work with these hypercalls. A native, bare-metal hypervisor should be adapted so that all ring 0 code is changed. For a hosted hypervisor this indicates that the module, that is loaded into the kernel of the host operating system, is modified to work in the paravirtualized environment. Hence, companies that develop virtualization products need to actively make their hypervisors compatible for running inside a paravirtualized guest. Memory management of the L2 guests is done by the nested hypervisor. The pages tables of the L1 guests are directly registered with the MMU, so the nested hypervisor can use the hypercalls to register its page tables with the MMU. A nested hypervisor based on paravirtualization might allow a L2 guest to register its page tables directly with the MMU, while a nested hypervisor based on dynamic binary translation will maintain shadow tables. Hardware supported virtualization: Hardware support for x86 virtualization is also for paravirtualization an exceptional case. The L1 hypervisor should provide the extensions for the hardware support to the guests, probably by means of hypercalls. Modified hypervisors based on hardware support can then use the hardware extensions. Second generation hardware support can also only be used if it is provided by the L1 hypervisor, together with first generation hardware support. In conclusion, nested virtualization with paravirtualization as a bottom layer needs modifications to the nested hypervisor, whereas nested virtualization with dynamic binary translation as bottom layer did not need these changes. On the other hand, the guests know that they are virtualized which might influence the performance of the L2 guests in a positive way. The nested virtualization will not work unless support is actively introduced. There is a low likelihood that virtualization software developers are willing to incorporate these modifications in their hypervisors since the cost of the implementation does not exceed the benefits. 4.3. HARDWARE SUPPORTED VIRTUALIZATION 33 4.3 Hardware supported virtualization The last setup is to use a hypervisor based on hardware support for x86 virtualization as the bottom layer. For this configuration a processor is required that has the extensions for hardware support. KVM (see subsection 3.6.4) is a popular example of such a hypervisor but the latest versions of VMware, VirtualBox and Xen can also use hardware support. As with the previous configurations, there are three combinations. The combination using a hypervisor that is based on the same technique as the L1 hypervisor. A combination where a hypervisor based on dynamic binary translation is used and the last combination where a paravirtualization based hypervisor is the nested hypervisor. Dynamic binary translation and paravirtualization: These combinations are similar to the combinations where a hypervisor based on dynamic binary translation is used as bottom layer. A guest or its operating system does not need modifications, hence it should in theory be possible to nest virtual machines in a setup where the bottom layer hypervisor is based on hardware support. The nested hypervisor thinks its code is running in ring 0, but actually it is running in the guest mode of the processor, which is a result of a VMRUN or VMEntry instruction. The memory management depends on whether the processor supports the second generation hardware support. If the processor does not support this, the L1 hypervisor uses a software technique for virtualizing the MMU. In this case, memory management will be the same as with dynamic binary translation where both L1 and L2 hypervisor maintain shadow tables for virtualizing the MMU. Whereas, if the processor does support the hardware MMU, then the L1 hypervisor does not need to maintain these shadow tables which can improve the performance. Hardware supported virtualization: As for the other configurations, hardware support for nested hypervisors is a special case. The virtualized processor that is provided to the L1 guest is based on the x86 processor but needs to contain the hardware extensions for virtualization if the nested hypervisor uses hardware support. If the L1 hypervisor does not provide these hardware extensions to its guests, only the combination with a nested hypervisor that uses dynamic binary translation or paravirtualization can work. KVM and Xen are doing research and work to provide hardware extensions for virtualization on the x86 architecture to the guests. More details are given in section 5.4. The hardware support for EPT or NPT (see section 3.4) in the guest, which can also be referred to as nested EPT or nested NPT, deserves special attention according to Avi Kivity [48]. Avi Kivity is a lead developer and maintainer of KVM and posted some interesting information about nested virtualization on his blog. Nested EPT or nested NPT can be critical for obtaining reasonable performance. The guest hypervisor needs to trap and service context switches and writes to guest tables. A trap in the guest hypervisor is multiplied by quite a large factor into KVM traps. Since the hardware only supports two levels of address translation, nested EPT or NPT should be implemented in software. CHAPTER 5 Nested virtualization in Practice The previous chapter gave some insight in the theoretical requirements of nested x86 virtualization. The division into three categories resulted in nine combinations. This chapter presents how nested x86 virtualization behaves in practice. Each of the nine combinations is tested and performance tests are executed on working combinations. The results of these tests are discussed in the following chapter. The combinations that fail to run are analyzed in order to find the reason for the failure. A selection of the currently popular virtualization products are tested. These products are VirtualBox, VMware Workstation, Xen and KVM as discussed in section 3.6. Table 3.1 shows a summary of these hypervisors and the supported virtualization techniques. There are seven different hypervisors if we consider that the products with multiple techniques consist of different hypervisors. For each hypervisor we can nest the other seven hypervisors. Thus, nesting these hypervisors result in 49 different setups, which will be described in the following sections. Details of the tests are given in appendix B. It lists the used configuration for each setup together with version information of the hypervisors and the result of the setup. The subsection in which each nested setup can be found is summarized in table 5.1. The columns of the table represent the L1 hypervisors and the rows represent the L2 hypervisors, i.e. the hypervisor represented by the row is nested inside the hypervisor represented by the column. For example, information about the nested setup where VirtualBox based on dynamic binary translation is nested inside Xen using paravirtualization, can be found in subsection 5.1.2. The table cells for setups with a L1 hypervisor based on hardware support are split in two cells, the upper cell represents the nested setup tested on a processor with first generation hardware support. The bottom cell represents the setup tested on a processor with second generation hardware support. KVM Xen VMware VirtualBox HV 5.1.1 5.1.1 5.1.1 PV HV 5.1.1 5.1.1 DBT HV 5.1.1 5.1.1 DBT 5.1.1 5.3.3 5.2.3 5.2.3 5.3.3 5.3.3 5.3.3 5.2.3 5.2.3 5.1.1 5.3.2 5.3.2 5.2.2 5.2.2 5.1.1 5.3.3 5.3.3 5.2.3 5.2.3 5.1.1 5.3.1 5.3.1 5.2.1 5.2.1 5.1.1 5.3.3 5.3.3 5.2.3 5.1.1 5.2.3 5.2.1 HV 5.3.1 5.1.1 DBT VMware 5.3.1 5.2.1 HV VirtualBox HV DBT Subsections 5.1.2 5.1.2 5.1.2 5.1.2 5.1.2 5.1.2 5.1.2 PV 5.3.3 5.2.3 5.3.3 5.2.3 5.3.2 5.2.2 5.3.3 5.2.3 5.3.1 5.2.1 5.3.3 5.2.3 5.3.1 5.2.1 HV XEN 5.3.3 5.2.3 5.3.3 5.2.3 5.3.2 5.2.2 5.3.3 5.2.3 5.3.1 5.2.1 5.3.3 5.2.3 5.3.1 5.2.1 HV KVM 2nd gen. 1st gen. 2nd gen. 1st gen. 2nd gen. 1st gen. 2nd gen. 1st gen. 2nd gen. 1st gen. 2nd gen. 1st gen. 2nd gen. 1st gen. Gen. HV 35 Table 5.1: Index table containing directions in which subsections information can be found about a certain nested setup. 5.1. SOFTWARE SOLUTIONS 36 5.1 Software solutions 5.1.1 Dynamic binary translation In this subsection, we will give the results of actually nesting the virtual machines inside a L1 hypervisor based on dynamic binary translation, as discussed in section 4.1. The nested hypervisors should not need modifications. Only the nested hypervisors based on hardware support for virtualization need a virtual processor that contains the hardware extensions. The L1 hypervisors are VirtualBox and VMware Workstation using dynamic binary translation for virtualizing guests. Since we test two L1 hypervisors, this subsection describes 14 setups. These setups are described in the following paragraphs categorized by their technique for the L2 hypervisor. The first paragraphs elaborate on the setups that use dynamic binary translation on top of dynamic binary translation. The next paragraph presents the setups that use paravirtualization as the L2 hypervisor, followed by a paragraph that presents the setups that use hardware support as the L2 hypervisor. The last paragraph concludes this subsection with an overview. Dynamic binary translation: Each setup that used dynamic binary translation as both the L1 and L2 hypervisor resulted in failure. The setups either hung or crashed when starting the inner guest. In the two setups where VMware Workstation was nested inside VMware Workstation and VirtualBox was nested inside VirtualBox, the L2 guest became unresponsive when started. After a few hours, the nested guests were still trying to start, so these setups could be marked as failures. In both setups the L1 and L2 hypervisors were the same, the developers know what instructions and functionality is used by the nested hypervisor and may have foreseen this situation. However, the double layer of dynamic binary translation seems to be inoperative or too slow for a working nested setup with the same hypervisor for both L1 and L2 hypervisors. The other two setups, where VMware Workstation is nested in VirtualBox and VirtualBox is nested in VMware Workstation, resulted in a crash. In the former setup the L1 VirtualBox guest crashed which indicates that the L2 guest tried to use functionality that is not fully supported by VirtualBox. This can be functionality that was left out in order to improve performance or a simple bug. In the other setup, with VMware Workstation as the L1 hypervisor and VirtualBox as the L2, the VirtualBox guest crashed but the VMware Workstation guest stayed operational. The L2 guest noticed that some conditions are not met and crashes with an assertion failure. In both setups, it seems that the L2 guest does not see a fully virtualized environment and one of the guests, in particularly VirtualBox, reports a crash. More information about the reported crash is given in section B.1. A possible reason that in both cases VirtualBox reports the crash is that VirtualBox is open source and can allow more information to be viewed by its users. Paravirtualization: Of the two setups that use paravirtualization on top of dynamic binary translation, one worked and the other crashed. Figure 5.1 shows the layers of these setups, where the L1 guest and the L2 hypervisor are represented by the same layer. The setup with VMware Workstation as the L1 hypervisor allowed 5.1. SOFTWARE SOLUTIONS 37 Figure 5.1: Layers for nested paravirtualization in dynamic binary translation. a Xen guest to boot successfully. In the other setup, using VirtualBox, the L1 guest crashed and reported a message similar to the setup with VMware Workstation inside VirtualBox (see section B.1). The result, one setup that works and one that does not, gives some insight in the implementation of VMware Workstation and VirtualBox. The latter contains one or more bugs which make the L1 guest crash when a nested hypervisor starts a guest. The functionality could be left out deliberately because such a situation might not be very common. Leaving out these exceptional situations allows developers to focus on more important functionality for allowing virtualization. On the other hand, VMware Workstation does provide the functionality and could be considered more mature for nested virtualization using dynamic binary translation as the L1 hypervisor. Hardware supported virtualization: VirtualBox and VMware Workstation do not provide the x86 virtualization processor extensions to their guests. This means that there is no hardware support available in the guests, neither for the processor, nor the memory management. Since four of the hypervisors are based on hardware support, there are eight setups that contain such a hypervisor. The lack of hardware support causes the failure of these eight setups. Implementing the hardware support in the L1 hypervisor using software, without underlying support from the processor, could result in bad performance. However, if performance is not an issue, such a setup could be useful to simulate a processor with hardware support on an incompatible processor. Only one out of 14 setups worked with dynamic binary translation as the L1 hypervisor. The successful combination is the Xen hypervisor using paravirtualization within the VMware Workstation hypervisor. Other setups hung or crashed and VirtualBox reported the most crashes. VirtualBox seems to contain some bugs that VMware Workstation does not have, resulting in crashes in the guest being virtualized by VirtualBox. Hardware support for virtualization is not present in the L1 guest using VMware Workstation or VirtualBox, which eliminates the eight setups with a nested hypervisor that needs the hardware extensions. Table 5.2 gives a summary of the setups described in this subsection. The columns represent the 5.1. SOFTWARE SOLUTIONS VirtualBox VMware Xen KVM 38 VirtualBox VMware DBT DBT DBT × × HV × × DBT × × HV × × PV × X HV × × HV × × Table 5.2: The nesting setups with dynamic binary translation as the L1 hypervisor technique. DBT stands for dynamic binary translation, PV for paravirtualization and HV for hardware virtualization. L1 hypervisors and the rows represent the L2 hypervisors. 5.1.2 Paravirtualization Previous subsection described the setups that use dynamic binary translation as the L1 hypervisor. The following paragraphs elaborate on the use of a L1 hypervisor based on paravirtualization. In section 4.2, we concluded that nested virtualization with paravirtualization as a bottom layer needs modifications to the nested hypervisor. The L1 hypervisor used for the tests is Xen. In all the nested setups, the L2 hypervisor should be modified to use the paravirtual interfaces offered by Xen instead of executing ring 0 code. We discuss the problems for each hypervisor technique in the following paragraphs, together with what the setup would look like if the nested virtualization works. The last paragraph summarizes the setups described in this subsection. Paravirtualization: The paravirtualized guest does not allow the start of a Xen hypervisor within the guest. The kernel loaded in the paravirtualized guest is a kernel adapted for paravirtualization. The Xen hypervisor is not adapted to use the provided interface and hence the paravirtualized guest removes the other kernels from the bootloader. The complete setup, see figure 5.2, consists of Xen as the L1 hypervisor which automatically starts domain 0. This domain 0 is a L1 privileged guest. Another domain would run the nested hypervisor, which in turn would run its automatically started domain 0 and a nested virtual machine. Dynamic binary translation: The hypervisor of VMware Workstation and VirtualBox based on dynamic binary translation could not be loaded in the paravirtualized guest. The reason is that the ring 0 code is not adapted for the paravirtualization. In practice this expresses itself as the inability to compile the driver or module that needs to be loaded. It should be compiled against the kernel headers 5.1. SOFTWARE SOLUTIONS 39 Figure 5.2: Layers for nested Xen paravirtualization. but fails to compile since it does not recognize the version of the adapted kernel and its headers. The setup for dynamic binary translation as technique for the nested hypervisor (see figure 5.3) differs from the previous setup (figure 5.2) in that the L2 hypervisor is on top of a guest operating system. Xen is a native, bare-metal hypervisor which runs directly on the hardware, i.e. in this case the virtual hardware. VMware Workstation and VirtualBox are hosted hypervisors and do need an operating system between the hypervisor and the virtual hardware. Figure 5.3: Layers for nested dynamic binary translation in paravirtualization. Hardware supported virtualization: The other four setups, where a nested hypervisor based on hardware support is used, have the same problem. None of the hypervisors are modified to run in a paravirtualized environment. In addition, the virtualization extensions are not provided in the paravirtualized guest. Even if the hypervisors were adapted for the paravirtualization, they would still need these extensions. These setups look like figure 5.2 or figure 5.3, depending on whether the nested hypervisor is hosted or native, bare-metal. None of the seven setups with paravirtualization as bottom layer worked. The results of the setups are shown in table 5.3. The column with the header “Xen” represents the L1 hypervisor. The main problem is the adaptation of the hypervisors. 5.2. FIRST GENERATION HARDWARE SUPPORT 40 XEN PV VirtualBox VMware Xen KVM DBT × HV × DBT × HV × PV × HV × HV × Table 5.3: The nesting setups with paravirtualization as the L1 hypervisor technique. Unless these hypervisors are modified, paravirtualization is not a good choice as L1 hypervisor technique. It will always depend on the adaptation of the hypervisor and one could only use that hypervisor. When using paravirtualization, the best one could do is hope that developers adapt their hypervisors or modify the hypervisor oneself. 5.1.3 Overview software solutions Previous subsections explain the results of nested virtualization with software solutions for the bottom layer hypervisor. This subsection gives an overview of all the possible setups described in the previous subsections. All these setups are gathered in table 5.4. The columns of the table represent the setups belonging to the same L1 hypervisor. The rows in the table indicate a different nested hypervisor, i.e. the hypervisor represented by the row is nested inside the hypervisor represented by the column. Nested x86 virtualization using a L1 hypervisor based on a software solution is not successful. Out of the 21 setups that were tested, only one setup allows to successfully boot a L2 guest: nesting Xen inside VMware Workstation. Note that 12 setups are unsuccessful simply because hardware support for x86 virtualization is not available in the L1 guest. 5.2 First generation hardware support The setups with a bottom layer hypervisor based on hardware support are described in this section. The theoretical possibilities and requirements needed for these setups are discussed in section 4.3. The conclusion was that it should be possible to nest virtual machines without modifying the guest operating systems, given that the physical processor provides the hardware extensions for x86 virtualization. In 5.2. FIRST GENERATION HARDWARE SUPPORT VirtualBox VMware XEN DBT DBT PV 5.1.1 5.1.1 5.1.2 DBT × × × HV × × × DBT × × × HV × × × PV × X × HV × × × HV × × × Subsection VirtualBox VMware Xen KVM 41 Table 5.4: Overview of the nesting setups with a software solution as the L1 hypervisor technique. chapter 3, the hardware support for x86 virtualization was divided into the first generation and second generation hardware support. The second generation hardware support adds a hardware supported memory management unit so that the hypervisor does not need to maintain shadow tables. The original research was done on a processor1 that did not have second generation hardware support. Detailed information about the hypervisor versions is listed in section B.3. To make a comparison between first generation and second generation hardware support for x86 virtualization, the setups were also tested on a newer processor2 that does provide the hardware supported MMU. The results of the tests on the newer processor are given in section 5.3. The tested L1 hypervisors using the hardware extensions for virtualization are VirtualBox, VMware Workstation, Xen and KVM. We nested the seven hypervisors (see table 3.1) within these four hypervisors, resulting in 28 setups. In the first subsection the nested hypervisor is based on dynamic binary translation. The second subsection described the setups with Xen paravirtualization as the L2 hypervisor. The last subsection handles the setups with a nested hypervisor based on hardware support for x86 virtualization. 1 Setups with a L1 hypervisor based on first generation hardware support for x86 virtualization R were tested on an Intel CoreTM 2 Quad Q9550 processor. 2 Setups with a L1 hypervisor based on second generation hardware support for x86 virtualization R were tested on an Intel CoreTM i7-860 processor. 5.2. FIRST GENERATION HARDWARE SUPPORT 42 5.2.1 Dynamic binary translation Using dynamic binary translation as the nested hypervisor technique, there are eight setups. Three of these setups are able to successfully boot and run a nested virtual machine. The layout of these setups can be seen in figure 5.4 where the L1 hypervisor is based on hardware support and the L2 hypervisor is based on dynamic binary translation. When Xen is used as the L1 hypervisor, the host OS layer can be left out and a domain 0 is started next to VM1, which still uses hardware support for its virtualization. Figure 5.4: Layers for nested dynamic binary translation in a hypervisor based on hardware support. VirtualBox: When VirtualBox based on hardware support is used as the bottom layer hypervisor, none of the setups worked. Nesting VirtualBox inside VirtualBox resulted in the L2 guest becoming unresponsive. The same result happened when VirtualBox was nested in VirtualBox but used dynamic binary translation for both levels. When trying to nest a VMware Workstation guest inside VirtualBox, the configuration of that setup is very unstable so that each minor change resulted in a setup that refuses to start the L2 guest. There was one working configuration which we listed in section B.3. VMware Workstation: If the L1 hypervisor in figure 5.4 is VMware Workstation, the setups were successful in nesting virtual machines. Both VirtualBox and VMware Workstation as nested hypervisors based on dynamic binary translation were able to start the L2 guest which booted and ran correctly. Xen: VMware Workstation3 checks whether there is an underlying hypervisor running. It noticed that Xen was running and refused to start a nested guest. This prevents a L2 VMware guest from starting within a Xen guest. In the other setup, where VirtualBox is used as inner hypervisor, the L2 again became unresponsive 3 In version VMware Workstation 6.5.3 build-185404 and newer 5.2. FIRST GENERATION HARDWARE SUPPORT 43 after starting. There is no crash, error message or warning which might indicate that the L2 guest booted at a very slow pace. KVM: The third and last working setup for nesting a hypervisor based on dynamic binary translation within one based on hardware support is nesting VMware Workstation inside KVM. In newer versions of VMware Workstation4 , a check for an underlying hypervisor noticed that KVM was running and refused to boot a nested guest. The setup with VirtualBox as the nested hypervisor crashed while booting. The L2 guest showed an error indicating a kernel panic because it could not synchronize. The guest became unresponsive after displaying the message. VirtualBox VMware XEN KVM HV HV HV HV VirtualBox DBT × X × × VMware DBT ∼ X × X Table 5.5: The nesting setups with first generation hardware support as the L1 hypervisor technique and DBT as the L2 hypervisor technique. Table 5.5 gives a summary of the eight setups discussed in this subsection. VMware Workstation is the best option since it allows nesting other hypervisors based on dynamic binary translation, but it will also most likely work when used as nested hypervisor based on dynamic binary translation. In comparison to nesting inside a software solution, VirtualBox is able to nest within VMware Workstation when using hardware support for the L1 hypervisor. VirtualBox is still not able to nest within KVM, Xen and within itself, while VMware Workstation is able to nest within KVM and itself. It is regretful that VMware Workstation checks for an underlying hypervisor, other than VMware itself, to prevent the use of VMware Workstation within other hypervisors. 5.2.2 Paravirtualization In this subsection, we discuss the setups that nest a paravirtualized guest inside a guest virtualized using hardware support. Figure 5.5 shows the layers in these setups. The main differences with the setups in the previous subsection are that the L1 guest and the L2 hypervisor are represented by the same layer and that Xen automatically starts domain 0. There are just four setups tested in this subsection since only Xen is nested within the four hypervisors based on hardware support. All four setups could successfully nest a paravirtualized guest inside the L1 guest. However, the setup where Xen is nested inside VirtualBox was not very stable. Sometimes during the start-up of the privileged domain several segmentation faults occurred. Domain 0 was able to boot and run successfully but the creation of another paravirtualized guest was sometimes 4 In version VMware Workstation 7.0.1 build-227600 and newer 5.2. FIRST GENERATION HARDWARE SUPPORT 44 Figure 5.5: Layers for nested paravirtualization in a hypervisor based on hardware support. impossible. Xen reported that the guest is created, however, it did not show up in the list of virtual machines indicating that the guest crashed immediately. Xen PV VirtualBox VMware XEN KVM HV HV HV HV ∼ X X X Table 5.6: The nesting setups with first generation hardware support as the L1 hypervisor technique and PV as the L2 hypervisor technique. An overview of the four setups is shown in table 5.6. It is clear that using paravirtualization as technique for the nested hypervisor can be recommended. The only setup that does not completely work is the one with VirtualBox. Since the other three setups work and since previous conclusions were also not in favor of VirtualBox, VirtualBox is probably the reason for the instability. 5.2.3 Hardware supported virtualization The remaining setups, which attempt to nest a hypervisor based on hardware supported virtualization, are discussed in this subsection. Nesting the four hypervisors based on hardware support within each other results in 16 setups. The layout of the setups is equal to figure 5.4 and figure 5.5, depending on which hypervisor is used. None of the hypervisors provide the x86 virtualization processor extensions to their guests indicating that none of the setups will work. Developers of both KVM and Xen are working on support for nested hardware support. Detailed information can be found in section 5.4. KVM has already released initial patches for nested hardware support on AMD processors and is working on patches for the nested support on Intel processors. Xen is also researching the ability 5.2. FIRST GENERATION HARDWARE SUPPORT 45 to nest the hardware support so that nested virtual machines can use the hardware extensions. VirtualBox VMware XEN KVM HV HV HV HV VirtualBox HV × × × × VMware HV × × × × Xen HV × × × × KVM HV × × × × Table 5.7: The nesting setups with first generation hardware support as the L1 and L2 hypervisor technique. The results of this subsection are summarized in table 5.7. It is regretful that currently none of the setups work because the L1 hypervisors do not yet provide the hardware support for virtualization to the guests. Nonetheless, it is hopeful that KVM and Xen are doing research and work in this area. Their work can motivate managers and developers of other hypervisors to provide these hardware extensions to their guests as well. We would like to note that VMware and VirtualBox guests with a 64 bit operating system need hardware support to execute. If we would use a 64 bit operating system for the nested guest, the result would be the same as the results in this section since there is currently no nested hardware support. 5.2.4 Overview first generation hardware support In this subsection, we summarize the results using the setups that are described in the previous subsections. All the setups were tested on a processor that had first generation hardware support for virtualization on x86 architectures. The results of all the setups are gathered in table 5.8. The columns indicate the L1 hypervisor and the rows indicate the L2 hypervisor, i.e. the hypervisor represented by the row is nested inside the hypervisor represented by the column. Nested x86 virtualization using a L1 hypervisor based on hardware support is more successful than using a L1 hypervisor based on software solutions (see section 5.1.3). For nesting dynamic binary translation, results suggest that VMware Workstation was the best option and that VirtualBox works although it showed some instabilities. Nesting paravirtualization is the most suitable solution when using a L1 hypervisor based on hardware support on a processor that only supports first generation hardware support. Nested hardware support is not present yet but KVM and Xen are working on it. The future will tell whether they will be successful or not. The number of working setups increased when using hardware support in the L1 hypervisor so the future looks promising for nested hardware support. For now, VMware Workstation is the most suitable choice for the L1 hypervisor, 5.3. SECOND GENERATION HARDWARE SUPPORT VirtualBox VMware Xen KVM 46 VirtualBox VMware XEN KVM HV HV HV HV DBT × X × × HV × × × × DBT ∼ X × X HV × × × × PV ∼ X X X HV × × × × HV × × × × Table 5.8: Overview of the nesting setups with first generation hardware support as the L1 hypervisor technique. directly followed by KVM, since it can nest three different hypervisors. The advisable choice for the L2 hypervisor is a paravirtualized guest using Xen since it allows nesting in all the hypervisors. VirtualBox as the L1 hypervisor has two unstable setups which makes it rather unsuitable for nested virtualization. 5.3 Second generation hardware support In section 4.3 we concluded that it should be possible to nest virtual machines without modifying the guest operating system, given that the hardware extensions for virtualization on x86 architectures are provided by the physical processor. In section 5.2, the setups with a L1 hypervisor based on hardware support were tested on a processor that only provided first generation hardware support. In this section, we test the same setups but on a newer processor5 that provides second generation hardware support. The comparison of the results presented in both sections can give an insight in the influence of the hardware supported MMU for nested virtualization. Section B.4 lists detailed information about the hypervisor versions. The second generation hardware support offers a hardware supported MMU. The hardware supported MMU provides the Extended Page Tables for Intel and the Nested Page Tables for AMD (see section 3.4). The memory management in nested virtualization needs three levels of address translation, as can be seen in figure 4.2, while the hardware supported MMU only offers two levels of address translation. 5 Setups with a L1 hypervisor based on second generation hardware support for x86 virtualization R were tested on an Intel CoreTM i7-860 processor 5.3. SECOND GENERATION HARDWARE SUPPORT 47 This problem is solved by reusing the existing code for the shadow tables. The L2 hypervisor will maintain shadow tables that translate the nested virtual guest address to the physical guest address. These shadow tables are used together with the EPT or NPT to translate the nested virtual guest address into a real physical address. So in a nested setup, the L2 guest maintains its own page tables that translate nested virtual guest addresses into nested physical guest addresses. The L2 hypervisor maintains shadow tables for these page tables that immediately translate nested virtual guest addresses into physical guest addresses. The L1 hypervisor maintains the EPT or NPT that translate the physical guest addresses into real physical addresses. The setups in this section remain unchanged; VirtualBox, VMware Workstation, Xen and KVM are used as L1 hypervisor, supporting the hardware extensions for virtualization on x86 architectures. The first subsection elaborates on the results of nesting a L2 hypervisor based on dynamic binary translation. The second subsection discusses the results of a nested hypervisor based on paravirtualization. The results of nesting a hypervisor based on hardware support within the L1 hypervisor are explained in the third subsection. Subsection 5.3.4 gives an overview of the results in this section and compares these results with the results obtained in section 5.2. 5.3.1 Dynamic binary translation Eight setups were tested using a L2 hypervisor based on dynamic binary translation. Compared to the three working setups in subsection 5.2.1, there are six working setups when using a L1 hypervisor based on the second generation hardware support for virtualization on x86 architectures. The layout of the setups in this subsection is shown in figure 5.4. VirtualBox: Using VirtualBox based on hardware support as the bottom layer hypervisor results in a different outcome for one of the setups. If the L2 hypervisor is VMware Workstation, the result was very unstable comparable to subsection 5.2.1. The other setup, which uses VirtualBox as the nested hypervisor, was able to boot and run a L2 guest successfully. In the tests without second generation hardware support, this setup became unresponsive. The use of the hardware supported MMU affects the outcome of the test for this setup. VMware Workstation: Nothing has changed for these results. Both setups, with VMware Workstation as L1 hypervisor, were still successful in running a L2 guest. Xen: The setup with VirtualBox as the L2 hypervisor also has a different outcome with Xen based on hardware support as the L1 hypervisor. A L2 guest was able to boot and run successfully, while in the test with first generation hardware support, the setup became unresponsive. The setup with VMware Workstation as the L2 hypervisor still does not work because the hypervisor checks for an underlying L1 hypervisor. The Xen hypervisor was detected and VMware Workstation6 reported that the user should disable the other hypervisor. KVM: Nesting VirtualBox or VMware Workstation within KVM now worked 6 In version VMware Workstation 6.5.3 build-185404 and newer 5.3. SECOND GENERATION HARDWARE SUPPORT 48 for both setups. The setup with VMware Workstation as the inner hypervisor already worked without second generation hardware support. The newer versions of VMware Workstation7 checked whether there is an underlying hypervisor and noticed that KVM was running. The new check prevented the setup from working. The setup using VirtualBox as L2 hypervisor, which showed a kernel panic in the previous test, now booted and ran successfully. VirtualBox VMware XEN KVM HV HV HV HV VirtualBox DBT X X X X VMware DBT ∼ X × X Table 5.9: The nesting setups with second generation hardware support as the L1 hypervisor technique and DBT as the L2 hypervisor technique. The new results are gathered in table 5.9. In subsection 5.2.1, VMware Workstation was the recommended option to use as both L1 and L2 hypervisor. The conclusion is different with the results in this subsection. VMware Workstation now shares the most suitable choice for the bottom layer hypervisor with KVM. The most suitable choice for the L2 hypervisor is no longer VMware Workstation but VirtualBox since it could be nested in all setups. The check for an underlying hypervisor in VMware Workstation prevented it from being nested in certain setups. The setup that nests VMware inside VirtualBox is very unstable, preventing VirtualBox from being the advisable choice for the L1 hypervisor. 5.3.2 Paravirtualization In this subsection, we replace dynamic binary translation as the L2 hypervisor with paravirtualization. The layout of the setups is shown in figure 5.5. There were three setups that completely worked in subsection 5.2.2. The fourth setup was very unstable because segmentation faults could occur during the start-up of domain 0. Using second generation hardware support these segmentation faults disappeared and the fourth setup successfully passed the test. There is little difference with the previous results on a processor with first generation hardware support. The new results are collected in table 5.10. Xen using paravirtualization remains a perfect choice for nesting inside a virtual machine. 5.3.3 Hardware supported virtualization None of the setups in subsection 5.2.3 worked. The results are the same for this subsection since the problem is not the hardware supported MMU. The layout of the setups with the L1 and L2 hypervisors based on hardware support for virtualization on x86 architectures is similar to figure 5.4 and figure 5.5, depending on which 7 In version VMware Workstation 7.0.1 build-227600 and newer 5.3. SECOND GENERATION HARDWARE SUPPORT Xen PV 49 VirtualBox VMware XEN KVM HV HV HV HV X X X X Table 5.10: The nesting setups with second generation hardware support as the L1 hypervisor technique and PV as the L2 hypervisor technique. hypervisor is used. The problem is that there is no nested hardware support. There is work in progress on this subject by KVM and Xen, see section 5.4. VirtualBox VMware XEN KVM HV HV HV HV VirtualBox HV × × × × VMware HV × × × × Xen HV × × × × KVM HV × × × × Table 5.11: The nesting setups with first generation hardware support as the L1 and L2 hypervisor technique. For completeness, the results are shown in table 5.11 but the results are the same as the results of the tests on a processor without second generation hardware support. 5.3.4 Overview second generation hardware support The intermediate results of the previous subsections are gathered in this subsection. The setups were tested on a processor that provides second generation hardware support for virtualization. Table 5.12 shows a summary of the results obtained in the previous subsections. The columns indicate the L1 hypervisor and the rows represent the L2 hypervisor, i.e. the hypervisor indicated by the row is nested inside the hypervisor represented by the column. Nested x86 virtualization with a L1 hypervisor based on hardware support is even more successful when the processor provides second generation hardware support. Both dynamic binary translation and paravirtualization are capable of being nested inside a hypervisor based on hardware support. The two setups that did not work, had problems with configuration instability and the check for an underlying hypervisor in VMware Workstation. KVM and VMware Workstation are the advisable choice for the L1 hypervisor since all dynamic binary translation and paravirtualization setups worked for these hypervisors. VirtualBox using dynamic binary 5.4. NESTED HARDWARE SUPPORT VirtualBox VMware Xen KVM 50 VirtualBox VMware XEN KVM HV HV HV HV DBT X X X X HV × × × × DBT ∼ X × X HV × × × × PV X X X X HV × × × × HV × × × × Table 5.12: Overview of the nesting setups with second generation hardware support as the L1 hypervisor technique. translation and Xen using paravirtualization are the most suitable choice for the L2 hypervisor. Many setups that were unresponsive in section 5.2 became responsive when using a hardware supported MMU. The use of EPT or NPT improves the performance for the memory management and releases the L1 hypervisor from maintaining shadow tables. The maintenance of the shadow tables is based on software and can contain bugs. It must also be implemented in a performance oriented way since it is a crucial part. After some research8 , it was clear that hypervisors normally take shortcuts in order to improve the performance of the memory management. Thus, the main issue is the shadow tables, which optimize the MMU virtualization but not exactly follow architecture equivalence for performance reasons. Two levels of shadow page tables seemed to be the cause of unresponsiveness in several setups. Replacing the shadow tables in the L1 hypervisor by the use of EPT or NPT removes the inaccurate virtualization of the memory management unit. The second generation hardware support inserts an accurate hardware MMU with two levels of address translation in the L1 hypervisor allowing L2 hypervisors and L2 guests to run successfully. 5.4 Nested hardware support Nested hardware support is the support of hardware extensions for virtualization on x86 architectures within a guest. The goal of nested hardware support is mainly supporting nested virtualization for L2 hypervisors based on that hardware support. 8 http://www.mail-archive.com/[email protected]/msg29779.html 5.4. NESTED HARDWARE SUPPORT 51 In section 4.3, we concluded that in order to nest a hypervisor based on hardware support, the virtualized processor should provide the hardware extensions. In subsection 5.2.3 and subsection 5.3.3, we noticed that none of the hypervisors provide a virtualized processor with hardware extensions, resulting in none of the setups being able to nest a hypervisor. Recently, KVM and Xen started research in this domain in order to develop nested hardware support. In the following subsection, the work in progress of both KVM and Xen is presented. 5.4.1 KVM Nested hardware support was not supported by default in KVM. The virtualized processor provided to the guest is similar to the host processor, but lacks the hardware extensions for virtualization. These extensions are needed in order to use KVM or any other hypervisor based on hardware support. The introduction of nested hardware support should allow these hypervisors to be nested inside a virtual machine. The first announcement of nested hardware support was made on September 2008 in a blog post of Avi Kivity [48]. He writes about an e-mail of Alexander Graf and Joerg Roedel presenting a patch for nested SVM support [49], i.e. nested hardware support for AMD processors with SVM support, and about the relative simplicity of this patch. More information on AMD SVM itself can be found in section 3.3. Alexander Graf and Joerg Roedel are both developers working on new features for KVM. The patch was eventually included in development version kvm-82 and allows the guest on an AMD processor, with hardware extensions for virtualization, to run a nested hypervisor based on hardware support. The implementation of the patch stayed relatively simple by exploiting the design of the SVM instruction set. A year later on September 2009, Avi Kivity announced that support for nested VMX, i.e. nested hardware support for Intel processors with Intel VT-x extensions, is coming. The bad news is that it will take longer to implement this feature since nested VMX is more complex than nested SVM. In section 3.3, we explained that Intel VT-x and AMD SVM are very similar but the terminology is somewhat different. Besides the similarities, there are some fundamental differences in their implementation that make VMX support more complex. A first difference is the manipulation of the data structure used by the hypervisor to communicate with the processor. For Intel VT-x, this data structure is called the VMCS, the equivalent in AMD SVM is called VMCB. Intel uses two instructions, VMREAD and VMWRITE, to manipulate the VMCS, while AMD allows manipulation of the VMCB by reading and writing in a memory region. The drawback of the two extra instructions is that KVM must trap and emulate the special instructions. For SVM, KVM could just allow the guest to read and write to the memory region of the VMCB without intervention. A second difference is the number of fields used in the data structure. Intel uses a lot more fields to allow hypervisor-processor intercommunication. AMD SVM has 91 fields in the VMCS, while Intel VT-x has no less than 144 fields. KVM needs to virtualize all these fields and make sure that the guest, running a hypervisor, can use those fields in a correct way. Besides the differences in the implementation of Intel VT-x and AMD SVM, 5.4. NESTED HARDWARE SUPPORT 52 another reason for the longer development time for the nested VMX support is that the patch will immediately support nested EPT. This means that not only the hypervisor in the host can use Extended Page Tables, see section 3.4, but the hypervisor in the guest also benefits from EPT support. As already pointed out in section 4.3, nested EPT or nested NPT could be critical for obtaining reasonable performance. With the VMX support, a KVM guest must support the 32 bit and 64 bit page tables format and the EPT format. In practice The nested hardware support was tested on an AMD processor9 since the nested SVM patch was already released. The installation is the same as a regular install but in order to use the patch one must set a flag when loading the modules. We can do this using the following commands: modprobe kvm modprobe kvm−amd n e s t e d=1 “nested=1” indicates that we want to use the nested SVM. The tested setup was KVM as both L1 and L2 hypervisor. After installing and booting the L1 guest, KVM was installed inside the guest in exactly the same way as a normal installation of KVM. The nested hypervisor’s modules do not need to be loaded with “nested=1”. In subsection 5.2.3 and subsection 5.3.3, we could not install KVM within the guest. Installing KVM within the guest is a promising step towards nested virtualization with KVM, or any other hypervisor based on hardware support, as a nested hypervisor. When starting the L2 guest for installation of an operating system or for booting an existing operating system, some “handle exit” messages occurred. On KVM’s mailing list, Joerg Roedel replied10 on March 2010 that the messages result from a difference between a real hardware SVM and the emulated SVM from KVM. A patch should fix this issue, as it needs more testing the current setup was not able to boot. Nonetheless, developers are constantly improving the nested SVM by means of new patches and tests so it is just a matter of time before the current setup will work. 5.4.2 Xen Xen is also working on nested virtualization with an emphasis on virtualization based on hardware support. On November 2009, during the Xen Summit in Asia, Qing He presented his work on nested virtualization [50]. Qing He has been working on Xen since 2006 and is a software engineer from the Intel Open Source Technology Center. His work focusses on hardware support based virtualization and more specifically on Intel’s VT-x hardware support. The current progress is a proof of concept for a simple scenario with a single processor and one nested guest. The nested guest is able to boot to an early stage successfully with KVM as the L2 hypervisor. Before releasing the current version, it still needs some stabilization and refinement. The main target is the virtualization of VMX in order to present a virtualized VMX to the guest. This means that everything of the hardware support must be 9 10 The nested hardware support was tested on a Quad-Core AMD OpteronTM 2350 processor. http://www.mail-archive.com/[email protected]/msg31096.html 5.4. NESTED HARDWARE SUPPORT 53 available in the guest. The guest should be able to use the data structures and the instructions to manipulate the VMCS. The guest should also be able to control the execution flow of the VMX with VMEntry and VMExit instructions. Figure 5.6: Nested virtualization architecture based on hardware support. The data structures are shown in figure 5.6. The L1 guest has a VMCS that is loaded into the hardware when this guest is running. The VMCS is maintained by the L1 hypervisor. If the L2 guest wants to execute, it needs to have a corresponding VMCS. That corresponding VMCS is maintained by the L2 hypervisor running in the L1 guest and is called the virtual VMCS, or vVMCS. The L2 hypervisor sees the virtual VMCS as the controlling VMCS of the L2 guest but it is called virtual because the L1 hypervisor maintains a corresponding shadow VMCS, or sVMCS. This shadow VMCS is not a complete duplicate of the virtual VMCS but contains translations, similar to the shadow tables (see subsection 3.1.3). It is the shadow VMCS that is loaded to the hardware when the L2 guest is running. Thus, each nested guest has a virtual VMCS in the L2 hypervisor and a corresponding shadow VMCS in the L1 hypervisor. The general idea is to treat the L2 guests as a guest of the L1 hypervisor using the shadow VMCS. Figure 5.7 shows the execution flow in a nested virtualization scenario based on hardware support. On the left side of the figure, the L1 guest is running and wants to start a nested guest. The guest does this by executing a VMEntry with the instruction VMLAUNCH or VMRESUME. The virtual VMEntry can not directly switch to the L2 guest because it is not supported by the hardware. The L1 guest is already using the VMX guest mode and can only trigger a VMExit. The VMExit results in a transition to the L1 hypervisor which will intercept the VMEntry call and tries to switch to the shadow VMCS indicated by the VMEntry. This results in the transition to the L2 guest and the L2 can run from then on. Similar to a virtual VMEntry, the virtual VMExit will transition to the L1 hypervisor. The L1 hypervisor does not know whether the VMExit is a virtual VMExit 5.4. NESTED HARDWARE SUPPORT 54 Figure 5.7: Execution flow in nested virtualization based on hardware support. or whether the VMExit happened due to the L2 guest executing a privileged instruction. When the L2 guest tries to run a privileged instruction, the L1 hypervisor can fix this without having to forward the VMExit to the L2 hypervisor. An algorithm in the L1 hypervisor determines whether this is a virtual VMExit and should be forwarded to the L2 hypervisor, or it is another type of VMExit that can be handled by the L1 hypervisor. For a virtual VMExit, the L1 hypervisor forwards to the L2 hypervisor and the shadow VMCS of the L2 guest is unloaded. The L1 hypervisor switches the controlling VMCS to the VMCS of the L1 guest. In the figure, there are 3 VMExits which result in a transition to the L1 hypervisor. The first and the last VMExit is forwarded by the L1 hypervisor to the L2 hypervisor and the second VMExit is handled by the L1 hypervisor itself. There is no special handling in place for the memory management. The nested EPT, as described in the previous subsection, is also very helpful in this case because it significantly reduces the number of virtual VMExits. Nested EPT support is still work in progress. Table 5.13: Overview of all nesting setups. KVM Xen VMware VirtualBox HV × × × PV HV × × DBT HV × × DBT × × × × × × × × × × X X X ∼ X × × × × × X ∼ X ∼ × × × × × × X HV X × DBT VMware X × HV VirtualBox HV DBT Subsections × × × × × × × PV × × × × X X × × × × × × X × HV XEN × × × × X X × × X X × × X × HV KVM 2nd gen. 1st gen. 2nd gen. 1st gen. 2nd gen. 1st gen. 2nd gen. 1st gen. 2nd gen. 1st gen. 2nd gen. 1st gen. 2nd gen. 1st gen. Gen. HV 5.4. NESTED HARDWARE SUPPORT 55 CHAPTER 6 Performance results This chapter elaborates on the performance of the working setups for nested virtualization on x86 architectures. Chapter 5 showed that there was one working setup for nested virtualization when using dynamic binary translation as the L1 hypervisor technique. There were also ten working setups when using a L1 hypervisor based on hardware support with a processor that contains the second generation hardware extensions for virtualization on x86 architectures. The performance in a normal virtual machine is compared to the performance in a nested virtual machine in order to get an idea about the performance degradation between virtualization and nested virtualization. The performed tests measure the processor, memory and I/O performance. These are the three most important components of a computer system. The evolution of hardware support for virtualization on x86 architecture also shows that the processor, the memory management unit and I/O are important components, see chapter 3. The first generation hardware support focusses on the processor, second generation hardware support concentrates on a hardware supported MMU and the newer generation provides support for directed I/O. The benchmarks used for the tests are sysbench1 , iperf2 and iozone3 . sysbench was used for the processor, memory and file I/O performance. iperf was used for network performance and iozone was used for a second benchmark for file I/O. The rest of this chapter is organized using these three components. The first section elaborates on the performance of the processor in nested virtualization. The next section evaluates the memory performance of the nested virtual machines and the third section shows the performance of I/O in a nested setup. The last section gives an overall conclusion on the performance of nested virtualization. 1 http://sysbench.sourceforge.net/ http://iperf.sourceforge.net/ 3 http://www.iozone.org/ 2 6.1. PROCESSOR PERFORMANCE 57 Whenever a test ran directly on the host operating system, without any virtualization, the test is labeled with the word “native”. If the label is a name of a single virtualization product, the test ran inside a L1 guest with the indicated hypervisor as L1 hypervisor. The “DBT” suffix indicates that the L1 hypervisor uses the dynamic binary translation technique. All “HV” tests use the hardware support of the processor4 for virtualization. A label of the form L1hypervisor -L2hypervisor shows the result of a performance test executed in a L2 guest using the given L2 hypervisor and L1 hypervisor. For example, “KVM (HV) - VirtualBox (DBT)” indicates the setup where KVM is used as L1 hypervisor and VirtualBox is used as L2 hypervisor based on dynamic binary translation. All nested setups use the hardware support of the processor in the L1 hypervisor, except for “VMware (DBT) - Xen (PV)”. The latter uses VMware as the L1 hypervisor based on dynamic binary translation and uses Xen as L2 hypervisor based on paravirtualization. The L2 hypervisor is never based on hardware support as can be seen in chapter 5. Thus, VirtualBox and VMware are always based on dynamic binary translation and Xen is always based on paravirtualization, when used as L2 hypervisor. 6.1 Processor performance The experiment used to measure the performance of the processor consists of a sysbench test which calculates prime numbers. It calculates the prime numbers until a set maximum and does this a given amount of times. The number of threads that will calculate the prime numbers can also be modified prior to running the test. In the executed tests, the maximum number for the primes was 150000 and all prime numbers until 150000 were calculated 10000 times spread over 10 threads. The measured unit of the test was the duration in seconds. Figure 6.1 shows the first results of the performance test for the processor. The left bar is the result on the computer system without virtualization and the other bars are the results of the tests in L1 guests. The figure shows a serious gap between the native performance and the performance in a virtual machine. The reason for this big gap in performance is the use of only one core inside the virtual machine while the host operating system can use four cores. The tests were executed in virtual machines with only one core so that the comparison between the different virtualization software would be fair. In order to get an indication of the real performance degradation, the same test was executed in a VMware guest that can use four cores and in a “VMware (HV) - VMware (DBT)” nested guest that can use four cores. The results of these tests are given in figure 6.2. The figure shows that the performance degradation between a virtual machine and a nested virtual machine is less than the performance degradation between a native platform and a virtual machine. By adding an extra level R 4 All performance tests were executed on an Intel CoreTM i7-860 processor that provides second generation hardware support for x86 virtualization. 6.2. MEMORY PERFORMANCE 58 400 CPU 350 Duration (in seconds) 300 250 200 150 100 50 0 KV M V) (H V) V) V) (H (H x Bo e ar (H w n Xe VM al rtu Vi e tiv na Figure 6.1: CPU performance for native with four cores and L1 guest with one core (lower is better). of virtualization, one expects a certain overhead, but this shows that the performance degradation for the extra level is promising. The performance overhead is linear and does not increase exponentially, which is promising because the latencies of VMEntry and VMExit instructions (see section 3.3) do not have to be improved dramatically in order to get acceptable performance in the nested guest. The results of the tests on virtual machines and nested virtual machines are shown in figure 6.3. The performance between L1 guests with “HV” is about the same since the L1 hypervisors use hardware support for virtualization. The L1 guest that is virtualized using dynamic binary translation, “VMware (DBT)”, was able to perform equally well. The results of the L2 guests vary heavily between the different setups and are higher than the results of the L1 guests. However, the performance degradation is not problematic, except for one outlier which uses dynamic binary translation for the L1 hypervisor. With a duration of 496.83 seconds, the “VMware (DBT) - Xen (PV)” setup performs much worse than other nested setups. 6.2 Memory performance In this section, the performance degradation of the memory management unit is evaluated. In section 5.3 we explained that the hardware supported L1 hypervisors use the hardware supported MMU of the processor and the L2 hypervisors use a software technique for maintaining the page tables of their guests. In the “VMware 6.2. MEMORY PERFORMANCE 59 CPU 160 140 Duration (in seconds) 120 100 80 60 40 20 0 e ar w VM e ar w VM e iv t na V) (H V) (H -V e ar w M ) BT (D Figure 6.2: CPU performance for native, L1 and L2 guest with four cores (lower is better). (DBT) - Xen (PV)” setup, the L1 hypervisor maintains shadow tables and the L2 hypervisor provides paravirtual interfaces to its guests. The performed memory tests evaluate the read and write throughput. The tests read or write data with a total size of 2 Gb from or to the memory in block sizes of 256 bytes. The tests were done in twofold, one that reads or writes in a sequential order and one that reads or writes in a random order. Figure 6.4 presents the results of the memory tests for the native platform, L1 guests and L2 guests. Several observations for nested virtualization can be made from the results. A first observation is that the duration of the tests increases greatly when using virtualization. The L1 guests needed approximately 10 seconds to read or write 2 Gb, while the test on the native platform took about 1.5 seconds. Most L2 guests took more than 128 seconds to pass the test. For nested virtualization the performance degradation of the memory is more significant than the performance degradation of the processor. A second observation shows that nesting Xen allows to avoid the performance degradation for the memory, except for the setup with dynamic binary translation as L1 hypervisor. While other nested setups took more than 128 seconds, the nested setups with Xen as L2 hypervisor took 10 seconds which is the same as in the L1 6.2. MEMORY PERFORMANCE 60 410 CPU 400 390 Duration (in seconds) 380 370 360 350 340 330 320 310 300 ) BT (D V) ) (P BT en D ( ) -X e ar BT V) w (D (H M x M - V lBo KV a V) u (H irt M ) -V ) V KV V) (P BT (H (D en M x - X lBo KV V) V) ua (H (P irt n n -V Xe Xe V) )V) (H ) n (P BT BT (D Xe en D e ( ) -X e ar w ar BT V) w (D (H VM M e ox -V ar B ) l w a V (H rtu VM ) e V) Vi ar (P BT w n V) (D (H Xe VM ox e lB ar V) ua w (H irt VM ox -V B al V) rtu x (H Vi Bo al rtu ) V (H Vi M KV e V) V) (H (H x V) (H ar w n Xe VM e Bo ar w al rtu VM Vi Figure 6.3: CPU performance for L1 and L2 guests with one core (lower is better). guests. Paravirtualization appears to be a promising technique for the L2 hypervisor to minimize the performance overhead of the memory. The reason why “Xen (PV)” does not minimize the performance overhead of the memory when compared to native is unclear. The figure also shows that the “VMware (DBT)” setup performs poorly compared to the other L1 setups and that “VMware (DBT) - Xen (PV)” did not take advantage of the paravirtualization in the L2 hypervisor. The nested setup “VMware (DBT) - Xen (PV)” is not the worst of all nested setups, but the duration still increases despite the use of paravirtualization as the L2 hypervisor technique. Therefore, a L1 hypervisor that uses second generation hardware support for virtualization on x86 architectures performs better for memory management. The thread test stresses many divergent paths through the hypervisor, such as system calls, context switching, creation of address spaces and injection of page faults [11]. Figure 6.5 summarizes the results of a thread test. The test created 1000 threads and 4 mutexes. Each thread locks a mutex, yields the CPU and unlocks the mutex afterwards. These actions are performed in a loop so concurrency is placed on each mutex. The results of the test are equal to the results of the memory performance test. This indicates that the thread test depends heavily on memory management. 6.3. I/O PERFORMANCE 61 1024 Memory Read RND Memory Read SEQ Memory Write RND Memory Write SEQ 512 256 Duration (in seconds) 128 64 32 16 8 4 2 1 e tiv V) T) (P B en (D ) - X are BT V) w (D M ox (H M ) - V lB KV HV rtua i ( - V V) M T) ) KV HV n (P (DB ( e x M Bo ) -X l KV V) tua (PV (H Vir en n - X V) Xe V) T) T) (P (H B B n n D D e ) Xe re ( - X re ( BT a ) a w w HV (D M x ( o V VM e ar V) - ualB w ) t ) (H Vir PV BT VM e ( (D )ar w HV Xen ox lB VM re ( V) tua a w (H Vir VM Box ) al (HV rtu Vi Box al rtu ) Vi HV ( M KV V) (P n Xe V) T) (H B n D Xe re ( ) a V w H VM re ( V) a w (H VM Box al rtu Vi na Figure 6.4: Memory performance for L1 and L2 guests (lower is better). 6.3 I/O performance We evaluate the I/O performance in this section. There are many I/O devices so we selected two major devices. The first test measured the network throughput and the second test measured the reads and writes from and to the hard disk. The first subsection elaborates on the results of the network test and the second subsection presents the results of the disk I/O. 6.3.1 Network I/O The network throughput was tested using the iperf benchmark which measures the TCP throughput during 10 seconds. Figure 6.6 shows that there is little or no performance degradation between native and L1 guests. The bottleneck in these tests was the 100 Mbit/s network card and not the virtualization. The results can be different for a network card with a higher throughput. The performance overhead for L2 guests heavily depends on which setup is used. The lowest performance was measured for VirtualBox on top of Xen. The nine nested setups can clearly be divided in groups of three. The nested setups in the first group perform rather poorly for network I/O with a throughput 6.3. I/O PERFORMANCE 62 32768 Threads 16384 8192 4096 Duration (in seconds) 2048 1024 512 256 128 64 32 16 8 4 2 1 V) V) (H (H x V) T) (P B en (D e T) -X ar DB V) w (H VM ox ( M lB KV V) ua (H Virt ) M V) BT KV (P V) (D (H en x M - X lBo ) KV V) ua (PV irt (H n n - V Xe Xe V) ) ) PV T) (H ( B n BT n D D ( Xe e ( Xe e T) ar ar DB w V) w ( M (H x VM e -V Bo l ar w V) tua ) H ) r VM e ( Vi PV BT ( ar (D n w V) H Xe ox VM e ( lB )ar HV rtua w i ( VM ox ) - V B V al H rtu x ( Vi Bo al rtu ) Vi V (H M ) KV V (H BT) n D Xe e ( ar w e Bo ar w VM VM e tiv al rtu Vi na Figure 6.5: Threads performance for native, L1 guests and L2 guests with sysbench benchmark (lower is better). of less than 50 Mbit/s. The second group achieved reasonable performance. They are the nested setups with a throughput between 50 Mbit/s and 80 Mbit/s. The last group has a good performance with a network throughput of more than 80 Mbit/s nearing native performance. This group has little performance degradation compared to L1 guests and native, taken into account that the network card is the bottleneck. 6.3.2 Disk I/O We measured the disk I/O performance using two tests. The first test is a file I/O test of sysbench. In the preparation stage, it creates a specified number of files with a specified total size. During the test, each thread performs specified I/O operations on this set of files. In figure 6.7, we can observe that the setups with virtualization perform much better than in the native case. These results are unusual since we would expect that the virtualization layer adds a certain performance overhead. The test in the L1 VMware Workstation guest took 0.5 seconds while the same test on the native platform took 37.7 seconds. This suggests that some optimization provides a speedup for the disk performance. The optimization is not a feature of the hard disk 6.3. I/O PERFORMANCE 63 100 iperf 90 Network throughput (in Mbit/s) 80 70 60 50 40 30 20 10 0 M KV M KV V) (H V) (H n Xe ) ) BT (D ) BT ) BT (D ) BT ) BT (D (D (D BT (D V) (P e ar w VM V) (P ox lB ox lB ua rt Vi n Xe ua ox lB ox V) (P lB e ar w M -V ua n Xe ua rt Vi irt -V rt Vi V) (H V) (H M KV n Xe V) (H V) V) (H V) (H V) (H V) (H e ar w e ar w ox ox lB ua lB V) (H ua n Xe VM VM rt Vi rt Vi M KV V) (H V) (H (H ox e ar w lB e ua n Xe VM rt Vi tiv na Figure 6.6: Network performance for native, L1 guests and L2 guests (higher is better). because it is not possible for a virtual machine to read from a disk faster than a native machine. The documentation of the iozone benchmark suggests that the processor cache and buffer cache are helping out for smaller files. It advises to run their benchmark with a maximum file size that is larger than the available memory on the computer system. The results of these tests are shown in figure 6.8. In the figure we can clearly see that optimizations are obtained by the use of the caches. The theoretical speed of the hard disk is marked on the graphs. The throughput of the I/O exceeds this theoretical speed, indicating that the measured values are not the real I/O performance. The real I/O performance can be found when using larger files. In figure 6.8(a), the measured performance for the L1 VMware Workstation guest is lower than in the native test for larger files. The performance in the L2 VMware Workstation guest is higher than in the L1 guest, but the test stopped after 2 Gb files since the hard disk of the nested guest was not large enough. The iozone tests showed that the sysbench tests were inaccurate due to caching and suggest that real I/O performance can be measured with writing and reading large files. In order to obtain good performance results, these tests should be conducted for larger files than the tests we ran. 6.4. CONCLUSION 64 100 File I/O 90 80 Duration (in seconds) 70 60 50 40 30 20 10 0 e V) (D V) (H (H x V) ) (P BT en D ( ) -X e ar BT V) w (D (H M x M - V lBo KV a V) u (H irt M ) -V ) V KV V) (P BT (H (D en M x - X lBo KV V) ) ua (H irt ) PV n ( T -V n B Xe V) (D Xe ) (H re a n BT V) w (D (H Xe x e VM ar Bo w al V) (H rtu VM ) i ) e PV -V ar ( BT ) w n V (D (H Xe VM ox e lB ar V) ua w H ( irt VM ox -V B al V) rtu x (H Vi Bo al rtu ) Vi V (H M KV V) ) (H BT ar w n Xe e Bo ar w VM VM e tiv al rtu Vi na Figure 6.7: File I/O performance for native, L1 guests and L2 guests with sysbench benchmark (lower is better). 6.4 Conclusion Performance overhead for nested virtualization is linear for CPU performance and exponential for memory performance, except for the memory performance of nested setups with paravirtualization as L2 hypervisor technique. Paravirtualization minimizes the performance degradation. For CPU performance in nested virtualization, the setup that uses dynamic binary translation as the L1 hypervisor was the only outlier, the other setups performed adequate. For memory performance, the setups that use paravirtualization as the L2 hypervisor performed as well as the L1 guests. The results for the I/O performance was split into network and disk performance. The network performance could be divided into three groups. The first group had near native performance, the second group performed acceptable and the last group performed rather poorly. The results for disk performance were not accurate enough since real disk I/O was difficult to measure due to caching. More testing is required for disk I/O performance to reach an accurate conclusion. 6.4. CONCLUSION 65 4194304 write native write L1 write L2 2097152 Speed (in KBytes/sec) 1048576 524288 SATA2 speed (3.0 Gbit/s) 262144 131072 65536 32768 16384 6 1 72 77 8 16 60 88 4 83 30 94 2 41 15 97 6 20 57 48 10 88 42 52 44 21 26 72 10 13 6 53 65 8 76 32 4 38 16 92 81 96 40 48 20 24 10 2 51 6 25 8 12 64 File size (in KBytes) (a) write test 16777216 read native read L1 read L2 8388608 Speed (in KBytes/sec) 4194304 2097152 1048576 524288 SATA2 speed (3.0 Gbit/s) 262144 131072 65536 16 72 77 8 16 60 88 4 83 30 94 2 41 15 97 6 20 57 48 10 88 42 52 44 21 26 72 10 13 6 53 65 8 76 32 4 38 16 92 96 81 40 48 24 20 2 10 6 51 8 25 12 64 File size (in KBytes) (b) read test Figure 6.8: File I/O performance for native, L1 guests and L2 guests with iozone benchmark. CHAPTER 7 Conclusions This chapter concludes the work of this thesis. In the first section, we elaborate on the results and conclusions of the previous chapters. The last section proposes some future work for nested virtualization. 7.1 Nested virtualization and performance results Nested virtualization on the x86 architecture can be a useful tool for the development of test setups for research purposes, creating test frameworks for hypervisors, etc. In chapter 5, we investigated which techniques are the most suitable choice for the L1 and L2 hypervisors. The most suitable L1 hypervisor technique is a hardware based virtualization solution. When comparing the results of the setups that use software solutions for the L1 hypervisor with the results of the setups that use hardware support as technique, we saw that the latter resulted in more working setups. The hardware support of the processor is preferably the second generation hardware support for virtualization. The use of EPT or NPT improves the performance for the memory management and releases the L1 hypervisor from maintaining shadow tables. These shadow tables form a problem for certain nested setups when used in the L1 hypervisor. Hypervisors take shortcuts in order to improve the performance of the software based memory management. This can lead to failures in nesting virtual machines. The second generation hardware support avoids these problems by providing a hardware supported MMU and appears to be the most advisable choice for the L1 hypervisor. Table 5.13 shows an overview of the results of all nested setups. The best technique for the L2 hypervisor is paravirtualization. The only working setup in section 5.1 used paravirtualization for the L2 hypervisor and except for one, all nested setups with paravirtualization as the L2 hypervisor worked for a L1 7.2. FUTURE WORK 67 hypervisor based on hardware support. Dynamic binary translation also performed well on top of hardware support when the processor provided the hardware supported MMU. Without the use of EPT or NPT, dynamic binary translation as the L2 hypervisor results in two levels of shadow tables which does not work very well. The performance results in chapter 6 support the decision that paravirtualization is the most suitable choice for the L2 hypervisor. The processor performance is comparable to other nested setups and the memory performance is comparable to a single layer of virtualization. Nested hardware support is the great absentee in the whole nesting story. None of the hypervisors provided the hardware extensions for virtualization to their guests. This prevented the installation of a L2 hypervisor based on hardware support within these guests. KVM and Xen are working on nested hardware support. KVM already released nested SVM support but the implementation is still in its infancy. The development of nested VMX support takes more time because of the differences between AMD SVM and Intel VT-x. Xen is focussing on nested hardware support for VMX and a proof of concept has been made that can successfully boot a nested virtual machine to an early stage. For the performance results, we observed that the processor performance degradation was linear for nested virtualization with hardware support for the L1 hypervisor. The memory performance decreased greatly for nested virtualization. The only exception is the use of paravirtualization for the L2 hypervisor. In these experiments, no memory overhead was introduced in the nested setups. The other nested setups suffered from a significant memory overhead and had memory access times that were on average 128 times slower when compared to native. The I/O performance results were not accurate and need more work to gain an accurate view of the I/O performance degradation for nested virtualization. 7.2 Future work One area of future work is testing the nested hardware support when KVM and Xen release their updated versions. The release of nested hardware support allows to test other nested setups than tested in this thesis and might provide new results. An extra task could be to check whether other virtualization software vendors have started to develop nested hardware support. Throughout this thesis, we focussed on software solutions and hardware support as techniques for a hypervisor. The first generation and second generation hardware support were compared to see what impact the hardware supported MMU had. Lately, hardware vendors are working on directed I/O for virtualization and more specifically, Intel is working on its Intel VT-d and AMD on its IOMMU. Another area of feature work would be to investigate whether directed I/O can be useful for nested virtualization and what the performance impact of these new generation hardware support is. Bibliography [1] S. Nanda and T. Chiueh, “A survey on virtualization technologies,” tech. rep., Stony Brook University, 2005. [2] VMware, “Virtualization History.” http://www.vmware.com/ virtualization/history.html. Last accessed on May 19, 2010. [3] VMware, “Vmware: Virtualization overview.” http://www.vmware.com/pdf/ virtualization.pdf. Last accessed on May 19, 2010. [4] S. Adabala, V. Chadha, P. Chawla, R. Figueiredo, J. Fortes, I. Krsul, A. Matsunaga, M. Tsugawa, J. Zhang, M. Zhao, L. Zhu, and X. Zhu, “From virtualized resources to virtual computing grids: the In-VIGO system,” Future Generation Computer Systems, vol. 21, no. 6, pp. 896–909, 2005. [5] J. E. Smith and R. Nair, Virtual Machines: Versatile Platforms for Systems and Processes. The Morgan Kaufmann Series in Computer Architecture and Design, San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2005. [6] W. Stallings, Operating Systems: Internals and Design Principles. Prentice Hall, 5th ed., 2004. [7] J. E. Smith and R. Nair, “The architecture of virtual machines,” Computer, vol. 38, pp. 32–38, May 2005. [8] G. J. Popek and R. P. Goldberg, “Formal requirements for virtualizable third generation architectures,” Commun. ACM, vol. 17, no. 7, pp. 412–421, 1974. R [9] Intel Corporation, Intel 64 and IA-32 Architectures Software Developer’s Manual, Dec. 2009. http://www.intel.com/products/processor/manuals/. Last accessed on May 19, 2010. [10] VMware, “Understanding Full Virtualization, Paravirtualization, and Hardware Assist.” http://www.vmware.com/files/pdf/VMware_ paravirtualization.pdf, Sept. 2007. Last accessed on May 19, 2010. BIBLIOGRAPHY 69 [11] K. Adams and O. Agesen, “A comparison of software and hardware techniques for x86 virtualization,” in ASPLOS-XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systems, (New York, NY, USA), pp. 2–13, ACM, Oct. 2006. [12] E. Witchel and M. Rosenblum, “Embra: fast and flexible machine simulation,” in SIGMETRICS ’96: Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, (New York, NY, USA), pp. 68–79, ACM, 1996. [13] J. D. Gelas, “Hardware Virtualization: the Nuts and Bolts.” http://it. anandtech.com/printarticle.aspx?i=3263, Mar. 2008. Last accessed on May 19, 2010. [14] A. Vasudevan, R. Yerraballi, and A. Chawla, “A high performance KernelLess Operating System architecture,” in ACSC ’05: Proceedings of the Twentyeighth Australasian conference on Computer Science, (Darlinghurst, Australia, Australia), pp. 287–296, Australian Computer Society, Inc., 2005. [15] K. Onoue, Y. Oyama, and A. Yonezawa, “Control of system calls from outside of virtual machines,” in SAC ’08: Proceedings of the 2008 ACM symposium on Applied computing, (New York, NY, USA), pp. 2116–1221, ACM, 2008. [16] J. Sugerman, G. Venkitachalam, and B.-H. Lim, “Virtualizing i/o devices on vmware workstation’s hosted virtual machine monitor,” in Proceedings of the General Track: 2002 USENIX Annual Technical Conference, (Berkeley, CA, USA), pp. 1–14, USENIX Association, 2001. [17] J. Fisher-Ogden, “Hardware Support for Efficient Virtualization.” [18] Y. Dong, J. Dai, Z. Huang, H. Guan, K. Tian, and Y. Jiang, “Towards highquality i/o virtualization,” in SYSTOR ’09: Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, (New York, NY, USA), pp. 1–8, ACM, 2009. [19] Advanced Micro Devices, Inc., “AMD-V Nested Paging,” tech. rep., Advanced Micro Devices, Inc., July 2008. http://developer.amd.com/assets/ NPT-WP-1%201-final-TM.pdf. Last accessed on May 19, 2010. [20] VMware, “Software and Hardware Techniques for x86 Virtualization.” http:// www.vmware.com/files/pdf/software_hardware_tech_x86_virt.pdf. Last accessed on May 19, 2010. [21] A. Whitaker, M. Shaw, and S. D. Gribble, “Denali: Lightweight Virtual Machines for Distributed and Networked Applications,” in In Proceedings of the USENIX Annual Technical Conference, 2002. [22] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, “Xen and the art of virtualization,” SIGOPS Oper. Syst. Rev., vol. 37, no. 5, pp. 164–177, 2003. BIBLIOGRAPHY 70 [23] J. R. Santos, Y. Turner, G. Janakiraman, and I. Pratt, “Bridging the gap between software and hardware techniques for i/o virtualization,” in ATC’08: USENIX 2008 Annual Technical Conference on Annual Technical Conference, (Berkeley, CA, USA), pp. 29–42, USENIX Association, 2008. [24] VMware, “A Performance Comparison of Hypervisors.” http://www.vmware. com/pdf/hypervisor_performance.pdf, Feb. 2007. Last accessed on May 19, 2010. R Virtualization Technology for Directed I/O,” [25] Intel Corporation, “Intel tech. rep., Intel Corporation, Sept. 2008. http://download.intel.com/ technology/computing/vptech/Intel(r)_VT_for_Direct_IO.pdf. Last accessed on May 19, 2010. [26] Intel Corporation, “A superior hardware platform for server virtualization,” tech. rep., Intel Corporation, 2009. http://download.intel.com/ business/resources/briefs/xeon5500/xeon_5500_virtualization.pdf. Last accessed on May 19, 2010. R [27] Intel Corporation, “Intel Virtualization Technology Specification for the R IA-32 Intel Architecture,” tech. rep., Intel Corporation, Apr. 2005. http://dforeman.cs.binghamton.edu/~foreman/552pages/Readings/ intel05virtualization.pdf. Last accessed on May 19, 2010. [28] Advanced Micro Devices, Inc., AMD64 Virtualization Codenamed “Pacifica” Technology: Secure Virtual Machine Architecture Reference Manual, May 2005. http://www.mimuw.edu.pl/~vincent/lecture6/sources/ amd-pacifica-specification.pdf. Last accessed on May 19, 2010. [29] G. Neiger, A. Santoni, F. Leung, D. Rodgers, and R. Uhlig, R “Intel Virtualization Technology: Hardware Support for Efficient ProR cessor Virtualization,” Intel Virtualization Technology, vol. 10, pp. 167–178, Aug. 2006. R [30] G. Gerzon, “Intel Virtualization Technology: Processor Virtualization ExR tensions and Intel Trusted execution Technology.” http://software.intel. com/file/1024. Last accessed on May 19, 2010. [31] VirtualBox, “Virtualbox.” http://www.virtualbox.org. May 19, 2010. Last accessed on [32] VirtualBox, “Virtualbox architecture.” http://www.virtualbox.org/wiki/ VirtualBox_architecture. Last accessed on May 19, 2010. [33] VMware, “Vmware.” http://www.vmware.com. Last accessed on May 19, 2010. [34] Xen, “Xen.” http://www.xen.org/. Last accessed on May 19, 2010. [35] Xen, “Dom0 - xen wiki.” http://wiki.xensource.com/xenwiki/Dom0. Last accessed on May 19, 2010. BIBLIOGRAPHY 71 [36] KVM, “Kvm.” http://www.linux-kvm.org. Last accessed on May 19, 2010. [37] A. Shah, “Kernel-based virtualization with kvm,” Linux Magazine, vol. 86, pp. 37–39, 2008. [38] F. Bellard, “Qemu, a fast and portable dynamic translator,” in ATEC ’05: Proceedings of the annual conference on USENIX Annual Technical Conference, (Berkeley, CA, USA), pp. 41–41, USENIX Association, 2005. [39] I. Habib, “Virtualization with kvm,” Linux J., vol. 2008, no. 166, p. 8, 2008. [40] H. C. Lauer and D. Wyeth, “A recursive virtual machine architecture,” in Proceedings of the workshop on virtual computer systems, (New York, NY, USA), pp. 113–116, ACM, 1973. [41] G. Belpaire and N.-T. Hsu, “Formal properties of recursive virtual machine architectures.,” in SOSP ’75: Proceedings of the fifth ACM symposium on Operating systems principles, (New York, NY, USA), pp. 89–96, ACM, 1975. [42] M. Baker and R. Buyya, “Cluster computing at a glance.” Cluster Computing, Chapter 1, 1999. [43] M. Baker, “Cluster computing white paper,” Tech. Rep. Version 2.0, IEEE Computer Scociety Task Force on Cluster Computing (TFCC), December 2000. [44] M. L. Bote-Lorenzo, Y. A. Dimitriadis, and E. Gomez-Sanchez, “Grid characteristics and uses: a grid definition,” in Proceedings of the First European Across Grids Conference (AG’03), vol. 2970 of Lecture Notes in Computer Science, (Heidelberg), pp. 291–298, Springer-Verlag, 2004. [45] I. Foster, “What is the grid? a three point checklist,” 2002. [46] I. Foster, Y. Zhao, I. Raicu, and S. Lu, “Cloud computing and grid computing 360-degree compared,” in Proceedings of the Grid Computing Environments Workshop, 2008. GCE ’08, pp. 1–10, Nov. 2008. [47] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, “Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility,” Future Gener. Comput. Syst., vol. 25, no. 6, pp. 599–616, 2009. [48] A. Kivity, “Avi kivity’s blog.” http://avikivity.blogspot.com/. Last accessed on May 19, 2010. [49] A. Graf and J. Roedel, “Add support for nested svm (kernel).” http://thread. gmane.org/gmane.comp.emulators.kvm.devel/21119. Last accessed on May 19, 2010. [50] Xen, “Xen summit asia at intel 2009.” http://www.xen.org/xensummit/ xensummit_fall_2009.html. Last accessed on May 19, 2010. Appendices APPENDIX A Virtualization software A.1 VirtualBox VirtualBox is a hypervisor that performs full virtualization. It started as proprietary software but currently comes under a Personal Use and Evaluation License (PUEL). The software is free of charge for personal and educational use. VirtualBox was initially created by Innotek. This company was later purchased by Sun Microsystems, which in turn was recently purchased by Oracle Corporation. This section contains information that is almost completely extracted from the website of VirtualBox. You can see [32] for more information. It is presented here to give extra information about the internals of VirtualBox. The host operating system runs each VirtualBox virtual machine as an application. VirtualBox takes over control over a large part of the computer, executing a complete OS with its own guest processes, drivers, and devices inside this virtual machine process. The host OS does not notice much of this, only that an extra process is started. So a virtual machine is just another process in the host operating system. Thus, this implementation is an example of an hosted hypervisor. Initially, VirtualBox used dynamic binary translation as implementation approach for their hypervisor. However, with the release of hardware support for virtualization, it also provides Intel VT-x and AMD SVM support. Upon starting VirtualBox, one extra process gets started; the VirtualBox “service” process VBoxSVC. This service is running in the background to keep track of all the processes involved, i.e. it keeps track of which virtual machines are running and what state they are in. It is automatically started by the first GUI process. The guts of the VirtualBox implementation are hidden in a shared library, A.1. VIRTUALBOX 74 VBoxVMM.dll, or VBoxVMM.so in linux. This library contains all complicated and messy things that make a x86 architecture. This can be considered as a static “backend”, or black box. Around this backend, many frontends can be written without having to mess with the gory details of x86 virtualization. VirtualBox already comes with several frontends: the Qt GUI, a command-line utility VBoxManage, a “plain” GUI based on SDL and remote interfaces. The host operating system does not need much tweaking to support virtualization. A ring 0 driver is loaded in the host operating system for VirtualBox to work. This driver does not mess the scheduling or process management of the host operating system. The entire guest OS, including its own hundreds of processes, is only scheduled when the host OS gives the VM process a timeslice. The ring 0 driver only performs a few specific tasks: allocating physical memory for the VM, saving and restoring CPU registers and descriptor tables when a host interrupt occurs while a guest’s ring-3 code is executing (e.g. when the host OS wants to reschedule), switching from host ring 3 to guest context, and enabling or disabling VT-x etc. support. When running virtual machine, the computer can be in one of several states, from the processor’s point of view: 1. The CPU can be executing host ring 3 code (e.g. from other host processes), or host ring 0 code, just as it would be if VirtualBox was not running. 2. The CPU can be emulating guest code (within the ring 3 host VM process). Basically, VirtualBox tries to run as much guest code natively as possible. But it can (slowly) emulate guest code as a fallback when it is not sure what the guest system is doing, or when the performance penalty of emulation is not too high. The VirtualBox emulator is based on QEMU and typically steps in when: • guest code disables interrupts and VirtualBox cannot figure out when they will be switched back on (in these situations, VirtualBox actually analyzes the guest code using its own disassembler); • for execution of certain single instructions; this typically happens when a nasty guest instruction such as LIDT has caused a trap and needs to be emulated; • for any real mode code (e.g. BIOS code, a DOS guest, or any operating system startup). 3. The CPU can be running guest ring 3 code natively (within the ring 3 host VM process). With VirtualBox, we call this ”raw ring 3”. This is, of course, the most efficient way to run the guest, and hopefully we don’t leave this mode too often. The more we do, the slower the VM is compared to a native OS, because all context switches are very expensive. 4. The CPU can be running guest ring 0 code natively. Here is where things get tricky: the guest only thinks it’s running ring 0 code, but VirtualBox has A.1. VIRTUALBOX 75 fooled the guest OS to instead enter ring 1 (which is normally unused with x86 operating systems). The guest operating system is thus manipulated to actually execute its ring 0 code in ring 1. This causes a lot of additional instruction faults, as ring 1 is not allowed to execute any privileged instructions. With each of these faults, the hypervisor must step in and emulate the code to achieve the desired behavior. While this normally works perfectly well, the resulting performance would be very poor since CPU faults tend to be very expensive and there will be thousands and thousands of them per second. To make things worse, running ring 0 code in ring 1 causes some nasty occasional compatibility problems. Because of design flaws in the x86 architecture that were never addressed, some system instructions that should cause faults when called in ring 1 unfortunately do not. Instead, they just behave differently. It is therefore imperative that these instructions be found and replaced. To address these two issues, VirtualBox has come up with a set of unique techniques that they call ”Patch Manager” (PATM) and ”Code Scanning and Analysis Manager” (CSAM). Before executing ring 0 code, the code is scanned recursively to discover problematic instructions. In-place patching is then performed, i.e. replacing the instruction with a jump to hypervisor memory where an integrated code generator has placed a more suitable implementation. In reality, this is a very complex task as there are lots of odd situations to be discovered and handled correctly. So, with its current complexity, one could argue that PATM is an advanced in-situ recompiler. In addition, every time a fault occurs, the fault’s cause is analyzed to determine if it is possible to patch the offending code to prevent it from causing more expensive faults in the future. This turns out to work very well, and it can reduce the faults caused by their virtualization to a rate that performs much better than a typical recompiler, or even VT-x technology, for that matter. APPENDIX B Details of the nested virtualization in practice This chapter gives some detailed information about error messages and warnings that occurred during the tests in chapter 5. For each setup, information about the operating system and the hypervisor version is given. The setups are combined into sections so that each section contains the setups that have the same bottom layer hypervisor technique. Note that the setups with a nested hypervisor based on hardware support for virtualization on x86 architectures are left out because the nested hypervisor could not be installed. B.1 Dynamic binary translation B.1.1 VirtualBox VirtualBox within VirtualBox Host operating system: Ubuntu 9.10 64bit L1 hypervisor: VirtualBox 3.1.6 r59338 L1 guest: Ubuntu 9.04 32bit L2 hypervisor: VirtualBox 3.1.6 r59338 L2 guest: Ubuntu 9.04 32bit The L1 guest booted and ran correctly using dynamic binary translation. The image of the L2 guest was copied to the L1 guest. The L2 guest tried to start but did not show any sign of activity. The guest showed the following output: Boot from ( hd0 , 0 ) e x t 3 S t a r t i n g up . . . <HDD−ID> This screen remained for several hours without aborting or continuing. B.1. DYNAMIC BINARY TRANSLATION 77 VMware Workstation within VirtualBox Host operating system: Ubuntu 9.10 64bit L1 hypervisor: VirtualBox 3.1.6 r59338 L1 guest: Ubuntu 9.04 32bit L2 hypervisor: VMware Workstation 6.5.3 build-185404 L2 guest: Ubuntu 9.04 32bit The L1 guest booted and ran correctly using dynamic binary translation. The image of the L2 guest was copied to the L1 guest. The L2 guest attempted to start but the L1 guest crashed and showed following error: A c r i t i c a l e r r o r has o c c u r r e d w h i l e r u n n i n g t h e v i r t u a l machine and t h e machine e x e c u t i o n has been s t o p p e d . For h e l p , p l e a s e s e e t h e Community s e c t i o n on h t t p : //www. v i r t u a l b o x . o r g o r your s u p p o r t c o n t r a c t . P l e a s e p r o v i d e t h e c o n t e n t s o f t h e l o g f i l e VBox . l o g and t h e image f i l e VBox . png , which you can f i n d i n t h e /home/ o l i v i e r / . V i r t u a l B o x / Machines / vbox−vmware/ Logs d i r e c t o r y , a s w e l l a s a d e s c r i p t i o n o f what you were d o i n g when t h i s e r r o r happened . Note t h a t you can a l s o a c c e s s t h e above f i l e s by s e l e c t i n g Show Log from t h e Machine menu o f t h e main V i r t u a l B o x window . P r e s s OK i f you want t o power o f f t h e machine o r p r e s s I g n o r e i f you want t o l e a v e i t as i s f o r debugging . Please note that debugging r e q u i r e s s p e c i a l knowledge and t o o l s , s o i t i s recommended t o p r e s s OK now . The log of the L1 guest showed: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 PATM: patmR3RefreshPatch : s u c c e e d e d t o r e f r e s h patch a t c 0 1 5 2 6 1 0 PATM: patmR3RefreshPatch : s u c c e e d e d t o r e f r e s h patch a t c 0 1 3 f 2 5 0 PATM: patmR3RefreshPatch : s u c c e e d e d t o r e f r e s h patch a t c 0 5 0 2 6 5 0 PATM: D i s a b l i n g IDT e f patch h a n d l e r c 0 1 0 5 2 f 0 PATM: patmR3RefreshPatch : s u c c e e d e d t o r e f r e s h patch a t c 0 1 0 5 2 f 0 PIIX3 ATA: C t l #0: RESET, DevSel=0 AIOIf=0 CmdIf0=0x20 (−1 u s e c ago ) CmdIf1=0 x00 (−1 u s e c ago ) PIIX3 ATA: C t l #0: f i n i s h e d p r o c e s s i n g RESET PATM: patmR3RefreshPatch : s u c c e e d e d t o r e f r e s h patch a t c015b7a0 PIIX3 ATA: C t l #1: RESET, DevSel=0 AIOIf=0 CmdIf0=0x00 (−1 u s e c ago ) CmdIf1=0 x00 (−1 u s e c ago ) PIIX3 ATA: C t l #1: f i n i s h e d p r o c e s s i n g RESET PATM: D i s a b l e b l o c k a t c 0 7 4 5 9 6 3 − w r i t e c07459b5−c07459b9 PATM: D i s a b l e b l o c k a t c0745bab − w r i t e c 0 7 4 5 c 0 f −c 0 7 4 5 c 1 3 PATM: D i s a b l e b l o c k a t c0746d22 − w r i t e c 0 7 4 6 d 8 f −c0746d93 PATM: D i s a b l e b l o c k a t c 0 7 6 3 a 9 0 − w r i t e c0763aed−c 0 7 6 3 a f 1 PATM: patmR3RefreshPatch : s u c c e e d e d t o r e f r e s h patch a t c 0 1 8 4 5 9 0 PCNet#0: I n i t : s s 3 2 =1 GCRDRA=0x36596000 [ 3 2 ] GCTDRA=0x36597000 [ 1 6 ] f a t a l e r r o r i n r e c o m p i l e r cpu : t r i p l e f a u l t Xen within VirtualBox Host operating system: Ubuntu 9.10 64bit L1 hypervisor: VirtualBox 3.1.6 r59338 L2 hypervisor: xen-3.0-x86 32p Domain 0 (L2): openSUSE 11.3 build 0475 L2 guest: openSUSE 11.3 build 0475 The L1 guest booted and crashed almost immediately. The L1 guest showed the following error: B.1. DYNAMIC BINARY TRANSLATION 78 A c r i t i c a l e r r o r has o c c u r r e d w h i l e r u n n i n g t h e v i r t u a l machine and t h e machine e x e c u t i o n has been s t o p p e d . For h e l p , p l e a s e s e e t h e Community s e c t i o n on h t t p : //www. v i r t u a l b o x . o r g o r your s u p p o r t c o n t r a c t . P l e a s e p r o v i d e t h e c o n t e n t s o f t h e l o g f i l e VBox . l o g and t h e image f i l e VBox . png , which you can f i n d i n t h e /home/ o l i v i e r / . V i r t u a l B o x / Machines / vbox−vmware/ Logs d i r e c t o r y , a s w e l l a s a d e s c r i p t i o n o f what you were d o i n g when t h i s e r r o r happened . Note t h a t you can a l s o a c c e s s t h e above f i l e s by s e l e c t i n g Show Log from t h e Machine menu o f t h e main V i r t u a l B o x window . P r e s s OK i f you want t o power o f f t h e machine o r p r e s s I g n o r e i f you want t o l e a v e i t as i s f o r debugging . Please note that debugging r e q u i r e s s p e c i a l knowledge and t o o l s , s o i t i s recommended t o p r e s s OK now . The log output was similar to the “VirtualBox within VMware Workstation” output (see subsection B.1.2). B.1.2 VMware Workstation VirtualBox within VMware Workstation Host operating system: Ubuntu 9.10 64bit L1 hypervisor: VMware Workstation 7.0.1 build-227600 L1 guest: Ubuntu 9.04 32bit L2 hypervisor: VirtualBox 3.1.6 r59338 L2 guest: Ubuntu 9.04 32bit The L1 guest booted and ran correctly using dynamic binary translation. The image of the L2 guest was copied to the L1 guest. When starting the L2 guest, the bootloader showed Boot from ( hd0 , 0 ) e x t 3 S t a r t i n g up . . . <HDD−ID> and afterwards, VirtualBox aborted the start and showed following message: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 PATM: patmR3RefreshPatch : s u c c e e d e d PATM: patmR3RefreshPatch : s u c c e e d e d PATM: F a i l e d t o r e f r e s h d i r t y patch PATM: patmR3RefreshPatch : s u c c e e d e d PATM: patmR3RefreshPatch : s u c c e e d e d PATM: patmR3RefreshPatch : s u c c e e d e d PATM: patmR3RefreshPatch : s u c c e e d e d PATM: patmR3RefreshPatch : s u c c e e d e d PATM: patmR3RefreshPatch : s u c c e e d e d PATM: patmR3RefreshPatch : s u c c e e d e d PATM: patmR3RefreshPatch : s u c c e e d e d PATM: patmR3RefreshPatch : s u c c e e d e d PATM: patmR3RefreshPatch : s u c c e e d e d PATM: patmR3RefreshPatch : s u c c e e d e d PATM: patmR3RefreshPatch : s u c c e e d e d PATM: patmR3RefreshPatch : s u c c e e d e d PATM: F a i l e d t o r e f r e s h d i r t y patch PATM: patmR3RefreshPatch : s u c c e e d e d PIT : mode=0 count=0x10000 ( 6 5 5 3 6 ) − t o r e f r e s h patch a t c 0 5 0 0 6 7 0 t o r e f r e s h patch a t c 0 5 0 0 5 6 0 at c013f370 . Disabling i t . t o r e f r e s h patch a t c 0 1 5 4 b c 0 t o r e f r e s h patch a t c0109d00 t o r e f r e s h patch a t c0122d50 t o r e f r e s h patch a t c05006d0 t o r e f r e s h patch a t c 0 1 0 9 c 8 0 t o r e f r e s h patch a t c 0 1 8 0 a c 0 t o r e f r e s h patch a t c 0 1 5 a 0 4 0 t o r e f r e s h patch a t c 0 1 3 f 9 8 0 t o r e f r e s h patch a t c 0 1 5 4 8 8 0 t o r e f r e s h patch a t c 0 1 5 4 a 0 0 t o r e f r e s h patch a t c 0 1 3 2 3 c 0 t o r e f r e s h patch a t c 0 1 5 1 6 0 0 t o r e f r e s h patch a t c 0 1 7 e d 0 0 at c013f370 . Disabling i t . t o r e f r e s h patch a t c01051d0 1 8 . 2 0 Hz ( ch =0) 20 21 22 23 ! ! Assertion Failed ! ! E x p r e s s i o n : pOrgInstrGC Location : /home/ vbox / vbox − 3 . 1 . 6 / s r c /VBox/VMM/PATM/VMMAll/PATMAll . cpp ( 1 5 9 ) v o i d PATMRawLeave(VM∗ , CPUMCTXCORE∗ , i n t ) B.2. PARAVIRTUALIZATION 79 VMware Workstation within VMware Workstation Host operating system: Ubuntu 9.10 64bit L1 hypervisor: VMware Workstation 7.0.1 build-227600 L1 guest: Ubuntu 9.04 32bit L2 hypervisor: VMware Workstation 6.5.3 build-185404 L2 guest: Ubuntu 9.04 32bit The L1 guest booted and ran correctly using dynamic binary translation. The image of the L2 guest was copied to the L1 guest. The L2 guest tried to start but did not show any activity. The screen stayed black and nothing happened, even after several hours. Xen within VMware Workstation Host operating system: Ubuntu 9.10 64bit L1 hypervisor: VMware Workstation 7.0.1 build-227600 L2 hypervisor: xen-3.0-x86 32p Domain 0 (L2): openSUSE 11.3 build 0475 L2 guest: openSUSE 11.3 build 0475 The L1 guest booted and ran the Xen hypervisor correctly using dynamic binary translation. The image of the L2 guest was copied to the L1 guest. The paravirtualized L2 guest booted rather slow but a user could log in and use the nested guest. B.2 Paravirtualization None of the setups with paravirtualization as bottom layer work. The configuration of the setups are given but, as explained in section 5.1.2, the problem is that nested hypervisors need modifications. VirtualBox within Xen L1 hypervisor: xen-3.0-x86 32p Domain 0 (L1): openSUSE 11.3 build 0475 L1 guest: Ubuntu 9.04 32bit L2 hypervisor: VirtualBox 3.1.6 r59338 VMware Workstation within Xen L1 hypervisor: xen-3.0-x86 32p Domain 0 (L1): openSUSE 11.3 build 0475 L1 guest: Ubuntu 9.04 32bit L2 hypervisor: VMware Workstation 6.5.3 build-185404 Xen within Xen L1 hypervisor: xen-3.0-x86 32p Domain 0 (L1): openSUSE 11.3 build 0475 L2 hypervisor: xen-3.0-x86 32p B.3. FIRST GENERATION HARDWARE SUPPORT 80 B.3 First generation hardware support In this section, the results are given of the setups with a L1 hypervisor based on hardware support for x86 virtualization. All tests were conducted on a processor that R CoreTM 2 Quad Q9550. only has first generation hardware support, namely Intel The following subsections combine the setups that use the same nested hypervisor technique. B.3.1 Dynamic binary translation VirtualBox within VirtualBox Host operating system: Ubuntu 9.10 64bitor Ubuntu 9.04 32bit L1 hypervisor: VirtualBox 3.1.6 r59338 L1 guest: Ubuntu 9.04 32bit L2 hypervisor: VirtualBox 3.1.6 r59338 L2 guest: Ubuntu 9.04 32bit The L2 guest hung in this setup. It did not crash, did not show information in the log during several hours, and was probably just very slow. VMware Workstation within VirtualBox Host operating system: Ubuntu 9.10 64bit or Ubuntu 9.04 32bit L1 hypervisor: VirtualBox 3.1.6 r59338 L1 guest: Ubuntu 9.04 32bit L2 hypervisor: VMware Workstation 6.5.3 build-185404 L2 guest: Ubuntu 9.04 32bit This setup was rather unstable. One configuration allowed booting the nested guest. This working configuration used the graphical user interface for the L1 and L2 guest, while other setups used a text based environment. The setup was tested on a Ubuntu 9.10 64bit and Ubuntu 9.04 32bit operating system, with a graphical user interface and with a text based environment. The test on the Ubuntu 9.04 32bit operating system with the graphical user interface was the only one that worked. VirtualBox within VMware Workstation Host operating system: Ubuntu 9.10 64bit L1 hypervisor: VMware Workstation 6.5.3 build-185404 L1 guest: Ubuntu 9.04 32bit L2 hypervisor: VirtualBox 3.1.6 r59338 L2 guest: Ubuntu 9.04 32bit Both L1 and L2 guest were able to boot and run correctly. However, in order to get the L2 guest started, one must append the line m o n i t o r c o n t r o l . r e s t r i c t b a c k d o o r = ”TRUE” B.3. FIRST GENERATION HARDWARE SUPPORT 81 to the configuration file (.vmx) of the L1 guest. This allowed to run a hypervisor within the L1 guest. Upon starting the L2 guest, the L1 guest displayed the following warning: The v i r t u a l machine ’ s o p e r a t i n g system has attempted t o e n a b l e p r o m i s c u o u s mode on a d a p t e r E t h e r n e t 0 . This i s not a l l o w e d f o r s e c u r i t y r e a s o n s . P l e a s e go t o t h e Web page h t t p : / / vmware . com/ i n f o ? i d =161 f o r h e l p e n a b l i n g t h e p r o m i s c u o u s mode i n t h e v i r t u a l machine . One can get around this message by starting VirtualBox as root instead of a normal user. VMware Workstation within VMware Workstation Host operating system: Ubuntu 9.10 64bit L1 hypervisor: VMware Workstation 6.5.3 build-185404 L1 guest: Ubuntu 9.04 32bit L2 hypervisor: VMware Workstation 6.5.3 build-185404 L2 guest: Ubuntu 9.04 32bit Both L1 and L2 guest were able to boot and run correctly. However, in order to get the L2 guest started, one must append the line m o n i t o r c o n t r o l . r e s t r i c t b a c k d o o r = ”TRUE” to the configuration file (.vmx) of the L1 guest. This allowed running a hypervisor within the L1 guest. Upon starting the L2 guest, the L1 guest displayed the following warning: The v i r t u a l machine ’ s o p e r a t i n g system has attempted t o e n a b l e p r o m i s c u o u s mode on a d a p t e r E t h e r n e t 0 . This i s not a l l o w e d f o r s e c u r i t y r e a s o n s . P l e a s e go t o t h e Web page h t t p : / / vmware . com/ i n f o ? i d =161 f o r h e l p e n a b l i n g t h e p r o m i s c u o u s mode i n t h e v i r t u a l machine . One can get around this message by starting the L2 VMware Workstation as root instead of a normal user. VirtualBox within Xen L1 hypervisor: xen-3.0-x86 32p Domain 0 (L1): openSUSE 11.3 build 0475 L1 guest: Ubuntu 9.04 32bit L2 hypervisor: VirtualBox 3.1.6 r59338 L2 guest: Ubuntu 9.04 32bit The L2 hung while booting inside the Xen guest. The L2 guest was probably just very slow since it stayed unresponsive for several hours. VMware Workstation within Xen L1 hypervisor: xen-3.0-x86 32p Domain 0 (L1): openSUSE 11.3 build 0475 L1 guest: Ubuntu 9.04 32bit L2 hypervisor: VMware Workstation 7.0.1 build-227600 L2 guest: Ubuntu 9.04 32bit B.3. FIRST GENERATION HARDWARE SUPPORT 82 The L2 guest did not start because VMware Workstation checks whether there is an underlying hypervisor. VMware Workstation noticed that there was a Xen hypervisor running and displayed the following message: You a r e r u n n i n g VMware W o r k s t a t i o n v i a an i n c o m p a t i b l e h y p e r v i s o r . You may not power on a v i r t u a l machine u n t i l t h i s h y p e r v i s o r i s d i s a b l e d . VirtualBox within KVM Host operating system: Ubuntu 9.10 64bit L1 hypervisor: kvm-kmod 2.6.32-9 L1 guest: Ubuntu 9.04 32bit L2 hypervisor: VirtualBox 3.1.6 r59338 L2 guest: Ubuntu 9.04 32bit The L1 guest booted and ran correctly. Upon starting the L2 guest, the following error was shown: Boot from ( hd0 , 0 ) e x t 3 S t a r t i n g up . . . <HDD−ID> ... [ 3 . 9 8 2 8 8 4 ] K e r n e l p a n i c − not s y n c i n g : Attempted t o k i l l init ! VMware Workstation within KVM Host operating system: Ubuntu 9.10 64bit L1 hypervisor: kvm-kmod 2.6.32-9 L1 guest: Ubuntu 9.04 32bit L2 hypervisor: VMware Workstation 6.5.3 build-185404 L2 guest: Ubuntu 9.04 32bit The setup worked for newer versions of KVM. In older versions and in the developer version (kvm-88), the L2 guest hung during start-up. With the newer version, the L2 guest booted and ran successfully. Note that newer versions of VMware Workstation check if there is an underlying hypervisor and in those cases it will refuse to boot a virtual machine. In the newest version, VMware Workstation 7.0.1 build-227600, this setup no longer worked due to the check for an underlying hypervisor. B.3.2 Paravirtualization All four setups could successfully nest a paravirtualized guest inside the L1 guest. However, the setup where Xen is nested inside VirtualBox was not very stable. Xen within VirtualBox Host operating system: Ubuntu 9.10 64bitor Ubuntu 9.04 32bit L1 hypervisor: VirtualBox 3.1.6 r59338 L2 hypervisor: xen-3.0-x86 32p Domain 0 (L2): openSUSE 11.3 build 0475 L2 guest: openSUSE 11.3 build 0475 B.4. SECOND GENERATION HARDWARE SUPPORT 83 Sometimes during the start-up of domain 0 several segmentation faults occurred. Domain 0 was able to boot and run successfully but the creation of another paravirtualized guest was sometimes impossible. Xen reported that the guest was created, however it did not show up in the list of virtual machines indicating that the guest crashed immediately. The setup worked most of the times on host operating system Ubuntu 9.04 32bit. On the Ubuntu 9.10 64bit operating system, there were always segmentation faults. Xen within VMware Workstation Host operating system: Ubuntu 9.10 64bit L1 hypervisor: VMware Workstation 7.0.1 build-227600 L2 hypervisor: xen-3.0-x86 32p Domain 0 (L2): openSUSE 11.3 build 0475 L2 guest: openSUSE 11.3 build 0475 Both L1 and L2 guest were able to boot and run correctly. Xen within Xen L1 hypervisor: xen-3.0-x86 32p Domain 0 (L1): openSUSE 11.3 build 0475 L2 hypervisor: xen-3.0-x86 32p Domain 0 (L2): openSUSE 11.3 build 0475 L2 guest: openSUSE 11.3 build 0475 Both L1 and L2 guest were able to boot and run correctly. Xen within KVM Host operating system: Ubuntu 9.10 64bitor Ubuntu 9.04 32bit L1 hypervisor: VirtualBox 3.1.6 r59338 L2 hypervisor: xen-3.0-x86 32p Domain 0 (L2): openSUSE 11.3 build 0475 L2 guest: openSUSE 11.3 build 0475 Both L1 and L2 guest were able to boot and run correctly. B.4 Second generation hardware support This section summarizes the results of the setups with a L1 hypervisor based on hardware support for x86 virtualization. The tests were conducted on a newer R processor with second generation hardware support, namely Intel CoreTM i7-860. Only the setups are given where the outcome differed from section B.3. These setups are the setups that worked using the second generation hardware support and did not work using the first generation. All the other setups had the same output. B.5. KVM’S NESTED SVM SUPPORT 84 B.4.1 Dynamic binary translation VirtualBox within VirtualBox Host operating system: Ubuntu 9.10 64bitor Ubuntu 9.04 32bit L1 hypervisor: VirtualBox 3.1.6 r59338 L1 guest: Ubuntu 9.04 32bit L2 hypervisor: VirtualBox 3.1.6 r59338 L2 guest: Ubuntu 9.04 32bit Both L1 and L2 guest were able to boot and run correctly. In the test result of section B.3, the L2 guest hung. VirtualBox within Xen L1 hypervisor: xen-3.0-x86 32p Domain 0 (L1): openSUSE 11.3 build 0475 L1 guest: Ubuntu 9.04 32bit L2 hypervisor: VirtualBox 3.1.6 r59338 L2 guest: Ubuntu 9.04 32bit Both L1 and L2 guest were able to boot and run correctly. With only the use of first generation hardware support (section B.3), the L2 guest hung. VirtualBox within KVM Host operating system: Ubuntu 9.10 64bit L1 hypervisor: kvm-kmod 2.6.32-9 L1 guest: Ubuntu 9.04 32bit L2 hypervisor: VirtualBox 3.1.6 r59338 L2 guest: Ubuntu 9.04 32bit Both L1 and L2 guest were able to boot and run correctly. The L2 guest showed a kernel panic message when only using first generation hardware support, as shown in section B.3. B.4.2 Paravirtualization Xen within VirtualBox Host operating system: Ubuntu 9.10 64bitor Ubuntu 9.04 32bit L1 hypervisor: VirtualBox 3.1.6 r59338 L2 hypervisor: xen-3.0-x86 32p Domain 0 (L2): openSUSE 11.3 build 0475 L2 guest: openSUSE 11.3 build 0475 Both L1 and L2 guest were able to boot and run correctly. In the test result of section B.3, domain 0 displayed some segmentation faults. B.5 KVM’s nested SVM support Host operating system: Ubuntu 9.04 server 64bit B.5. KVM’S NESTED SVM SUPPORT 85 L1 hypervisor: kvm-kmod 2.6.33 L2 hypervisor: Ubuntu 9.04 Domain 0 (L2): kvm-kmod 2.6.33 L2 guest: Ubuntu 9.04 After installing the L1 hypervisor, it must be loaded with the arguments “nested=1”. The L1 guest booted and ran perfectly. The installation of the L2 hypervisor within the L1 guest was successful. No special actions were required for installing the L2 hypervisor. When booting the L2 hypervisor, the L1 guest showed the following messages: [ 1 6 . 7 1 2 0 4 7 ] h a n d l e e x i t : u n e x p e c t e d e x i t i n i i n f o 0 x80000008 e x i t c o d e 0 x60 [ 3 1 . 4 3 2 0 3 2 ] h a n d l e e x i t : u n e x p e c t e d e x i t i n i i n f o 0 x80000008 e x i t c o d e 0 x60 [ 3 4 . 4 6 8 0 5 8 ] h a n d l e e x i t : u n e x p e c t e d e x i t i n i i n f o 0 x80000008 e x i t c o d e 0 x60 Patches fix these messages but were not yet released because they need more testing1 . 1 http://www.mail-archive.com/[email protected]/msg31096.html APPENDIX C Details of the performance tests This chapter gives some detailed information about the performance tests that were executed in chapter 6. The benchmarks used for the tests are sysbench, iperf and iozone. Each section lists the tests that were executed for the corresponding benchmark. C.1 sysbench 1 # !/ bin / bash 2 # # # # # 3 4 5 6 The sysbench @author tests Olivier Berghmans 7 8 9 10 11 12 i f [ $# - ne 1 ]; then echo " Usage : $0 < prefix > " echo " with < prefix > the prefix for the output files " exit fi 13 14 15 PREFIX = $1 OUTPUT = " ### " 16 17 18 19 20 date echo $ { OUTPUT } " sysbench CPU Test " sysbench -- num - threads =10 -- max - requests =10000 -- t e s t = cpu -cpu - max - prime =150000 run > $ { PREFIX } _cpu . txt date C.1. SYSBENCH 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 87 echo $ { OUTPUT } " sysbench Memory Test Write " sysbench -- num - threads =10 -- t e s t = memory -- memory - block - size =256 -- memory - total - size =2 G -- memory - scope = l o c a l -- memory hugetlb = off -- memory - oper = write -- memory - access - mode = seq run > $ { PREFIX } _mem_write_seq . txt sysbench -- num - threads =10 -- t e s t = memory -- memory - block - size =256 -- memory - total - size =2 G -- memory - scope = l o c a l -- memory hugetlb = off -- memory - oper = write -- memory - access - mode = rnd run > $ { PREFIX } _mem_write_rnd . txt date echo $ { OUTPUT } " sysbench Memory Test Read " sysbench -- num - threads =10 -- t e s t = memory -- memory - block - size =256 -- memory - total - size =2 G -- memory - scope = l o c a l -- memory hugetlb = off -- memory - oper = read -- memory - access - mode = seq run > $ { PREFIX } _mem_read_seq . txt sysbench -- num - threads =10 -- t e s t = memory -- memory - block - size =256 -- memory - total - size =2 G -- memory - scope = l o c a l -- memory hugetlb = off -- memory - oper = read -- memory - access - mode = rnd run > $ { PREFIX } _mem_read_rnd . txt date echo $ { OUTPUT } " sysbench Thread Test " sysbench -- num - threads =1000 -- max - requests =10000 -- t e s t = threads -- thread - yields =10000 -- thread - locks =4 run > $ { PREFIX } _threads . txt date echo $ { OUTPUT } " sysbench Mutex Test " sysbench -- num - threads =5000 -- max - requests =10000 -- t e s t = mutex -- mutex - num =4096 -- mutex - locks =500000 -- mutex - loops =10000 run > $ { PREFIX } _mutex . txt date echo $ { OUTPUT } " sysbench File io Test " sysbench -- num - threads =16 -- t e s t = fileio -- file - num =256 -- file - block - size =16 K -- file - total - size =512 M -- file - t e s t - mode = rndrw -- file - io - mode = sync prepare sysbench -- num - threads =16 -- t e s t = fileio -- file - num =256 -- file - block - size =16 K -- file - total - size =512 M -- file - t e s t - mode = rndrw -- file - io - mode = sync run > $ { PREFIX } _fileio . txt sysbench -- num - threads =16 -- t e s t = fileio -- file - num =256 -- file - block - size =16 K -- file - total - size =512 M -- file - t e s t - mode = rndrw -- file - io - mode = sync cleanup date echo $ { OUTPUT } " sysbench MySQL Test " sysbench -- num - threads =4 -- max - requests =10000 -- t e s t = oltp -oltp - table - size =1000000 -- mysql - table - engine = innodb -mysql - user = root -- mysql - password = root -- oltp - t e s t - mode = complex prepare sysbench -- num - threads =4 -- max - requests =10000 -- t e s t = oltp -oltp - table - size =1000000 -- mysql - table - engine = innodb -mysql - user = root -- mysql - password = root -- oltp - t e s t - mode = complex run > $ { PREFIX } _oltp . txt C.2. IPERF 43 44 45 46 47 88 sysbench -- num - threads =4 -- max - requests =10000 -- t e s t = oltp -oltp - table - size =1000000 -- mysql - table - engine = innodb -mysql - user = root -- mysql - password = root -- oltp - t e s t - mode = complex cleanup date echo $ { OUTPUT } " packing Results " tar czf $ { PREFIX }. tgz $ { PREFIX }*. txt rm -f $ { PREFIX }*. txt C.2 iperf This benchmarks consists of a server and a client. The server runs on a separate computer in the network with the command: iperf -s The test machines connect with the server by running the command: iperf -c < hostname > C.3 iozone The iozone benchmark tests the performance of writing and reading a file. The commands used for running the benchmark native, in a L1 guest and in a nested guest are the following: native$ iozone -a -g 16 G -i 0 -i 1 L1guest$ iozone -a -g 4 G -i 0 -i 1 L2guest$ iozone -a -g 2 G -i 0 -i 1 The “-g” option specifies the maximum file size used in the tests. The test on the native platform uses 16 Gb since the physical memory of the computer system was 8 Gb. The physical memory of the L1 guest was 2 Gb and the physical memory of the L2 guest was 512 Mb.