downloads a PDF - Columbia Software Systems Laboratory
Transcription
downloads a PDF - Columbia Software Systems Laboratory
Virtualization Mechanisms for Mobility, Security and System Administration Shaya Potter Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2010 c 2010 Shaya Potter All Rights Reserved ABSTRACT Virtualization Mechanisms for Mobility, Security and System Administration Shaya Potter This dissertation demonstrates that operating system virtualization is an effective method for solving many different types of computing problems. We have designed novel systems that make use of commodity software while solving problems that were not conceived when the software was originally written. We show that by leveraging and extending existing virtualization techniques, and introducing new ones, we can build these novel systems without requiring the applications or operating systems to be rewritten. We introduce six architectures that leverage operating system virtualization. *Pod creates fully secure virtual environments and improves user mobility. AutoPod reduces the downtime needed to apply kernel patches and perform system maintenance. PeaPod creates least-privilege systems by introducing the pea abstraction. Strata improves the ability of administrators to manage large numbers of machines by introducing the Virtual Layered File System. Apiary builds upon Strata to create a new form of desktop security by using isolated persistent and ephemeral application containers. Finally, ISE-T applies the two-person control model to system administration. By leveraging operating system virtualization, we have built these architectures on Linux without requiring any changes to the underlying kernel or user-space applications. Our results, with real applications, demonstrate that operating system virtualization has minimal overhead. These architectures solve problems with minimal impact on end-users while providing functionality that would previously have required modifications to the underlying system. Contents Contents i List of Figures vii List of Tables ix Acknowledgments xi 1 Introduction 1 1.1 OS Virtualization Security and User Mobility . . . . . . . . . . . . . 3 1.2 Mobility to Improve Administration . . . . . . . . . . . . . . . . . . . 5 1.3 Isolating Cooperating Processes . . . . . . . . . . . . . . . . . . . . . 6 1.4 Managing Large Numbers of Machines . . . . . . . . . . . . . . . . . 6 1.5 A Desktop of Isolated Applications . . . . . . . . . . . . . . . . . . . 7 1.6 Two-Person Control Administration . . . . . . . . . . . . . . . . . . . 8 1.7 Technical Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Overview of Operating System Virtualization 12 2.1 Operating System Kernel Virtualization . . . . . . . . . . . . . . . . 13 2.2 File System Virtualization . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 i 3 *Pod: Improving User Mobility 3.1 20 *Pod Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.1 Secure Operating System Virtualization . . . . . . . . . . . . 24 3.2 Using a *Pod Device . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4 AutoPod: Reducing Downtime for System Maintenance 41 4.1 AutoPod Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2 Migration Across Different Kernels . . . . . . . . . . . . . . . . . . . 45 4.3 Autonomic System Status Service . . . . . . . . . . . . . . . . . . . . 49 4.4 AutoPod Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.4.1 System Services . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4.2 Desktop Computing . . . . . . . . . . . . . . . . . . . . . . . 53 4.4.3 Setting Up and Using AutoPod . . . . . . . . . . . . . . . . . 55 4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5 PeaPod: Isolating Cooperating Processes 63 5.1 PeaPod Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2 PeaPod Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2.1 Pea Virtualization . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2.2 Pea Configuration Rules . . . . . . . . . . . . . . . . . . . . . 73 5.2.2.1 File System . . . . . . . . . . . . . . . . . . . . . . . 73 5.2.2.2 Transition Rules . . . . . . . . . . . . . . . . . . . . 76 ii 5.2.2.3 Networking Rules . . . . . . . . . . . . . . . . . . . . 77 5.2.2.4 Shared Namespace Rules . . . . . . . . . . . . . . . . 78 5.2.2.5 Managing Rules . . . . . . . . . . . . . . . . . . . . 78 5.3 Security Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.4 Usage Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.4.1 Email Delivery . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.4.2 Web Content Delivery . . . . . . . . . . . . . . . . . . . . . . 85 5.4.3 Desktop Computing . . . . . . . . . . . . . . . . . . . . . . . 87 5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6 Strata: Managing Large Numbers of Machines 95 6.1 Strata Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.2 Strata Usage Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.3 6.2.1 Creating Layers and Repositories . . . . . . . . . . . . . . . . 103 6.2.2 Creating Appliance Templates . . . . . . . . . . . . . . . . . . 103 6.2.3 Provisioning and Running Appliance Instances . . . . . . . . . 104 6.2.4 Updating Appliances . . . . . . . . . . . . . . . . . . . . . . . 105 6.2.5 Improving Security . . . . . . . . . . . . . . . . . . . . . . . . 106 Virtual Layered File System . . . . . . . . . . . . . . . . . . . . . . . 107 6.3.1 Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.3.2 Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.3.3 6.3.2.1 Dependency Example . . . . . . . . . . . . . . . . . 114 6.3.2.2 Resolving Dependencies . . . . . . . . . . . . . . . . 114 Layer Creation . . . . . . . . . . . . . . . . . . . . . . . . . . 116 iii 6.3.4 Layer Repositories . . . . . . . . . . . . . . . . . . . . . . . . 117 6.3.5 VLFS Composition . . . . . . . . . . . . . . . . . . . . . . . . 119 6.4 Improving Appliance Security . . . . . . . . . . . . . . . . . . . . . . 122 6.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.6 6.5.1 Reducing Provisioning Times . . . . . . . . . . . . . . . . . . 126 6.5.2 Reducing Update Times . . . . . . . . . . . . . . . . . . . . . 127 6.5.3 Reducing Storage Costs . . . . . . . . . . . . . . . . . . . . . 128 6.5.4 Virtualization Overhead . . . . . . . . . . . . . . . . . . . . . 130 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 7 Apiary: A Desktop of Isolated Applications 136 7.1 Apiary Usage Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 7.2 Apiary Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 7.3 7.2.1 Process Container . . . . . . . . . . . . . . . . . . . . . . . . . 144 7.2.2 Display 7.2.3 File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7.2.4 Inter-Application Integration . . . . . . . . . . . . . . . . . . . 148 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 7.3.1 Handling Exploits . . . . . . . . . . . . . . . . . . . . . . . . . 153 7.3.1.1 Malicious Files . . . . . . . . . . . . . . . . . . . . . 154 7.3.1.2 Malicious Plugins . . . . . . . . . . . . . . . . . . . . 155 7.3.2 Usage Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 7.3.3 Performance Measurements . . . . . . . . . . . . . . . . . . . 161 7.3.3.1 Application Performance . . . . . . . . . . . . . . . . 161 7.3.3.2 Container Creation . . . . . . . . . . . . . . . . . . . 162 iv 7.4 7.3.4 File System Efficiency . . . . . . . . . . . . . . . . . . . . . . 165 7.3.5 File System Virtualization Overhead . . . . . . . . . . . . . . 167 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 8 ISE-T: Two-Person Control Administration 173 8.1 Usage Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 8.2 ISE-T Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 8.2.1 Isolation Containers . . . . . . . . . . . . . . . . . . . . . . . 181 8.2.2 ISE-T’s File System . . . . . . . . . . . . . . . . . . . . . . . 183 8.2.3 ISE-T System Service . . . . . . . . . . . . . . . . . . . . . . . 184 8.3 ISE-T for Auditing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 8.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 8.5 8.4.1 Software Installation . . . . . . . . . . . . . . . . . . . . . . . 190 8.4.2 System Services . . . . . . . . . . . . . . . . . . . . . . . . . . 192 8.4.3 Configuration Changes . . . . . . . . . . . . . . . . . . . . . . 193 8.4.4 Exploit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 9 Conclusions and Future Work 198 Bibliography 203 A Restricted System Calls 221 A.1 Host-Only System Calls . . . . . . . . . . . . . . . . . . . . . . . . . 221 A.2 Root-Squashed System Calls . . . . . . . . . . . . . . . . . . . . . . . 223 A.3 Option-Checked System Calls . . . . . . . . . . . . . . . . . . . . . . 224 v A.4 Per-Virtual-Environment System Calls . . . . . . . . . . . . . . . . . 225 vi List of Figures 3.1 *Pod Virtualization Overhead . . . . . . . . . . . . . . . . . . . . . . 32 3.2 *Pod Checkpoint/Restart vs. Normal Startup Latency . . . . . . . . 34 4.1 AutoPod Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.1 PeaPod Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.2 Example of Read/Write Rules . . . . . . . . . . . . . . . . . . . . . 74 5.3 Protecting a Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.4 Directory-Default Rule . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.5 Transition Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.6 Networking Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.7 Namespace Access Rules . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.8 Compiler Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.9 Set of Multiple Rule Files . . . . . . . . . . . . . . . . . . . . . . . . 79 5.10 Email Delivery Configuration . . . . . . . . . . . . . . . . . . . . . . 84 5.11 Web Delivery Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.12 Desktop Application Rules . . . . . . . . . . . . . . . . . . . . . . . . 87 5.13 PeaPod Virtualization Overhead . . . . . . . . . . . . . . . . . . . . . 91 vii 6.1 How Layers, Repositories, and VLFSs Fit Together . . . . . . . . . . 101 6.2 Layer Definition for MySQL Server . . . . . . . . . . . . . . . . . . . 109 6.3 Layer Definition for Provisioned Appliance . . . . . . . . . . . . . . . 109 6.4 Metadata for MySQL Server Layer . . . . . . . . . . . . . . . . . . . 111 6.5 Metadata Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.6 Storage Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.7 Postmark Overhead in Multiple VAs . . . . . . . . . . . . . . . . . . 131 6.8 Kernel Build Overhead in Multiple VAs . . . . . . . . . . . . . . . . . 132 6.9 Apache Overhead in Multiple VAs . . . . . . . . . . . . . . . . . . . . 133 7.1 Apiary Screenshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 7.2 Usage Study Task Times . . . . . . . . . . . . . . . . . . . . . . . . . 160 7.3 Application Performance with 25 Containers . . . . . . . . . . . . . . 162 7.4 Application Startup Time . . . . . . . . . . . . . . . . . . . . . . . . 164 7.5 Postmark Overhead in Apiary . . . . . . . . . . . . . . . . . . . . . . 167 7.6 Kernel Build Overhead in Apiary . . . . . . . . . . . . . . . . . . . . 168 8.1 ISE-T Usage Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 viii List of Tables 3.1 Per-Device *Pod File System Sizes . . . . . . . . . . . . . . . . . . . 31 3.2 Benchmark Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3 *Pod Checkpoint Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.1 Application Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2 AutoPod Migration Costs . . . . . . . . . . . . . . . . . . . . . . . . 59 5.1 Application Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.1 VA Provisioning Times . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.2 VA Update Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.3 Layer Repository vs. Static VAs . . . . . . . . . . . . . . . . . . . . . 130 7.1 Application Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 161 7.2 File System Instantiating Times . . . . . . . . . . . . . . . . . . . . . 163 7.3 Apiary’s VLFS Layer Storage Breakdown . . . . . . . . . . . . . . . . 166 7.4 Comparing Apiary’s Storage Requirements Against a Regular Desktop 166 7.5 Update Times for Apiary’s VLFSs . . . . . . . . . . . . . . . . . . . . 166 8.1 ISE-T Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 ix 8.2 Administration Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 x Acknowledgments My deepest thanks go to my advisor, Jason Nieh, for his continual support and guidance. His constant questioning, demanding of explanations and objective evaluation has helped develop ideas that I would not have been able to reach on my own, while also teaching me skills that I hope remain with me. I am constantly amazed by how many different studies, projects and papers he is able to juggle while retaining the ability to ask insightful questions. He has provided the model that I aspire to be. There are many people at Columbia who have been a significant part of my graduate experience. My officemates, Dinesh Subhraveti, Dan Phung and Dana Glasner have been good friends, acted as sounding boards, provided valuable feedback, and, in general, made the graduate experience an enjoyable one. I’ve worked on many projects together with Ricardo Baratto and Oren Laadan and I am always amazed by their abilities. Stelios Sidiroglou-Douskos, Mike Locasto, Carlo Pérez and Gong Su provided valuable feedback and friendship. I’d also like to thank Angelos Keromytis, Steven M. Bellovin for providing help and guidance in my research. In addition, I’d like to thank Erez Zadok, Gail Kaiser and Chandra Narayanaswami for serving on my Ph.D. committee. Finally, I’d be remiss if I did not thank the administrative staff in the Computer Science Department, including Alice Cueba, Twinkle Edwards, Elias xi Tesfaye and Susan Tritto for handling many tasks that enabled me to focus on my research. Finally, I’d like to thank my parents, whose constant support and belief in me has enabled all my accomplishments. xii Dedicated in memory of my grandmothers, יוכבד בת צבי הירש לייבand אלתע מאשא בת חיים יצחק They were proud of all my accomplishments and were always looking forward to the day when my Ph.D. would be complete. Their memory will be with me always. xiii Chapter 1 Introduction Computer use is more widespread today than it was even 10 years ago, but we are still using software designs from 20 or 30 years ago. Although these designs are well tested and understood, they were created to solve the problems of that time. Today’s users face difficulties that the original software designers did not imagine. We can redesign the operating system and applications to attempt to address these problems, but this creates new, relatively untested software and designs and may force users and administrators to learn fundamentally new models of usage. This dissertation demonstrates that many problems can be solved not by redesigning and rewriting the applications, but instead by virtualizing the interfaces through which existing applications interact with the operating system. Virtualization is the creation of a layer of indirection between two entities that previously communicated directly. For example, in hardware virtualization [28, 34, 142, 147], a virtual machine monitor (VMM) places a layer of indirection between an operating system and the underlying hardware. A VMM provides a complete vir- Chapter 1. Introduction 2 tualized hardware platform for an operating system, enabling any operating system supporting that platform to run as though on physical hardware. Hardware virtualization has been shown to enable operating systems to take advantage of hardware for which they were not designed. The Disco project [34] demonstrated how to run an operating system not designed for ccNUMA architectures on those architectures by using a VMM. Operating systems can also be virtualized in multiple ways, most commonly by providing each process with its own virtualized and protected memory mappings. Instead of letting a process directly access the machine’s memory, the operating system, with hardware support, places a layer of indirection between the processes and physical memory, creating a virtualized mapping between the process’s memory space and the physical machine’s memory space. This provides security, efficiency and flexibility. The processes’ memories are isolated from one another, but memory can still be shared among processes. Memory, however, is not the only operating system interface that can be virtualized. Zap [100] and FiST [152] demonstrated that an operating system’s kernel state and file systems can be virtualized as well. Kernel virtualization operates by virtualizing the system call interface, that is, by placing a layer of indirection between processes and the system calls they use to access the operating system kernel’s functionality and ephemeral state. Similarly, file system virtualization works by placing a layer of indirection between processes and the underlying physical file systems, or the operating system’s persistent state. Instead of accessing the machine’s kernel and file system directly using built-in system call and file system functions, the application running in the virtualized operating system executes a function within the virtualization layer. The virtualization layer can modify the parameters passed to it, perform work required by the desired virtualization, call built-in kernel and file system func- Chapter 1. Introduction 3 tions to perform the desired real work, and modify the return value passed to the calling process. This dissertation demonstrates that by leveraging different forms of operating system virtualization, we can use commodity operating systems and software in novel ways and solve problems that the original developers could not have anticipated. By virtualizing the interfaces, we do not change the applications or operating system, but instead create specialized environments that enable us to solve problems. Although virtualized environments, from the perspective of processes, look and behave like the system they are virtualizing, they can look and behave very differently to the systems on which they are hosted. This decoupling of execution environment and host environment lets us create tools that run on the host and solve new problems without modifying a well-tested operating system and application code. For example, we can create virtual private namespaces for applications distinct from the namespace of the physical computer. To the processes running within the virtualized environment, the environment looks like a regular machine, provides the same application interface, and does not require applications to be rewritten. Similarly, because operating system virtualization only interposes itself between the application and the underlying operating system kernel, the underlying kernel’s binary and source code do not have to be modified either. 1.1 OS Virtualization Security and User Mobility Some forms of operating system virtualization [85, 100] are limited to isolating a single user’s processes and are not designed to provide any security constraints. This is especially noticeable for processes that run with elevated privileges, such as those provided to root on Unix systems. Without secure virtualization, operating system Chapter 1. Introduction 4 virtualization can only solve single user problems, substantially limiting its use. To enable secure virtualization, we have enabled each virtualized environment to have a unique set of virtualized users. Virtualizing the set of users gives each environment an isolated set of privileges. However, unlike hardware virtualization, where each virtual machine has a full operating system instance and therefore its own isolated privileged state, operating systems generally only have a single set of privileged states. Therefore, in addition to providing unique sets of virtualized users, we also restrict the abilities of virtualized root users. If the virtualized root users were not restricted, they could be treated equivalently to the root user of the underlying system, enabling them to break the virtualization abstraction. This dissertation demonstrates how operating system virtualization can be used to simply virtualize the set of users while restricting the abilities of the privileged but virtualized root user. We then show that operating system virtualization can be combined with checkpoint/restart functionality to improve mobile users’ computing experience. Many users lug around bulky, heavy computers simply to have access to their data and applications. To solve this problem, we created *Pod devices. A *Pod is a physical storage device, such as a portable hard disk or USB thumb drive, containing a complete application-specific environment, such as a desktop or web environment. *Pod devices run their applications on whatever host computer is available at the user’s current location. By storing the entire environment on the portable device, users can move it between computers while retaining a common usage environment. Operating system virtualization, coupled with process migration technology, enables users to move their running processes and data between physical machines, much like a laptop can be suspended and resumed when changing locations. We have built a number of *Pod devices that enable users to carry an application [109,110,113] or an entire desktop [114] with them. Chapter 1. Introduction 1.2 5 Mobility to Improve Administration Building on *Pod, we demonstrate how operating system virtualization and checkpoint/restart ability can improve system maintenance, much of which requires taking the machine offline and shutting down all active processes. Among other problems, this prevents the kernel from being patched quickly. as it requires the machine to be rebooted for the patch to take effect, thereby killing all running processes on the machine. To address this, we developed AutoPod [112], a system that enables unscheduled operating system updates while preserving application service availability. AutoPod leverages *Pod’s virtualization abstraction to provide a group of processes and associated users with an isolated machine-independent virtualized environment decoupled from the underlying operating system instance. This enables AutoPod to run each independent service in its own isolated environment, preventing a security fault in one from propagating to other services running on the same machine. This virtualized environment is integrated with a checkpoint/restart system that allows processes to be suspended, resumed and migrated across operating system kernel versions with different security and maintenance patches. AutoPod incorporates a system status service to determine when operating system patches need to be applied to the current host, then automatically migrates application services to another host to preserve their availability while the current host is updated and rebooted. AutoPod’s ability to migrate processes across kernel versions also increases *Pod’s value by making it possible for users to move their *Pod between machines that are not running the exact same kernel version. Chapter 1. Introduction 1.3 6 Isolating Cooperating Processes AutoPod envisions virtual computer usage growing rapidly as users create and use many task-specific virtual computers, as is already occurring with the rise of virtual appliances. But more computers mean more targets for malicious attackers, making it even more important to keep them secure. Operating system virtualization, as in a pod, provides namespaces that isolate processes from the host, enabling a level of least-privilege isolation as single services are constrained to independent pods. Today’s services, however, are complex applications with many distinct components. Even within a pod, each component of the service has access to all resources required by every component within the system, which is not a true least-privilege system. To solve this problem, we developed PeaPod [115], which combines the pod with a pea (Protection and Encapsulation Abstraction). As AutoPod demonstrates, pods can be used to isolate services into separate virtual machine environments. The pea is used within a pod to provide finer-grained isolation among application components of a single service while still enabling them to interact. This allows services composed of multiple distinct processes to be constructed more securely. PeaPod enables processes to work together while limiting the resources each process can access to only those needed to perform its job. 1.4 Managing Large Numbers of Machines Although virtualization provides numerous benefits, such as minimizing the amount of hardware to maintain by putting multiple virtual machines on a single physical host, this can also make it harder for administrators to maintain an increased number of virtual machines. Just as the proliferation of virtual machines affects security, Chapter 1. Introduction 7 it also significantly increases the administrative burden. Instead of managing a single machine providing a number of services, one manages many independent virtual machines that each provide a single service. When security holes are discovered in core operating system functionality, each virtual machine must be fixed separately. This dissertation shows that operating system virtualization improves management of large systems. Although virtualization decreases the amount of physical hardware to manage, it does not reduce, and can even increase, the number of machine instances to be managed. Strata improves this situation by introducing the Virtual Layered File System (VLFS ). Instead of having independent file systems for each service, the VLFS enables a file system to be divided into a set of shareable layers and combined into a single file system namespace view. This enables many machines to be stored efficiently because data that is common to more than one only has to be stored once. It allows efficient provisioning because none of the shared files have to be copied into place. Finally, it improves maintenance because the patched layer only has to be installed once, and is then pulled into all the VLFSs that use that layer. 1.5 A Desktop of Isolated Applications Once we can manage multiple independent machines efficiently, we can use those machines in novel ways. For instance, Apiary improves the ability to create secure computer desktops. Apiary leverages Strata’s VLFS to contain each application in an independent and isolated container. Even if one application is exploited, the exploit will be confined to that application and the rest of the user’s data will remain secure. Similarly, because VLFSs allow very quick provisioning, Apiary can run desktop applications ephemerally in addition to their regular persistent execution Chapter 1. Introduction 8 models. An ephemeral application is an application whose container is provisioned anew for each execution of the application. Once the execution is complete, the container is removed from the system. This means that even if an application executed ephemerally is exploited, the exploit will not persist because the next ephemeral execution will be within a fresh container. Finally, because independent applications do not provide the integrated feel users expect from their desktops, Apiary enables applications to integrate securely at specific points. Apiary improves on PeaPod for desktop scenarios by enabling applications to be securely isolated without requiring complicated access rules to be designed and written. 1.6 Two-Person Control Administration Finally, we have leveraged operating system virtualization to provide high assurance system administration. In a traditional operating system, the administrative user is an all-powerful entity who can perform any task with no record of the changes made on the system and no check on their power. This causes two problems. First, they are able to subvert the security of the system with malicious intent. Second, changes made by single users are prone to error. ISE-T [111] changes this model by applying the concept of two-person control to system administration. Two-person control changes system administration in two ways. First, instead of performing administrative actions directly on the machine, the changes are first performed on a sandbox that mirrors the machine being administrated. By providing two administrators with their own sandboxes to perform the same administrative task, ISE-T can extract their changes, compare them for equivalence, and, if equivalent, commit them to the underlying machine. Second, in cases where the two-person control system is too expensive, ISE-T can extract the changed Chapter 1. Introduction 9 state and store it in a secure audit log for future verification before committing it to the underlying system. This enables a high assurance system with little additional administration cost. 1.7 Technical Contributions This dissertation contributes multiple technical innovations and their associated architectures: 1. We introduce an operating system virtualization platform that provides secure virtual machines without any underlying operating system changes. This is necessary to enable multiple virtual environments to run in parallel on a single machine as well as to enable secure execution of untrusted processes. 2. We introduce a portable storage-based computing environment. By combining our secure operating system virtualization platform, a checkpoint/restart system, and portable storage devices, we created the *Pod architecture to migrate a user’s processes between machines securely. 3. We introduce a checkpoint/restart mechanism to enable the migration of processes between machines running different kernels. This is accomplished by saving the checkpoint/restart state in a kernel-independent format so that it can be adapted to the internal data structures of the kernel to which the processes are being migrated. The AutoPod architecture improves system management by allowing administrators to administrate machines without terminating processes. It also improves the utility of *Pod by not limiting users to machines running the same kernel version. Chapter 1. Introduction 10 4. We introduce the pea process isolation abstraction. Peas allow individual processes in a multi-process system to cooperate while contained in individual resource-restricted compartments. The PeaPod architecture creates least-privilege environments for the multiple processes that constitute services in use today. 5. We introduce the Virtual Layered File System (VLFS). The VLFS improves system administration by enabling system administrators to divide a file system into distinct subset layers and use the layers for multiple simultaneous installations. The VLFS combines traditional package management with unioning file systems in a new way, yielding powerful new functionality. The Strata architecture permits administrators to provision and manage large numbers of virtual machines efficiently. 6. We introduce the concepts of a containerized desktop and ephemeral application execution. In a containerized desktop, each desktop application is fully isolated in its own container with its own file system. This prevents an exploited application from accessing data belonging to other applications. Ephemeral application execution creates a single-use application container and file system for each individual application execution. Ephemeral containers prevent malicious data from having any persistent effect on the system and isolate faults to a single application instance. The Apiary architecture provides a new way to secure desktop applications by isolating each application within its own container, while letting the isolated applications interact in a secure manner through ephemeral execution. 7. We introduce two-person control for system administration to create a high assurance form of system administration. This helps keep system administration faults from impacting a system. We use the same mechanism to introduce au- Chapter 1. Introduction 11 ditable system administration, increasing assurance with little additional cost. The ISE-T architecture enables systems to be administrated within this twoperson control model. Chapter 2 Overview of Operating System Virtualization To understand how operating system virtualization allows us to solve new software problems without requiring the software to be rewritten, we first explain what operating system virtualization is and how it works. Many people are familiar with hardware virtualization, where real operating systems run on virtual hardware, which is a virtualization layer between the host machine and the operating system. Operating system virtualization differs from hardware virtualization in where it places the virtualization layer. Instead of virtualizing the hardware interfaces, it virtualizes the operating system interfaces to provide virtualized views of the underlying host operating system. Unlike hardware virtualization, where different operating systems can run in parallel, operating system virtualization is restricted to the same operating system as the host. This dissertation explores the benefits of virtualizing the two primary operating system elements that applications leverage: the kernel that provides the runtime, but ephemeral, state of a process, and the file system that provides the long-term stable storage on which processes depend. Chapter 2. Overview of Operating System Virtualization 2.1 13 Operating System Kernel Virtualization Applications depend heavily on kernel state during their runtime, from simple things like process identifiers to more complicated states like inter-process communication (IPC) keys, file descriptors and memory mappings. Some of these states already have an element of virtualization that enables multiple processes to coexist on a single system. For example, each process has its own file descriptor and virtual memory namespaces. On the other hand, states such as process identifiers and IPC keys are shared within a single namespace accessible to all processes. One primary use of operating system kernel virtualization is to create multiple parallel namespaces that are fully isolated from one another [7, 74, 116]. But this requires significant in-kernel modifications. Operating system kernel virtualization is commonly implemented by virtualizing resource identifiers. Every resource that a virtualized process accesses has a virtual identifier which corresponds to a physical operating system resource identifier. When an operating system resource is created for a virtualized process, such as with IPC key creation, the virtualization layer, instead of returning the corresponding physical name to the process, intercepts the physical name value and returns a virtual name to the process. Similarly, any time a virtualized process passes a virtual identifier to the operating system, the virtualization layer intercepts it, replacing it with the appropriate physical identifier. This type of operating system kernel virtualization is easily implemented by system call interposition. System call interposition can create a virtualized namespace because all operating system resources are accessed through system calls. By interposing on the system call, the virtualization abstraction can intercept the virtual resource identifier the process passes in with the system call, and, if valid, replace Chapter 2. Overview of Operating System Virtualization 14 it with the correct physical resource identifier. Similarly, whenever a physical resource is created and has its identifier passed back to a process, the virtualization abstraction can intercept the value and replace it with a newly created and mapped virtual identifier. By virtualizing a process so that it can only access virtual named resources, operating system virtualization decouples a process’s execution from the underlying namespace of the host machine. Many commodity operating systems, including Solaris [116] and Linux [6], now include this functionality natively. Kernel virtualization is not limited to creating independent and isolated namespaces, but can also change how the kernel behaves. Instead of simply translating resource identifiers, kernel virtualization can change how the system calls interact with those identifiers. For instance, it can change the security semantics of system calls. Many system calls have built-in security checks to decide whether a process has permission to execute a specific functionality. Once the kernel is virtualized through the system call interface, the virtualized system calls can allow a process to access a resource it would have otherwise been prevented from accessing, or vice versa. 2.2 File System Virtualization Kernel virtualization and system call interposition enable virtualization of the ephemeral kernel state, but each process also uses the file system, which provides the processes with persistent storage. By virtualizing the file system, we enable processes to have a file system view that is independent of the host machine’s file system. For instance, when creating multiple parallel kernel namespaces, one often intends to provide virtual machine environments. To do this, one must also provide a private file system namespace for each environment. If private file system namespaces are inadvertently omitted, the file system is shared and isolation is severely weakened. Chapter 2. Overview of Operating System Virtualization 15 In fact, commodity operating systems offer the ability to virtualize the file system in exactly this way, such as by leveraging the chroot ability, enabling a process to be confined to a subset of the underlying machine’s file system. But because current commodity operating systems are not built to support multiple namespaces, we must address the security issues this causes. Although chroot can provide processes within a pod a virtualized file system namespace, there are many ways to break out of the standard chrooted environment, especially if one allows the chroot system call to be used by processes within the virtualized file system environment [58]. To provide secure file system virtualization, the virtualization mechanism must enforce the chrooted environment’s limitation at all times. We have implemented a barrier directory in the underlying file system that prevents processes within a pod from crossing it. Even if a process is able to break the chrooted virtualized file system view, the process will never be able to access any files outside the virtualized area. To enforce a barrier, we interpose on the file system’s ->permission method, which determines if a process can access a file or directory. For example, if a process tries to access a file a few directories below the current directory, the permission function is called on each directory in order as well as on the file itself. If a call determines that the process does not have permission on that directory, the chain of calls ends, because the process must have permission to traverse the directory hierarchy in order to access the file. By interposing on the permission function, we can deny permission to access the barrier directory to processes within a pod. The process cannot traverse the barrier and so cannot access any file outside the virtualized file system environment. However, file system virtualization is not limited to the creation of private file system namespaces. Much as the barrier directory is implemented by interposing on the file system’s permission function, one can also interpose on all functionality the file system exposes to the operating system in order to create virtualized file Chapter 2. Overview of Operating System Virtualization 16 system instances. Just as pod virtualization allows differentiating virtual machine environments without unique machine or operating system instances, file system virtualization permits differentiating each pod’s file system namespace in unique ways without requiring each pod to have a unique physical file system. For instance, file system virtualization enables pods to have unique file system security policies. It can even create file system views totally independent of the underlying file system by combining multiple individual file systems into a single view. In fact, this is exactly how stackable file systems [124, 152] work. Stackable file systems provide a completely virtual file system by interposing on the kernel’s file system operations. Instead of interposing directly, as with system call virtualization, stackable file systems create a virtual file system that the kernel uses as a regular file system. But rather than having a data store on a block device of its own, it leverages the data stored within other file systems. This enables stackable file systems to interpose directly on the physical file system by leveraging the operating system’s file system interface. Instead of executing the file system’s functions directly, including using its directory entry and inode structures, file system virtualization interposes on those functions and provides its own set of file system structures that map onto those of the underlying physical file system. By interposing between the kernel and physical file systems, stackable file systems allow easy creation of virtual file systems. The virtual file system is then able to modify operations as appropriate for the needs of the system. For example, a unioning semantic can be implemented with a stackable file system that combines multiple underlying physical directories into a single view by interposing on the ->readdir method. Whenever a program calls the operation, the stackable file system creates a virtualized view by running the operation against all the underlying directories that are being unioned into a single view and returning the unioned set of data. Chapter 2. Overview of Operating System Virtualization 2.3 17 Related Work Many different systems have been created to enable the virtualization of kernel states. They can be loosely grouped into four categories: Operating system provided virtualization. This is most notable in operating systems that provide alternate namespaces for the creation of containers, including Solaris’s Zones [116], Linux’s Vserver [7] and Containers [6], and BSD’s Jail mode [74]. Systems in this category are the least flexible, as their techniques are tightly coupled to the underlying system. This prevents them from being leveraged to solve problems for which they were not explicitly designed. Direct interposition on system calls. This enables code to directly intercept the system call within the kernel. The kernel does not call the built-in system call’s function, but instead executes the function provided by the virtualization layer, which in turn calls the built-in one if needed. This very old technique, common in MS-DOS, was used in Terminate and Stay Resident (TSR) programs [99]. In more modern usage, Zap [100] implements its virtualization by interposing directly on a set of system calls that it desires to virtualize, as well as by providing a generic interface to enable other virtualization layers to interpose on whatever system call they desire. The architectures in this dissertation use this approach. Kernel-based system call trace and trapping. This is most notably provided by the ptrace system call [144], which provides tracing and debugging facilities that enable one process to completely control the execution of another. For instance, a controlling process can be notified whenever the controlled process attempts to execute a system call. Instead of letting the system call run directly, the controlling process chooses to allow or disallow the system call, to change the parameters being passed to the system call, or even to cause a totally separate code path to be executed. Chapter 2. Overview of Operating System Virtualization 18 This is a very flexible approach because while the interposition is being enforced by the kernel via the ptrace system call, it runs as a regular user space program. However, due to the the many context switches between the user space program using the ptrace system call and the kernel, performance suffers. User space-based system call trace and trapping. Instead of trapping in the kernel, one can provide a user space library that provides its own system call wrapper function [1]. Well-behaved programs do not execute system calls directly, but call a library function that wraps the system call, enabling the system call to be virtualized by replacing that library function with one that enforces the virtualization of the kernel state. But this only works for well-behaved applications and cannot be used to enforce security schemes, as any application can execute system calls directly and avoid the library’s interposition mechanism. File system virtualization: Operating systems’ file system interfaces have also been virtualized in multiple ways. Modern operating systems provide a Virtual File System (VFS) interface [73]. This enables different types of file systems to be used with the operating system in a manner transparent to all applications. In addition, modern operating systems support network file system shares using protocols such as NFS [135] and SMB [151]. These network file systems provide virtualized access to a remote file system while enabling applications to treat its contents as though they were stored locally. A common way to create virtualized file system access is through stackable file systems. For example, Plan 9 [104] offered the 9P distributed file system protocol [105] to enable the creation of virtual file systems. HURD [35] and Spring [78] also included extensible file system interfaces. More commonly today, the NFS protocol serves as the basis for other file systems that virtualize and extend the Unix file system via the SFS Toolkit [89]. It exposes the NFS interface to user space programs, allowing them Chapter 2. Overview of Operating System Virtualization 19 to provide file system functionality safely. But the NFS protocol is very complicated. User space file systems that depend on it must fully understand it to be implemented correctly. The more usual approach is to leverage kernel functionality to create these virtualized file systems. This is generally easier to implement than an NFS-based approach because the kernel’s file system interface is simpler than the one the NFS protocol exposes. This can be implemented via a user space file system such as FUSE [137] that provides the necessary kernel hooks. Alternatively, the entire file system can be built as an in-kernel file system that can be dynamically loaded and unloaded, as in FiST [152], which behaves as a native file system. In general, the in-kernel approach yields significantly better performance because fewer context switches are needed. Kernel-based virtualized file systems are known as stackable file systems and have been implemented in many different operating systems [97, 124, 130, 152]. Chapter 3 *Pod: Improving User Mobility A key problem mobile users face is the lack of a common environment as they switch locations. The computer at the office is configured differently from the one at home, which is again different from the one at the library. Even though mobile users have massive computer power at each location, they cannot easily take advantage of it. These locations can have different sets of software installed, which can make it difficult for a user to complete a task. Moreover, mobile users want consistent access to their files, which is difficult to guarantee as they move around. The current personal computer framework ties a user’s data to a single machine. Laptops are a common solution in an attempt to solve the problems posed by mobility. Laptops enable users to carry their data and applications with them wherever they go. But laptops only mask the problem, as they do not leverage the existing infrastructure and suffer from a number of difficulties of their own. First, laptops are not as full-featured as a desktop computer. They have less storage and smaller physical features like keyboards and displays. They are slower because cooling and space constraints prevent the fastest processors from being used in a laptop. Even laptops considered to be Desktop Replacements have speed limitations, tend to be Chapter 3. *Pod: Improving User Mobility 21 as heavy as 8 or 9 pounds, and are not meant to be extremely mobile. Second, because laptops use small, specialized, and moving parts, they are more fault-prone. This manifests itself in moving parts like a fan or hard disk breaking down, or in an internal connection coming loose, as when memory is unseated from its socket. To address these mobility and reliability problems posed by laptops, we have designed and built the *Pod architecture. *Pod leverages operating system virtualization to enable the creation of application-specific portable devices that decouple a user’s application environment from any one physical machine. Depending on the mobile user’s needs, the *Pod architecture lets users carry a single application or a large set of applications, as well as large sets of data. For instance, many users do most of their computing work through a web browser. They read email through webmail interfaces, interact with friends on social networking websites, and even use word processors and spreadsheets without leaving the web browser. But while a web browser is available on every Internet-connected computer, it will not necessarily be configured according to their needs. For instance, helper applications, browser plugins, bookmarks and cookies will not move with them between machines. For these users, we leveraged the *Pod architecture to create a WebPod device that contains a web browser, plugins and helper applications needed within the web environment. Many mobile users, however, require a more full-featured computing environment. They do not want to store all their data on the Internet, nor to be limited to the applications available via the web. They expect the traditional desktop experience. Although they have access to powerful computers at many locations, these computers are not configured correctly for their work. For these users, we leveraged the *Pod architecture to create a DeskPod device containing all of the desktop applications a user requires, along with their data. Chapter 3. *Pod: Improving User Mobility 22 The *Pod architecture enables mobile users to obtain the same persistent, personalized computing experience at any computer. *Pod takes advantage of commodity storage devices that can easily fit in a user’s pocket yet store large amounts of data. These devices range from flash memory sticks that can hold 64 GB of data to portable hard disks, such as an Apple iPod, that can hold 120 GB of data. These devices can hold a user’s entire computing environment, including applications and all their data. The *Pod architecture allows a user to decouple their computing session from the underlying computer, so that it can be suspended to a portable storage device, carried around easily, and resumed from the storage device on a completely different computer. Users have ubiquitous access to computing power, at work, home, school, library or even an Internet cafe, and the *Pod architecture enables them to continue working, even in the face of faulty components, simply by moving their *Pod-based environment to a new host machine. *Pod provides this functionality without modifying, recompiling or relinking any applications or the operating system kernel, and with only a negligible impact on performance. The *Pod architecture does have limitations, as shown by our MediaPod and GamePod devices. These devices enable users to carry with them a multimedia player and a game playing environment, respectively. Although they are very flexible in what media formats and games they can play, they do not provide any computing capabilities of their own. Moreover, although they allow users to move their environment among computers, they do not let them make use of the environment on the go, when they have no access to a computer. This is in contrast to devices such as Apple’s iPod and Nintendo’s Game Boy and DS portable devices, which can only play a limited amount of formats, but provide their own computing ability and are therefore usable on the go, without a powerful computer. These devices are popular with users on the move, so MediaPod and GamePod are less likely to replace them. Chapter 3. *Pod: Improving User Mobility 3.1 23 *Pod Architecture *Pod operates by encapsulating a user’s computing session within a virtualized execution environment and storing all states associated with the session on the portable storage device. *Pod also leverages THINC [27] to virtualize the display so that the application session can be scaled to different display resolutions as a user moves among computers. This enables a computing session to run the same way on any host despite different operating system environments and display hardware. These virtualization mechanisms enable *Pod to isolate and protect the host from untrusted applications that a user may run as part of their *Pod session. *Pod virtualization also prevents other applications outside of the computing session that may be running on the host from accessing any of the session’s data, protecting the user’s privacy. We have combined *Pod’s virtualization with Zap’s checkpoint/restart mechanism [100], allowing users to suspend the entire computing session to the portable storage device so that it can be migrated between physical computers by simply moving the storage device to a new computer and resuming the session there. *Pod preserves on the portable device the file system states and process execution states associated with the computing session. A limitation of this approach is that Zap only supports homogeneous migration, so it can migrate only between machines running the exact same kernel. In Chapter 4, we demonstrate heterogeneous migration, thereby removing this limitation. As a result, *Pod enables users to maintain a common environment, no matter what computer they are using. Devices built upon the *Pod architecture are also less prone to problems because they do not contain a complete operating system, only the programs needed for one specific application environment. Various operating system services that a normal machine depends on are not needed, so maintenance is simpler. Chapter 3. *Pod: Improving User Mobility 24 To the user, a *Pod-based device appears no different than a private computer, even though it runs on a host that may be running other applications. Those applications run outside the session provided by the *Pod device and are not visible to a user within the *Pod session. To provide strong security, the *Pod can store the session on an encrypted file system. If the *Pod device is lost or stolen, an attacker will only be able to use it as his own personal storage device. 3.1.1 Secure Operating System Virtualization In order to enable a *Pod device to be used on computers that are not controlled by the *Pod user, we must securely isolate the *Pod device from the underlying machine. Previous operating system virtualization techniques either are not designed to provide secure isolation and therefore do not protect the host machine from rogue processes running within the device’s context, or require significant changes to the underlying operating system. For example, pods as introduced by Zap [100] provide a level of isolation and enable multiple pods to coexist on a single system, but they were not designed to be secure. Zap’s virtualization operates by providing each environment with its own virtual private namespace. A pod contains its own host-independent view of operating system resources such as PID/GID, IPC, memory, file system and devices. The namespace is the only means for the processes to access the underlying operating system. Zap introduces this namespace to decouple processes from the host’s operating system. But Zap assumes that the person using the pod already has privileged access to the machine, and therefore is not directly concerned with a user breaking out of the abstraction. Without protecting the host, no one would allow *Pod devices to use their systems. Therefore, we leverage operating system virtualization at both the Chapter 3. *Pod: Improving User Mobility 25 kernel and file system levels to create the secure pod abstraction, enabling untrusted *Pod devices to be used securely. Protecting the host from rogue processes requires a complete virtualization abstraction that totally confines the process and prevents it from breaking the abstraction and effecting change to the host machine. The secure pod abstraction achieves this in two ways. First, it prevents processes within it from accessing any file system outside of the *Pod device. Second, while it lets processes in the *Pod device context run with privilege, it prevents any privileged action that can break the abstraction. Many previous operating system virtualization architectures relied on the chroot functionality to provide a private file system namespace in which processes run. While chroot can give a set of processes a virtualized file system namespace, there are many ways to break out of the standard chrooted environment, especially if one allows the chroot system call to be used by the virtualized processes. To prevent this, the secure pod abstraction virtualizes the file system interface and implements a barrier, thereby enforcing the chrooted environment even while allowing the chroot system call. We can implement a barrier easily because file systems provide a ->permission method that determines if a process can access a file. For example, if a process tries to access a file a few directories below the current directory, the file system’s ->permission method is called on each directory as well as the file itself, in order. If any call determines that the process does not have permission on a directory, the chain of calls ends. Even if the ->permission method were to determine that the process has access to the file itself, it must have permission to traverse the directory hierarchy to reach the file. We implemented a barrier simply by stacking a small virtual file system on top of the staging directory that virtualized the underlying ->permission method to prevent the virtualized processes from accessing the parent directory of the Chapter 3. *Pod: Improving User Mobility 26 staging directory. This effectively confines *Pod processes to the *Pod’s file system by preventing a rogue process from ever walking past the *Pod’s file system root. The secure pod abstraction also takes advantage of the user identifier (UID) security model in traditional file systems to support multiple security domains on the same system running on the same operating system kernel. For example, since each secure pod has its own private file system, it has its own /etc/passwd file that determines its list of users and their corresponding UIDs. In traditional Unix file systems, the UID of a process determines what permissions it has when accessing a file. This means that since the *Pod’s file system is separate from the host file system, a *Pod process is effectively running in a separate security domain from another process with the same UID that is running directly on the host system. Although both processes have the same UID, the *Pod process is only allowed to access files in its own file system namespace. Similarly, this model allows multiple secure pods on a single system to contain independent processes running with the same UID. This UID model supports an easy-to-use migration model when a user may be using a *Pod device on a host in one administrative domain and then moves the *Pod device to another. Even if the user has computer accounts in both administrative domains, it is unlikely that the user will have the same UID in both domains if they are administratively separate. Nevertheless, the secure pod abstraction enables the user to run the same *Pod device with access to the same files in both domains. Suppose the user has UID 100 on a machine in administrative domain A and starts a pod connecting to a file server residing in domain A. Suppose that all virtualized processes are then running with UID 100. When the user moves to a machine in administrative domain B where they have UID 200, they can migrate their *Pod device to the new machine and continue running its processes. Those processes can continue to run as UID 100 and continue to access the same set of files on the *Pod Chapter 3. *Pod: Improving User Mobility 27 file system, even though the user’s real UID has changed. This works even if there is a regular user on the new machine with a UID of 100. Whereas this example considers the case of a *Pod device with all processes running with the same UID, it is easy to see that the secure pod abstraction supports running processes with many different UIDs. However, this only works for regular processes, however, because they do not have special privileges. But because the root UID 0 is privileged and treated specially by the operating system kernel, the secure pod virtualization abstraction treats UID 0 processes within a secure pod specially as well. We must do this to prevent processes with privilege from breaking the virtualization abstraction, accessing resources on the host, and harming it. The secure pod abstraction does not disallow UID 0 processes, as this would limit the range of application services that can be virtualized. Instead, it restricts such processes to ensure that they function correctly when virtualized. While a process is running in user space, its UID does not have any effect on process execution. Its UID only matters when it tries to access the underlying kernel via one of the kernel entry points, namely devices and system calls. Since the secure pod abstraction already provides a virtual file system that includes a virtual /dev with a limited set of secure devices, the device entry point is already secured. Furthermore, the secure pod abstraction disallows device nodes on the *Pod device’s file system. The only system calls of concern are those that could allow a root process to break the virtualization abstraction. Only a small number of system calls can be used for this purpose. These system calls are listed and described in further detail in Appendix A. Secure pod virtualization classifies these system calls into three classes. The first class of system calls are those affecting only the host system and serving no purpose within a virtualized context. Examples of these system calls include those that load and unload kernel modules (create module, delete module) or that reboot Chapter 3. *Pod: Improving User Mobility 28 the host system (reboot). Since they only affect the host, they would break the secure pod abstraction by allowing processes within it to make administrative changes to the host. System calls that are part of this class are therefore made inaccessible by default to virtualized processes. The second class of system calls are those forced to run unprivileged. Just as NFS, by default, squashes root on a client machine to act as user nobody, secure pod virtualization forces privileged processes to act as the nobody user when they execute some system calls. Examples of these system calls include those that set resource limits and ioctl system calls. Because some system calls, such as setrlimit and nice, can allow a privileged process to increase its resource limits beyond predefined limits imposed on virtualized processes, privileged virtualized processes are by default treated as unprivileged when executing these system calls. Similarly, the ioctl system call is a multiplexer that effectively allows any driver on the host to install its own set of system calls. It is impossible to audit the large set of system calls, given that a *Pod device may be used on a wide range of machine configurations, so we conservatively treat access to this system call as unprivileged by default. The final class of system calls are those that are required for regular applications to run, but that have options that will give the processes access to the underlying host resources, breaking the isolation provided by the secure pod abstraction. Since these system calls are required by applications, the secure pod virtualization checks all their options to ensure that they are limited to resources to which the *Pod device has access, making sure they do not break the secure pod abstraction. For example, the mknod system call can be used by privileged processes to make named pipes or files in certain application services. It is therefore desirable to make it available to virtualized processes. But it can also be used to create device nodes that provide access to the underlying host resources. The secure pod’s kernel virtualization mechanism checks Chapter 3. *Pod: Improving User Mobility 29 the options of the system call and only allows it to continue if it is not trying to create a device. 3.2 Using a *Pod Device A user starts a *Pod device simply by plugging it in to a computer. The computer detects the device and automatically tries to restart the *Pod session. The user is asked for a password. Authentication can also be done without a password by using built-in fingerprint readers available on some USB drives [11]. Once a user is authorized, the *Pod device mounts its file system, restarts its desktop computing session, and attaches a *Pod viewer to the session, making the associated set of applications available and visible to the user. Applications running in a *Pod session appear to the underlying operating system just like other applications that may be running on the host machine, and they use the host’s network interface in the same manner. Once the *Pod is started, the user can use the applications available in the computing environment. When the user wants to leave the computer, they simply close the *Pod viewer. The *Pod session is quickly checkpointed to the *Pod storage device, which can then be unplugged and carried around by the user. When the user is ready to use another computer, they simply plug in the *Pod device and the session restarts exactly where it was suspended. With a *Pod-based device, the user does not need to manually launch applications and reload documents. The *Pod’s integrated checkpoint/restart functionality maintains a user’s computing session persistently as a user moves from one computer to another, even including ephemeral states such as copy/paste buffers. If the host machine crashes, it takes down the current *Pod session with it. But since *Pod devices do not provide their own operating system, Chapter 3. *Pod: Improving User Mobility 30 one can simply plug it into a new host machine and start a fresh *Pod session. The only data lost is that not committed to disk when the host machine crashes. In addition, the *Pod device’s file system is automatically backed up when connected to the user’s primary computer. This enables quick recovery if the device is lost. The user can replicate the file system on a new device and continue working. 3.3 Experimental Results We have implemented four *Pod devices: WebPod [113], DeskPod [114], MediaPod [109] and GamePod [110]. Each *Pod device contains three components: a simple viewer application for accessing the *Pod session, an unmodified XFree86 4.3 display server with THINC’s virtual display device driver, and a loadable kernel module in Linux that requires no changes to the Linux kernel. The kernel module provides the secure pod’s operating system virtualization layer and Zap’s process migration mechanism. We present experimental results using our Linux prototype to quantify the overhead of using the *Pod device on various applications. Experiments were conducted on two IBM PC machines, each with a 933 MHz Intel Pentium-III CPU and 512 MB RAM. The machines each had a 100 Mbps NIC and were connected to one another via 100 Mbps Ethernet and a 3Com Superstack II 3900 switch. Two machines were used as hosts for running the *Pod device and the other was used as a web server for measuring web benchmark performance. To demonstrate *Pod’s ability to operate across different operating system distributions, each machine was configured with a different Linux distribution. The machines ran both Debian 3.0 (“Woody”) and 3.1 (“Sarge”) with a Linux 2.4.18 kernel. We used a 40 GB Apple iPod as the *Pod portable storage device, although a much smaller USB memory drive would have sufficed. Both PCs used FireWire to Chapter 3. *Pod: Improving User Mobility WebPod 163 MB Size GamePod Deskpod 283 MB 418 MB 31 MediaPod 633 MB Table 3.1 – Per-Device *Pod File System Sizes Name getpid ioctl semaphore fork-exit fork-sh iBench Description average getpid runtime average runtime for the FIONREAD ioctl IPC Semaphore variable is created and removed process forks and waits for child which calls exit immediately process forks and waits for child to run /bin/sh to run a program that prints “hello world” then exits Measures the average time it takes to load a set of web pages Linux 350 ns 427 ns 1370 ns 44.7 µs 3.89 ms 826 ms Table 3.2 – Benchmark Descriptions connect to the iPod. We built an unoptimized *Pod file system by bootstrapping a Debian GNU/Linux installation onto the iPod and installing the appropriate set of applications. In all cases, we included a simple KDE 2.2.2 environment. WebPod additionally included the Konqueror 2.2.2 web browser. GamePod included Quake 2, Tetris and Solitaire. DeskPod added on top of WebPod the entire KDE Office Suite, with all the desktop applications a user needs. Finally, MediaPod added on top of DeskPod multiple media-related applications, including video, DVD and music players, with their related codecs. We removed the extra packages needed to boot a full Linux system, as *Pod is just a lightweight application environment, not a full operating system. As can be seen in Table 3.1, the various *Pod devices we built all have minimal storage requirements, enabling them to be stored on many portable devices with ease. In addition, our unoptimized *Pod file systems could be even smaller if the file system were built from scratch instead of by installing programs and libraries as needed. To measure the cost of *Pod’s virtualization, we took a range of benchmarks that represent various operations that occur in a normal application environment and measured their performance on both our Linux *Pod prototype and a vanilla Linux Chapter 3. *Pod: Improving User Mobility 1.6 Plain *Pod 1.4 Normalized Performance 32 1.2 1.0 0.8 0.6 0.4 0.2 get pid ioct l sem fork exit aph ore fork sh iBe nch Figure 3.1 – *Pod Virtualization Overhead system. We used a set of micro-benchmarks that represent operations executed by real applications as well as a real web browsing application benchmark. Table 3.2 shows the 6 benchmarks we used along with their performance on a vanilla Linux system in which all benchmarks were run from a local disk. These benchmarks were then run for comparison purposes in the *Pod portable storage environment. To obtain accurate, repeatable results, we rebooted the system between measurements. Additionally, the system call micro-benchmarks directly used the TSC register available on Pentium CPUs to record timestamps at the significant measurement events. Each timestamp’s average cost was 58 ns. The files for the benchmarks were stored on the *Pod’s file system. All of these benchmarks were performed in a *Pod environment running on the PC machine running Debian Unstable with a Linux 2.4.18 kernel. Figure 3.1 shows the results of running our benchmarks under both configurations, with the vanilla Linux configuration normalized to 1. A smaller number is better for all Chapter 3. *Pod: Improving User Mobility 33 benchmark results. Figure 3.1 shows that *Pod virtualization overhead is small. *Pod incurs less than 10% overhead for most of the micro-benchmarks and less than 4% overhead for the iBench application workload. The overhead for the simple system call getpid benchmark is only 7% compared to vanilla Linux, reflecting the fact that *Pod virtualization for these kinds of system calls only requires an extra procedure call and a hash table lookup. The most expensive benchmark for *Pod is semget+semctl, which took 51% longer than vanilla Linux. The cost reflects the fact that our untuned *Pod prototype needs to allocate memory and do a number of namespace translations. Kernel semaphores are widely used by web browsers such as Mozilla and Konqueror to perform synchronization. The ioctl benchmark also has high overhead because of the 12 separate assignments it does to protect the call against malicious processes. This is large compared to the simple FIONREAD ioctl that just performs a simple dereference. But because the ioctl is simple, it only adds 200 ns of overhead over any ioctl. There is a minimal overhead for functions such as fork and the fork/exec combination. This is indicative of what happens when the web browser loads a plugin such as Adobe Acrobat, where the web browser runs the acroread program in the background. Figure 3.1 shows that *Pod has low virtualization overhead for real applications as well as micro-benchmarks. This is illustrated by the performance on the iBench benchmark, which is a modified version of the Web Text Page Load test from the Ziff-Davis iBench 1.5 benchmark suite. It consists of a JavaScript-controlled load of a set of web pages from the web benchmark server. iBench also uses JavaScript to measure how long it takes to download and process each web page, then determines the average download time per page. The pages contain both text and bitmap graphics, with pages varying in the proportions of text and graphics. The graphics Chapter 3. *Pod: Improving User Mobility WebPod DeskPod 100.0 34 MediaPod GamePod Checkpoint Restart Plain Time (s) 10.0 1.0 0.1 1B 1 D T O m S T Q row 0 Bro eskto otem gle pg12 olitareetris uake 3 ser wse p rs Figure 3.2 – *Pod Checkpoint/Restart vs. Normal Startup Latency are embedded images in GIF and JPEG formats. Our results show that running the iBench benchmark in the *Pod environment incurs no performance overhead versus running in vanilla Linux from local SCSI storage. To measure the cost of checkpointing and restarting *Pod sessions, as well as demonstrate *Pod’s ability to improve the way a user works with various applications, we migrated multiple *Pod sessions containing different sets of applications. For WebPod, we migrated multiple sessions containing different numbers of open browser windows between the two machines described above. For DeskPod, we migrated a session containing the KWrite word processor, the KSpread spreadsheet and the Konqueror web browser, each displaying a document, in addition to a Konsole terminal application. This is indicative of a regular desktop environment. For MediaPod, we migrated multiple sessions containing different sets of running desktop and multimedia applications: first, a MediaPod using the Totem media player playing an Chapter 3. *Pod: Improving User Mobility 35 XviD encoded version of a DVD; second, a MediaPod using Ogle playing a straight DVD image copied to the MediaPod; third, a MediaPod playing an mp3 file using the mpg123 program. Figure 3.2 shows how long it takes to checkpoint to disk and warm cache restart the multiple *Pod sessions described above. We compared this to how long it would take to warm cache startup each session independently. Figure 3.2 shows that, in general, it is significantly faster to checkpoint and restart *Pod sessions than it is to start the same kind of session from scratch. Checkpointing and restarting a *Pod, even with many browser windows opened, takes under a second. A *Pod user can disconnect, plug in to another machine, and start using their session again very quickly. Many tasks have a large startup time, such as Ogle, which iterates through all the files on the DVD image to determine if they have to be decrypted and calculates the decryption key. Furthermore, these experiments were run across two different machines with two different operating system environments, demonstrating that *Pod can indeed work across different software environments. In contrast, Figure 3.2 shows that starting the applications the traditional way is much slower in all cases. For instance, starting a browsing session takes 12 seconds when opening the browser windows with actual web content. Even starting a web browsing session by opening a single browser window takes more than a second. Even in the mpg123 case, where it appears that the mpg123 application starts faster than *Pod can restart it, this is because it is not a direct comparison. For plain startup, all we are doing is restarting the small 136KB mpg123 application, while for *Pod restart, we are restarting the entire KDE desktop environment as well. It should be noted that *Pod’s approach to restarting applications is fundamentally different than plain restarting, as *Pod returns the application sessions to where they were executing when they were suspended. For instance, a restarted MediaPod session Chapter 3. *Pod: Improving User Mobility WebPod 1 Web 10 Web Size 25 mb 46 mb DeskPod 50 mb Totem 44 mb MediaPod Ogle mpg123 27 mb 17 mb 36 Solit. 44 mb GamePod Tetris Quake 22 mb 50 mb Table 3.3 – *Pod Checkpoint Sizes will continue playing the file from where it was, while a WebPod session will show the web browser’s content, even if the content on the web has changed in the meantime. Similarly, restarting a *Pod device requires restarting all applications associated with it, including the desktop environment, as opposed to starting a plain application that uses the desktop environment that is already running. Table 3.3 shows the amount of storage needed to store the checkpointed sessions using *Pod for each of the *Pod devices and sessions described. The results reported show checkpointed image sizes without applying any compression techniques to reduce the image size. These results show that the checkpointed state that needs to be saved is very modest and easy to store on any portable storage device. Given the modest size of the checkpointed images, there is no need for any additional compression, which would reduce the minimal storage demands, but add additional latency due to the need to compress and decompress the checkpointed images. The checkpointed image size in all cases was 50 MB or less. 3.4 Related Work *Pod builds upon our previous work on MobiDesk [26], which provides a hosted desktop infrastructure that improves management by enabling the desktop sessions to be migrated between the back-end infrastructure machines. *Pod differs from MobiDesk in two fundamental ways. First, it builds upon MobiDesk by coupling its compute session migration with portable storage to improve users’ mobility. Second, MobiDesk is limited to a single administrative domain. Unlike *Pod devices, which can be moved Chapter 3. *Pod: Improving User Mobility 37 between machines managed by different users and organizations, MobiDesk sessions can only exist within a single organization and therefore do not require the secure operating system virtualization abstraction. Given the ubiquity of web browsers on modern computers, many traditional applications are becoming web-enabled, allowing the mobile user to use them from any computer. Common applications such as email [2, 5], instant messaging [20], and even word processing and spreadsheet applications [3] have been ported to a web services environment that is usable within a simple web browser. The advantage of this approach is that users effectively store their data on centrally managed servers accessible from any networked computer. But even the web user relies on various applications, such as Adobe Acrobat Reader, to be available on whatever computer they are using at the moment. If the application is already installed on the host, the web browser can use it, but otherwise, the user is unable to complete the task at hand. Some web-based applications have been created to fill these gaps, such as one that converts PDF files to simple image files viewable from any web browser. These approaches, however, are application-specific and often quite limited. For instance, converting PDF files to simple image files cuts out useful features of the native application, such as the ability to search the PDF. Similarly, items like cookies and bookmarks allow a user to work more efficiently, but do not travel with the user as they move between web browsers on different machines. Another solution that solves many of the above problems is the use of thin-client computing [14, 27, 51]. The thin-client approach provides several significant advantages over traditional desktop computing. Clients can be essentially stateless appliances that do not need to be backed up or restored, require almost no maintenance or upgrades, and do not store any sensitive data that can be lost or stolen. Server resources can be physically secured in protected data centers and centrally adminis- Chapter 3. *Pod: Improving User Mobility 38 tered, with all the attendant benefits of easier maintenance and cheaper upgrades. Computing resources can be consolidated and shared across many users, resulting in more effective utilization of computing hardware. Moreover, the ability of thin clients to decouple display from application execution over a network offers a myriad of other benefits, including graphical remote access to a persistent session from anywhere, screen sharing for remote collaboration, and instant technical support. A number of solutions resembling the thin-client approach have sprung up in the past. The model has come and gone many times, however, whether in mainframe dumb terminals, X terminals, or network computers, without being able to displace the desktop computer. No matter how fast the network connection is, the connection between the computer and the local video device will be significantly faster. For example, one would need gigabit ethernet to transfer a decoded DVD across the network, while even 10-year-old PCs have enough video bandwidth to do this. Although gigabit ethernet is becoming more common today, we are transitioning to high-definition video streams which require many times more bandwidth. Similarly, many applications, especially 3D-oriented ones, need to transfer large amounts of data quickly, and have been shown to use as much bandwidth as possible, as they are what is pushing the state of the art in hardware graphic devices. The emergence of cheap, portable storage devices has led to the development of web browsers for USB drives, including Stealth Surfer [10] and Portable Firefox [8]. These approaches only provide the ability to run a web browser on a USB drive. Unlike *Pod, they do not provide a complete application environment. The various programs and plugins that make the user’s experience more comfortable do not work within this environment. The U3 platform [12] has attempted to provide a standard way to enable applications to store data and launch applications. But it has not gained any traction in the marketplace and, unlike *Pod, does not address mobile users’ need Chapter 3. *Pod: Improving User Mobility 39 for persistent application sessions that can be easily moved between locations. Systems like SoulPad [37] and the Collective [41] provide a solution similar to *Pod, but are based on using a bootable Linux distribution like Knoppix and VMware [142] on a USB drive. For these systems, Knoppix provides a Linux operating system that can boot from a USB drive for certain hardware platforms. VMware provides a virtual machine monitor (VMM) that enables an entire operating system environment and its applications to be suspended and resumed from disk. They are designed to take over the host computer they are plugged into by booting their own operating system. They then launch a VMware VM that runs the migratable operating system environment. Unlike *Pod, they do not rely on any software installed on the host. However, they require minutes to start up given the need to boot and configure an entire operating system for the specific host being used. *Pod does not need to provide an entire operating system instance for the virtual machine to run, and so is much more lightweight. *Pod requires less storage, so it can operate on smaller USB drives, and does not require rebooting the host into another operating system, so it starts up much faster. However, unlike these systems, *Pod is limited to the same operating system interface as the host machine, and requires a secure operating system virtualization layer to be written for every operating system it is to be used with. Moka5 [94] attempts to optimize the management and distribution of these portable hardware virtual machine-based devices by storing the virtual machine on the network and only requiring the user to carry a small cache storage device. This cache storage device provides a base host operating system and the ability to page in the necessary parts of the virtual machine on demand. Whereas today’s storage devices can easily hold the entire virtual machine, this cache architecture improves management. It allows the virtual machines to be upgraded on the server by a central administrator, Chapter 3. *Pod: Improving User Mobility 40 with updates pulled into the cache when the machine is rebooted. In general, providing virtualization and checkpoint/restart capabilities using a VMM such as VMware represents an interesting alternative to the *Pod operating system virtualization approach. VMMs virtualize the underlying machine hardware while *Pod virtualizes the operating system. VMMs can checkpoint and restart an entire operating system environment. However, unlike *Pod, VMMs cannot checkpoint and restart applications without also checkpointing and restarting the operating system. *Pod virtualization operates at a finer granularity than virtual machine approaches by virtualizing individual sessions instead of complete operating system environments. Using VMMs can be more space- and time-intensive because the operating system must be included on the portable storage device. Chapter 4 AutoPod: Reducing Downtime for System Maintenance A key problem many organizations face is keeping their computer services available while the underlying machines are maintained. These services run on increasingly networked computers, which are frequent targets of attacks that attempt to exploit vulnerable software they could be running. To prevent these attacks from succeeding, software vendors frequently release patches to address security and maintenance issues. But for these patches to be effective, they must be applied to the machines. System maintenance, however, commonly results in a system service being unavailable. For example, patching an operating system may mean that the whole system is down for a length of time. If system administrators fix an operating system security problem immediately, they risks upsetting their users because of loss of data. If the underlying hardware has to be replaced, the machine will have to be shut down. The system administrators must schedule downtime in advance and in cooperation with users, leaving the computer vulnerable until repaired. If the operating system is patched successfully, downtime may be limited to just a few minutes during Chapter 4. AutoPod: Reducing Downtime for System Maintenance 42 the reboot. Even then, users incur additional inconvenience and delays in starting applications again and attempting to restore their sessions. If the patch is not successful, downtime can extend for many hours while the problem is diagnosed and solved. Downtime due to security and maintenance problems is costly as well as inconvenient. Therefore, it is not uncommon for systems to continue running unpatched software long after a security exploit is well-known [123]. To address these problems, we have designed and built AutoPod, a system that provides an easy-to-use autonomic infrastructure [77] for operating system self-maintenance. AutoPod is unique because it enables unscheduled operating system updates of commodity operating systems while preserving application service availability during system maintenance. AutoPod functions without modifying, recompiling, or relinking applications or operating system kernels. We have done this by combining three key mechanisms: a lightweight operating system virtualization isolation abstraction that can be used at the level of individual applications, a checkpoint/restart mechanism that operates across operating system versions with different security and maintenance patches, and an autonomic system status service that monitors the system for system faults and security updates. AutoPod combines *Pod’s secure pod abstraction with a novel checkpoint/restart mechanism that uniquely decouples processes from the underlying system and maintains process state semantics, allowing processes to migrate across different machines with different operating system versions. The checkpoint/restart mechanism introduces a platform-independent intermediate format for saving the states associated with processes and AutoPod virtualization. AutoPod combines this format with higher-level functions for saving and restoring process states to yield a degree of portability impossible with previous approaches. This checkpoint/restart mechanism relies on the same kind of operating system semantics that allow applications to function Chapter 4. AutoPod: Reducing Downtime for System Maintenance 43 correctly across operating system versions with different security and maintenance patches. AutoPod combines these mechanisms with an autonomous system status service. The service monitors the system for faults and security updates. When the service detects new security updates, it downloads and installs them automatically. If the update requires a reboot, the service uses AutoPod’s checkpoint/restart capability to save the AutoPod’s state, reboot the machine into the newly repaired environment, and restart the processes within the AutoPod without data loss. This permits fast recovery from downtime even when other machines are not available to run application services. Alternatively, if another machine is available, the AutoPod can be migrated to the new machine while the original machine is maintained and rebooted, further reducing application service downtime. This allows security patches to be applied to operating systems in a timely manner with minimal impact on application service availability. Once the original machine is updated, applications can continue to execute even though the underlying operating system has changed. Similarly, if the service detects an imminent system fault, AutoPod can checkpoint the processes, migrate, and restart them on a new machine before the fault causes their execution to fail. 4.1 AutoPod Architecture The AutoPod architecture is based on *Pod’s secure pod abstraction. As shows in Figure 4.1, AutoPod permits server consolidation by allowing multiple pods to run on a single machine while enabling automatic machine status monitoring. As each pod provides a complete secure virtual machine abstraction, it is able to run any server application that would run on a regular machine. By consolidating multiple machines Chapter 4. AutoPod: Reducing Downtime for System Maintenance Pod B AutoPod Virtualization Layer AutoPod System Monitor Pod A 44 Host Operating System Host Hardware Figure 4.1 – AutoPod Model into distinct pods running on a single server, the administrator has fewer physical hardware and operating system instances to manage. Similarly, when kernel security holes are discovered, server consolidation minimizes the number of machines to be upgraded and rebooted. The AutoPod system monitor further improves manageability by constantly monitoring the host system for stability and security problems. By leveraging the secure pod abstraction, AutoPod is able to securely isolate multiple independent services running on a single machine. Operating system virtualization restricts what operating system resources are accessible to processes within it simply by not providing identifiers to certain resources within its namespace. An AutoPod can then be constructed to provide access only to resources needed for its service. An administrator configures the AutoPod in the same way a regular machine is configured and installs applications within it. The secure pod abstraction en- Chapter 4. AutoPod: Reducing Downtime for System Maintenance 45 forces secure isolation to prevent exploited services from attacking the host or other services on it. Similarly, secure isolation allows running multiple services from different organizations, with different sets of users and administrators on a single host, while retaining the semantic of multiple distinct and individually managed machines. Multiple services that previously ran on multiple machines can now run on a single machine. For example, a web server pod is easily configured to contain only the files the web server needs to run and the content it is to serve. The web server pod could have its own IP address, decoupling its network presence from that of the underlying system. Using a firewall, the pod’s network access is limited to client-initiated connections. Connections to the pod’s IP address are limited to the ports served by the application running within this pod. If multiple isolated web servers are required, multiple pods can be set up on a single machine. If one web server application is compromised, its pod limits further harm to the system, because the only resources the compromised pod can access are those explicitly needed by its service. Because this web server pod does not need to initiate connections to other hosts, it is easy to firewall it to prevent it from directly initiating connections to other systems. This limits an attacker’s ability to use the exploited service as a launching point for other attacks. Furthermore, there is no need to disable other network services commonly enabled by the operating system to guard against the compromised pod because those services, and the operating system itself, reside outside the pod’s context. 4.2 Migration Across Different Kernels AutoPod complements the secure pod virtualization abstraction with a cross-kernel checkpoint/restart system that improves the mobility of services within a data cen- Chapter 4. AutoPod: Reducing Downtime for System Maintenance 46 ter. Checkpoint/restart provides the glue that permits a pod to checkpoint services, migrate the state to a new machine, and restart them across other computers with different hardware and operating system kernels. AutoPod’s migration is limited to machines with a common CPU architecture, and that run “compatible” operating systems. Compatibility is determined by the extent to which they differ in their API and internal semantics. Minor versions are normally limited to maintenance and security patches, without affecting the kernel’s API. Major versions carry significant changes, like modifying the application’s execution semantics or introducing new functionality, that may break application compatibility. Nevertheless, they are usually backward compatible. For instance, the Linux kernel has two major versions, 2.4 and 2.6, with over 30 minor versions each. Linux 2.6 significantly differs from 2.4 in how threads behave, and also introduces various new system calls. This implies that migration across minor versions is generally not restricted, but migration between major versions is only feasible from older to newer. To support migration across different kernels, AutoPod’s checkpoint/restart mechanism employs three key design principles: storing operating system state in adequate abstract representation, converting between the abstract representation and operating system-specific state using specialized filters, and using well-established native kernel interfaces to access and alter the state. AutoPod’s checkpoint/restart mechanism relies on an intermediate abstract format to represent the state to be saved. While the low-level details maintained by the operating system may change radically between different kernels, the high-level properties are unlikely to change since they reflect the actual semantics upon which the application relies. AutoPod describes the state of a process in terms of this higher-level semantic information rather than kernel-specific data. To illustrate this, consider the data that describe inter-process relationships, e.g., parent, child, siblings Chapter 4. AutoPod: Reducing Downtime for System Maintenance 47 and threads. The operating system normally optimizes for speed by keeping multiple data structures to reflect these relationships. But this format has limited portability across different kernels; in Linux, the exact technique did indeed change between 2.4 and 2.6. Instead, AutoPod uses a tree structure to capture a high-level representation of the relationships, mirroring its semantics. The same holds for other resources, e.g., communication sockets, pipes, open files and system timers. AutoPod extracts the relevant state the way it is encapsulated in the operating system’s API, rather than in the details of its implementation. Doing so maximizes portability across kernel versions by adopting properties that are considered highly stable. To accommodate differences in semantics that inevitably occur between kernel versions, AutoPod uses specialized conversion filters. The checkpointed state data is saved and restored as a stream. The conversion filters manipulate the contents of this stream. Although typically they are designed to translate between different representations, they can be used to perform other operations such as compression and encryption. Their main advantages are extreme flexibility and being executed like regular helper applications. Building on the example above, because the thread model changes between Linux 2.4 and 2.6, a filter can easily be designed to make the former abstract data adhere to the new semantics. Additional filters can be built if semantic changes occur in the future. This is a very robust and powerful solution. AutoPod leverages high-level native kernel services in order to transform the intermediate representation of the checkpointed image into the complete internal state required by the target kernel during restart. Continuing with the previous example, AutoPod restores the structure of the process tree by exploiting the native fork system call. According to the abstract process tree data, a sequence of fork calls is issued to replicate the original relationships. This avoids dealing with any internal kernel details. Moreover, high-level primitives of this sort remain virtually unchanged Chapter 4. AutoPod: Reducing Downtime for System Maintenance 48 across minor or major kernel changes. Finally, these services are available for use by loadable kernel modules, enabling AutoPod to perform cross-kernel migration without requiring modifications to the kernel. To eliminate possible dependencies on low-level kernel details, AutoPod’s checkpoint/restart mechanism requires processes to be suspended before being checkpointed. Suspending processes creates a quiescent state necessary to guarantee the correctness of the checkpointed image, and substantially reduces the amount of information that needs to be saved by avoiding transient data. For example, consider a checkpoint started while one of the processes is executing the exit system call. It would take tremendous effort and detail to ensure a proper and consistent capture of such a transient state. Instead, by first suspending all processes, such ongoing activities are either completed or interrupted. AutoPod uses this property to guarantee a consistent and static state during the checkpoint. Finally, we must ensure that changes in system call interfaces are properly handled. AutoPod has a virtualization layer that employs system call interposition to maintain namespace consistency. It follows that a change in the semantics for any system call intercepted could raise an issue in migrating across such differences. Fortunately, such changes are rare, and when they occur, they are hidden by standard libraries from the application level lest they break the applications. Consequently, AutoPod is protected just as legacy applications are protected. On the other hand, the addition of new system calls to the kernel requires that the encapsulation be extended to support them. Moreover, it restricts migration back to older versions. For instance, an application that invokes the new waitid system call in Linux 2.6 cannot be migrated back to 2.4 unless an emulation layer exists there. AutoPod uses two techniques to save and restore device-specific states, depending on the device class. Some devices provide standard interfaces for applications to read Chapter 4. AutoPod: Reducing Downtime for System Maintenance 49 and set their state. Common sound cards and the Intel MMX processor extensions are two notable examples. With these, it is possible to easily inquire the device with regard to state prior to checkpoint, and reestablish it during restart. However, many device drivers maintain internal state that is practically inaccessible from the outside. AutoPod ensures that processes within its session only have access to such devices through the virtual device drivers provided by the AutoPod device. This makes it simple to checkpoint the device-specific data associated with the processes. For instance, the AutoPod display system is built using its own virtual display device driver which is not tied to any specific hardware device, and keeps its entire state in regular memory. As a result, its state can be readily checkpointed as a simple matter of saving that process similarly to others. After the process is restarted, the AutoPod viewer on the host reconnects to the virtual display driver to display the complete session. 4.3 Autonomic System Status Service AutoPod provides a generic autonomic framework for managing system state. The framework can monitor multiple sources for information and use this information to make autonomic decisions about when to checkpoint pods, migrate them to other machines, and restart them. Although there are many items that can be monitored, our service monitors two in particular. First, it monitors the vendor’s software security update repository to ensure that the system stays up to date with the latest security patches. Second, it monitors the underlying hardware of the system to ensure that an imminent fault is detected before the fault occurs and corrupts application state. By monitoring these two sets of information, the autonomic system status service is able to reboot or shut down the computer while checkpointing or migrating the processes. Chapter 4. AutoPod: Reducing Downtime for System Maintenance 50 This helps to ensure that data is not lost or corrupted because of a forced reboot or hardware fault propagating into the running processes. Many operating system vendors enable users to automatically check for and install system updates. Example of these include Microsoft’s Windows Update service and Debian’s security repositories. These updates are guaranteed genuine through cryptographic signed hashes that verify that the contents come from the vendors. But some of these updates require reboots. In the case of Debian GNU/Linux, this is limited to kernel upgrades. We provide a simple service that monitors security update repositories. The autonomic service downloads all security updates and uses AutoPod’s checkpoint/restart mechanism to enable the updates that need reboots without disrupting running applications and causing them to lose state. Commodity systems also provide information about the current state of the system that can indicate if the system has an imminent failure on its hands. Subsystems, such as a hard disk’s Self-Monitoring Analysis Reporting Technology (S.M.A.R.T.) [46] let an autonomic service monitor the system’s hardware state. S.M.A.R.T. provides diagnostic information, such as temperature and read/write error rates, on the hard drives in the system that can indicate if the hard disk is nearing failure. Many commodity computer motherboards also have the ability to measure CPU and case temperature, as well as the speeds of the fans that regulate those temperatures. If temperature in the machine rises too high, hardware in the machine can fail catastrophically. Similarly, if the fans fail and stop spinning, the temperature will likely rise out of control. Our autonomic service monitors these sensors. If it detects an imminent failure, it will attempt to migrate a pod to a cooler system, and shut down the machine to prevent the hardware from being destroyed. Many administrators use an interruptible power supply to avoid data loss or corruption during a power loss. Although one can shut down a computer when the Chapter 4. AutoPod: Reducing Downtime for System Maintenance 51 battery backup runs low, most applications are not written to save their data in the presence of a forced shutdown. AutoPod, on the other hand, monitors UPS status. If the battery backup becomes low, it can quickly checkpoint the pod’s state to avoid any data loss when the computer is forced to shut down. Similarly, the operating system kernel on the machine monitors the state of the system, and if irregularities occur, such as DMA timeouts or resetting the IDE bus, it logs them. Our autonomic service monitors the kernel logs to discover these irregular conditions. When the hardware monitoring systems or the kernel logs provide information about possible pending system failures, the autonomic service checkpoints the pods running on the system and migrates them to a new system to be restarted. This ensures that state is not lost and informs administrators that maintenance is needed. Many policies can be implemented to determine to which system a pod should be migrated when a machine needs maintenance. Our autonomic service allows a pod to be migrated within a specified set of clustered machines. The autonomic service gets reports at regular intervals from the other machines’ autonomic services that report each machine’s load. If the autonomic service decides that it must migrate a pod, it chooses the machine in its cluster with the lightest load. 4.4 AutoPod Examples We give two brief examples to illustrate how AutoPod can be used to improve application availability for system services such as email delivery and desktop computing. In both cases we describe the architecture of the system and show how it can be run within AutoPod, enabling administrators to reduce downtime in the face of machine maintenance. We also discuss how a system administrator can set up and use AutoPod. Chapter 4. AutoPod: Reducing Downtime for System Maintenance 4.4.1 52 System Services Administrators like to run many services on a single machine. By doing this, they are able to benefit from improved machine utilization, but this gives each service access to many resources not needed to perform their job. A classic example of this is email delivery. Email delivery services such as Exim and Sendmail are often run on the same system as other Internet services to improve resource utilization and simplify system administration through server consolidation. But these services, Sendmail in particular, have been exploited many times because they have access to system resources, such as a shell program, that they do not need to perform their job. For email delivery, AutoPod can isolate email delivery to provide a significantly higher level of security in light of the many attacks on mail transfer agents. Consider isolating an Exim service installation, the default Debian mail transfer agent. Using AutoPod, Exim can execute in a resource-restricted pod that isolates email delivery from other services on the system. Since AutoPod allows migrating a service between machines, the email delivery pod is migratable. If a fault is discovered in the underlying host machine, the email delivery service can be moved to another system while the original host is patched, keeping the email service available. With this email delivery example, a simple system configuration can prevent the common buffer overflow exploit of getting the privileged server to execute a local shell. By simply removing shells from within the Exim pod, we are limiting the amateur attacker’s ability to exploit flaws, while requiring very little additional knowledge about how to configure the service. AutoPod can further automatically monitor system status and checkpoint the Exim pod if a fault is detected to ensure that no data is lost or corrupted. Similarly, in the event that a machine has to be rebooted, the service can automatically be migrated to a new machine to avoid downtime. Chapter 4. AutoPod: Reducing Downtime for System Maintenance 53 A common problem system administrators face is that forced machine downtime, e.g., for reboots, can make a service unavailable. A usual way to avoid this is to throw multiple machines at the problem. By providing the service through a cluster of machines, system administrators can upgrade the individual machines in a rolling manner. This enables system administrators to upgrade the systems while keeping the service available. But more machines increase management complexity and cost. AutoPod, in conjunction with hardware virtual machine monitors, improves this situation immensely. Using a virtual machine monitor to provide two virtual machines on a single host, AutoPod can then run a pod within a virtual machine to enable a single node maintenance scenario that decreases costs as well as management complexity. During regular operation, all application services run within the pod on one virtual machine. To upgrade the operating system on the running virtual machine, bring the second virtual machine online and migrate the pod to the new virtual machine. Once the initial virtual machine is upgraded and rebooted, migrate the pod back to it. Only one physical machine is needed, reducing costs. Only one virtual machine is in use for the majority of the time, reducing management complexity. Because AutoPod runs unmodified applications, any application service that can be installed can take advantage of its general single node maintenance. 4.4.2 Desktop Computing As personal computers have become ubiquitous in large corporate, government, and academic organizations, the cost of owning and maintaining them is growing unmanageable. These computers are increasingly networked, which only complicates matters. They must be constantly patched and upgraded to protect them and their data from the myriad of viruses and other attacks commonplace on today’s networks. Chapter 4. AutoPod: Reducing Downtime for System Maintenance 54 To solve this problem, many organizations have turned to thin-client solutions such as Microsoft’s Windows Terminal Services and Sun’s Sun Ray. Thin clients allow administrators to centralize many of their administrative duties because only a single computer or cluster of computers needs to be maintained in a central location, while stateless client devices are used to access users’ desktop computing environments. Although thin-client solutions lower some administrative costs, this comes at the loss of semantics that users normally expect from a private desktop. For instance, users who use their own private desktop expect to be isolated from their coworkers. However, in a shared thin-client environment, users share the same machine. There may be many shared files, and a user’s computing behavior can impact the performance of other users on the system. Although a thin-client environment minimizes the number of machines, the centralized servers still need to be administered, and since they are more heavily utilized, management becomes more difficult. For instance, on a private system, one only has to schedule system maintenance with a single user. However, in a thin-client environment, one has to schedule maintenance with all the users on the system to avoid data loss. AutoPod enables system administrators to solve these problems by allowing each user to run a desktop session within a pod. Instead of users sharing a single file system, AutoPod provides each pod with three file systems: a shared read-only file system of all the regular system files users expect in their desktop environments, a private writeable file system for a user’s persistent data, and a private writeable file system for a user’s temporary data. By sharing common system files, AutoPod provides centralization benefits that simplify system administration. By providing private writeable file systems for each pod, AutoPod provides each user with privacy benefits similar to a private machine. Chapter 4. AutoPod: Reducing Downtime for System Maintenance 55 Coupling pod virtualization and isolation mechanisms with a migration mechanism can provide scalable computing resources for the desktop and improve desktop availability. If a user needs access to more computing resources, for instance while doing complex mathematical computations, AutoPod can migrate that user’s session to a more powerful machine. If maintenance needs to be done on a host machine, AutoPod can migrate the desktop sessions to other machines without scheduling downtime and without forcibly terminating any programs users are running. 4.4.3 Setting Up and Using AutoPod To demonstrate how simple it is to set up a pod to run within the AutoPod environment, we provide a step-by-step walkthrough on how one would create a new pod that can run the Exim mail transfer agent. Setting up AutoPod to provide the Exim pod on Linux is straightforward and leverages the same skill set and experience system administrators already have on standard Linux systems. AutoPod is started by loading its kernel module into a Linux system and using its user-level utilities to set up and insert processes into a pod. Creating a pod’s file system is the same as creating a chroot environment. Administrators with experience creating a minimal environment containing only the application they want to isolate do not need to do any extra work. However, many administrators do not have experience creating such an environment and therefore need an easy way to create an environment in which to run their application. These administrators can take advantage of Debian’s debootstrap utility that allows a user to quickly set up an environment equivalent to a base Debian installation. An administrator would do a debootstrap stable /autopod to install the most recently released Debian system into the /autopod directory. While this also includes many Chapter 4. AutoPod: Reducing Downtime for System Maintenance 56 packages that are not required by the installation, it provides a small base to work from. To configure Exim, an administrator edits the appropriate configuration files within the /autopod/etc/exim4/ directory. To run Exim in a pod, an administrator does mount -o bind /autopod /autopod/exim/root to loopback-mount the pod directory onto the staging area directory, where the pod expects it to be. autopod add exim is used to create a new pod named exim which uses /autopod/exim/root as the root for its file system. Finally, autopod addproc exim /usr/sbin/exim4 is used to start Exim within the pod by executing the /usr/sbin/exim4 program, which is located at /autopod/exim/root/usr/sbin/exim4. AutoPod isolates the processes running within a pod from the rest of the system, which helps contain intrusions if they occur. But since a pod does not have to be maintained by itself, but can be maintained in the context of a larger system, one can also prune down the environment and remove many programs that an attacker could use against the system. For instance, if an Exim pod does not need to run shell scripts, there is no reason to leave programs such as /bin/bash, /bin/sh, and /bin/dash within the environment. But these programs will be necessary in the future if the administrator wants to upgrade the package using normal Debian methods. Because it is easy to recreate the environment, one approach would be to remove all the programs that are not wanted within the environment and recreate the environment when an upgrade is needed. Another would be to move those programs outside the pod, perhaps by creating a /autopod-backup directory. To upgrade the pod using normal Debian methods, the programs can be moved back into the pod’s file system. If an administrator wants to manually reboot the system without killing the processes within the Exim pod, they can first checkpoint the pod to disk by running autopod checkpoint exim -o /exim.ck, which tells AutoPod to checkpoint the Chapter 4. AutoPod: Reducing Downtime for System Maintenance 57 processes associated with the Exim pod to the file /exim.ck. The system can then be rebooted, potentially with an updated kernel. Once it comes back up, the pod can be restarted from the /exim.ck file by running autopod restart exim -i /exim.ck. These mechanisms are the same as those used by the AutoPod system status service for controlling the checkpointing and migration of pods. Standard Debian facilities can be used for running other services within a pod. Once the base environment is set up, an administrator can chroot into this environment to continue setup. By editing the /etc/apt/sources.list file appropriately and running apt-get update, an administrator will be able to install any Debian package into the pod. In the Exim example, Exim does not need to be installed since it is the default mail transfer agent (MTA) and is already included in the base Debian installation. If one wanted to install another MTA, such as Sendmail, one could run apt-get install sendmail, which will download Sendmail and all the packages needed to run it. This will work for any service available within Debian. An administrator can also use the dpkg --purge option to remove packages that are not required by a given pod. For instance, in running an Apache web server in a pod, one can remove the default Exim mail transfer agent because Apache does not need it. 4.5 Experimental Results We implemented AutoPod as a loadable kernel module in Linux, which requires no changes to the kernel, as well as a user space system status monitoring service. We present some experimental results using our Linux prototype to quantify the overhead of using AutoPod on various applications. Experiments were conducted on three IBM Netfinity 4500R machines, each with a 933Mhz Intel Pentium-III CPU, 512MB RAM, Chapter 4. AutoPod: Reducing Downtime for System Maintenance Name Applications Email Web Exim 3.36 Apache 1.3.26 and MySQL 4.0.14 Xvnc – VNC 3.3.3r2 X Server KDE – Entire KDE 2.2.2 environment, including window manager, panel and assorted background daemons and utilities SSH – openssh 3.4p1 client inside a KDE Konsole terminal conDesktop nected to a remote host Shell – The Bash 2.05a shell running in a Konsole terminal KGhostView – A PDF viewer with a 450KB 16-page PDF file loaded Konqueror – A modern standards-compliant web browser that is part of KDE KOffice – The KDE word processor and spreadsheet programs 58 Normal Startup 504 ms 2.1 s 19 s Table 4.1 – Application Scenarios 9.1 GB SCSI HD, and 100 Mbps Ethernet connected to a 3Com Superstack II 3900 switch. One of the machines was used as an NFS server from which directories were mounted to construct the virtual file system for the pod on the other client systems. One client ran Debian Stable with a Linux 2.4.5 kernel, and the other ran Debian Unstable with a Linux 2.4.18 kernel. To measure the cost of AutoPod migration and demonstrate the ability of AutoPod to migrate real applications, we migrated three application scenarios: an email delivery service using Exim and Procmail, a web content delivery service using Apache and MySQL, and a KDE desktop computing environment. Table 4.1 describes the configurations of the application scenarios we migrated and shows the time it takes to start up on a regular Linux system. To demonstrate our AutoPod prototype’s ability to migrate across Linux kernels with different minor versions, we checkpointed each application workload on the 2.4.5 kernel client machine and restarted it on the 2.4.18 kernel machine. For these experiments, the workloads were checkpointed to and restarted from a local disk. Chapter 4. AutoPod: Reducing Downtime for System Maintenance Case Email Web Desktop Checkpoint 11 ms 308 ms 851 ms Restart 14 ms 47 ms 942 ms Size 284 KB 5.3 MB 35 MB 59 Compressed 84 KB 332 KB 8.8 MB Table 4.2 – AutoPod Migration Costs Table 4.2 shows the time to checkpoint and restart each application workload. Migration time also has to take into account network transfer time. As this is dependent on the transport medium, we include the uncompressed and compressed checkpoint image sizes. In all cases, checkpoint and restart times were significantly faster than the regular startup times listed in Table 4.1, taking less than a second for both operations, even when performed on separate machines or across a reboot. Moreover, a number of techniques have since been pioneered to further minimize downtime, including pre-copy/incremental checkpointing [43,81,141] and intelligent quiescing [81]. Pre-copy/incremental checkpointing minimizes the amount of time the services will be unavailable by taking partial checkpoints during the service’s execution and only saving what has changed since the last checkpoint was taken. Intelligent quiescing minimizes the time checkpointing takes by keeping the services available until the entire service is ready to checkpoint. We also show that the actual checkpoint images saved were modestly sized for complex workloads. For example, the Desktop pod had over 30 different processes running, including the KDE desktop applications, substantial underlying window system infrastructure, inter-application sharing, and a rich desktop interface managed by a window manager. Even with all these applications running, they checkpoint to a very reasonable 35 MB uncompressed for a full desktop environment. Additionally, if checkpoint images must be transferred over a slow link, Table 4.2 shows that they can be compressed very well with bzip2. Chapter 4. AutoPod: Reducing Downtime for System Maintenance 4.6 60 Related Work Virtual machine monitors (VMMs) have been used to provide secure isolation [28, 142, 147] and to migrate an entire operating system environment [128]. Unlike AutoPod, standard VMMs decouple processes from the underlying machine hardware, but tie them to an instance of an operating system. As a result, VMMs cannot migrate processes apart from that operating system instance and cannot continue running those processes if the operating system instance goes down, such as during security upgrades. In contrast, AutoPod decouples process execution from the underlying operating system, allowing it to migrate processes to another system when an operating system instance is upgraded. VMMs have been proposed to support online maintenance of systems [87] by having a microvisor that supports at most two virtual machines running on the machine at the same time, effectively giving each physical machine the ability to act as its own hot spare. This proposal, however, explicitly depends on AutoPod’s heterogeneous migration without providing this functionality itself. Many systems have been proposed to support process migration [22, 24, 40, 42, 54, 85, 95, 106, 119, 120, 125, 129], but they do not allow migration across independent machines running different operating system versions. TUI [131] provides support for process migration across machines running different operating systems and hardware architectures. Unlike AutoPod, TUI has to compile applications on each platform using a special compiler and does not work with unmodified legacy applications. AutoPod builds on Zap [100] to support transparent migration across systems running the same kernel version. AutoPod goes beyond Zap in providing transparent migration across minor kernel versions, which is essential for making applications available during operating system security upgrades. Chapter 4. AutoPod: Reducing Downtime for System Maintenance 61 Replication in clustered systems can provide the ability to do rolling upgrades. By leveraging many nodes, individual nodes can be taken down for maintenance without significantly impacting the load that the cluster can handle. For example, web content is commonly delivered by multiple web servers behind a front end manager. This front end manager enables an administrator to bring down back end web servers for maintenance by directing requests only to the active web servers. This simple solution is effective because it is easy to replicate web servers to serve the same content. Although this model works fine for web server loads, as the individual jobs are very short, it does not work for long-running jobs, such as a user’s desktop. In the web server case, replication and upgrades are easy to do because only one web server is used to serve any individual request and any web server can be used to serve any request. For long-running stateful applications, such as a user’s desktop, requests cannot be arbitrarily redirected to any desktop computing environment because each user’s desktop session is unique. Although specialized hardware support could be used to keep replicas synchronized by having all of them process all operations, this is prohibitively expensive for most workloads and does not address how to resynchronize the replicas in the presence of rolling upgrades. Another possible solution is allowing the kernel to be hot-pluggable. Although micro-kernels are not prevalent, they are able to upgrade their parts on the fly. More commonly, many modern monolithic kernels have kernel modules that can be inserted and removed dynamically. This can allow upgrading parts of a monolithic kernel without requiring reboots. The Nooks [136] system extends this concept by enabling kernel drivers and other kernel functionality, such as file systems, to be isolated in their own domains to help isolate faults in kernel code and provide a more reliable system. However, in all of these cases, there is still a base kernel on the machine that cannot be replaced without a reboot. If that part must be replaced, all data is lost. Chapter 4. AutoPod: Reducing Downtime for System Maintenance 62 The K42 operating system can be dynamically updated [29], enabling software patches to be applied to a running kernel even in the presence of data structure changes. But it requires a completely new operating system design and does not work with any commodity operating system. Even on K42, it is not yet possible to upgrade the kernel while running realistic application workloads. Chapter 5 PeaPod: Isolating Cooperating Processes A key problem faced by today’s computers is that they are difficult to secure due to the numerous complex services they run. If a single service is exploited, an attacker is able to access all the resources available to the machine it is running on. To prevent this from occurring, it is important to design systems with security principles [126] in mind to limit the damage that can occur when security is breached. One of the most important principles is ensuring that one operates in a Least-Privilege environment. Least-Privilege environments require that a user or a program has access only to the resources that are required to complete their job. Even if the user’s or service’s environment is exploited, the attacker will be constrained. For a system with many distinct users and uses, designing a least-privilege system can prove to be very difficult, as many independent application systems can be used in many different and unknown ways. A common approach to providing least-privilege environments is to separate each individual service into its own sandbox container environment, such as provided by Chapter 5. PeaPod: Isolating Cooperating Processes 64 AutoPod. Many sandbox container environments have been developed to isolate untrusted applications [7, 60, 65, 74, 86, 118, 144]. However, many of these approaches have suffered from being too complex and too difficult to configure to use in practice, and have often been limited by an inability to work seamlessly with existing system tools and applications. Virtual machine monitors (VMMs) offer a more attractive approach by providing a much easier-to-use isolation model of virtual machines, which look like separate and independent systems apart from the underlying host system. However, because VMMs need to run an entire operating system instance in each virtual machine, the granularity of isolation is very coarse, enabling malicious code in a virtual machine to use the entire set of operating system resources. Multiple operating instances also need to be maintained, adding administrative overhead. A primary problem with a sandbox container that attempts to isolate a single service is that many services are composed of many interdependent and cooperating programs. Each individual application that makes up the service has its own set of access requirements. However, since they all run within the same sandbox container, each individual application ends up with access to the superset of resources that are needed by all the programs that make up the service, thereby negating the least-privilege principle. One cannot divide the programs into distinct sandbox container environments since many programs are interdependent and expect to work from within a single context. We leveraged operating system virtualization to design and build PeaPod to enable the ability to sandbox complete services, while also enabling its interdependent and cooperating components to be restricted into least-privilege environments. PeaPod combines two key virtualization abstractions in its virtualization layer. First, it leverages the secure pod abstraction to provide a sandbox container for entire services to run within. Second, it introduces the pea (Protection and Encapsulation Abstrac- Chapter 5. PeaPod: Isolating Cooperating Processes 65 tion). A pea is an easy-to-use least-privilege mechanism that enables further isolation among application components that need to share limited system resources within a single pod. It can prevent compromised application components from attacking other components within the same pod. A pea provides a simple resource-based model that restricts access to other processes, IPC, file system and network resources available to the pod as a whole. PeaPod improves upon previous approaches by not requiring any operating system modifications, as well as avoiding the time of check, time of use (TOCTOU) race conditions that affect many of them [145]. For instance, unlike other approaches that perform file system security checks at the system call level and therefore do not check the actual file system object that the operating system uses, PeaPod leverages file system virtualization to integrate directly into the kernel’s file system security framework. PeaPod is designed to avoid the time of check, time of use race conditions that affect previous approaches by performing all file system security checks within the regular file system security paths and on the same file system objects that the kernel itself uses. 5.1 PeaPod Model The PeaPod model combines the previously introduced operating system virtualization secure pod abstraction with a new abstraction called peas. The secure pod abstraction, as shown in AutoPod, is useful for separating distinct application services into separate machine environments. Peas are used in a pod to provide finegrained isolation among application components that may need to interact within a single machine environment, such as using interprocess communication mechanisms, including signals, shared memory, IPC messages and semaphores, and process forking Chapter 5. PeaPod: Isolating Cooperating Processes 66 Figure 5.1 – PeaPod Model and execution. Figure 5.1 shows how pods and peas work together. Each pod, and the resources contained with it, is fully independent from each other pods, but each pod can each have an arbitrary number of peas associated with them to apply extra security restrictions amongst their cooperating processes. A pea is an abstraction that can contain a group of processes, restrict those processes in interacting with processes outside of the pea, and limit their access to only a subset of system resources. Unlike the secure pod abstraction, which achieves isolation by controlling what resources are located within the namespace, a pea achieves isolation levels by controlling what system resources within a namespace its processes are allowed to access and interact with. For example, a process in a pea can see file system resources and processes available to other peas within a single pod, but can be restricted from accessing them. Unlike processes in separate pods, processes in separate peas in a single pod share the same namespace and can be allowed to inter- Chapter 5. PeaPod: Isolating Cooperating Processes 67 act using traditional interprocess communication mechanisms. Processes can also be allowed to move between peas in the same pod. However, by default, a processes in a pea cannot access any resource that is not made available to that pea, be it a process pid, IPC key or file system entry. Peas can support a wide range of resource restriction policies. By default, processes contained in a pea can only interact with other processes in the same pea. They have no access to other resources, such as file system and network resources or processes outside of the pea. This provides a set of fail safe defaults, as any extra access has to be explicitly allowed by the administrator. The pea abstraction allows for processes running on the same system to have varying levels of isolation by running in separate peas. Many peas can be used side by side to provide flexibility in implementing a least-privilege system for programs that are composed of multiple components that must work together, but do not all need the same level of privilege. One usage scenario would be to have a severely resource limited pea in which a privileged process executes. The process is, howerver, allowed to use traditional Unix semantics to work with less privileged programs that are in less resource restricted peas. For example, peas can be used to allow a web server appliance the ability to serve dynamic content via CGI in a more secure manner. Since the web server and the CGI scripts need separate levels of privilege, and have different resource requirements, they should not have to run within the same security context. By configuring two separate peas for a web service, one for the web server to run within, and a separate one for the specific CGI programs it wants to execute, one limits the damage that can occur if a fault is discovered within the web server. If one manages to execute malicious code within the context of the web server, one can only use resources that are allocated to the web server’s pea, as well as only execute the specific programs that are needed Chapter 5. PeaPod: Isolating Cooperating Processes 68 as CGIs. Since the CGI programs will also only run within their specific security context, the ability for malicious code to do harm is severely limited. Peas and pods together provide secure isolation based on flexible resource restriction for programs as opposed to restricting access based on users. Peas and pods also do not subvert underlying system restrictions based on user permissions, but instead complement such models by offering additional resource control based on the environment in which a program is executed. Instead of allowing programs with root privileges to do anything they want to a system, PeaPod allows a system to control the execution of such programs to limit their ability to harm a system even if they are compromised. 5.2 PeaPod Virtualization To support the PeaPod virtualization abstraction design of secure and isolated namespaces on commodity operating systems, we leveraged the secure pod virtualization architecture described in Chapter 3.1.1. For example, if one had a web server that just serves static content, one can easily set up a web server pod to contain only the files the web server needs to run and the content it wants to serve. The web server pod could have its own IP address, decoupling its network presence from the underlying system. It could also limit network access to client-initiated connections. If the web server application gets compromised, the pod limits the ability of an attacker to further harm the system since the only resources the attacker has access to are the ones explicitly needed by the service. Furthermore, there is no need to carefully disable other network services commonly enabled by the operating system that might be compromised, as only the single service is running within the pod. Chapter 5. PeaPod: Isolating Cooperating Processes 5.2.1 69 Pea Virtualization Peas are supported using virtualization mechanisms that label resources and enforce a simple set of configurable permission rules to impose levels of isolation among processes running within a single pod. For example, when a process is created via the fork() and clone() system calls, its process identifier is tagged with the identifier of the pea in which it was created. Peas leverage the pod’s shadow pod process identifier and also place it in the same pea as its parent process. A process’s ability to access pod resources is then dictated by the set of access permissions rules associated with its pea. Like pod virtualization, the key pea operating system virtualization mechanisms are system call interposition and file system stacking. Pea virtualization employs system call interposition to virtualize the kernel and wrap existing system calls. Kernel virtualization enables peas to enforce restrictions on process interactions by controlling access to process and IPC virtual identifiers. Since each resource is labeled with the pea in which it was created, the kernel virtualization mechanism checks if the pea labels of the calling process and the resource to be accessed are the same. When a process in one pea tries to send a signal to a process in a separate pea by using the kill system call, the system returns an error value of EPERM, as the process exists, but this process has no permission to signal it. However, a parent process is able to use the wait system call to clean up a terminated child process’s state, even if that child process is running within a separate pea, since wait does not modify a process by affecting its execution. This is analogous to a regular user being able to list the metadata of a file, such as owner and permission bits, even if the user has no permission to read from or write to the file. When a new process is created, it executes in the pea security domain of its parent. However, when the process executes a new program, the security domain of Chapter 5. PeaPod: Isolating Cooperating Processes 70 the parent might not be the appropriate security domain to execute the new program in. Therefore, one wants the ability to explicitly transition the process from one pea security domain to another on new program execution. To support this, peas provide a single type of pea transition rule that lets a pea determine how a process can transition from its current pea to another. This transition rule is specified by a program filename and pea identifier. A pea is able to have multiple pea access transition rules of this type. The rule specifies that a process should be moved into the pea specified by the pea identifier if it executes the program specified by the given filename. This is useful when it is desirable to have that new program execution occur in an environment with different resource restrictions. For example, an Apache web server running in a pea may want to execute its CGI child processes in a pea with different restrictions. Pea transitioning is supported by interposing on the exec system call and transitioning peas if the process to be executed matches a pea access transition rule for the current pea. Note that pea access transition rules are one-way transitions that do not allow a process to return to its previous pea unless its new pea explicitly provides for such a transition. Kernel virtualization is used to control network access inside the pea. Peas provide two networking access rule types. One allow processes in the pea to make outgoing network connections on a pod’s virtual network adapters, while the other allows processes in the pea to bind to specific ports on the adapter to receive incoming connections. Pea network access rules can allow complete access to a pod network adapter, or only allow access on a per-port basis. Since any network access occurs through system calls, peas simply check the options of the networking system call, such as bind and connect, to ensure that it is allowed to perform the specified action. Pea virtualization employs a set of file system access rules and file system virtualization to provide each pea with its own permission set on top of the pod file Chapter 5. PeaPod: Isolating Cooperating Processes 71 system. To provide a least-privilege environment, processes should not have access to file system privileges they do not need. For example, while Sendmail has to write to /var/spool/mqueue, it only has to read its configuration from /etc/mail and should not need to have write permission on its configuration. To implement such a least-privilege environment, peas allow files to be tagged with additional permissions that overlay the respective underlying file permissions. File system permissions determine access rights based on the user identity of the process while pea file permission rules determine access rights based on the pea context in which a process is executed. Each pea file permission rule can selectively allow or deny the use of the underlying read, write and execute permissions of a file on a per-pea basis. The underlying file permission is always enforced, but pea permissions can further restrict whether the underlying permission is allowed to take effect. The final permission is achieved by performing a bitwise and operation on both the pea and file system permissions. For example, if the pea permission rule allowed for read and execute, the permission set of r-x would be triplicated to r-xr-xr-x for the three sets of Unix permissions and the bitwise and operation would mask out any write permission that the underlying file system allows. This prevents any process in the pea from opening the file to modify it. Enforcing on-disk labeling of every single file, such as supported through access control lists provided by many modern file systems, is inflexible if a single underlying file system is going to be used for multiple disparate pods and peas. As each pea in each pod can use the same files with different permission schemes, storing the pea’s permission data on disk is not feasible. Instead, peas support the ability to dynamically label each file within a pod’s file system based on two simple path-matching permission rules: path-specific permission rules and directory-default permission rules. A path-specific permission matches an exact path on the file sys- Chapter 5. PeaPod: Isolating Cooperating Processes 72 tem. For instance, if there is a path-specific permission for /home/user/file, only that file will be matched with the appropriate permission set. On the other hand, if there is a directory-default permission for the directory /home/user/, then any file under that directory in the directory tree can match it, and inherit its permission set. Given a set of path-specific and directory-default permissions for a pea, the algorithm for determining what permission matches to what path starts with the complete path and walks up the path to the root directory until it finds a matching permission rule. The algorithm can be described in four simple steps: 1. If the specific path has a path-specific permission, return that permission set. 2. Otherwise, choose the path’s directory as the current directory to test. 3. If the directory being tested has a directory-default permission, return that permission set. 4. Otherwise set its parent as the current directory to test and go back to step 3. If there is no path-specific permission, the closest directory-default permission to the specified path becomes the permission set for that path. By default, peas give the root directory “/” a directory-default permission denying all permissions; thus, the default for every file on the system, unless otherwise specified, is deny. This ensures that the peas have a fail safe default setup and do not allow access to any files unless specified by the administrator. The semantics of pea file permission are based on file path name. If a file has more than one path name, such as via a hard link, both have to be protected by the same permission; otherwise, depending on what order the file is accessed, the permission set it gets will be determined simply based on the path name that was accessed initially. This issue only occurs on creating the initial set of pea file access permissions. Once Chapter 5. PeaPod: Isolating Cooperating Processes 73 the pea is set up, any hard links that are created will obey the regular file system permissions. For instance, one is not allowed to create a hard link to a file that one does not have permission to. On the other hand, if one has permission to access the file, a path-specific permission rule will be created for the newly created file that corresponds to the permission of the path name it was linked to. The pea architecture uses file system virtualization to integrate the pea file system namespace restrictions into the regular kernel permission model, thereby avoiding TOCTOU race conditions. It accomplishes this by virtualizing the file system’s ->lookup method, which fills in the respective file’s inode structure, and the ->permission method, which uses the stored permission data to make simple permission determinations. A file system’s ->permission method is a standard part of the operating system’s security infrastructure, so no kernel changes are necessary. 5.2.2 Pea Configuration Rules 5.2.2.1 File System Many system resources in Unix, including normal files, directories, and system devices, are accessed via files, so controlling access to the file system is crucial. Each pea must be restricted to those files used by its component processes. This control is important for security, because processes that work together do not necessarily need the same access rights to files. All file system access is controlled by path-specific and directorydefault rules, which specify a file or directory and an access right. The access right values for file rules are read, write, and execute similar to standard Unix permissions. For convenience, we also define allow and deny, which are aliases for all three of read, write and execute and cannot be combined with other access values in the same rules. When a path-specific or directory-default rule gives Chapter 5. PeaPod: Isolating Cooperating Processes 74 access to a directory entry, it implicitly gives execute, but not read or write, access to all parent directories of the file, up to the root directory. On the other hand, if a separate path-specific rule denies access to a directory, then access to both the directory and its contents will be denied. This occurs even if a separate directorydefault rule would give access to subdirectories or files, as the path-specific rule is a better match. pod mailserver { pea sendmail { path /etc/mail/aliases path /etc/mail/aliases.db } pea newaliases { path /etc/mail/aliases path /etc/mail/aliases.db } } read read read read,write Figure 5.2 – Example of Read/Write Rules Consider the case of the Sendmail mail daemon and the newaliases command with regard to the system-wide aliases file. Sendmail runs as the root user and needs to be able to read the aliases file in order to know to where it should forward mail or otherwise redirect it. newaliases is a symbolic link to sendmail and typically also runs as the root user in order to update the aliases file and convert it into the database format used by the Sendmail daemon. In our example, newaliases runs in its own pea and is able to read from /etc/mail/aliases and read from and write to /etc/mail/aliases.db. Meanwhile sendmail runs in another pea and is able to read both files, but not write to them. We use two path-specific rules to express these access rules as described in Figure 5.2. Similar rules can protect a device like /dev/dsp. When a user logs into a system locally, via the console, they are typically given control of local devices, such as the Chapter 5. PeaPod: Isolating Cooperating Processes pod music { pea play { path /dev/dsp } pea rec { path /dev/dsp } } 75 write read Figure 5.3 – Protecting a Device physical display and the sound card. Any application that the user runs has access to read from and write to these local devices, even though this privilege is not necessary. For example, we want to restrict playing and recording of sound files to the play and rec applications, which are part of SoX [9]. Figure 5.3 describes the rules that provide the appropriate access to the device. The other file system rule is the directory-default rule. It uses the same access values as path-specific rules, but it is used to specify the default access for files below a directory. Any file or sub-directory will inherit the same access flags since access is determined by matching the longest possible path prefix. Unlike path-specific rules, directory-default rules can deny access to a directory in general, while still allowing access to specific files. Figure 5.4 describes a pea that denies access to all files in /bin, while only allowing access to /bin/ls. pod fileLister { pea onlyLs { dir-default /bin path /bin/ls } } deny allow Figure 5.4 – Directory-Default Rule Chapter 5. PeaPod: Isolating Cooperating Processes 5.2.2.2 76 Transition Rules When Sendmail and Procmail are used together to deliver mail to local users, the sendmail process creates a new process and executes the procmail program to deliver the mail to the user’s spool. Procmail needs different security settings, so it must transition from a Sendmail pea to a Procmail pea. Rules must be defined that state to which pea a process will transition upon execution. When a process calls the execve system call, we examine the file name to be executed and perform a longest prefix match on all the transition rules. For instance, by specifying a directory for a transition, PeaPod will cause a pea transition to occur for any program executed that is located in that directory, unless there is a more specific transition rule available. Figure 5.5 creates a pea for Sendmail and Procmail, and specifies that a process should transition when the procmail program is executed. pod mailserver pea sendmail transition } pea procmail } } { { /usr/bin/procmail procmail { Figure 5.5 – Transition Rules PeaPod does not provide the ability for a process to transition to another pea except by executing a new program. If it could, a process could open an allowed file in one pea and then transition to another pea where access to that file was not allowed and thus circumvent the security restrictions. Chapter 5. PeaPod: Isolating Cooperating Processes 5.2.2.3 77 Networking Rules PeaPod provides two rules that define the network capabilities a pea exposes to the processes running within it. First, peas are able to restrict a process from instantiating an outgoing connection. Second, peas are able to limit what ports a process can bind to and listen for incoming connections. By default, peas do not let processes make any outgoing connections or bind to any port. Whereas a full network firewall is an important part of any security architecture, it is orthogonal to the goals of PeaPod and therefore belongs in its own security layer. Continuing the simplified Sendmail/Procmail usage case, an administrator would want to easily confine the network presence of processes running within Sendmail/Procmail peas as shown in Figure 5.6. By allowing sendmail to make outgoing connections, to enable it to send messages, as well as bind to port 25, the standard port for receiving messages, Sendmail can continue to work normally. However, processes running within the procmail pea, which will be less restricted, are not allowed to bind to any port for this same reason, while they are allowed to initiate outgoing network connections. This allows programs, such as spam filters that require checking network-based information, to continue to work. pod mailserver { pea sendmail { outgoing allow bind tcp/25 } pea procmail { outgoing allow } } Figure 5.6 – Networking Rules Chapter 5. PeaPod: Isolating Cooperating Processes 5.2.2.4 78 Shared Namespace Rules PeaPod provides a single namespace rule for allowing processes to access the pod’s virtual private identifiers that do not belong to its personal pea. PeaPod enables peas to be configured to only have access to resources tagged with specific pea identifiers or with the special global pea identifier that enables access to every virtual private resource in the pod. This rule is used to create a global pea with access to all the resources of a pod, for instance to allow a process to start up and shut down services running within a resource-restricted pea. Figure 5.7 describes a pod that has a pea, global access, that is able to access every resource in the pod, as well as a pea, test1, that is able to access the resources created within one of its sibling peas, test2. pod service { pea global_access { namespace global } pea test1 { namespace test2 } pea test2 { } } Figure 5.7 – Namespace Access Rules 5.2.2.5 Managing Rules To make it simpler for administrators to create peas in a pod, we allow groups of rules to be saved to a file and included in the main configuration file for a given PeaPod configuration. These groups of rules would typically describe the minimum resources necessary for a single application. Application packagers can include rule group files in their package and administrators can share rule groups with each other. Chapter 5. PeaPod: Isolating Cooperating Processes path /usr/bin/gcc dir-default /usr/lib/gcc-lib path /usr/bin/cpp path /usr/lib/libiberty.a path /usr/bin/ar path /usr/bin/as path /usr/bin/ld path /usr/bin/ranlib path /usr/bin/strip 79 read,execute read,execute read,execute read read,execute read,execute read,execute read,execute read,execute Figure 5.8 – Compiler Rules pod workstation { pea kernel-development { include "stdlibs" include "compiler" include "tar" include "bzip2" dir-default /usr/local/src/ read dir-default /scratch/binaries allow } } Figure 5.9 – Set of Multiple Rule Files A rule group, seen in Figure 5.8 for a compiler, would be stored in a central location. An administrator uses an include rule to reference the external file as part of a development PeaPod. Figure 5.9 contains the tools necessary to build a Linux kernel from source; it permits access to the source code itself and a writable directory for the binaries. These management rules demonstrate PeaPod’s ability to isolate the specific resource needs of individual programs from the local policy an administrator defines. The knowledge needed to build a set of rules for a program service that provides the specific set of resources needed to execute is not always readily available to users of security systems. However, this knowledge is available to the authors and distributors of the system. PeaPod’s management rules allow the creation and distribution of rule Chapter 5. PeaPod: Isolating Cooperating Processes 80 files that define the specific set of resources needed to execute a program service, while enabling the local administrator to further define the resource-restriction policy. 5.3 Security Analysis Saltzer and Schroeder [126] describe several principles for designing and building secure systems. These include: • Economy of mechanism: Simpler and smaller systems are easier to understand and ensure that they do not allow unwanted access. • Fail safe defaults: Systems must choose when to allow access as opposed to choosing when to deny. • Complete mediation: Systems should check every access to protected objects. • Least-privilege: A process should only have access to the privileges and resources it needs to do its job. • Psychological acceptability: If users are not willing to accept the requirements that the security system imposes, such as very complex passwords that the users are forced to write down, security is impaired. Similarly, if using the system is too complicated, users will misconfigure it and end up leaving it wide open. • Work factor : Security designs should force an attacker to have to do extra work to break the system. The classic quantifiable example is when one adds a single bit to an encryption key, one doubles the key space an attacker has to search. PeaPod is designed to satisfy these six principles. PeaPod provides economy of mechanism using a thin virtualization layer, based on system call interposition for Chapter 5. PeaPod: Isolating Cooperating Processes 81 kernel virtualization and file system stacking for file system virtualization, that only adds a modest amount of code to a running system. The largest part of the system is due to the use of a null stackable file system with 7000 lines of C code, but this file system was generated using a simple high-level file system language [152], and only 50 lines of code were added to this well-tested file system to implement PeaPod’s file system security. Furthermore, PeaPod changes neither applications nor the underlying operating system kernel. The modest amount of code to implement PeaPod makes the system easier to understand. As the PeaPod security model provides only resources that are explicitly stated, it is relatively easy to understand the security properties of resource access provided by the model. PeaPod provides fail safe defaults by only allowing access to resources that have been explicitly given to peas and pods. If a resource is not created within a pea, or explicitly made available to that pea, no process within that pea will be allowed to access it. Whereas a pea can be configured to enable access to all resources of the pod, this is an explicit action an administrator has to take. PeaPod provides for complete mediation of all resources available on the host machine by ensuring that all resource access occur through the pod’s virtual namespace. Unless a file, process or other operating system resource was explicitly placed in the pod by the administrator or created within the pod, PeaPod’s virtualization will not allow a process within a pod to access the resource. PeaPod provides a least-privilege environment by enabling an administrator to only include the data necessary for each service. PeaPod can provide separate pods for individual services so that separate services are isolated and restricted to the appropriate set of resources. Even if a service is exploited, PeaPod will limit the attacker to the resources the administrator provided for that service. While one can achieve similar isolation by running each individual service on a separate machine, Chapter 5. PeaPod: Isolating Cooperating Processes 82 this leads to inefficient use of resources. PeaPod maintains the same least-privilege semantic of running individual services on separate machines, while making efficient use of machine resources at hand. For instance, an administrator could run MySQL and Sendmail mail transfer services on a single machine, but within different pods. If the Sendmail pod gets exploited, the pod model ensures that the MySQL pod and its data will remain isolated from the attacker. Furthermore, PeaPod’s peas are explicitly designed to enable least-privilege environments by restricting programs in an environment that can be easily limited to provide the least amount of access for the encapsulated program to do its job. PeaPod provides psychological acceptability by leveraging the knowledge and skills system administrators already use to set up system environments. Because pods provide a virtual machine model, administrators can use their existing knowledge and skills to run their services within pods. Furthermore, peas use a simple resource-based model that does not require a detailed understanding of any underlying operating system specifics. This differs from other least-privilege architectures that force an administrator to learn new principles or complicated configuration languages that require a detailed understanding of operating system principles. Similar to least-privilege, PeaPod increases the work factor that it would take to compromise a system simply by not making available the resources that attackers depend on to harm a system once they have broken in. For example, because PeaPod can provide selective access to what programs are included within their view, it would be very difficult to get a root shell on a system that does not have access to any shell program. While removing the shell does not create a complete least-privilege system, it is a simple change that creates a lesser privilege system and therefore increases the work factor that would be required to compromise the system. Chapter 5. PeaPod: Isolating Cooperating Processes 5.4 83 Usage Examples We briefly describe three examples that help illustrate how the PeaPod virtualization layer can be used to improve computer security and application availability for different application scenarios. The application scenarios are email delivery, web content delivery, and desktop computing. In the following examples we make extensive use of PeaPod’s ability to compose rule files in order to simplify the rules. Instead of listing every file and library necessary to execute a program, we isolate them into a separate rule file to place the focus on the actual management of the service that the pea is trying to protect. 5.4.1 Email Delivery For email delivery, PeaPod’s virtualization layer can isolate different components of email delivery to provide a significantly higher level of security in light of the many attacks on Sendmail vulnerabilities that have occurred [15,16,83,88]. Consider isolating a Sendmail installation that also provides mail delivery and filtering via Procmail. Email delivery services are often run on the same system as other Internet services to improve resource utilization and simplify system administration through server consolidation. However, this can provide additional resources to services that do not need them, potentially increasing the damage that can be done to the system if attacked. As shown in Figure 5.10, using PeaPod’s virtualization layer, both Sendmail and Procmail can execute in the same pod, which isolates email delivery from other services on the system. Furthermore, Sendmail and Procmail can be placed in separate peas, which allows necessary interprocess communication mechanisms between them while improving isolation. This pod is a common example of a privileged service that Chapter 5. PeaPod: Isolating Cooperating Processes 84 pod mail-delivery { pea sendmail { include "stdlibs" include "sendmail" dir-default /etc read dir-default /var/spool/mqueue allow dir-default /var/spool/mail allow dir-default /var/run allow path /usr/bin/procmail read, execute transition /usr/bin/procmail procmail bind tcp/25 outgoing allow } pea procmail { dir-default / allow outgoing allow } } Figure 5.10 – Email Delivery Configuration has child helper applications. In this case, the Sendmail pea is configured with full network access to receive email, but only with access to files necessary to read its configuration and to send and deliver email. Sendmail would be denied write access to file system areas such as /usr/bin to prevent modification to those executables, and would only be allowed to transition a process to the Procmail pea if it is executing Procmail, the only new program its pea allows it to execute. On mail delivery, Sendmail would then exec Procmail, which transitions the process into the Procmail pea. The Procmail pea is configured with a more liberal access permission, namely allowing access to the pod’s entire file system, enabling it to run other programs, such as SpamAssassin. Although an administrator could configure programs Procmail executes, such as SpamAssassin, to run within their own peas, this example keeps them all within a single pea to demonstrate a simple configuration. As a result, the Sendmail/Procmail pod can provide full email delivery service while isolating Sendmail Chapter 5. PeaPod: Isolating Cooperating Processes 85 such that even if Sendmail is compromised by an attack, such as a buffer overflow, the attacker would be contained in the Sendmail pea and would not even be able to execute programs, such as a root shell, to further compromise the system. 5.4.2 Web Content Delivery For web content delivery, PeaPod’s virtualization layer can isolate different components of web content delivery to provide a significantly higher level of security in light of common web server attacks that may exploit CGI script vulnerabilities. Consider isolating an Apache web server front end, a MySQL database back-end, and CGI scripts that interface between them. Although one could run Apache and MySQL in separate pods, because they are providing a single service, it makes sense to run them within a single pod that is isolated from the rest of the system. However, because both Apache and MySQL are within the pod’s single namespace, if an exploit is discovered in Apache, it could be used to perform unauthorized modifications to the MySQL database. To provide greater isolation among different web content delivery components, Figure 5.11 describes a set of three peas in a pod: one for Apache, a second for MySQL, and a third for the CGI programs. Each pea is configured to contain the minimal set of resources needed by the processes running within the respective pea. The Apache pea includes the apache binary, configuration files and the static HTML content, as well as a transition permission to execute all CGI programs into the CGI pea. The CGI pea contains the relevant CGI programs as well as access to the MySQL daemon’s named socket, allowing interprocess communication with the MySQL daemon to perform the relevant SQL queries. The MySQL pea contains the mysql daemon binary, configuration files and the files that make up the relevant Chapter 5. PeaPod: Isolating Cooperating Processes 86 pod web-delivery { pea apache { include "stdlibs" path /usr/sbin/apache read,execute path /usr/sbin/apachectl read,execute dir-default /var/www read,execute transition /var/www/cgi-bin cgi bind tcp/80 } pea cgi { include "stdlibs" include "perl" dir-default /var/www/data allow path /tmp/mysql.sock allow } pea mysql { include "stdlibs" path /usr/sbin/mysqld read, execute path /tmp/mysql.sock allow dir-default /usr/share/mysql read dir-default /var/lib/mysql allow } } Figure 5.11 – Web Delivery Rules databases. As Apache is the only program exposed to the outside world, it is the the process that is mostly likely to be directly exploited. However, if an attacker is able to exploit it, the attacker is limited to a pea that is able only to read or write specific Apache files, as well as execute specific CGI programs into a separate pea. As the only way to access the database is through the CGI programs, the only access to the database an attacker would have is what is allowed by said programs. By writing the CGI programs carefully to sanitize the inputs passed to them, one can protect these entry points. Consequently, the ability of an attacker to cause serious harm to such a web content delivery system running with PeaPod’s virtualization layer is significantly reduced. Chapter 5. PeaPod: Isolating Cooperating Processes 87 pod desktop { pea firefox { include "firefox" dir-default /home/spotter/.mozilla allow dir-default /home/spotter/tmp allow dir-default /home/spotter/download allow transition /usr/bin/mpg123 mpg123 transition /usr/bin/acroread acroread } pea mp3 { include "stdlibs" path /usr/bin/mpg123 read, execute path /dev/dsp write dir-default /home/spotter/tmp allow dir-default /home/spotter/music allow } pea acroread { include "stdlibs" include "acroread" dir-default /home/spotter/tmp allow } } Figure 5.12 – Desktop Application Rules 5.4.3 Desktop Computing For desktop computing, PeaPod’s virtualization layer enables desktop computing environments to run multiple desktops from different security domains within multiple pods. Peas can also be used within the context of such a desktop computing environment to provide additional isolation. Many applications used on a daily basis, such as mp3 players [64] and web browsers [67], have had bugs that turn into security holes when maliciously created files are viewed by them. These holes allow attackers to execute malicious code or gain access to the entire local file system. Figure 5.12 describes a set of PeaPod rules that can contain a small set of desktop applications being used by a user with the /home/spotter home directory. Chapter 5. PeaPod: Isolating Cooperating Processes 88 To secure an mp3 player, a pea can be created within the desktop computing pod that restricts the mp3 player’s use of files outside of a special mp3 directory. As most users store their music within its own subtree, this is not a serious restriction. Most mp3 content should not be trusted, especially if one is streaming mp3s from a remote site. By running the mp3 player within this fully restricted pea, a malicious mp3 cannot compromise the user’s desktop session. This mp3 player pea is simply configured with four file system permissions. First, a path-specific permission that provides access to the mp3 player itself is required to load the application. Second, a directory-default permission that provides access to the entire mp3 directory subtree is required to give the process access to the mp3 file library. Third is a directorydefault permission to a directory meant to store temporary files so the mp3 player can be used as a helper application. Finally, a path-specific permission that provides access to the /dev/dsp audio device is required to allow the process to play audio. To secure a web browser, a pea can be created within a desktop computing pod that restricts the web browser’s access to system resources. Consider the Mozilla Firefox web browser as an example. A Firefox pea would need to have all the files Firefox needs to run accessible from within the pea. Mozilla dynamically loads libraries and stores them along with its plugins within the /usr/lib/firefox directory. By providing a directory-default permission that provides access to that directory, as well as another directory-default permission that provides access to the user’s .mozilla directory, the Firefox web browser can run normally within this special Firefox pea. Users also want the ability to download and save files, as well as launch viewers, such as for postscript or mp3 files, directly from the web browser. This involves a simple reconfiguration of Firefox to change its internal application.tmp dir variable to be a directory that is writable within the Mozilla pea. By creating such a directory, such as download within the user’s home directory, and providing a directory-default Chapter 5. PeaPod: Isolating Cooperating Processes 89 permission allowing access, we allow one to explicitly save files, as well as implicitly save them when one wants to execute a helper application. Similarly, just like Mozilla is configured to run helper applications for certain file types, one could configure the Mozilla pea to execute those helper applications within their respective peas. As shown in Figure 5.12, for an mp3 player, configuring such a pea for these processes is fairly simple. The only addition one would have to make is to provide an additional pea transition permission to the Mozilla pea that tells the PeaPod’s virtualization layer to transition the process to a separate pea on execution of programs such as the mpg123 mp3 player or the Acrobat Reader PDF viewer. However, this desktop computing example is also the most complicated, and shows the difficulty that can occur in trying to secure a complex desktop. In this example we only attempt to secure a simplified desktop and isolate three applications, and yet it is the largest rule set. Many desktop environments are made up of many applications and each application would need its own set of rules. To avoid the need to create rules for each individual application, we created Apiary, described in Chapter 7, to specifically address desktop security. 5.5 Experimental Results We implemented PeaPod’s virtualization layer as a loadable kernel module in Linux that requires no changes to the Linux kernel source code or design. We present experimental results using our Linux prototype to quantify the overhead of using PeaPod on various applications. Experiments were conducted on two IBM Netfinity 4500R machines, each with a 933Mhz Intel Pentium-III CPU, 512MB RAM, 9.1 GB SCSI HD and a 100 Mbps Ethernet connected to a 3Com Superstack II 3900 switch. One of the machines was used as an NFS server from which directories were mounted Chapter 5. PeaPod: Isolating Cooperating Processes Name getpid ioctl semaphore fork-exit fork-sh Postmark Apache Make MySQL 90 Description average getpid runtime average runtime for the FIONREAD ioctl IPC Semaphore variable is created and removed process forks and waits for child that calls exit immediately process forks and waits for child to run /bin/sh to run a program that prints ”hello world” then exits Use Postmark Benchmark to simulate Sendmail performance Runs Apache 1.3 under load and measures average request time Linux Kernel 2.4.21 compile with up to 10 processes active at one time “TPC-W like” interactions benchmark that uses Tomcat 4 and MySQL 4 Table 5.1 – Application Benchmarks to construct the virtual file system for the PeaPod on the other client system. The client ran Debian stable with a 2.4.21 kernel. To measure the cost of PeaPod’s virtualization layer, we used a range of microbenchmarks and real application workloads and measured their performance on our Linux PeaPod prototype and a vanilla Linux system. Table 5.1 shows the five microbenchmarks and four application benchmarks we used to quantify PeaPod’s virtualization overhead. To obtain accurate measurements, we rebooted the system between measurements. The files for the benchmarks were stored on the NFS server. All of these benchmarks were performed in a chrooted environment on the NFS client machine running Debian Unstable. Figure 5.13 shows the results of running the benchmarks under both configurations, with the vanilla Linux configuration normalized to one. Since all benchmarks measure the time to run the benchmark, a smaller number is better for all benchmarks results. The results in Figure 5.13 show that the PeaPod’s pea virtualization layer, as expected, imposes negligible overhead over the already existing pod virtualization. This is because to enforce resource isolation, all PeaPod has to do is compare the re- Chapter 5. PeaPod: Isolating Cooperating Processes 1.6 Plain Linux PeaPod 1.5 Normalized performance 91 1.4 1.3 1.2 1.1 1 0.9 0.8 0.7 L Q he e yS M ak M k ar m ac Ap st Po h -s rk fo it re ho ex d ap rk fo m se l ct io i tp ge Figure 5.13 – PeaPod Virtualization Overhead source’s pea attribute against the process trying to access it. For PIDs and IPC keys, it is a single equality test, which is minimal extra work beyond looking up the virtualized mapping in a hash table. On the other hand, for file system entries it might have to iterate through a small set of rules. Furthermore, this only matters for file system operations that care about permissions, such as open; for all other file system operations, such as read and write, there is no extra pea overhead. Therefore, just like *Pod, PeaPod incurs less than 10% overhead for most of the micro-benchmarks and less than 4% overhead for the application workloads. For the system call microbenchmarks, PeaPod has to do very little extra work to restrict each process to its pea context. Similarly, PeaPod has to do very little work to virtualize the file system. This is apparent from both the Postmark benchmark and a set of real applications. Postmark was configured to operate on files between 512 and 10K bytes in size, representative of the individual files on a mail server queue, with an initial set of 20,000 files and to perform 200,000 transactions. PeaPod exhibits very little overhead in the Chapter 5. PeaPod: Isolating Cooperating Processes 92 postmark benchmark as it does not require any additional I/O operations. In addition, PeaPod exhibited a maximum overhead of 4% in real application benchmarks. This overhead was measured using the http load benchmark [108] to place a parallel fetch load on an Apache 1.3 server by simulating 30 parallel clients continuously fetching a set of files and measuring the average request time for each HTTP session. Similarly, we tested MySQL as part of a web commerce scenario outlined by TPC-W with a bookstore servlet running on top of Tomcat 4 with a MySQL 4 back-end. The PeaPod overhead for this scenario was less than 2% versus vanilla Linux. 5.6 Related Work A number of other approaches have explored the idea of virtualizing the operating system environment to provide application isolation. FreeBSD’s Jail mode [74] provides a chroot-like environment that processes cannot break out of. However, Jail is limited, such as the fact that it does not allow IPC within a jail [57], and therefore many real-world application will not work. More recently, Linux Vserver [7] and Containers [6], and Solaris Zones [116] offer a VM abstraction similar to PeaPod’s pods, but require substantial in-kernel modifications to support the abstraction. Although these systems share the simplicity of the Pod abstraction, they do not provide finer-grained isolation as provided with peas. Similarly, VMMs have been used to provide secure isolation [28, 142, 147]. Unlike PeaPod’s virtualization layer, VMMs decouple processes from the underlying machine hardware, but tie them to an instance of an operating system. As a result, VMMs provide an entire operating system instance and namespace for each VM and lack the ability to isolate components within an operating system. If a single process in a VM is exploitable, malicious code can use it to access the entire set of operating system Chapter 5. PeaPod: Isolating Cooperating Processes 93 resources. As PeaPod’s virtualization layer decouples processes from the underlying operating system and its resulting namespace, it is natively able to limit the separate processes of a larger system to the appropriate resources needed by them. Furthermore, VMMs require more administrative overhead due to requiring administration of multiple full OS instances, as well as imposing higher memory overhead due to the requirements of the underlying operating system. Many systems have been developed to isolate untrusted applications. NSA’s Security Enhanced Linux [86], which is based upon the Flask Architecture [133], implements a policy language that is used to implement models that enforce privilege separation. The policy language is very flexible but also very complex to use. The example security policy is over 80 pages long. There is research into creating tools to make policy analysis tractable [21], but the language’s complexity makes it difficult for the average end user to construct an appropriate policy. System call interception has been used by systems such as Janus [59, 144], Systrace [118], MAPbox [17], Software Wrappers [80] and Ostia [60]. These systems can enable flexible access controls per system call, but they have been limited by the difficulty of creating appropriate policy configurations. TRON [30], SubDomain [49] and Alcatraz [84] also operate at the system call level but focus on limiting access to the underlying file system. TRON allows transitions between different isolation units but requires application modifications to use this feature, while SubDomain supports an implicit transition on execution of a new child process. These systems provide a model somewhat similar to the file system approach used by PeaPod peas. However, the pea’s file system virtualization is based on a full-fledged stackable file system that integrates fully with regular kernel security infrastructure and provides much better performance. Similarly, the PeaPod’s kernel virtualization layer provides a complete process-isolation solution that is not just limited to file system protection. Chapter 5. PeaPod: Isolating Cooperating Processes 94 Safer languages and runtime environments, most notably Java, have been developed to prevent common software errors and isolate applications in language-based virtual machine environments. These solutions require applications to be rewritten or recompiled, often with some loss in performance. Other language-based tools [25, 48] have also been developed to harden applications against common attacks, such as buffer overflow attacks. PeaPod’s virtualization layer complements these approaches by providing isolation of legacy applications without modification. Chapter 6 Strata: Managing Large Numbers of Machines A key problem organizations face is how to efficiently provision and maintain the large number of machines deployed throughout their organizations. This problem is exemplified by the growing adoption and use of virtual appliances (VAs). VAs are pre-built software bundles run inside virtual machines (VMs). For example, one VM might be tailored to be a web server VA, while another might be tailored to be a desktop computing VA. Since VAs are often tailored to a specific application, these configurations can be smaller and simpler, potentially resulting in reduced resource requirements and more secure deployments. VAs simplify application deployment. Once an application is installed in a VA, it is easily deployed by end users with minimal hassle because both the software and its configuration have already been set up in the VA. A new VA can be easily created by cloning an existing VA that already contains a base installation of the necessary software, then modifying it by adding applications and changing the system configuration. There is no need to set up the common components from scratch. Chapter 6. Strata: Managing Large Numbers of Machines 96 But while virtualization and VAs decrease the cost of hardware, they can tremendously increase the human cost of administering these machines. As VAs are cloned and modified, creating an ever-increasing sprawl of different configurations, organizations that once had a few hardware machines to manage now find themselves juggling many more VAs with diverse system configurations and software installations. For instance, in the past, an organization would have run services such as web, mail, databases, file storage and shell access on a single machine because these services share many common files. By dividing these services into separate VAs, instead of a single machine with five services, one now has five independent VAs. This causes many management problems. First, because these VAs share a lot of common data, they are inefficient to store, as there are multiple copies of many common files. Although storage is cheap, the bandwidth available to write data to the disks is not. Copying the VA into place is extremely time-consuming and negatively impacts the performance of the other VAs running on the system. Second, by increasing the number of systems in use, we increase the number of systems needing security updates. Although software patches are released for security threats, constantly deploying patches and upgrading software creates a management nightmare as the number of VAs in the enterprise continues to rise. Many VAs may be turned off, suspended, or not even actively managed, making patch deployment before a security problem hits even more difficult. This problem is exacerbated by the large number of VAs in a large organization. Although the management of any one VA may not be difficult, the need to manage many different VAs results in a huge scaling problem for large organizations as management overhead grows linearly with the number of VAs needing maintenance. Instead of a single update for a machine running five services, the administrator now must apply the update five separate times. Chapter 6. Strata: Managing Large Numbers of Machines 97 Finally, as VAs are increasingly networked, the management problem only grows, given the myriad viruses and other attacks commonplace today. Security problems can wreak havoc on an organization’s virtual computing infrastructure. While virtualization can improve security via isolation, the sprawl of machines increases the number of hiding places for an attacker. Instead of a single actively used machine to monitor for malicious changes, administrators now have to monitor many less used machines. Furthermore, as VAs are designed to be dropped in place and not actively managed, administrators might not even know what VAs have been put into use by their end users. Many approaches have been used to address these problems, including traditional package management systems [4, 56], copy-on-write disks [91] and new VM storage formats [41, 103]. Unfortunately, these approaches suffer from various drawbacks that limit their utility and effectiveness in practice. They either incur management overheads that grow with the number of VAs, or require all VAs to have the same configuration, eliminating the main advantages of VAs. The fundamental problem with previous approaches is that they are based on a monolithic file system or block device. These file systems and block devices address their data at the block layer and are simply used as a storage entity. They have no direct concept of what the file system contains or how it is modified. However, managing VAs is essentially done by making changes to the file system. As a result, any upgrade or maintenance operation needs to be done to each VA independently, even when they all need the same maintenance. We have built Strata, a novel system that integrates file system unioning with package management semantics, by introducing the Virtual Layered File System (VLFS) and using it to solve VA management problems. Strata makes VA creation and provisioning fast. It simplifies the regular maintenance and upgrades that must Chapter 6. Strata: Managing Large Numbers of Machines 98 be performed on provisioned VA instances. Finally, it improves the administrator’s ability to detect and recover from security exploits. Strata achieves this by providing three architectural components: layers, layer repositories and Virtual Layered File Systems. A layer is a file hierarchy of related files that are typically installed and upgraded as a unit. Layers are analogous to software packages in package management systems. Like software packages, a layer may require other layers to function correctly, just as applications often require various system libraries to run. Strata associates dependency information with each layer that defines relationships among layers. Unlike software packages, which must be installed into each VA’s file system, layers can be shared directly among multiple VAs. Layer repositories are used to store layers centrally within a virtualization infrastructure, enabling them to be shared among multiple VAs. Layers are updated and maintained in the layer repository. For example, if a new version of an application becomes available, a new layer is added to the repository. If a patch for an application is issued, the corresponding layer is patched by creating a new layer with the patch. Different versions of the same application may be available through different layers in the layer repository. The layer repository is typically stored in a shared storage infrastructure accessible by the VAs, such as a Storage Area Network (SAN). Storing layers on the SAN does not impact VA performance because a SAN is where a traditional VA’s monolithic file system is stored. The VLFS is the file system for a VA. Unlike a traditional monolithic file system, it is a collection of individual layers dynamically composed into a single view. This is analogous to a traditional file system managed by a package manager that is composed of many packages extracted into it. Each VA has its own VLFS, which typically consists of a private read-write layer and a set of read-only layers shared through the layer repository. The private read-write layer is used for all file system Chapter 6. Strata: Managing Large Numbers of Machines 99 modifications private to the VA that occur during runtime, such as modifying user data. The shared read-only layers allow VAs with very different system configurations and applications to share common layers representing software components common across VAs. Layer changes to shared layers only need be done once in the repository and are then automatically propagated to all VLFSs, resulting in management overhead independent of the number of VAs. By dynamically building a VLFS out of discrete layers, Strata introduces file system unioning as the package management semantic. This provide a number of management benefits. First, Strata is able to create and provision VAs more quickly and easily. To create a template VA, an administrator just selects the applications and tools of interest from the layer repository. The template VA’s VLFS automatically unions the selected layers together with a read-write layer and incorporates any additional layers needed to resolve any necessary dependencies. This template VA then becomes the single image end users in an enterprise will use when they want to use this service. End users can instantly create provisioned instances of this template VA because no copying or on-demand paging is needed to instantiate its file system, as all the layers are accessed from the shared layer repository. The layer repository allows easy identification of the applications and tools of interest, and the VLFS automatically resolves dependencies on other layers, so provisioning VAs is relatively easy. Because VAs are just defined by their associated sets of layers, Strata also offers a new way to build VAs simply by combining existing ones. Second, Strata simplifies upgrades and maintenance of provisioned VAs. If a layer contains a bug to be fixed, the administrator creates a replacement layer with the fix and updates the template VA. This informs the provisioned VAs to incorporate the layer into their VLFS’s namespace view. Traditional VAs, which are provisioned and updated by replacing their file system [41, 103], have to be rebooted in order to Chapter 6. Strata: Managing Large Numbers of Machines 100 incorporate changes by making use of a new block device. Strata, however, allows online upgrades like a traditional package management system. Unlike package management system upgrades, in which a significant amount of time is spent deleting the existing files and copying the new files into place, upgrades in a VLFS are atomic, preventing the file system from ever being in an inconsistent state. Finally, this semantic allows VAs managed by Strata to easily recover from security exploits. VLFSs distinguish between files installed via its package manager, which are stored in a shared read-only layer, and the changes made over time, which are stored in the private read-write layer. If a VA is compromised and an attacker installs new malware or modifies an existing application, these changes will be separated from the deployed system’s initial state and isolated to the read-write layer. Such changes are easier to identify and remove, returning the VA to a clean state. 6.1 Strata Basics Figure 6.1 shows Strata’s three architectural components: layers, layer repositories and VLFSs. A layer is a distinct self-contained set of files that corresponds to a specific functionality. Strata classifies layers into three categories: software layers with selfcontained applications and system libraries, configuration layers with configuration file changes for a specific VA, and private layers allowing each provisioned VA to be independent. Layers can be mixed and matched, and may depend on other layers. For example, a single application or system library is not fully independent, but depends on the presence of other layers, such as those that provide needed shared libraries. Strata enables layers to enumerate their dependencies on other layers. This dependency scheme allows automatic provisioning of a complete, fully consistent file system by selecting the main features desired within the file system. Chapter 6. Strata: Managing Large Numbers of Machines 101 Figure 6.1 – How Layers, Repositories, and VLFSs Fit Together Layers are provided through layer repositories. As Figure 6.1 shows, a layer repository is a file system share containing a set of layers made available to VAs. When an update is available, the old layer is not overwritten. Instead, a new version of the layer is created and placed within the repository, making it available to Strata’s users. Administrators can also remove layers from the repository, e.g., those with known security holes, to prevent them from being used. Layer repositories are generally stored on centrally managed file systems, such as a SAN or NFS, but they can also be provided by protocols such as FTP and HTTP and mirrored locally. Layers from multiple layer repositories can form a VLFS as long as they are compatible with one another. This allows layers to be provided in a distributed manner. Layers provided by different maintainers can have the same layer names, causing a conflict. This, however, is no different from traditional package management systems as packages Chapter 6. Strata: Managing Large Numbers of Machines 102 with the same package name, but different functionality, can be provided by different package repositories. As Figure 6.1 shows, a VLFS is a collection of layers from layer repositories that are composed into a single file system namespace. The layers making up a particular VLFS are defined by the VLFS’s layer definition file, which enumerates all the layers that will be composed into a single VLFS instance. To provision a VLFS, an administrator selects software layers that provide the desired functionality and lists them in the VLFS’s layer definition file. Within a VLFS, layers are stacked on top of another and composed into a single file system view. An implication of this composition mechanism is that layers on top can obscure files on layers below them, only allowing the contents of the file instance contained within the higher level to be used. This means that files in the private or configuration layers can obscure files in lower layers, such as when one makes a change to a default version of a configuration file located within a software layer. However, to prevent an ambiguous situation from occurring, where the file system’s contents depend on the order of the software layers, Strata prevents software layers that contain the same file names from being composed into a single VLFS. 6.2 Strata Usage Model Strata’s usage model is centered around the usage of layers to quickly create VLFSs for VAs as shown in Figure 6.1. Strata allows an administrator to compose together layers to form template VAs. These template VAs can be used to form other template appliances that extend their functionality, as well as to provide the VA that end users will provision and use. Strata is designed to be used within the same setup as a traditional VM architecture. This architecture includes a cluster of physical machines Chapter 6. Strata: Managing Large Numbers of Machines 103 that are used to host VM execution as well as a shared SAN that stores all the VM disk images that can be executed. However, instead of storing disk images on the SAN, Strata stores the layers that will be used by the VMs it manages. 6.2.1 Creating Layers and Repositories Layers are first created and stored in layer repositories. Layer creation is similar to the creation of packages in a traditional package management system, where one builds the software, installs it into a private directory, and turns that directory into a package archive, or in Strata’s case, a layer. For instance, to create a layer that contains the MySQL SQL server, the layer maintainer would download the source archive for MySQL, extract it, and build it normally. However, instead of installing it into the system’s root directory, one installs it into a virtual root directory that becomes the file system component of this new layer. The layer maintainer then defines the layer’s metadata, including its name (mysql-server in this case) and an appropriate version number to uniquely identify this layer. Finally, the entire directory structure of the layer is copied into a layer repository, making the layer available to users of that repository. 6.2.2 Creating Appliance Templates Given a layer repository, an administrator can then create template VAs. Creating a template VA involves: 1. Creating the template VA with an identifiable name and the VLFS it will use. 2. Determining what repositories are available to it. 3. Selecting a set of layers that provide the functionality desired. Chapter 6. Strata: Managing Large Numbers of Machines 104 For example, to create a template VA that provides a MySQL SQL server, an administrator creates an appliance/VLFS named sql-server and selects the layers needed for a fully functional MySQL server file system, most importantly, the mysql-server layer. Strata composes these layers together into the VLFS in a readonly manner along with a read-write private layer, making the VLFS usable within a VM. The administrator boots the VM and makes the appropriate configuration changes to the template VA, storing them within the VLFS’s private layer. Finally, the private layer belonging to the template appliance’s VLFS is frozen and becomes the template’s configuration layer. As another example, to create an Apache web server appliance, an administrator creates an appliance/VLFS named web-server, and selects the layers required for an Apache web server, most importantly, the layer containing the Apache httpd program. Strata extends this template model by allowing multiple template VAs to be composed together into a single new template. For example, an administrator can create a new template VA/VLFS, sql+web-server, composed of the MySQL and Apache template VAs. The resulting VLFS has the combined set of software layers from both templates, both of their configuration layers, and a new configuration layer containing the configuration state that integrates the two services together, for a total of three configuration layers. 6.2.3 Provisioning and Running Appliance Instances Given templates, VAs are efficiently and quickly provisioned and deployed by end users by cloning the available templates. Provisioning a VA involves: 1. Creating a virtual machine container with a network adapter and an virtual disk. Chapter 6. Strata: Managing Large Numbers of Machines 105 2. Using the network adapter’s MAC address as the machine’s identifier for identifying the VLFS created for this machine. 3. Forming the VLFS by referencing the already existing template VLFS and combining the template’s read-only software and configuration layers with a readwrite private layer provided by the VM’s virtual disk. As each VM managed by Strata does not have a physical disk off which to boot, Strata network boots each VM. When the VM boots, its BIOS discovers a network boot server which provides it with a boot image, including a base Strata environment. The VM boots this base environment, which then determines which VLFS should be mounted for the provisioned VM using the MAC address of the machine. Once the proper VLFS is mounted, the machine transitions to using it as its root file system. 6.2.4 Updating Appliances Strata upgrades provisioned VAs efficiently using a simple three-step process. First, an updated layer is installed into a shared layer repository. Second, administrators are able to modify the template appliances under their control to incorporate the update. Finally, all provisioned VAs based on that template will automatically incorporate the update as well. Note that updating appliances is much simpler than updating generic machines, as appliances are not independently managed machines. This means that extra software that can conflict with an upgrade will not be installed into a centrally managed appliance. Centrally managed appliance updates are limited to changes to their configuration files and what data files they store. Strata’s updates propagate automatically even if the VA is not currently running. If a VA is shut down, the VA will compose whatever updates have been applied to its templates automatically, never leaving the file system in a vulnerable state, because Chapter 6. Strata: Managing Large Numbers of Machines 106 it composes its file system afresh each time it boots. If it is suspended, Strata delays the update to when the VA is resumed, as updating layers is a quick task. Updating is significantly quicker than resuming, so this does not add much to its cost. Furthermore, VAs are upgraded atomically, as Strata adds and removes all the changed layers in a single operation. This is not like a traditional package management system which, when upgrading a package, first uninstalls it before reinstalling the newer version. The traditional method leaves the file system in an inconsistent state for a short period of time because it is possible that files needed for program execution may not be available. For instance, when the libc package is upgraded, its contents are first removed from the file system before being replaced. Any application that tries to execute during the interim will fail to dynamically link because the main library on which it depends is not present within the file system at that moment. 6.2.5 Improving Security Strata makes it much easier to manage VAs that have had their security compromised. By dividing a file system into a set of shared read-only layers and storing all file system modifications inside the private read-write layer, Strata separates changes made to the file system via layer management from regular runtime modifications. This enables Strata to easily determine when system files have been compromised as the changes will be readily visible in the private layer. This allows Strata to not rely on tools like Tripwire [79] or maintain separate databases to determine if files have been modified from their installed state. Similarly, this check can be run external to the VA, as it just needs access to the private layer, thereby preventing an attacker from disabling it. This reduces management load due to not requiring any external databases be kept in sync with the file system state as it changes. Chapter 6. Strata: Managing Large Numbers of Machines 107 This segregation of modified file system state also enables quick recovery from a compromised system. By simply replacing the VA’s private layer with a fresh private layer, the compromised system is immediately fixed by returning it to its default freshly provisioned state. However, unlike reinstalling a system from scratch, replacing the private layer does not require throwing away the contents of the old private layer. Strata enables the layer to be mounted within the file system, enabling administrators to have easy access to the files located within it to move the uncompromised files back to their proper places. 6.3 Virtual Layered File System Strata introduces the concept of a virtual layered file system in place of traditional monolithic file systems. Strata’s VLFS allows file systems to be created by composing layers together into a single file system namespace view. Strata allows these layers to be shared by multiple VLFSs in a read-only manner or to remain read-write and private to a single VLFS. Every VLFS is defined by a layer definition file (LDF), which specifies what software layers should be composed together. An LDF is a simple text file that lists the layers and their respective repositories. The LDF’s layer list syntax is repository/layer version and can be preceded by an optional modifier command. When an administrator wants to add or remove software from the file system, instead of modifying the file system directly, they modify the LDF by adding or removing the appropriate layers. Figure 6.2 contains an example LDF for a MySQL SQL server template appliance. The LDF lists each individual layer included in the VLFS along with its corresponding repository. Each layer also has a number indicating which version will be composed Chapter 6. Strata: Managing Large Numbers of Machines 108 into the file system. If an updated layer is made available, the LDF is updated to include the new layer version instead of the old one. If the administrator of the VLFS does not want to update the layer, they can hold a layer at a specific version, with the = syntax element. This is demonstrated by the mailx layer in Figure 6.2, which is being held at the version listed in the LDF. Strata allows an administrator to explicitly select only the few layers corresponding to the exact functionality desired within the file system. Other layers needed in the file system are implicitly selected by the layers’ dependencies as described in Section 6.3.2. Figure 6.2 shows how Strata distinguishes between explicitly and implicitly selected layers. Explicitly selected layers are listed first and separated from the implicitly selected layers by a blank line. In this case, the MySQL server has only one explicit layer, mysql-server, but has 21 implicitly selected layers. These include utilities such as Perl and TCP Wrappers (tcpd), as well as libraries such as OpenSSL (libssl). It also includes a layer providing a shared base common to all VLFSs. Strata distinguishes explicit layers from implicit layers to allow future reconfigurations to remove one implicit layer in favor of another if dependencies need to change. When an end user provisions an appliance by cloning a template, an LDF is created for the provisioned VA. Figure 6.3 shows an example introducing another syntax element, @, that instructs Strata to reference another VLFS’s LDF as the basis for this VLFS. This lets Strata clone the referenced VLFS by including its layers within the new VLFS. In this case, because the user wants only to deploy the SQL server template, this VLFS LDF only has to include the single @ line. In general, a VLFS can reference more than one VLFS template, assuming that layer dependencies allow all the layers to coexist. Chapter 6. Strata: Managing Large Numbers of Machines 109 main/mysql-server 5.0.51a-3 main/base 1 main/libdb4.2 4.2.52-18 main/apt-utils 0.5.28.6 main/liblocale-gettext-perl 1.01-17 main/libtext-charwidth-perl 0.04-1 main/libtext-iconv-perl 1.2-3 main/libtext-wrapi18n-perl 0.06-1 main/debconf 1.4.30.13 main/tcpd 7.6-8 main/libgdbm3 1.8.3-2 main/perl 5.8.4-8 main/psmisc 21.5-1 main/libssl0.9.7 0.9.7e-3 main/liblockfile1 1.06 main/adduser 3.63 main/libreadline4 4.3-11 main/libnet-daemon-perl 0.38-1 main/libplrpc-perl 0.2017-1 main/libdbi-perl 1.46-6 main/ssmtp 2.61-2 =main/mailx 3a8.1.2-0.20040524cvs-4 Figure 6.2 – Layer Definition for MySQL Server @main/sql-server Figure 6.3 – Layer Definition for Provisioned Appliance 6.3.1 Layers Strata’s layers are composed of three components: metadata files, the layer’s file system and configuration scripts. The metadata files define the information that describes the layer. This includes its name, version and dependency information. This information is important to ensure that a VLFS is composed correctly. The metadata file contains all the metadata that is specified for the layer. Figure 6.4 shows an example metadata file. Figure 6.5 shows the full metadata syntax. The metadata file has a single field per line with two elements, the field type and the field contents. In general, the metadata file’s syntax is Field Type: value, where value Chapter 6. Strata: Managing Large Numbers of Machines 110 can be either a single entry or a comma-separated list of values. The layer’s file system is a self-contained set of files providing a specific functionality. The files are the individual items in the layer that are composed into a larger VLFS. There are no restrictions on the types of files that can be included. They can be regular files, symbolic links, hard links or device nodes. Similarly, each directory entry can be given whatever permissions are appropriate. A layer can be seen as a directory stored on the shared file system that contains the same file and directory structure that would be created if the individual items were installed into a traditional file system. On a traditional UNIX system, the directory structure would typically contain directories such as /usr, /bin and /etc. Symbolic links work as expected between layers since they work on path names, but one limitation is that hard links cannot exist between layers. The layer’s configuration scripts are run when a layer is added or removed from a VLFS to allow proper integration of the layer within the VLFS. Although many layers are just a collection of files, other layers need to be integrated into the system as a whole. For example, a layer that provides MP3 file playing capability should register itself with the system’s MIME database to allow programs contained within the layer to be launched automatically when a user wants to play an MP3 file. Similarly, if the layer were removed, it should remove the programs contained within itself from the MIME database. Strata supports four types of configuration scripts: pre-remove, post-remove, preinstall and post-install. If they exist in a layer, the appropriate script is run before or after a layer is added or removed. For example, a pre-remove script can be used to shut down a daemon before it is actually removed, while a post-remove script can be used to clean up file system modifications in the private layer. Similarly, a pre-install script can ensure that the file system is as the layer expects, while the post-install Chapter 6. Strata: Managing Large Numbers of Machines 111 Layer: mysql-server Version: 5.0.51a-3 Depends: ..., perl (>= 5.6), tcpd (>= 7.6-4),... Figure 6.4 – Metadata for MySQL Server Layer Layer: Layer Name Version: Version of Layer Unit Conflicts: layer1 (opt. constraint), ... Depends: layer1 (...), layer2 (...) | layer3, ... Pre-Depends: layer1 (...), ... Provides: virtual_layer, ... Figure 6.5 – Metadata Specification script can start daemons included in the layer. The configuration scripts can be written in any scripting language. The layer must include the proper dependencies to ensure that the scripting infrastructure is composed into the file system in order to allow the scripts to run. Layers are stored on disk as a directory tree named by the layer’s name and version. For instance, version 5.0.51a of the MySQL server, with a Strata layer version of 3, would be stored under the directory mysql-server 5.0.51a-3. Within this directory, Strata defines a metadata file, a filesystem directory and a scripts directory corresponding to the layer’s three components. 6.3.2 Dependencies A key Strata metadata element is enumeration of the dependencies that exist between layers. Strata’s dependency scheme is heavily influenced by the dependency scheme in Linux distributions such as Debian and Red Hat. In Strata, every layer composed into Strata’s VLFS is termed a layer unit. Every layer unit is defined by its name and version. Two layer units that have the same name but different layer versions are Chapter 6. Strata: Managing Large Numbers of Machines 112 different units of the same layer. A layer refers to the set of layer units of a particular name. Every layer unit in Strata has a set of dependency constraints placed within its metadata. There are four types of dependency constraints: • dependency • pre-dependency • conflict • provide Dependency and Pre-Dependency: Dependency and pre-dependency constraints are similar in that they require another layer unit to be integrated at the same time as the layer unit that specifies them. They differ only in the order the layer’s configuration scripts are executed to integrate them into the VLFS. A regular dependency does not dictate order of integration. A pre-dependency dictates that the dependency has to be integrated before the dependent layer. Figure 6.4 shows that the MySQL layer depends on TCP Wrappers (tcpd) because it dynamically links against the shared library libwrap.so.0 provided by TCP Wrappers. MySQL cannot run without this shared library, so the layer units that contain MySQL must depend on a layer unit containing an appropriate version of the shared library. These constraints can also be versioned to further restrict which layer units satisfy the constraint. For example, shared libraries can add functionality that breaks their application binary interface (ABI), breaking in turn any applications that depend on that ABI. Since MySQL is compiled against version 0.7.6 of the libwrap library, the dependency constraint is versioned to ensure that a compatible version of the library is integrated at the same time. Chapter 6. Strata: Managing Large Numbers of Machines 113 Conflict: Conflict constraints indicate that layer units cannot be integrated into the same VLFS. This generally occurs because the layer units depend on exclusive access to the same operating system resource. This can be a TCP port in the case of an Internet daemon, or two layer units that contain the same file pathnames and therefore would obscure each other. For this reason, two layer units of the same layer are by definition in conflict because they will contain some of the same files. An example of this constraint occurs when the ABI of a shared library changes without any source code changes, generally due to an ABI change in the tool chain that builds the shared library. Because the ABI has changed, the new version can no longer satisfy any of the previous dependencies. But because nothing else has changed, the file on disk will usually not be renamed either. A new layer must then be created with a different name, ensuring that the library with the new ABI is never used to satisfy an old dependency on the original layer. Because the new layer contains the same files as the old layer, it must conflict with the older layer to ensure that they are not integrated into the same file system. Provide: Provide dependency constraints introduce virtual layers. A regular layer provides a specific set of files, but a virtual layer indicates that a layer provides a particular piece of general functionality. Layer units that depend on a certain piece of general functionality can depend on a specific virtual layer name in the normal manner, while layer units that provide that functionality will explicitly specify that they do. For example, layer units that provide webmail or content management software depend on the presence of a web server, but which one is not important. Instead of depending on a particular web server, they depend on the virtual layer name httpd. Similarly, layer units containing a web server, such as Apache or Boa, are defined to provide the httpd virtual layer name and therefore satisfy those dependencies. Unlike regular layer units, virtual layers are not versioned. Chapter 6. Strata: Managing Large Numbers of Machines 6.3.2.1 114 Dependency Example Figure 6.2 shows how dependencies can affect a VLFS in practice. This VLFS has only one explicit layer, mysql-server, but 21 implicitly selected layers. The mysql-server layer itself has a number of direct dependencies, including Perl, TCP Wrappers, and the mailx program. These dependencies in turn depend on the Berkeley DB library and the GNU dbm library, among others. Using its dependency mechanism, Strata is able to automatically resolve all the other layers needed to create a complete file system by specifying just a single layer. Returning to Figure 6.4, this example defines a subset of the layers that the mysqlserver layer requires to be composed into the same VLFS to allow MySQL to run correctly. More generally, Figure 6.5 shows the complete syntax for the dependency metadata. Provides is the simplest, with only a comma-separated list of virtual layer names. Conflicts adds an optional version constraint to each conflicted layer to limit the layer units that are actually in conflict. Depends and Pre-Depends add a boolean or of multiple layers in their dependency constraints to allow multiple layers to satisfy the dependency. 6.3.2.2 Resolving Dependencies To allow an administrator to select only the layers explicitly desired within the VLFS, Strata automatically resolves dependencies to determine which other layers must be included implicitly. To allow dependency resolution, Strata first provides a database of all the available layer units’ locations and metadata. The collection of layer units can be viewed as three sets: the set of layer units themselves, the set of dependency relations for each individual layer unit, and the set of conflict relations (C) that define which layer units cannot be integrated into the same file system. This collection can Chapter 6. Strata: Managing Large Numbers of Machines 115 be viewed as a directed dependency graph connecting layer units to the layer units on which they depend. A layer unit can be integrated into the VLFS when two principles hold. First, there must be a set of layer units (I) that fulfills total closure of all the dependencies, that is, every layer unit in the set has every dependency filled. Second, I × I ∩ C = ∅ must hold, meaning that none of the layer units in I can conflict with each other. Determining when these principles hold is a problem that has been shown to be polynomial time reducible to 3-SAT [47, 139]. Because 3-SAT is NP-complete, this could be very difficult to solve naively, but an optimized Davis-Putnam SAT solver [52] can be used to solve it efficiently [47]. Even when a layer unit can be integrated into the VLFS, however, there will often be many sets of implicitly selected layer units that allow this. Strata therefore has to evaluate which of those sets is the best. Linux distributions already face this problem and tools have been developed to address it, such as Apt [36] and Smart [98]. Strata leverages Smart and adopts the same metadata database format that Debian uses for packages for its own layers, as Smart already knows how to parse it. When Smart is used with a regular Linux distribution, administrators request that it install or remove packages and Smart determines whether the operation can succeed and what is the best set of packages to add or remove to achieve that goal. In Strata, when an administrator requests that a layer be added to or removed from a template appliance, Smart also evaluates if the operation can succeed and what is the best set of layers to add or remove. Instead of acting directly on the contents of the file system, however, Strata only has to update the template’s LDF with the set of layers to be composed into the file system. Chapter 6. Strata: Managing Large Numbers of Machines 6.3.3 116 Layer Creation Strata allows layers to be created in two ways. First, .deb packages used by Debianderived distributions and the .rpm packages used by RedHat-derived distributions can be directly converted into layers. Strata converts packages into layers in two steps. First, the relevant metadata from the package is extracted, including its name and version. Second, the package’s file contents are extracted into a private directory that will be the layer’s file system components. When using converted packages, Strata leverages the underlying distribution’s tools to run the configuration scripts belonging to the newly created layers correctly. Instead of using the distribution’s tools to unpack the software package, Strata composes the layers together and uses the distribution’s tools as though the packages have already been unpacked. Although Strata is able to convert packages from different Linux distributions, it cannot mix and match them because they are generally ABI incompatible with one another. More commonly, Strata leverages existing packaging methodologies to simplify the creation of layers from scratch. In a traditional system, when administrators install a set of files, they copy the files into the correct places in the file system using the root of the file system tree as their starting point. For instance, an administrator might run make install to install a piece of software compiled on the local machine. In Strata, layer creation is a three-step process. First, instead of copying the files into the root of the local file system, the layer creator installs the files into their own specific directory tree. That is, they make a blank directory to hold a new file system tree that is created by having the make install copy the files into a tree rooted at that directory, instead of the actual file system root. Second, the layer maintainer extracts programs that integrate the files into the underlying file system and creates scripts that run when the layer is added to and Chapter 6. Strata: Managing Large Numbers of Machines 117 removed from the file system. Examples of this include integration with GNOME’s GConf configuration system, creation of encryption keys, or creation of new local users and groups for new services that are added. This leverages skills that package maintainers in a traditional package management world already have. Finally, the layer maintainer needs to set up the metadata correctly. Some elements of the metadata, such as the name of the layer and its version, are simple to set, but dependency information can be much harder. But because package management tools have already had to address this issue, Strata is able to leverage the tools they have built. For example, package management systems have created tools that infer dependencies using an executable dynamically linking against shared libraries [117]. Instead of requiring the layer maintainer to enumerate each shared library dependency, we can programmatically determine which shared libraries are required and populate the dependency fields based on those versions of the library currently installed on the system where the layer is being created. 6.3.4 Layer Repositories Strata provides local and remote layer repositories. Local layer repositories are provided by locally accessible file system shares made available by a SAN. They contain layer units to be composed into the VLFS. This is similar to a regular virtualization infrastructure in which all the virtual machines’ disks are stored on a shared SAN. Each layer unit is stored as its own directory; a local layer repository contains a set of directories, each of which corresponds to a layer unit. The local layer repository’s contents are enumerated in a database file providing a flat representation of the metadata of all the layer units present in the repository. The database file is used for making a list of what layers can be installed and their dependency information. By storing Chapter 6. Strata: Managing Large Numbers of Machines 118 the shared layer repository on the SAN, Strata lets layers be shared securely among different users’ appliances. Even if the machine hosting the VLFS is compromised, the read-only layers will stay secure, as the SAN will enforce the read-only semantic independently of the VLFS. Remote layer repositories are similar to local layer repositories, but are not accessible as file system shares. Instead, they are provided over the Internet, by protocols such as FTP and HTTP, and can be mirrored into a local layer repository. Instead of mirroring the entire remote repository, Strata allows on-demand mirroring, where all the layers provided by the remote repository are accessible to the VAs, but must be mirrored to the local mirror before they can be composed into a VLFS. This allows administrators to store only the needed layers while maintaining access to all the layers and updates that the repository provides. Administrators can also filter which layers should be available to prevent end users from using layers that violate administration policy. In general, an administrator will use these remote layer repositories to provide the majority of layers, much as administrators use a publicly managed package repository from a regular Linux distribution. Layer repositories let Strata operate within an enterprise environment by handling three distinct yet related issues. First, Strata has to ensure that not all end users have access to every layer available within the enterprise. For instance, administrators may want to restrict certain layers to certain end users for licensing, security or other corporate policy reasons. Second, as enterprises get larger, they gain levels of administration. Strata must support the creation of an enterprise-wide policy while also enabling small groups within the enterprise to provide more localized administration. Third, larger enterprises supporting multiple operating systems cannot rely on a single repository of layers because of inherent incompatibilities among operating systems. Chapter 6. Strata: Managing Large Numbers of Machines 119 By allowing a VLFS to use multiple repositories, Strata solves these three problems. First, multiple repositories let administrators compartmentalize layers according to the needs of their end users. By providing end users with access only to needed repositories, organizations prevent their end users from using the other layers. Second, by allowing sub-organizations to set up their own repositories, Strata lets a sub-organization’s administrator provide the layers that end users need without requiring intervention by administrators of global repositories. Finally, multiple repositories allow Strata to support multiple operating systems, as each distinct operating system has its own set of layer repositories. Strata supports multiple layer repositories by providing a directory of layer repositories that can contain multiple subdirectories, each of which serves as a mount point for a layer repository file system share, or as a location to store the layers themselves locally. This enables administrators to use regular file system share controls to determine which layer repositories users can access. 6.3.5 VLFS Composition To create a VLFS, Strata has to solve a number of file system-related problems. First, Strata has to support the ability to combine numerous distinct file system layers into a single static view. This is equivalent to installing software into a shared read-only file system. Second, because users expect to treat the VLFS as a normal file system, for instance, by creating and modifying files, Strata has to let VLFSs be fully modifiable. Similarly, users must also be able to delete files that exist on the read-only layer. By basing the VLFS on top of unioning file systems [102, 150], Strata solves all these problems. Unioning file systems join multiple layers into a single namespace. Unioning file systems have been extended to apply attributes such as read-only and Chapter 6. Strata: Managing Large Numbers of Machines 120 read-write to their layers. The VLFS leverages this property to force shared layers to be read-only, while the private layer remains read-write. If a file from a shared read-only layer is modified, it is copied-on-write (COW) to the private read-write layer before it is modified. For example, LiveCDs use this functionality to provide a modifiable file system on top of the read-only file system provided by the CD. Finally, unioning file systems use whiteouts to obscure files located on lower layers. For example, if a file located on a read-only layer is deleted, a whiteout file will be created on the private read-write layer. This file is interpreted specially by the file system and is not revealed to the user while also preventing the user from seeing files with the same name. However, Strata has to solve two additional problems. First, Strata must maintain the usage semantic that users can recover deleted system files by reinstalling or upgrading the layer that contains them. For example, in a traditional monolithic file system managed by a package management system, reinstalling a package will replace any files that might have been deleted. However, if the VLFS only used a traditional union file system, the whiteouts stored in the private layer would persist and continue to obscure the file even if the shared layer was replaced. To solve this problem, Strata provides a VLFS with additional writeable layers associated with each read-only shared layer. Instead of containing file data, as does the topmost private writeable layer, these layers just contain whiteout marks that will obscure files contained within their associated read-only layer. The user can delete a file located in a shared read-only layer, but the deletion only persists for the lifetime of that particular instance of the layer. When a layer is replaced during an upgrade or reinstall, a new empty whiteout layer will be associated with the replacement, thereby removing any preexisting whiteouts. In a similar way, Strata handles the case where a file belonging to a shared read-only layer is modified and therefore copied to the Chapter 6. Strata: Managing Large Numbers of Machines 121 VLFS’s private read-write layer. Strata provides a revert command that lets the owner of a file that has been modified revert the file to its original pristine state. While a regular VLFS unlink operation would have removed the modified file from the private layer and created a whiteout mark to obscure the original file, revert only removes the copy in the private layer, thereby revealing the original below it. Second, Strata supports adding and removing layers dynamically without taking the file system offline. This is equivalent to installing, removing or upgrading a software package while a monolithic file system is online. While some upgrades, specifically of the kernel, will require the VA to be rebooted, most should be able to occur without taking the VA offline. However, if a layer is removed from a union, its data is effectively removed as well because unions operate only on file system namespaces and not on the data the underlying files contain. If an administrator wants to remove a layer from the VLFS, they must take the VA offline, because layers cannot be removed while in use. To solve this problem, Strata emulates a traditional monolithic file system. When an administrator deletes a package containing files in use, the processes that are currently using those files will continue to work. This occurs by virtue of unlink’s semantic of first removing a file from the file system’s namespace, and only removing its data after the file is no longer in use. This lets processes continue to run because the files they need will not be removed until after the process terminates. This creates a semantic in which a currently running program can be using versions of files no longer available to other programs. Existing package managers use this semantic to allow a system to be upgraded online, and it is widely understood. Strata applies the same semantic to layers. When a layer is removed from a VLFS, Strata marks the layer as unlinked, removing it from the file system namespace. Although this layer is no longer part of the file system Chapter 6. Strata: Managing Large Numbers of Machines 122 namespace and thus cannot be used by any operations such as open that work on the namespace, it does remain part of the VLFS, enabling data operations such as read and write to continue working correctly for previously opened files. 6.4 Improving Appliance Security In today’s world, machines are continually attacked and administrators work hard to deflect the attacks. But even with an administrator’s best efforts, attacks still succeed from time to time. A main problem in dealing with possibly compromised machines is detecting whether they have indeed been compromised. Just because an attack is detected does not mean that the attacker was able to change the machine in a persistent way. Many administrators employ additional tools such as Tripwire [79] to aid in this effort, but this creates an added burden. There are extra tools and databases to be maintained and possibly neglected. This leaves the administrators not always knowing what, or if, the attacker modified. A clean reinstall is often the best option, but this causes two problems: downtime and lost data. Although an administrator can back up the system before it is reinstalled, this further adds to the time lost to repairs. To address these problems, Strata not only manages appliances, but also keeps them more secure, improves compromise detection, and makes it easier to fix compromised machines. Strata does this in three fundamental ways. First, many machines are exploited because they provide functionality that is not needed and therefore not maintained appropriately. Strata improves auditing by allowing an administrator to examine each VLFS configuration to determine if unneeded layers, and therefore pieces of software, are being included. As opposed to a traditional monolithic file system, where files can become hidden among their peers, a VLFS enables an Chapter 6. Strata: Managing Large Numbers of Machines 123 administrator to determine easily which layers are included and isolate file system modifications stored in the private read-write layer. Similarly, in the face of an attempted compromise, the VLFS lets an administrator determine quickly if the file system has been compromised simply by checking the file system’s private layer. Because any changes made to the file system cause a change to the private read-write layer, an administrator can see if any system binaries or libraries have been copied up to the private layer. If this has occurred, the administrator knows that the system has been maliciously modified. The attacker has no ability to modify the shared read-only layers because the layer repository’s file system share enforces the read-only access to the shared contents. To modify the contents in the shared layer repositories, an attacker would have to find a way to attack the file system share itself. Although the attacker can still modify the appliance’s file system, administrators can easily tell that this has happened by noticing the system files stored within the VLFS’s private read-write layer. Administrators can detect these modifications without relying on external databases that have to be maintained separately and updated whenever the file system is changed. Second, by leveraging Strata’s layer concept, an administrator can deploy fixes to all of the machines more quickly, without having to worry about machines not currently running or forgotten altogether. When a layer update is available to fix a security hole, an administrator needs only to import it into the local layer repository. Systems managed by Strata will detect that the layer repository has been updated and identify that updates are available for a layer that is being used in the local VLFS. Strata will automatically include the new layer into the VLFS’s namespace while removing the old one. Finally, with a VLFS, it is simple to recreate a fresh system. By replacing the compromised private layer with a fresh layer, the system is instantly cleaned. This is Chapter 6. Strata: Managing Large Numbers of Machines 124 equivalent to deploying a new virtual appliance, as the private layer is what distinguishes virtual appliance clones. As opposed to physical systems, where reinstalling the system can require overwriting the compromised system, cleaning a system with Strata does not require losing the contents of the compromised machine. Because cleaning the system does not require getting rid of the compromised private layer, an administrator need not waste time backing it up and can make it available within the appliance’s file system as a regular directory without it being composed into the normal file system view. This can puts the system back online quickly while also allowing easy import of data to be preserved from the compromised system. Quickly fixing compromised systems is useful, but often results in discarding the authorized configuration changes made to that system. Until now, we have described a single VLFS containing multiple read-only layers shared among appliances and one read-write layer containing the virtual appliance’s private data. But the appliance’s private data need not be limited to a single layer. An end user of a deployed appliance can create their own configuration layers to lock in whatever persistent configuration changes they desire. Regular configuration layers are read-only and shared between appliances, but this configuration layer is read-only and accessible only to the local appliance. In practice, the end user will initially create a VLFS as described above that has only one read-write layer for private data. Configuration changes are usually done at the outset and remain static for an extended period, so static configuration changes can be confined to this private layer. When the user is satisfied with the configuration, they convert the read-write private layer to a read-only configuration layer to lock it in, while adding a new private layer to contain the file system changes that occur during regular usage. If the machine’s configuration is corrupted due to system compromise or an administrator’s authorized changes, the user can quickly revert back to the locked down configuration, kept as it is on a read-only layer. Chapter 6. Strata: Managing Large Numbers of Machines 6.5 125 Experimental Results We have implemented Strata as a loadable kernel module on an unmodified Linux 2.6 series kernel. The loadable kernel module implements Strata’s VLFS as a stackable file system. We present experimental results using our Linux prototype to manage various VAs, demonstrating its ability to reduce management costs while incurring only modest performance overhead. Experiments were conducted on VMware ESX 3.0 running on an IBM BladeCenter with 14 IBM HS20 eServer blades with dual 3.06 GHz Intel Xeon CPUs, 2.5 GB RAM, and a Q-Logic Fibre Channel 2312 host bus adapter connected to an IBM ESS Shark SAN with 1 TB of disk space. The blades were connected by a gigabit Ethernet switch. This is a typical virtualization infrastructure in an enterprise computing environment where all virtual machines are centrally stored and run. We compare plain Linux VMs with a virtual block device stored on the SAN and formatted with the Ext3 file system to VMs managed by Strata with the layer repository also stored on the SAN. By storing both the plain VM’s virtual block device and Strata’s layers on the SAN, we eliminate any differences in performance due to hardware architecture. To measure management costs, we quantify the time taken by two common tasks, provisioning and updating VAs. We quantify the storage and time costs for provisioning many VAs and the performance overhead for running various benchmarks using the VAs. We ran experiments on five VAs: an Apache web server, a MySQL SQL server, a Samba file server, an SSH server providing remote access, and a remote desktop server providing a complete GNOME desktop environment. While the server VAs had relatively few layers, the desktop VA has very many layers. This enables the experiments to show how the VLFS performance scales as the number of layers increases. To provide a basis for comparison, we provisioned these VAs using the Chapter 6. Strata: Managing Large Numbers of Machines 126 normal VMware virtualization infrastructure and plain Debian package management tools, and Strata. To make a conservative comparison to plain VAs and to test larger numbers of plain VAs in parallel, we minimized the disk usage of the VAs. The desktop VA used a 2 GB virtual disk, while all others used a 1 GB virtual disk. 6.5.1 Reducing Provisioning Times Table 6.1 shows how long it takes Strata to provision VAs versus regular and COW copying. To provision a VA using Strata, Strata copies a default VMware VM with an empty sparse virtual disk and provides it with a unique MAC address. It then creates a symbolic link on the shared file system from a file named by the MAC address to the layer definition file that defines the configuration of the VA. When the VA boots, it accesses the file denoted by its MAC address, mounts the VLFS with the appropriate layers, and continues execution from within it. To provision a plain VA using regular methods, we use QEMU’s qemu-img tool to create both raw copies and COW copies in the QCOW2 disk image format. Our measurements for all five VAs show that using COW copies and Strata takes about the same amount of time to provision VAs, while creating a raw image takes much longer. Creating a raw image for a VAs takes 3 to almost 6 minutes and is dominated by the cost of copying data to create a new instance of the VA. For larger VAs, these provisioning times would only get worse. In contrast, Strata provisions VAs in only a few milliseconds because a null VMware VM has essentially no data to copy. Layers do not need to be copied, so copying overhead is essentially zero. While COW images can be created in a similar amount of time, they do not provide any of the management benefits of Strata, as each new COW image is independent of the base image from which it was created. Chapter 6. Strata: Managing Large Numbers of Machines Apache Plain 184s Strata 0.002s QCOW2 0.003s MySQL 179s 0.002s 0.003s Samba 183s 0.002s 0.003s SSH 174s 0.002s 0.003s 127 Desktop 355s 0.002s 0.003s Table 6.1 – VA Provisioning Times VM Wake Network Update Suspend Total Plain 14.66s 43.72s 10.22s 3.96s 73.2s Strata NA NA 1.041s NA 1.041s Table 6.2 – VA Update Times 6.5.2 Reducing Update Times Table 6.2 shows how long it takes to update VAs using Strata versus traditional package management. We provisioned ten VA instances each of Apache, MySQL, Samba, SSH and Desktop for a total of 50 provisioned VAs. All were kept in a suspended state. When a security patch [146] was available for the tar package, installed in all the VAs, we updated them. Strata simply updates the layer definition files of the VM templates, which it can do even when the VAs are not active. When the VA is later resumed during normal operation, it automatically checks to see if the layer definition file has been updated and updates the VLFS namespace view accordingly, an operation that is measured in microseconds. To update a plain VA using normal package management tools, each VA instance must be resumed and retrieve a network address. An administrator or script must ssh into each VA, fetch and install the updated packages, and finally re-suspend the VA. Table 6.2 shows the average time to update each VA using traditional methods versus Strata. We break down the update time into times to resume the VM, get access to the network, actually perform the update, and re-suspend the VA. The measurements show that the cost of performing an update is dominated by the management Chapter 6. Strata: Managing Large Numbers of Machines 128 overhead of preparing the VAs to be updated. Preparation is itself dominated by getting an IP address and becoming accessible on a busy network. While this cost is not excessive on a quiet network, on a busy network it can take a significant amount of time for the client to get a DHCP address, and for the ARP tables on the machine controlling the update to find the target machine. In our test, the average total time to update each plain VA is about 73 seconds. In contrast, Strata takes only a second to update each VA. As this is an order of magnitude shorter even than resuming the VA, Strata is able to delay the update to a point when the VA will be resumed from standby normally without impacting its ability to quickly respond. Strata provides over 70 times faster update times than traditional package management when managing even a modest number of VAs. Strata’s ability to decrease update times would only improve as the number of VAs being managed grows. 6.5.3 Reducing Storage Costs Figure 6.6 shows the total storage space required for different numbers of VAs stored with raw and COW disk images versus Strata. We show the total storage space for 1 Apache VA, 5 VAs corresponding to an Apache, MySQL, Samba, SSH, and Desktop VA, and 50 VAs corresponding to ten instances of each of the five VAs. As expected, for raw images, the total storage space required grows linearly with the number of VA instances. In contrast, the total storage space using COW disk images and Strata is relatively constant and relatively independent of the number of VA instances. For one VA, the storage space required for the disk image is less than the storage space required for Strata, as the layer repository used contains more layers than those used by any one of the VAs. In fact, to run a single VA, the layer repository size could be trimmed down to the same size as the traditional VA. Chapter 6. Strata: Managing Large Numbers of Machines 100000.0 129 Plain VM Strata Size (MB) 10000.0 1000.0 100.0 10.0 1.0 1 VM 5 VMs 50 VMs Figure 6.6 – Storage Overhead For larger numbers of VAs, however, Strata provides a substantial reduction in the storage space required, because VAs share layers and do not require duplicate storage. For 50 VAs, Strata reduces the storage space required by an order of magnitude over the raw disk images. Table 6.3 shows that there is much duplication among the VAs, as the layer repository of 405 distinct layers needed to build the different VLFSs for multiple services is basically the same size as the largest service. Although initially Strata does not have an significant storage benefit over COW disk images, as each COW disk image is independent from the version it was created from, it now must be managed independently. This increases storage usage, as the same updates must be independently applied to many independent disk images. While other mechanisms exist, such as deduplication, that can help with storage usage, they increase overhead due to the effort that is required to find duplicates. Additionally, deduplication does not help with the management of the individuals VAs as the updates will still have to be applied to each individual system independently. Chapter 6. Strata: Managing Large Numbers of Machines Repo 1.8GB # Layer Shared Unique Apache 217MB 43 191MB 26MB MySQL 206MB 23 162MB 44MB Samba 169MB 30 152MB 17MB SSH 127MB 12 123MB 4MB 130 Desktop 1.7GB 404 169MB 1.6GB Table 6.3 – Layer Repository vs. Static VAs 6.5.4 Virtualization Overhead To measure the virtualization cost of Strata’s VLFS, we used a range of microbenchmarks and real application workloads to measure the performance of our Linux Strata prototype, then compared the results against vanilla Linux systems within a virtual machine. The virtual machine’s local file system was formatted with the Ext3 file system and given read-only access to a SAN partition formatted with Ext3 as well. We performed all benchmarks in every scenario described above. To demonstrate the effect that Strata’s VLFS has on system performance, we performed a number of benchmarks. Postmark [76] is a synthetic test that measures how the system would behave if used as a mail server. Our postmark test operated on files between 512 and 10K bytes to simulate the mail-server’s spool directory, with an initial set of 20,000, and performed 200,000 transactions. Postmark is very intensive on a few specific file system operations such as lookup(), create() and unlink() because it is constantly creating, opening and removing files. Figure 6.7 shows that running this benchmark within a traditional VA is significantly faster than running it in Strata. This is because as Strata composes multiple file system namespaces together, it places significant overhead on namespace operations such as lookup(). To demonstrate that postmark’s results are not indicative of performance in reallife scenarios, we ran two application benchmarks to measure the overhead Strata imposes in a desktop and server VA scenario. First, we timed a multi-threaded build of the Linux 2.6.18.6 kernel with two concurrent jobs using the VM’s two CPUs. In Chapter 6. Strata: Managing Large Numbers of Machines 1800.0 1600.0 131 Plain VM Strata 1400.0 Time (s) 1200.0 1000.0 800.0 600.0 400.0 200.0 0.0 Apache MySQL Samba SSH Desktop Figure 6.7 – Postmark Overhead in Multiple VAs all scenarios, we added the layers required to build a kernel to the layers needed to provide the service, generally adding 8 additional layers to each case. Figure 6.8 shows that while Strata imposes a slight overhead on the kernel build compared to the underlying file system it uses, the cost is minimal, under 5% at worst. Second, we measured the amount of HTTP transactions that were able to be completed per second to an Apache web server placed under load. We imported the database of a popular guitar tab search engine and used the http load [108] benchmark to continuously performed a set of 20 search queries on the database for 60 seconds. For each case that did not already contain Apache, we added the appropriate layers to the layer definition file to make Apache available. Figure 6.9 shows that Strata imposes a minimal overhead of only 5%. Chapter 6. Strata: Managing Large Numbers of Machines 600.0 132 Plain VM Strata 500.0 Time (s) 400.0 300.0 200.0 100.0 0.0 Apache MySQL Samba SSH Desktop Figure 6.8 – Kernel Build Overhead in Multiple VAs 6.6 Related Work The most common way to provision and maintain machines today is using the package management system built into the operating system [4, 56]. Package managers view the file system into which they install packages as a simple container for files, not as a partner in the management of the machine. This causes them to suffer from a number of flaws in their management of large numbers of VAs. They are not space- or time-efficient, as each provisioned VA needs an independent copy of the package’s files and requires time-consuming copying of many megabytes or gigabytes into each VA’s file system. These inefficiencies affect both provisioning and updating of a system because a lot of time is spent downloading, extracting and installing the individual packages into the many independent VAs. As the package manager does not work in partnership with the file system, the file system is unable to distinguish the different types of files it contains. A file installed Chapter 6. Strata: Managing Large Numbers of Machines 20.0 133 Plain VM Strata Fetches/s 15.0 10.0 5.0 0.0 Apache MySQL Samba SSH Desktop Figure 6.9 – Apache Overhead in Multiple VAs from a package and a file modified or created in the course of usage are indistinguishable. Specialized tools are needed to traverse the entire file system to determine if a file belongs to a package or was created or modified after the package was installed. For instance, to determine if a VA has been compromised, an administrator must determine if any system files have been modified. Finally, package management systems work in the context of a running system to modify the file system directly. These standard tools often do not work outside the context of a running system, for example, for a VA that is suspended or turned off. For local scenarios, the size and time efficiencies of provisioning a VA can be improved by utilizing copy-on-write (COW) disks, such as QEMU’s QCOW2 [91] format. These enables VAs to be provisioned quickly, as little data has to be written to disk immediately due to the COW property. However, once provisioned, each COW copy is now fully independent from the original, is equivalent to a regular copy, Chapter 6. Strata: Managing Large Numbers of Machines 134 and therefore suffers from all the same maintenance problems as a regular VA. Even if the original disk image is updated, the changes would be incompatible with the cloned COW images. This is because COW disks operate at the block level. As files get modified, they use different blocks on their underlying device. Therefore, it is likely that the original and cloned COW images address the same blocks for different pieces of data. For similar reasons, COW disks do not help with VA creation, as multiple COW disks cannot be combined together into a single disk image. Both the Collective [41] and Ventana [103] attempt to solve the VA maintenance problem by building upon COW concepts. Both systems enable VAs to be provisioned quickly by performing a COW copy of each VA’s system file system. However, they suffer from the fact that they manage this file system at either the block device or monolithic file system level, providing users with only a single file system. While ideally an administrator could supply a single homogeneous shared image for all users, in practice, users want access to many heterogeneous images that must be maintained independently and therefore increase the administrator’s work. The same is true for VAs provisioned by the end user, while they both enable the VAs to maintain a separate disk from the shared system disk that persists beyond upgrades. Mirage [121] attempts to improve the disk image sprawl problem by introducing a new storage format, the Mirage Index Format (MIF), to enumerate what files belong to a package. However, it does not help with the actual image sprawl in regard to machine maintenance, because each machine reconstituted by Mirage still has a fully independent file system, as each image has its own personal copy. Although each provisioned machine can be tracked, they are now independent entities and suffer from the same problems as a traditional VA. Stork [38] improves on package management for container-based systems by enabling containers to hard link to an underlying shared file system so that files are only Chapter 6. Strata: Managing Large Numbers of Machines 135 stored once across all containers. By design, it cannot help with managing independent machines, virtual machines, or VAs, because hard links are a function internal to a specific file system and not usable between separate file systems. Union file systems [102, 150] provide the ability to compose multiple different file system namespaces into a single namespace view. Unioning file systems are commonly used to provide a COW file system from a read-only copy, such as with LiveCDs. However, unioning file system by themselves do not directly help with VA management, as the underlying file system has to be maintained using regular tools. Strata builds upon and leverages this mechanism by improving its ability to handle deleted files as well as managing the layers that belong to the union. This allows Strata to provide a solution that enables efficient provisioning and management of VAs. Chapter 7 Apiary: A Desktop of Isolated Applications In today’s world of highly connected computers, desktop security and privacy are major issues. Desktop users interact constantly with untrusted data they receive from the Internet by visiting new websites, downloading files and emailing strangers. All these activities use information whose safety the user cannot verify. Data can be constructed maliciously to exploit bugs and vulnerabilities in applications, enabling attackers to take control of users’ desktops. For example, a major flaw was recently discovered in Adobe Acrobat products that enables an attacker to take control of a desktop when a maliciously constructed PDF file is viewed [18]. Adobe’s estimate to release a fix was nearly a month after the exploit was released into the wild. Even in the absence of bugs, untrusted data can be constructed to invade users’ privacy. For example, cookies are often stored when visiting websites that allow advertisers to track user behavior across multiple websites. The prevalence of untrusted data and buggy software makes application fault containment increasingly important. Many approaches have been proposed to isolate Chapter 7. Apiary: A Desktop of Isolated Applications 137 applications from one another using mechanisms such as process containers [7, 116] or virtual machines [147]. For instance, in Chapter 5, we introduced PeaPod to leverage process containers to isolate the components of a single application. Faults are confined so that if an application is compromised, only that application and the data it can access are available to an attacker. By having only one application percontainer, each individual container becomes a simpler system, making it easier to determine if unwanted processes are running within it. However, existing approaches to isolating applications suffer from an unresolved tension between ease of use and degree of fault containment. Some approaches [72,92] provide an integrated desktop feel but only partial isolation. They are relatively easy to use, but do not prevent vulnerable applications from compromising the system itself. Other approaches [122, 143] have less of an integrated desktop feel but fully isolate applications into distinct environments, typically by using separate virtual machines. These approaches effectively limit the impact of compromised applications, but are harder to use because users are forced to manage multiple desktops. Virtual machine (VM) approaches also require managing multiple machine instances and incur high overhead to support multiple operating system instances, making them too expensive to allow more than a couple of fault containment units per-desktop. To address these problems, we introduce Apiary, which provides strong isolation for robust application fault containment while retaining the integrated look, feel and ease of use of a traditional desktop environment. Apiary accomplishes this by using well-understood technologies like thin clients, operating system containers and unioning file systems in novel ways. It does this using three key mechanisms. First, it decomposes a desktop’s applications into isolated containers. Each container is an independent software appliance that provides all system services an application needs to execute. To retain traditional desktop semantics, Apiary integrates Chapter 7. Apiary: A Desktop of Isolated Applications 138 these containers in a controlled manner at the display and file system. Apiary’s containers prevent an exploit from compromising the user’s other applications. For example, by having separate web browser and personal finance containers, any compromise from web browsing would not be able to access personal financial information. At the same time, Apiary makes the web browser and personal finance containers look and feel like part of the same integrated desktop, with all normal windowing functions and cut-and-paste operations operating seamlessly across containers. Second, it introduces the concept of ephemeral containers. Ephemeral containers are execution environments with no access to user data that are quickly instantiated from a clean state for only a single application execution. When the application terminates, the container is archived, but never used again. Apiary uses ephemeral containers as a fundamental building block of the integrated desktop experience while preventing contamination across containers. For example, users often expect to view PDF documents from the web, but need separate web browser and PDF viewer containers for fault containment. If a user always views PDF documents in the same PDF viewer container, a single malicious document could exploit the container and have access to future documents the user wants to keep private, like bills and bank statements. Instead, Apiary enables the web browser to automatically instantiate a new ephemeral PDF viewer container for each individual PDF document. Even if the PDF file is malicious, it will have no effect on other PDF files because the container instance it exploited will never be used again. As illustrated by this PDF example, ephemeral containers have three benefits. First, they prevent compromises, because exploits, even if triggered, cannot persist. Second, they protect users from compromised applications. Even when an application has been compromised, a new ephemeral container running that application in parallel will remain uncompromised because it is guaranteed to start from a clean state. Chapter 7. Apiary: A Desktop of Isolated Applications 139 Third, they help protect user privacy when using the Internet. For example, while cookies must be accepted to use many websites, web browsers in separate ephemeral containers can be used for different websites to prevent cookies from tracking user behavior across websites. Apiary’s third mechanism is Strata’s VLFS. Apiary leverages the VLFS to allow the many application containers used in Apiary to be efficiently stored and instantiated. Since each container’s VLFS will share the layers that are common to them, Apiary’s storage requirements are the same as a traditional desktop. Similarly, since no data has to be copied to create a new VLFS instance, Apiary is able to quickly instantiate ephemeral containers for a single application execution. Apiary’s approach differs markedly from the approach taken by PeaPod in Chapter 5. In PeaPod, we isolate the different process components of a single larger application, such as an email server. These applications contain processes that require access to large amounts of the same data, but with differing levels of privilege and therefore they cannot be fully isolated. Furthermore, in many of these applications, the security model is well understood and therefore simple sets of rules can be created to isolate each component. However, desktop security is much more complicated. As can be seen in Chapter 5.4.3, just isolating one small portion of the desktop involved the creation of the largest set of rules. In Apiary, we enable the isolation of desktop applications without any rules. 7.1 Apiary Usage Model Figure 7.1 shows the Apiary desktop. It looks and works like a regular desktop. Users launch programs from a menu or from within other programs, switch among launched programs using a taskbar, interact with running programs using the keyboard and Chapter 7. Apiary: A Desktop of Isolated Applications 140 Figure 7.1 – Apiary screenshot showing a desktop session. At the the topmost left is (1), an application menu that provides access to all available applications. Just below it, the window list (2) allows users to easily switch among running applications. (3) is the composite display view of all the visible running applications. mouse, and have a single display with an integrated window system and clipboard functionality that contains all running programs. Although Apiary provides a look and feel similar to a regular desktop, it provides fault containment by isolating applications into separate containers. Containers enforce isolation so that applications running inside cannot get out. Apiary isolates individual applications, not individual programs. An application in Apiary can be understood as a software appliance made up of multiple programs used together in a single environment to accomplish a specific task. For instance, a user’s web browser and word processor would be considered separate applications and isolated from one another. The software appliance model means that users can install separate isolated Chapter 7. Apiary: A Desktop of Isolated Applications 141 applications containing many or all of the same programs, but used for different purposes. For example, a banking application contains a web browser for accessing a bank’s website, while a web surfing application also contains a web browser, but for general web browsing. Both appliances make use of the same web browser program, but are listed as different applications in the application menu. Apiary provides two types of containers: ephemeral and persistent. Ephemeral containers are created fresh for each application execution. Persistent containers, like a traditional desktop, maintain their state across application executions. Apiary lets users select whether an application should launch within an ephemeral or a persistent container. Windows belonging to ephemeral applications are, by default, given distinct border colors so that users can quickly identify based on appearance in which mode an application is executing. Ephemeral containers provide a powerful mechanism for protecting desktop security and user privacy when running common desktop operations, such as viewing untrusted data, that do not require storing persistent states. Users will typically run multiple ephemeral containers at the same time, and, in some cases, multiple ephemeral containers for the same application at the same time. They provide important benefits for a wide range of uses. Ephemeral containers prevent compromises because exploits cannot persist. For example, a malicious PDF document that exploits an ephemeral PDF viewer will have no persistent effect on the system because the exploit is isolated in the container and will disappear when the container finishes executing. Ephemeral containers protect user privacy when using the Internet. For example, many websites require cookies to function, but also store advertisers’ cookies to track user behavior across websites and compromise privacy. Apiary makes it easy to use multiple ephemeral web browser containers simultaneously, each with separate Chapter 7. Apiary: A Desktop of Isolated Applications 142 cookies, making it harder to track users across websites. Ephemeral containers protect users from compromises that may have already occurred on their desktop. If a web browser has been compromised, parallel and future uses of the web browser will allow an attacker to steal sensitive information when the user accesses important websites (e.g., for banking). Ephemeral containers are guaranteed to launch from a clean slate. By using a separate ephemeral web browser container for accessing a banking site, Apiary ensures that an already exploited web browser installation cannot compromise user privacy. Ephemeral containers allow applications to launch other applications safely. For example, users often receive email attachments such as PDF documents that they wish to view. To avoid compromising an email container, Apiary creates a separate ephemeral PDF viewer container for the PDF. Even if it is malicious, it will have no effect on the user’s desktop, as it only affects the isolated ephemeral container. Similarly, ephemeral word processor or spreadsheet containers will be created for viewing these email attachments to prevent malicious files from compromising the system. In general, Apiary allows applications to cause other applications to be safely launched in ephemeral containers by default to support scenarios that involve multiple applications. Isolated persistent containers are necessary for applications that maintain state across executions to prevent a single application compromise from affecting the entire system. Users typically run one persistent container per-application to avoid needing to track which persistent application container contains which persistent information. Some applications only run in persistent containers, while others may run in both types of containers. For example, an email application is typically used in a persistent container to maintain email state across executions. On the other hand, a web browser will be used both in a persistent container, to access a user’s trusted websites, and in Chapter 7. Apiary: A Desktop of Isolated Applications 143 an ephemeral container, to view untrusted websites. Similarly, a browser may be used in a persistent container to remember browsing history, plugins and bookmarks, but may also be used in an ephemeral container when accessing untrusted websites. Note that files stored in both kinds of containers are private by default and not accessible outside their container. Apiary’s containers work together to provide a security system that differs fundamentally from common security schemes that attempt to lock down applications within a restricted-privilege environment. In Apiary, each application container is an independent entity that is entirely isolated from every other application container on the Apiary desktop. One does not have to apply any security analysis or complex isolation rules to determine which files a specific application should be able to access. Also, in most other schemes, an application, once exploited, will continue to be exploited, even if the exploited application is restricted from accessing other applications’ data. Apiary’s ephemeral containers, however, prevent an exploit from persisting between application execution instances. Apiary provides every desktop with two ways to share files between containers. First, containers can use standard file system share concepts to create directories that can be seen by multiple containers. This has the benefit of allowing any data stored in the shared directory to be automatically available to the other containers that have access to the share. Second, Apiary supplies every desktop with a special persistent container with a file explorer. The explorer has access to all of the user’s containers and can manage all of the user’s files, including copying them between containers. This is useful if a user decides they want to preserve a file from an ephemeral container, or move a file from one persistent container to another, as, for instance, when emailing a set of files. The file explorer container cannot be used in an ephemeral manner, its functionality cannot be invoked by any other application on Chapter 7. Apiary: A Desktop of Isolated Applications 144 the system, and no other application is allowed to execute within it. This prevents an exploited container from using the file explorer container to corrupt others. Note that both of these mechanisms break the isolation barrier that exists between containers. File system shares can be used by an exploited container as a vector to infect other containers, while a user can be tricked into moving a malicious file between containers. However, this is a tension that will always exist in security systems that are meant to be usable by a diverse crowd of users. 7.2 Apiary Architecture To support its container model, Apiary must have four capabilities. First, Apiary must be able to run applications within secure containers to provide application isolation. Second, Apiary must provide a single integrated display view of all running applications. Third, Apiary must be able to instantiate individual containers quickly and efficiently. Finally, for a cohesive desktop experience, Apiary must allow applications in different containers to interact in a controlled manner. Apiary does this by using a virtualization architecture that consists of three main components: an operating system container that provides a virtual execution environment, a virtual display system that provides a virtual display server and viewer and the VLFS. Additionally, Apiary provides a desktop daemon that runs on the host. This daemon instantiates containers, manages their lifetimes and ensures that they are correctly integrated. 7.2.1 Process Container Apiary’s containers are essential to Apiary’s ability to isolate applications from one another. By providing isolated containers, individual applications can run in parallel Chapter 7. Apiary: A Desktop of Isolated Applications 145 within separate containers, and have no conception that there are other applications running. This enforces fault containment, as an exploited process will only have access to whatever files are available within its own container. Apiary’s containers leverage features such as Solaris’s zones [116], FreeBSD’s jails [74] and Linux’s containers [6] to create isolated execution environments. Each container has its own private kernel namespace, file system and display server, providing isolation at the process, file system and display levels. Programs within separate containers can only interact using normal network communication mechanisms. In addition, each container has an application control daemon that enables the virtual display viewer to query the container for its contents and interact with it. 7.2.2 Display Apiary’s virtual display system is crucial to complete process isolation and a cohesive desktop experience. If containers were to share a single display directly, malicious applications could leverage built-in mechanisms in commodity display architectures [61, 93] to insert events and messages into other applications that share the display, enabling the malicious application to remotely control the others, effectively exploiting them as well. Many existing commodity security systems do not isolate applications at the display level, providing an easy vector for attackers to further exploit applications on the desktop. But although independent displays isolate the applications from one another, they do not provide the single cohesive display users expect. This cohesive display has two elements. First, the display views have to be integrated into a single view. Second, Apiary has to provide the normal desktop metaphors that users want, including a single menu structure for launching applications and an integrated task switcher that Chapter 7. Apiary: A Desktop of Isolated Applications 146 allows the user to switch among all running applications. Apiary’s virtual display system incorporates both of these elements. First, Apiary’s virtual display provides each container with its own virtual display similar to existing systems [14, 27, 51, 140]. This virtual display operates by decoupling the display state from the underlying hardware and enabling the display output to be redirected anywhere. Second, Apiary enables these independent displays to be integrated into a single display view. While a regular remote framework provides all the information needed to display each desktop, it assumes that there is no other display in use, and therefore expects to be able to draw the entire display area. In Apiary, where multiple containers are in use, this assumption does not hold. Therefore, to enable multiple displays to be integrated into a single view, the Apiary viewer composes the display together using the Porter-Duff [107] over operation. Apiary’s viewer provides an integrated menu system that lists all the applications users are able to launch. Apiary leverages the application control daemon running within each container to enumerate all the applications within the container, much like a regular menu in a traditional desktop. Instead of providing the menu directly in the screen, however, it transmits the collected data back to the viewer, which then integrates this information into its own menu, associating the menu entry with the container it came from. When a user selects a program from the viewer’s menu, the viewer instructs the correct daemon to execute it within its container. Similarly, to manage running applications effectively, Apiary provides a single taskbar with which the user can switch between all applications running within the integrated desktop. Apiary leverages the system’s ability to enumerate windows and switch applications [63] by having the daemon enumerate all the windows provided by its container and transmit this information to the viewer. The viewer then integrates Chapter 7. Apiary: A Desktop of Isolated Applications 147 this information into a single taskbar with buttons corresponding to application windows. When the user switches windows using the taskbar, the viewer communicates with the daemon and instructs it to bring the correct window to the foreground. Note that by stacking the independent displays, the windowing semantic is changed slightly from a traditional desktop. In a traditional desktop, when one brings a window to the foreground, only that window will be brought up. In Apiary, each display can feature multiple windows, each of which can be raised to the foreground. However, in Apiary, bringing up a window also brings its entire display layer to the foreground. Consequently, all other windows in the display will be raised above the windows provided by all other displays. 7.2.3 File System Apiary requires containers to be efficient in storage space and instantiation time. Containers must be storage-efficient to allow regular desktops to support the large number of application containers used within the Apiary desktop. Containers must be efficiently instantiated to provide fast interactive response time, especially for launching ephemeral containers. Both of these requirements are difficult to meet using traditional independent file systems for each container. Each container’s file system would be using its own storage space, which would be inefficient for a large number of containers, as it means many duplicated files. More important, the desktop becomes much harder to maintain because each independent file system must be updated individually. Similarly, instantiating the container requires copying the file system, which can include many megabytes or gigabytes of storage space. Copying time prevents the container from being instantiated quickly. Although file systems that support a branching semantic [32, 103] can be used to quickly provision a new container’s file Chapter 7. Apiary: A Desktop of Isolated Applications 148 system from a template image, each template image will still be independent and therefore inefficient with regard to space, maintenance and upgrades. Apiary leverages Strata’s Virtual Layered File System to meet these requirements. The VLFS enables file systems to be created by composing layers together into a single file system namespace view. VLFSs are built by combining a set of shared software layers together in a read-only manner with a per-container private read-write layer. Multiple VLFSs providing multiple applications are as efficient as a single regular file system because all common files are stored only once in the set of shared layers. Therefore, Apiary is able to store efficiently the file systems its containers need. This also allows Apiary to manage its containers easily. To update every VLFS that uses a particular layer, the administrator need only replace the single layer containing the files that need updating. The VLFS also lets Apiary instantiate each container’s file system efficiently. No data has to be copied into place because each of the software layers is shared in a read-only manner. The instantiation is transparent to the end user and nearly instantaneous. 7.2.4 Inter-Application Integration Apiary provides independent containers for fault containment, but must also ensure that they do not limit effective use of the desktop. For instance, if Firefox is totally isolated from the PDF viewer, how does one view a PDF file? The PDF viewer could be included within the Firefox container, but this violates the isolation that should exist between Firefox and an application viewing untrusted content. Similarly, users could copy the file from the Firefox container to the PDF viewer container, but this is not the integrated feel that users expect. Apiary solves this problem by enabling applications to execute specific applications Chapter 7. Apiary: A Desktop of Isolated Applications 149 in new ephemeral containers. Every application used within Apiary is preconfigured with a list of programs that it enables other applications to use in an ephemeral manner. Apiary refers to these as global programs. For instance, a Firefox container can specify /usr/bin/firefox and a Xpdf container can specify /usr/bin/xpdf as global programs. Program paths marked global exist in all containers. Apiary accomplishes this by populating a single global layer, shared by all the container’s VLFSs, with a wrapper program for each global program. This wrapper program is used to instantiate a new ephemeral container and execute the requested program within it. Apiary only allows for the execution in a new ephemeral container and not in a preexisting persistent or ephemeral container, as that would break Apiary isolation constraints and cannot be done without risk to the preexisting container. When executed, the wrapper program determines how it was executed and what options were passed to it. It connects over the network to the Apiary desktop daemon on the same host and passes this information to it. The daemon maintains a mapping of global programs to containers and determines which container is being requested to be instantiated ephemerally. This ensures that only the specified global programs’ containers will be instantiated, preventing an attacker from instantiating and executing arbitrary programs. Apiary is then able to instantiate the correct fresh ephemeral container, along with all the required desktop services, including a display server. The display server is then automatically connected to the viewer. Finally, the daemon executes the program as it was initially called in the new container. To ensure that ephemeral containers are discarded when no longer needed, Apiary’s desktop daemon monitors the process executed within the container. When it terminates, Apiary terminates the container. Similarly, as the Apiary viewer knows which containers are providing windows to it, if it determines that no more windows are being provided by the container, it instructs the desktop daemon to terminate Chapter 7. Apiary: A Desktop of Isolated Applications 150 the container. This ensures that an exploited process does not continue running in the background. Merely running a new program in a fresh container, however, is not enough to integrate applications correctly. When Firefox downloads a PDF and executes a PDF viewer, it must enable the viewer to view the file. This will fail because Firefox and ephemeral PDF viewer containers do not share the same file system. To enable this functionality, Apiary enables small private read-only file shares between a parent container and the child ephemeral container it instantiates. Because well-behaved applications such as Firefox, Thunderbird and OpenOffice only use the system’s temporary file directory to pass files among them, Apiary restricts this automatic file sharing ability to files located under /tmp. To ensure that there are no namespace conflicts between containers, Apiary provides containers with their own private directory under /tmp to use for temporary files, and they are preconfigured to use that directory as their temporary file directory. But providing a fully shared temporary file directory allows an exploited container to access private files that are placed there when passed to an ephemeral container. For instance, if a user downloads a malicious PDF and a bank statement in close succession, they will both exist in the temporary file directory at the same time. To prevent this, Apiary provides a special file system that enhances the read-only shares with an access control list (ACL) that determines which containers can access which files. By default, these directories will appear empty to the rest of the containers, as they do not have access to any of the files. This prevents an exploited container from accessing data not explicitly given to it. A file will only be visible within the directories if the Apiary desktop daemon instructs the file system to reveal that file by adding the container to the file’s ACL. This occurs when a global program’s wrapper is executed and the daemon determines that a file was passed to it as an option. The Chapter 7. Apiary: A Desktop of Isolated Applications 151 daemon then adds the ephemeral container to the file’s ACL. Because the directory structure is consistent between containers, simply executing the requested program in the new ephemeral container with the same options is sufficient. Apiary enables the file explorer container discussed in Section 7.1 in a similar way. The file explorer container is set up like all other containers in Apiary. It is fully isolated from the rest of the containers and users interact with it via the regular display viewer. It differs from the rest of the containers in that other containers are not fully isolated from it. This is necessary as users can store their files in multiple locations, most notably, the container’s /tmp directory and the user’s home directory. Apiary’s file explorer provides read-write access to each of these areas as a file share within the file explorer’s FS namespace. Apiary prevents any executable located within these file systems from executing with the file explorer container to prevent malicious programs from exploiting it. Users are able to use normal copy/paste semantics to move files among containers. While this is more involved than a normal desktop with only a single namespace, users generally do not have to move files among containers. The primary situation in which users might desire to move files between containers is when interacting with an ephemeral container, as a user might want to preserve a file from there. For instance, a user can run their web browser in an ephemeral container to maintain privacy, but also download a file they want to keep. While the ephemeral container is active, a user can just use the file explorer to view all active containers. To avoid situations where the user only remembers after terminating the ephemeral container that it had files they wanted to keep, Apiary archives all newly created or modified non-hidden files that are accessible to the file explorer when the ephemeral container terminates. This allows a user to gain access to them even after the ephemeral container has terminated. Apiary automatically trims this archive if Chapter 7. Apiary: A Desktop of Isolated Applications 152 no visible data was stored within the ephemeral container, such as in the case of an ephemeral web browser that the user only used to view a web page, and did not save a specific file. Similarly, Apiary provides the user the ability to trim the archive to remove ephemeral container archives that do not contain data they need. Apiary also turns the desktop viewer into an inter-process communication (IPC) proxy that can enable IPC states to be shared among containers in a controlled and secure manner. This means that only an explicitly allowed IPC state is shared. For example, one of the most basic ways desktop applications share state is via the shared desktop clipboard. To handle the clipboard, each container’s desktop daemon monitors the clipboard for changes. Whenever a change is made to one container’s clipboard, this update is sent to the Apiary viewer and then propagated to all the other containers. The Apiary viewer also keeps a copy of the clipboard so that any future container can be initialized with the current clipboard state. This enables users to continue to use the clipboard with applications in different containers in a manner consistent with a traditional desktop. This model can be extended to other IPC states and operations. 7.3 Experimental Results We have implemented a remote desktop Apiary prototype system for Linux desktop environments. The prototype consists of a virtual display driver for the X window system that provides a virtual display for individual containers based on MetaVNC [140], a set of user space utilities that enable container integration and a loadable kernel module for the Linux 2.6 kernel that provides the ability to create and mount VLFSs. Apiary uses a Linux container-like mechanism to provide the isolated containers [100] and the VLFS. Chapter 7. Apiary: A Desktop of Isolated Applications 153 Our prototype’s VLFS layer repository contained 214 layers created by converting the set of Debian packages needed by the set of applications we tested into individual layers. Using these layers, we are able to create per-application appliances for each individual application by simply selecting which high level applications we want within the appliance, such as Firefox, with the dependencies between the layers ensuring that all the required layers are included. Using these appliances, we are able to instantly provision persistent and ephemeral containers for the applications as needed. Using this prototype, we used real exploits to evaluate Apiary’s ability to contain and recover from attacks. We conducted a user study to evaluate Apiary’s ease of use compared to a traditional desktop. We also measured Apiary’s performance with real applications in terms of runtime overhead, startup time and storage efficiency. For our experiments, we compared a plain Linux desktop with common applications installed to an Apiary desktop that has applications available to be used in persistent and ephemeral containers. The applications we used are the Pidgin instant messenger, the Firefox web browser, the Thunderbird email client, the OpenOffice.org office suite, the MPlayer media player and the Xpdf PDF viewing program. Experiments were conducted on an IBM HS20 eServer blade with dual 3.06 GHz Intel Xeon CPUs and 2.5 GB RAM. All desktop application execution occurred on the blade. Participants in the usage study connected to the blade via a Thinkpad T42p laptop with a 1.8 GHz Intel Pentium-M CPU and 2GB of RAM running the MetaVNC viewer. 7.3.1 Handling Exploits We tested two scenarios that illustrate Apiary’s ability to contain and recover from a desktop application exploit, as well as explore how different decisions can affect the security of Apiary’s containers. Chapter 7. Apiary: A Desktop of Isolated Applications 7.3.1.1 154 Malicious Files Many desktop applications have been shown to be vulnerable to maliciously created files that enable an attacker to subvert the target machine and destroy the data. These attacks are prevalent on the Internet, as many users will download and view whatever files are sent to them. To demonstrate this problem, we use two malicious files [62, 64] that exploit old versions of Xpdf and mpg123 respectively. The mpg123 program was stored within the MPlayer container. The mpg123 exploit works by creating an invalid mp3 file that triggers a buffer overflow in old versions of mpg123, enabling the exploit to execute any program it desires. The Xpdf exploit works by exploiting a behavior of how Xpdf launched helper programs, that is, by passing a string to sh -c. By including a back-tick (‘ ‘) string within a URL embedded in the PDF file, an attacker could get Xpdf to launch unknown programs. Both of these exploits are able to leverage sudo to perform privileged tasks, in this case, deleting the entire file system. Sudo is exploited because popular distributions require users to use it to gain root privileges and have it configured to run any applications. Additionally, sudo, by default, caches the user’s credentials to avoid needing to authenticate the user each time it needs to perform a privileged action. However, this enables local exploits to leverage the cached credentials to gain root privileges. In the plain Linux system, recovering from these exploits required us to spend a significant amount of time reinstalling the system from scratch, as we had to install many individual programs, not just the one that was exploited. Additionally, we had to recover a user’s 23GB home directory from backup. Reinstalling a basic Debian installation took 19 minutes. However, reinstalling the complete desktop environment took a total of 50 minutes. Recovering the user’s home directory, which included multimedia files, research papers, email and many other assorted files, took Chapter 7. Apiary: A Desktop of Isolated Applications 155 an additional 88 minutes when transferred over a Gbps LAN. Apiary protected the desktop and enabled easier recovery. It protected the desktop by letting the malicious files be viewed within an ephemeral container. Even though the exploit proceeded as expected and deleted the container’s entire file system, the damage it caused is invisible to the user, because that ephemeral container was never to be used again. Even when we permitted the exploit to execute within a persistent container, Apiary enabled significantly easier recovery from the exploit. As shown in Table 7.2, Apiary can provision a file system in just a few milliseconds. This is nearly 6 orders of magnitude faster than the traditional method of recovering a system by reinstallation. Furthermore, Apiary’s persistent containers divide up home directory content between them, eliminating the need to recover the entire home directory if one application is exploited. This also shows how persistent containers can be constructed in a more secure manner to prevent exploits from harming the user. As a large amount of the above user’s data, such as media files, is only accessed in a read-only manner, the data can be stored on file system shares. This enables the user to allow the different containers to have different levels of access to the share. The file explorer container can access it in a read-write manner, enabling a user to manage the contents of the file system share, while the actual applications that view these files can be restricted to accessing them in a read-only manner, protecting the files from exploits. 7.3.1.2 Malicious Plugins Applications are also exploited via malware that users are tricked into downloading and installing. This can be an independent program or a plugin that integrates with an already-installed application. For example, malicious attackers can try to convince users to download a “codec” they need to view a video. Recently, a malicious Firefox Chapter 7. Apiary: A Desktop of Isolated Applications 156 extension was discovered [31] that leverages Firefox’s extension and plugin mechanism to extract a user’s banking username and password from the browser when the user visits their bank’s website and sends the information to the attacker. These attacks are common because users are badly conditioned to allow a browser to install what it needs when it asks to install something. In a traditional environment, this malicious extension persists until its discovered and removed. As it does not affect regular use of the browser, there is very little to alert users that they have been attacked. As this exploit is not readily available to the public, we simulated its presence with the non-malicious Greasemonkey Firefox extension. Much like the malicious file example, Apiary prevented the extension from persisting when installed into an ephemeral container. Even when a user allowed the installation of the extension, it did not persist to future executions of Firefox. However, this exploit poses a significant risk if it enters the user’s persistent web browser container. While one might expect Firefox extensions to be uninstallable through Firefox’s extension manager, this is only true of extensions that are installed through it. If an extension is installed directly into the file system, it cannot be uninstalled this way. Although it can be disabled, it must later be removed from the file system. This applies equally to Apiary and traditional machines. While users can quickly recreate the entire persistent Firefox container, that requires knowing that the installation was exploited. Apiary handles this situation more elegantly by allowing the user to use Firefox in multiple web browsing containers. In this case, we created a general-purpose web browsing container for regular use, as well as a financial web browsing container for the bank website only. Apiary refused to install any addons in the financial web browsing container, keeping it isolated and secure even when the general-purpose web browsing container was compromised. Apiary enables the creation of multiple independent application containers, each Chapter 7. Apiary: A Desktop of Isolated Applications 157 containing the same application, but performing different tasks, such as visiting a bank website. Because the great majority of the VLFS’s layers are shared, the user incurs very little cost for these multiple independent containers. This approach can be extended to other related but independent tasks, for instance, using a media player to listen to one’s personal collection of music, as opposed to listening to Internet radio from an untrusted source. This scenario also reveals a problem with how plugins and other extensions are currently handled. When the browser provides its own package management interface independent of the system’s built-in package manager, this affects impacts Apiary, because certain application extensions might be needed in an ephemeral container, but if they are not known to the package manager, they cannot be easily included. Even today, however, many plugins and browser extensions are globally installable and manageable via the package manager itself in systems like Debian. In these systems, this yields the benefit that when multiple users wish to use an extension, it only has to be installed once. In Apiary, it additionally provides the benefit that it can become part of the application container’s definition, making it available to the ephemeral container without requiring it to be manually installed by the user on each ephemeral execution. Similarly, one can create containers with functionality provided by other containers. A LATEX paper writing container can provide Emacs, LATEX and a PDF viewer. This PDF viewer is separate from the primary PDF container and its ephemeral instances. This demonstrates how application containers can be designed to deliver a specific functionality even when it overlaps with that of other parts of the system. A user would want to include the PDF viewer within the LATEX container, as it is a primary component of the paper-writing process, and not just a helper application to be isolated. But as this copy of Xpdf is not made into a global program, no appli- Chapter 7. Apiary: A Desktop of Isolated Applications 158 cation will call into this container. Because the layers are shared between containers, it costs nothing to include it in the LATEX container. If Xpdf were not in the LATEX container, users would have to go through multiple steps of copying the generated PDF files to the PDF container to view them, as papers are not generally kept in the /tmp directory. 7.3.2 Usage Study We performed a usage study that evaluated the ability of users to use Apiary’s containerized application model with our prototype environment, focusing on their ability to execute applications from within other programs. Participants were mostly recruited from within our local university, including faculty, staff and students. All of the users were experienced computer users, including many experienced Linux users. 24 participants took part in the study. For our study, we created three distinct environments. The first was a plain Linux environment running the Xfce4 desktop. It provided a normal desktop Linux experience with a background of icons for files and programs and a full-fledged panel application with a menu, task switcher, clock and other assorted applets. Second was a full Apiary environment. It provided a much sparser experience, as the current Apiary prototype only provides a set of applications and not a full desktop environment. Finally we supplied a neutered Apiary environment that differs from the full environment in not launching any child applications within ephemeral containers. The three environments enable us to compare the participants’ experience along two axes. First, we can compare the plain Linux environment, where each application is only installed once and always run from the same environment, to the neutered Apiary environment, where each application is also only installed once and run from Chapter 7. Apiary: A Desktop of Isolated Applications 159 the same environment. This allows us to measure the cost of using the Apiary viewer, with its built-in taskbar and application menu, against plain Linux, where the taskbar and application menu are regular applications within the environment. Second, the full and neutered Apiary desktops enable us to isolate the actual and perceived cost to the participants of instantiating ephemeral containers for application execution. We presented the environments to the participants in random order and iterated through all 6 permutations equally. We timed the participants as they performed a number of specific multi-step tasks in each environment that were designed to measure the overhead of using multiple applications that needed to interact with one another. In summary, the tasks were: (1) download and view a PDF file with Firefox and Xpdf and follow a link embedded in the PDF back to the web; (2) read an email in Thunderbird that contains an attachment that is to be edited in OpenOffice and returned to the sender; (3) create a document in OpenOffice that contains text copied and pasted from the web and sent by email as a PDF file; (4) create a “Hello World” web page in OpenOffice and preview it in Firefox; and (5) launch a link received in the Pidgin IM client in Firefox. As Figure 7.2 shows, the average time to complete each task, when averaged over all the users doing tasks in random order, only differed by a few seconds in any direction for all tasks in all environments. Figure 7.2 shows that, in all cases, users performed their tasks quicker in the neutered Apiary environment than in the plain Linux environment. This indicates that Apiary’s simpler environment is actually faster to use than the plain Linux environment with its bells and whistles like application launchers and applets running within taskbar panels. While this may seem strange initially, it is perfectly understandable. Many environments that are simple to use with minimal distractions, for example, the command line, are faster, but less user-friendly, than others. Moreover, even though users were a little slower in the Chapter 7. Apiary: A Desktop of Isolated Applications 100.0 160 Plain Linux Persistent Ephemeral Time (s) 80.0 60.0 40.0 20.0 0.0 Task 1 Task 2 Task 3 Task 4 Task 5 Figure 7.2 – Usage Study Task Times full Apiary environment than in the neutered version, they were still generally faster than in the plain Linux environment. This indicates that while the full Apiary environment has a small amount of overhead, in practice, users are just as effective there as in the plain Linux environment. We also asked to rate their perceived ease of use of each environment. Most users perceived the prototype environments to be as easy to use as the plain Linux environment. While some users preferred the polish of the plain Linux environment, more preferred the simplicity of the environment provided by Apiary. Most users could not determine a difference between the full and neutered Apiary’s desktops. We also asked the participants a number of questions, including whether they could imagine using the Apiary environment full-time, and whether they would prefer to do so if it would keep their desktop more secure. All of the participants expressed a willingness to use this environment full-time, and a large majority indicated that Chapter 7. Apiary: A Desktop of Isolated Applications Test Untar Gzip Octave Kernel 161 Description Untar a compress Linux 2.6.19 kernel source code archive Compress a 250MB Linux kernel source tar archive Octave 3.0.1 (MATLAB 4 clone) running a numerical benchmark [68] Build the 2.6.19 kernel Table 7.1 – Application Benchmarks they would prefer to use Apiary over the plain Linux environment if it would keep their applications more secure. The majority of those who would not prefer Apiary expressed concern with bugs they perceived in the prototype. In addition, a few expressed interest in the system, but said that their preference would depend on the level of security they expected from the computer they were using. 7.3.3 Performance Measurements 7.3.3.1 Application Performance To measure the performance overhead of Apiary on real applications, we compared the runtime performance of a number of applications within the Apiary environment against their performance in a traditional environment. Table 7.1 lists our application tests. We focus mostly on file system benchmarks, as others have shown [27, 100] that display and operating system virtualization have little overhead. The untar tests file creation and throughput, while the gzip tests file system throughput and computation. The Octave benchmark is a pure computation benchmark. The kernel build benchmark tests computation as well as stressing the file system, because of the large number of lookups that occur due to the large size of the kernel source tree and the repeated execution of the preprocessor, compiler and linker. To stress the system with many containers and provide a conservative performance measure, each test was run in parallel with 25 instances. To avoid out-of-memory Chapter 7. Apiary: A Desktop of Isolated Applications 1400.0 162 Plain Apiary 1200.0 Time (s) 1000.0 800.0 600.0 400.0 200.0 0.0 Untar Gzip Octave Kernel Figure 7.3 – Application Performance with 25 Containers conditions, as the Octave benchmark requires 100-200 MB of memory at various points during its execution, we ran the benchmarks staggered 5 seconds apart to ensure they kept their high memory usage areas isolated and avoided the benchmark’s being killed by Linux’s out-of-memory handler. As is shown in Figure 7.3, Apiary imposes almost no overhead in most cases, with about 10% overhead in the kernel build case, with the VLFS’s constant need to perform lookups on the file system incurring an extra cost. This demonstrates that Apiary is able to scale to a large number of concurrent containers with minimal overhead. 7.3.3.2 Container Creation For ephemeral containers to be useful, container instantiation must be quick. We measured this cost in two ways: first, how long it takes to instantiate its VLFS, and second, how long the application takes to start up within the container. We quantify how long it takes to instantiate a container and compare Apiary to other common Chapter 7. Apiary: A Desktop of Isolated Applications Create Extract FS-Snap Apiary Pidgin 317 s 82 s .016 s .005 s Firefox 276 s 86 s .015 s .005 s T-Bird 294 s 87 s .016 s .005 s OOffice 365 s 150 s .020 s .005 s 163 Xpdf 291 s 81 s .009 s .005 s MPlayer 294 s 81 s .010 s .005 s Table 7.2 – File System Instantiating Times approaches. We compare how long it takes to setup a VLFS against how long it takes to setup a container file system using Debian’s traditional bootstrapping tools (Create), how long it would take to extract the same file system from a tar archive (Extract), and how long it takes a file system with a snapshot operation to create a new snapshot and branch of a preexisting file system namespace (FS-Snap), as shown in Table 7.2. To minimize network effects with the bootstrapping tools, we used a local Debian mirror on the local 100Mbps campus network, and were able to saturate the connection while fetching the packages to be installed. Table 7.2 shows that Apiary instantiates containers with a VLFS composed of nearly 200 layers nearly instantaneously. This compares very positively with traditional ways of setting up a system. Table 7.2 show that it takes a significant amount of time to create a file system for the application container using Debian’s bootstrapping tool, and even extracting a tar archive takes a significant amount of time as well. This discourages creating ephemeral application containers, as users will not want to wait minutes for their applications to start. Tar archives also suffer from their need be actively maintained and rebuilt whenever they need fixes. Therefore, the amount of administrative work increases linearly with the number of applications in use. As Apiary creates the file system nearly instantaneously, it is able to support the creation of ephemeral application containers with no noticeable overhead to the users. While Table 7.2 shows that file systems (in this case Btrfs) with a snapshot and branch operation can also perform it quickly, the user would have to manage Chapter 7. Apiary: A Desktop of Isolated Applications 20.0 18.0 16.0 164 Plain (C) Persistent (C) Plain (W) Persisent (W) Ephemeral Time (s) 14.0 12.0 10.0 8.0 6.0 4.0 2.0 Pidgin Firefox T-bird OOffice Mplayer Xpdf Figure 7.4 – Application Startup Time each of the application’s independent file systems separately. To quantify startup time, we measured how long it takes for the application to open and then be automatically closed. In the case of Firefox, Xpdf and OpenOffice.org, this includes the time it takes to display the initial page of a document, while Pidgin, MPlayer and Thunderbird are only loading the program. For ephemeral containers, we measure the total time it takes to set up the container and execute the application within it. Ephemeral containers differ from persistent containers only in the time it takes to set up the new ephemeral container, which is never a cold-cache operation because the system is already in use. We compare these results to cold and warm cache application startup times for both plain Linux and Apiary’s persistent containers. We include cold cache results for benchmarking purposes and warm cache results to demonstrate the results users would normally see. As Figure 7.4 shows, while running within a container induced some overhead Chapter 7. Apiary: A Desktop of Isolated Applications 165 on startup, it is generally under 25% in both cold and warm cache scenarios. This overhead is mostly due to the added overhead of opening the many files needed by today’s complex applications. The most complex application, OpenOffice, requires the most, while the least complex application, Xpdf, is almost equivalent to the plain Linux case. In addition, while the maximum absolute extra time spent in the cold cache case was nearly 5 seconds for OpenOffice, in the warm cache case it dropped to under 0.5 seconds. In addition, ephemeral containers provide an interesting result. Even though they have a fresh new file system and would be thought to be equivalent to a cold cache startup, they are nearly equivalent to the warm cache case. This is because their underlying layers are already cached by the system due to their uses by other containers. The ephemeral case has a slightly higher overhead due to the need to create the container and execute a display server inside of it in addition to regular application startup. However, as this takes under 10 milliseconds, it adds only a minimal amount to the ephemeral application startup time. 7.3.4 File System Efficiency To support a large number of containers, Apiary must store and manage its file system efficiently. This means that storage space should not significantly increase with an increasing number of instantiated containers and should be easily manageable in terms of application updates. For each application’s VLFS, Table 7.3 shows its size, its number of layers, the amount of state shared with the other application VLFSs, and the amount of state unique to it. For instance, the 129 layers that make up Firefox’s VLFS require 353 MB, of which 330MB are shared with other applications and 23 MB are unique to the Firefox VLFS. In general, as Table 7.3 shows, there is a lot of duplication among the containers, as the layer repository of 214 distinct Chapter 7. Apiary: A Desktop of Isolated Applications Repo 743 MB # Layers Shared Unique Pidgin 394 MB 147 322 MB 72 MB Firefox 353 MB 129 330 MB 23 MB T-Bird 367 MB 125 335 MB 32 MB OOffice 645 MB 186 329 MB 316 MB 166 Xpdf 339 MB 130 330 MB 9 MB MPlayer 355 MB 162 326 MB 29 MB Table 7.3 – Apiary’s VLFS Layer Storage Breakdown Size Single FS 743 MB Multiple FSs 2.1 GB VLFSs 743 MB Table 7.4 – Comparing Apiary’s Storage Requirements Against a Regular Desktop Avg. Time Traditional 18 s Apiary 0.12 s Table 7.5 – Update Times for Apiary’s VLFSs layers needed to build the different VLFSs for the different applications is the same magnitude as the largest application. Table 7.4 shows that using individual VLFSs for each application container consumes approximately the same amount of file system space as a regular desktop file system containing all the applications because each layer only has to be stored once. This is comparison to the traditional method of provisioning multiple independent file systems for each application container, which consumes a significantly larger amount of disk space. Similarly, if multiple desktops are provided on a server, the VLFS usage would remain constant with the size of the repository, while the other cases would grow linearly with the number of desktops. To demonstrate how Apiary allows users to maintain their many containers efficiently, we instantiated one container for each of the five applications previously mentioned. When a security update was necessary [146], we applied the update to each container. Table 7.5 shows the average times for the five application container file systems. This demonstrates that while individual updates by themselves are not too long, when there are multiple container file systems for each individual user, the Chapter 7. Apiary: A Desktop of Isolated Applications 2000.0 Time (s) 1500.0 167 Plain Desktop Pidgin VLFS Firefox VLFS T-bird VLFS OpenOffice VLFS Xpdf VLFS MPlayer VLFS 1000.0 500.0 0.0 Postmark Figure 7.5 – Postmark Overhead in Apiary amount of time to apply common updates will rise linearly, and as the traditional method is two orders of magnitude greater than Apiary, it will be impacted to a much greater extent. 7.3.5 File System Virtualization Overhead To measure the virtualization cost of VLFS in the Apiary operating system virtualization environment, we re-ran the benchmarks from Chapter 6. These benchmarks differ from Chapter 6 in that they are not run within a hardware virtual machine, but rather within an operating system virtualization namespace, and that instead of the backing store of the VLFS being on a fast SAN device, they are on the slower host machine disks. Figure 7.5 shows that Postmark runs faster within a plain Linux environment than when run within the VLFS. However, it should be noted that these results show Chapter 7. Apiary: A Desktop of Isolated Applications 300.0 250.0 Time (s) 200.0 168 Plain Desktop Pidgin VLFS Firefox VLFS T-Bird VLFS OpenOffice VLFS Xpdf VLFS MPlayer VLFS 150.0 100.0 50.0 0.0 Kernel Build Figure 7.6 – Kernel Build Overhead in Apiary significantly less overhead that those in Chapter 6. This is because even though the disks are slower, as indicated by the plain Linux results, the operating system virtualization overhead is minimal compared to the overhead imposed by the virtual machine monitor in Chapter 6. Most notable is a decrease in memory pressure which enables the VLFS to operate more efficiently because more data can remain cached. Figure 7.6 shows similar results with the multi-threaded build of the Linux 2.6.18.6 kernel. In Chapter 6, the VLFS showed a 5% overhead; here, overhead is essentially zero. Even though the SAN’s file system, used for the tests in Chapter 6, is significantly faster than the blade’s file system, the results here are much faster overall. This again indicates the amount of overhead imposed by virtual machine monitors over operating system virtualization. Chapter 7. Apiary: A Desktop of Isolated Applications 7.4 169 Related Work Isolation mechanisms such as VMs [143, 147] and OS containers [7, 116] are commonly used to increase the security of applications. However, if used for desktop applications, this isolation prevents an integrated desktop experience. Products like VMware’s Unity [143] attempt to solve part of this issue by combining the applications from multiple VMs into a single display with a single menu and taskbar, as well as providing file system sharing between host and VMs. The applications, however, are still fully isolated from one another, preventing them from leveraging other applications installed into separate VMs. While VMs provide superior isolation, they suffer higher overhead due to running independent operating systems. This impacts performance and makes them less suited for ephemeral usage on account of their long startup times. However, Apiary can leverage them if one does not want to trust a single operating system kernel. Tahoma [122] is similar to Apiary in that it creates fully isolated application environments that remain part of a single desktop environment. Tahoma creates browser applications that are limited to certain resources, such as certain URLs, and that are fully isolated from each other. Tahoma is similar to Apiary in that it enables the creation of isolated application environments. However, it only provides these isolated application environments for web browsers. It does not provide any way to integrate these isolated environments and does not provide ephemeral application environments. Google’s Chrome web browser [66] builds upon some of these ideas to isolate web browser pages within a single browser. But the browser as a whole does not offer any isolation from the system. While its multiple-process model uses OS mechanisms to isolate separate web pages that are concurrently viewed, it does not provide any isolation from the system itself. For instance, any plugin that is executed Chapter 7. Apiary: A Desktop of Isolated Applications 170 has the same access to the underlying system as does the user running the browser. Modern web browsers improve privacy by providing private browsing modes that prevent browser state from being committed to disk. While they serve a similar purpose to ephemeral containers, private browsing is fundamentally different. First, it has to be written into the program itself. Many different types of programs have privacy modes to prevent them from recording state and this model requires them to implement it independently. Second, it only provides a basic level of privacy. For instance, it cannot prevent a plugin from writing state to disk. Furthermore, it makes the entire browser and any helper program or plugin that it executes part of the trusted computing base (TCB). This means that the user’s entire desktop becomes part of the TCB. If any of those elements gets exploited, no privacy guarantees can be enforced. Apiary’s ephemeral containers make the entire execution private and support any application with a state a user desires to remain private without any application modifications. It also keeps the TCB much smaller, by only requiring that the underlying OS kernel and the minimal environment of Apiary’s system daemon be trusted. Lampson’s Red/Green isolation [82] and WindowBox [23] resemble Apiary’s ability to run multiple applications in parallel. These isolation schemes involve users running two or more separate environments, for instance, a red environment for regular usage and a green environment for actions requiring a higher level of trust. However, unlike Apiary’s ephemeral containers, if an exploit can enter the green container, it will persist. Furthermore, by requiring two separate virtual machines, one increases the amount of work a user has to do to manage their machines. Apiary, by leveraging the VLFS, minimizes the overhead required required to manage multiple machines. Storage Capsules [33] also attempt to mitigate this problem by securely running the applications requiring trust in the same operating system environment as the un- Chapter 7. Apiary: A Desktop of Isolated Applications 171 trusted applications, while keeping their data isolated from one another. However, this involves significant startup and teardown costs for each execution within a secure storage capsule. File systems and block devices with branching or COW semantics [32,103,128] can be used to create a fresh file system namespace for a new container quickly. However, these file systems do not help to manage the large number of containers that exist within Apiary. Because each container has a unique file system with different sets of applications, administrators must create individual file systems tailored to each application. They cannot create a single template file system with all applications because applications can have conflicting dependency requirements or desire to use the same file system path locations. Furthermore, if all applications are in a single file system, they are not isolated from each other. This results in a set of space-inefficient file systems, as each file system has an independent copy of many common files. This inefficiency also makes management harder. When security holes are discovered and fixed, each individual file system must be updated independently. Many systems have been created that attempt to provide security through isolation mechanisms [17, 30, 49, 84, 86, 118, 144]. All these systems differ from Apiary in that they try to isolate the many different components that make up a standard fully-integrated single system using sets of rules to determine which of the machine’s resources the application should be able to access. This often results in one of two outcomes. First, a policy is created that is too strict and does not let the application run correctly. Second, a policy is created that is too lenient and lets an exploited application interact with data and applications it should not be able to access. Apiary, on the other hand, forces each component to be fully isolated within its own container before determining on which levels it should be integrated. As each container provides all the resources that the application needs to execute in an isolated environment, no Chapter 7. Apiary: A Desktop of Isolated Applications 172 complicated rule sets have to be created to determine what it can access. Solitude [72] provides isolation via its Isolation File System (IFS), which a user can throw away. This is similar to Apiary’s ephemeral containers. However, the IFSs are not fully isolated. First, Solitude does not create a new IFS for each application execution. Second, the IFS is built on top of a base file system with which it can share data, breaking the isolation. To handle this, Solitude implements taint tracking on files shared with the underlying base file system. This helps determine post facto what other applications may have been corrupted by a maliciously constructed file. Similarly, Solitude only provides isolation at the file system level. Because each application still shares a single display, malicious and exploited applications can leverage built-in mechanisms in commodity display architectures [61, 93] to insert events and messages into other applications sharing the display. Chapter 8 ISE-T: Two-Person Control Administration All organizations that rely on system administrators to manage their machines must prevent accidental and malicious administrative faults from entering their systems. As systems become more complex, it gets easier for administrators to make mistakes. From a security perspective, these complex systems create an environment where it is easier for a rogue user, whether an insider or outsider, to hide their attacks. For example, Robert Hanssen, an FBI agent who was a Soviet spy, was able to evade detection because he was the administrator of some of the FBI’s counterintelligence computer systems [149]. He could see whether the FBI had identified his drop sites and if he was being investigated [45]. Most approaches to insider attacks involve intrusion detection or role separation, both of which are ineffective against rogue system administrators who can replace the system module that enforces the separation or performs the intrusion detection. This attack vector was described over thirty years ago by Karger and Schell [101] and still remains a serious problem. Even if administrators can be trusted, they must deal Chapter 8. ISE-T: Two-Person Control Administration 174 with very complicated software, and it is hard to catch mistakes before they cause problems. If a mistake takes down an important service, the machine may not be usable or administratable, and malicious attackers can act with impunity. There are several ways to address faults, including partitioning, restore points and peer review. One highly effective approach is two-person control [13], for example, two pilots in an airplane, two keys for a safe deposit box, or running two or more computations in parallel and comparing the results. We believe this concept can be extended to problems in system administration by using virtualization to create duplicate environments. Toward this end, we created the “I See Everything Twice” [70] (ISE-T, pronounced “ice tea”) architecture. ISE-T provides a general mechanism to clone execution environments, independently execute computations to modify the clones, and compare how the resulting modified clones have diverged. The system can be used in a number of ways, such as performing the same task in two initially identical clones, or executing the same computation in the same way in clones with some differences. By providing clones, ISE-T creates a system where computation actions can be “seen twice,” applying the concept used for fault-tolerant computing to other forms of twoperson control systems. There is, however, a crucial difference between our use of replicas and that of fault-tolerant computing. We test for equivalence between two replicas that may not be identical, rather than simply running two identical replicas in lockstep and ensuring they remain identical. By applying the ISE-T architecture to system administration, we are able to introduce the two-person control concept to system administration. As ISE-T allows a system to be easily cloned into multiple distinct execution domains, we can create separate cloned environments for multiple administrators. ISE-T can then compare the sets of changes produced by each administrator to determine if equivalent changes Chapter 8. ISE-T: Two-Person Control Administration 175 were made. ISE-T allows administration to proceed in both a fail-safe and an auditable manner. ISE-T forces administrative acts to be performed multiple times before they are considered correct. Current systems give full access to the machine to individual administrators. This means that one person can accidentally or maliciously break the system. ISE-T offers a new way to avoid this problem. ISE-T does not allow any administrator to modify the underlying system directly, but instead creates individual clones for two administrators to work on independently. ISE-T is then able to compare the changes each administrator performs. If the changes are equivalent, ISE-T has a high assurance that the changes are correct and will commit them to the base system. But if it detects discrepancies between the two sets of changes, it will notify the administrators so that they can resolve the problem. This enables fail-safe administration by catching accidental errors, while also preventing a single administrator from maliciously damaging the system. ISE-T leverages both virtualization and unioning file systems to produce the clones. ISE-T uses both operating system virtualization, as in Solaris Zones [116] and Linux VServer [7], and hardware virtualization as in VMware [142], to provide each administrator with an isolated environment. ISE-T builds upon DejaView [81] and Strata, using union file systems to yield a layered file system that provides the initial file system namespace in one layer, while capturing all the system administrator’s file system changes in a separate layer. This allows easy isolation of changes, simplifying equivalence testing. ISE-T’s requiring everything to be installed twice blocks many real attacks. A single malicious system administrator cannot create an intentional back door, weaken firewall rules, or create unauthorized user accounts. ISE-T is admittedly an expensive solution, too expensive for many commercial sites. For high-risk situations, such as in the financial, government, and military sec- Chapter 8. ISE-T: Two-Person Control Administration 176 tors, the added cost may be acceptable if risk is reduced. In fact, two-person controls are already routine in those environments, ranging from checks that require two signatures to requiring two people for nuclear weapons work. But we also demonstrate how ISE-T can be used in a less expensive manner by introducing a form of auditable system administration. Instead of requiring two system administrators at all times, ISE-T can save all the changes performed by the system administrator to a log, which is audited to provide a higher level of assurance that the administrator is behaving properly. In a similar manner, ISE-T can be extended to train less experienced system administrators. First, ISE-T allows a junior system administrator to perform tasks in parallel with a more senior system administrator. While only the senior administrator’s solution will be committed to the system, the junior system administrator can learn from how their solution differs from the senior system administrator’s. Second, ISE-T can be extended to provide an approval mode, in which a junior system administrator is given tasks to complete, but instead of being committed immediately, they will be presented for the senior system administrator to approve or disapprove. 8.1 Usage Model Systems managed by ISE-T are used by two classes of users, privileged and unprivileged. ISE-T does not change how regular users interact with the machine. They are able to install any program into their personal space and run any program on the system, including regular programs and UNIX programs such as setuid and passwd that raise the privileges of the process on execution. However, ISE-T fundamentally changes the way system administrators interact with the machine. In regular systems, when administrators need to perform mainte- Chapter 8. ISE-T: Two-Person Control Administration Administrative Clone #1 177 Administrative Clone #2 ISE-T Service System Figure 8.1 – ISE-T Usage Model nance on the machine, they use their administrative privilege to run arbitrary programs, for example, by executing a shell or using sudo. In these systems, administrators can modify the system directly. As ISE-T prevents system administrators from executing arbitrary programs with administrative privileges, the above model will not work with ISE-T. Instead, ISE-T provides a new approach as shown in Figure 8.1. Instead of administering a system directly, ISE-T creates administration clones. Each clone is fully isolated from others and from the base system. ISE-T instantiates a clone for each administrator. Once both administrators are finished making changes, ISE-T compares the clones for equivalence and commits the changes if they pass the test. As opposed to a regular system, where the administrator can interleave file system changes with program execution, in ISE-T only file system changes are committed to the underlying system. Chapter 8. ISE-T: Two-Person Control Administration 178 Therefore ISE-T requires administrators to use other methods if they require file system changes and program execution to be interleaved on the actual system, such as for rotating log files or exploratory changes to diagnose a subtle system malfunction. To allow this, ISE-T provides a new ise-t command that is used in a manner similar to su. Instead of spawning a shell on the existing system, ise-t spawns a new isolated container for that administrator. This container contains a clone of the underlying file system. Within this clone, the administrators can perform generic administrative actions, as on a regular system, but the changes will be isolated to this new container. When the administrators are finished with their changes, they exit the new container’s shell, much as they would exit a root shell; the container itself is terminated, while its file system persists. ISE-T then compares the changes each administrator performed for equivalence. ISE-T performs this task automatically after the second administrator exits their administration session and notifies both of the administrators of the results. If the changes are equivalent, ISE-T automatically commits the changes to the underlying base system. Otherwise, ISE-T notifies the administrators of the file system discrepancies that exist between the two administration environments, allowing the administrators to correct them. Command ise-t new ise-t enter ise-t done ise-t diff Description Create an administration environment Enter administration environment Ready for equivalence testing Results of a failed equivalence test Table 8.1 – ISE-T Commands Because ISE-T only looks at file system changes, this can prevent it from performing administrative actions that affect only the runtime of the system. To address this, ISE-T provides a raw control mechanism via the file system, and allows itself Chapter 8. ISE-T: Two-Person Control Administration 179 to be integrated with configuration management systems. First, ISE-T’s raw control mechanism is implemented via a specialized file system namespace where an administrator can write commands. For instance, if the administrators want to kill a process, stop a service or reboot the machine, those actions performed directly within their administration container will have no effect on the base system. Some actions can be inferred directly from the file system. For instance, if the system’s set of startup programs is changed, ISE-T can infer that the service should be started, stopped or restarted when the changes are committed to the underlying system. But this approach only helps when the file system is being changed. Sometimes administrators want to stop or restart services without modifying the file system. ISE-T therefore provides a defined method for killing processes, stopping and starting services, and rebooting the machine using files stored on the local file system. ISE-T provides each administrator with a special /admin directory for performing these predefined administrative actions. For example, if the administrator wants to reboot the machine, they create an empty reboot file in the /admin directory. If both administrators create the file, the system will reboot itself after the other changes are committed. Similarly, the administrators can create a halt file to halt the machine. In addition, the /admin directory has kill and services subdirectories. To kill a process, administrators create individual files with the names of the process identifiers of processes running on the base system that they want to kill. Similarly, if a user desires to stop, start, or restart a init.d service, they create a file named by that service prefixed with stop, start, or restart, such as stop.apache or restart.apache within the services directory. ISE-T performs the appropriate actions when the changes are committed to the base system. The files created within the /admin directory are not committed to the base system; they are only used for performing runtime changes to the system. Chapter 8. ISE-T: Two-Person Control Administration 180 Many systems already exist to manage systems and perform these types of tasks, namely, configuration management systems such as lcfg [19]. At a high level, configuration management systems work by storing configuration information on a centralized policy server that controls a set of managed clients. In general, the policy server will contain a set of template configuration files that it uses to create the actual configuration file for the managed clients based on information contained in its own configuration. Configuration management systems also generally support the ability to run predefined programs and scripts and execute predefined actions on the clients they are managing. When ISE-T is integrated with any configuration management system, it no longer manages the individual machines. Instead of the managed clients being controlled by ISE-T, the configuration policy server is managed by ISE-T and the clients are managed by the configuration management system. This offers a number of benefits. First, it simplifies the comparison of two different systems, as ISE-T can focus on the single configuration language of the configuration management system. Second, configuration system already have tools to manage the runtime state of their client machines, such as stopping and starting services and restarting them when the configuration changes. Third, many organizations are already accustomed to using configuration management systems. By implementing ISE-T on the server side, they can enforce the two-person control model in a more centralized manner. 8.2 ISE-T Architecture To implement the two-person administrative control semantic, ISE-T provides three architectural components. First, because the two administrators cannot administer the system directly, they must be provided with isolated environments in which they Chapter 8. ISE-T: Two-Person Control Administration 181 can perform their administrative acts. To ensure isolation, ISE-T provides container mechanisms that allow ISE-T to create parallel environments based on the underlying system to be administered. This allows ISE-T to fully isolate each administrator’s clone environment from each other and from the base system. Second, we note that any persistent administrative action must involve a change to the file system. If the file system is not affected, the action will not survive a reboot. While some administrative acts only affect the ephemeral runtime state of the machine, the majority are more persistent. The file system is therefore a central component in ISE-T’s two-person administrative control. ISE-T provides a file system that can create branches of itself as well as isolate the changes made to it. This allows for easy creation of clone containers and comparison of the changes performed to both environments. Finally, ISE-T provides the ISE-T System Service. This service instantiates and manages the lifetimes of the administration environments. It is able to compare the two separate administration environments for equivalence to determine if the changes performed to them should be committed to the base system. ISE-T’s System Service performs this via an equivalence test that compares the two administration environment’s file system modifications for equivalence. If the two environments are equivalent, the changes will be committed to the underlying base system. Otherwise, the ISE-T System Service will notify the two administrators of the discrepancies and allow them to fix their environments appropriately. 8.2.1 Isolation Containers ISE-T can leverage multiple types of container environments depending on administrative needs. In general, the choice will be between hardware virtual machine Chapter 8. ISE-T: Two-Person Control Administration 182 containers and operating system containers. Hardware virtual machines such as VMware [142] provide a virtualized hardware platform with a separate operating system kernel, yielding a complete operating system instance. Operating system containers such as Solaris Zones [116], however, are just isolated kernel namespaces running on a single machine. For ISE-T, there are two main differences between these containers. First, hardware virtual machines allow the administrators to install and test new operating system kernels, as each container will be running its own kernel. Operating system containers, on the other hand, prevent the administrators from testing the underlying kernel, as there is only one kernel running, that of the underlying host machine. Second, as hardware virtual machines require their own kernel and a complete operating system instance, they make it time-consuming to create administration clones. Operating system containers, however, can be created almost instantly. As both types of containers have significant benefits for different types of administrative acts, ISE-T supports both. For most actions, administrators will prefer operating system containers, but they can still use a complete hardware virtual machine to test kernel changes. When ISE-T is integrated with a configuration management system, ISE-T does not have to use any isolation container mechanism at all, as the configuration management system already isolates the administrators from the client system. Instead, ISE-T simply provides each administrator with their own configuration management tree and lets both administrators perform the changes. Chapter 8. ISE-T: Two-Person Control Administration 8.2.2 183 ISE-T’s File System To support its file system needs, ISE-T leverages the branching ability of some file systems. Unlike a regular file system, a branchable file system can be snapshot at some point in time and branched for future use. This allows ISE-T to quickly clone the file system of the machine being managed. Because each file system branch is independent, ISE-T can capture any file system changes in the newly created branch by comparing the branch’s state to the initial file system’s state. Similarly, ISE-T can then compare the sets of file system changes from both administration clones to one another. Although a classical branchable file system allows changes to be captured, it does not make it possible to determine efficiently what has changed, because the branch is a complete file system namespace. Iterating through the complete file system can take a significant amount of time, place a large strain on the file system, and decrease system performance. Two features allow ISE-T to use a file system efficiently. First, it must be able to duplicate the file system to provide each administrator with their own independent file system on which to make changes. Second, it must allow easy isolation of each administrator’s changes to test them for equivalence. To meet these requirements, ISE-T creates layered file systems for each administration environment. Multiple file systems can be layered together into a single file system namespace for each environment. This enables each administration environment to have a layered file system composed of two layers, a single shared layer that is the file system of the machine they are administrating, as well as a layer containing all the changes the administrator makes on the file system. Chapter 8. ISE-T: Two-Person Control Administration 8.2.3 184 ISE-T System Service ISE-T’s System Service has a number of responsibilities. First, it manages the lifetimes of each administrator’s environment. When administration is required, it has to set up the environments quickly. Similarly, when the administration session has been completed and the changes committed to the underlying system, it removes them from the system and frees up their space. Third, it evaluates the two environments for equivalence by running a number of equivalence tests to determine if the two administrators performed the same set of modifications. Finally, it has to either notify the administrators of the discrepancies between their two environments or commit the equivalent environment’s changes to the underlying base system. ISE-T’s layered file system allows the system service to easily determine which changes each administrator made, as each administrator’s changes are confined to their personal layer of the layered file system. To determine if the changes are equivalent, ISE-T first isolates the files that will not be committed to the base system, that is, the administrator’s personal files in their branch, such as shell history. Instead of merely removing them, ISE-T saves them for archival and audit purposes. ISE-T then iterates through the files in each environment, comparing the file system contents and files directly to one another. If each administrator’s branch has an equivalent set of file system changes, ISE-T can then simply commit a set to the base system. On the other hand, if the files contained within each branch are not equivalent, ISE-T flags the differences and reports them. The administrators then confer to ensure that they perform the same steps to create the same set of files to commit to the base system. Ways of determining equivalence can vary based on the type of file and what is considered to be equivalent in context. For instance, a configuration file modified by both administrators with different text editors can visually appear equivalent, Chapter 8. ISE-T: Two-Person Control Administration 185 but can differ if one uses spaces and another uses tabs. These files are equivalent insofar as applications parse them the same way, but are different on a character by character level. However, there are some languages (e.g., Python) where the amount of white space matters and can have a great effect on how the script executes. On the other hand, two files that have exactly the same file contents can have varying metadata associated with the file, such as permissions, extended attributes, or time data. Similarly, some sets of files need not be compared for equivalence, such as the shell history that records the steps the administrators take in their respective environments, and, in general, the home directory contents of the administrator in his administration environment. ISE-T removes these files from the comparison, and never commits them to the underlying system. Taking this into consideration, ISE-T’s prototype comparison algorithm determines these sets of differences. 1. Directory entries which do not exist in both sets of changes are different. 2. Directory entries with different UIDs, GIDs, or permission sets are different. 3. Directory entries of different file types (Regular File, Symbolic Link, Directory, Device Node, or Named Pipe) are different. For directory entries of the same type, ISE-T performs the appropriate comparison. • Device nodes must be of the same type. • Symbolic links must contain the exact same path. • Regular files must have the same size and the exact same contents. Chapter 8. ISE-T: Two-Person Control Administration 186 There are two major problems with this approach. First, this comparison takes place at a very low semantic level. It does not take into account simple differences between files that make no difference in practice. However, without writing a parser for each individual configuration language, one will not easily be able to compare equivalence. Second, there are certain files, such as encryption keys, that will never be generated identically, even though equivalent actions were taken to create them. This can be important, as some keys are known to be weaker and a malicious administrator can construct one by hand. Both of these problems can be solved by integrating ISE-T with a configuration management system and teaching ISE-T the configuration management system’s language. First, these systems simplify the comparison by enabling it to focus on the configuration management system’s language. Even though most configuration management systems work by creating template configuration files for the different applications, these files are not updated regularly and can be put through the stricter exact comparison test. On the other hand, when ISE-T understands the language of the configuration management system, it can rely on a more relaxed equivalence test. Second, configuration management systems already deal with dynamic files like encryption keys. A common way configuration management systems deal with these types of files is by creating them directly on the managed client machines. Because ISE-T understands the configuration management system’s language, the higher level semantics that instruct the system to create the file will be compared for equivalence instead of the files themselves. However, a potential weakness of ISE-T is in dealing with files that cannot easily be created on the fly and will differ between two system administration environments, such as databases. For instance, two identical database operations can result in different databases due to different timestamps or reordering of updates on the database server. Chapter 8. ISE-T: Two-Person Control Administration 8.3 187 ISE-T for Auditing Although the two-person control is useful for providing high assurance that faults are not going to creep into the system, its expense can make it impractical in many situations. For example, since the two-person control model requires the concurrence of two system administrators on all actions, it can prevent time-sensitive actions if only a single administrator is available. Similarly, while the two-person control model provides a very high degree of assurance for a price, it would be useful if organizations could get a somewhat higher degree of assurance at a lower price. To achieve these goals, we can combine ISE-T’s mechanisms with audit trail principles to create an auditable system administration semantic. In auditable system administration, every system administration act is logged to a secure location for review. The ISE-T System Service creates cloned administration environments for the two administrators and can capture the state they change in order to compare for equivalence. For auditable system administration, ISE-T’s mechanism can also be used. The audit system prevents the single system administrator from modifying the system directly, instead requiring the creation of a cloned administration environment where the administrator can perform the changes before they are committed to the underlying system. Instead of comparing for equivalence against a second system administrator, the changes are logged so that they can be examined at some time in the future, while being immediately committed to the underlying system. Audit systems are known to increase assurance against malicious changes, as the would-be perpetrator knows there is a good chance their actions will be caught. Similarly, depending on the frequency and number of audits performed, it can help prevent administration faults from persisting for long periods of time in the system. However, it does not provide as much assurance as two-person control, be- Chapter 8. ISE-T: Two-Person Control Administration 188 cause the administrator can use the fact that his changes are committed immediately to create back doors in the system that will not be discovered until later. Auditable system administration needs to be tied directly to an issue-tracking service. This allows an auditor to associate an administrative action with its intended result. Every time an administrator invokes ISE-T to administer the system, an issue-tracking number is passed into the system to tie that action to the issue in the tracker. This allows the auditor to compare the actual results with what the auditor expects to have occurred. In addition, auditable system administration can be used in combination with the two-person control system when only a single administrator is available and immediate action is needed. With auditing, the action can be performed by the single administrator, but can be immediately audited when the second administrator becomes available. 8.4 Experimental Results To test the efficacy of ISE-T’s layered file system approach, we recruited 9 experienced computer users with varying levels of system administration experience, though all were familiar with managing their own machines. We provided each user with a VMware virtual machine running Debian GNU/Linux 3.0. Each VM was configured to create an ISE-T administration environment that would allow the users to perform multiple administration tasks isolated from the underlying base system. Our ISE-T prototype uses UnionFS [150] to provide the layered file system needed by ISE-T. We asked the users to perform the eleven administration tasks listed in Table 8.2. The user study was conducted in virtual machines running on an IBM HS20 eServer blade with dual 3.06 Ghz Intel Xeon CPUs and 2.5GB RAM running VMware Server 1.0. These tasks were picked to be representative of common administration tasks, and Chapter 8. ISE-T: Two-Person Control Administration Category Software Installation System Services Configuration Changes Exploit Description Install official rdesktop package Compile & install rdesktop from source Install all pending security updates Install SSH daemon from package Remove PPP using package manager Edit machine’s persistent hostname Edit the inetd.conf to enable a service Add a daily run cron job Remove an hour run cron job Change the time of a cron job Create a backdoor setuid root shell 189 Result Equivalent Equivalent Equivalent Not Equivalent Equivalent Equivalent Not Equivalent Equivalent Equivalent Equivalent Not Equivalent Desired Yes Yes Yes No Yes Yes No Yes Yes Yes Yes Table 8.2 – Administration Tasks included a common way for a malicious administrator to create a back door in the system. Each task was performed in a separate ISE-T container, so that each administration task was isolated from the others, and none of the tasks depended on the results of a previous task. We used ISE-T to capture the changes each user performed for each task in its own file system. We were then able to compare each user against the others for each of the eleven tasks to see where their modifications differed. For every test, ISE-T prunes the changes to remove files that would not affect equivalence, as described in Chapter 8.2.3. Notably, in our prototype, ISE-T prunes the /root directory, which is the home directory of the root user, and therefore would contain differences in files such as .bash history, among others, that are particular to each user’s approach to the task. Similarly, ISE-T prunes the /var subtree to remove any files that are not equivalent. For instance, depending on what tools an administrator uses, different files are created. A cache of packages might be downloaded and installed via the apt-get tool instead of manually. The reasoning behind this pruning is that the /var tree is meant as a read-write file system for Chapter 8. ISE-T: Two-Person Control Administration 190 per-system usage. Tools will modify it; if different tools are used, different changes will be made. The entire directory tree cannot be pruned, however, because there are files or directories within it that are necessary for runtime use and those changes have to be committed to the underlying file system. Therefore, only those changes that are equivalent are committed, while those that are different were ignored. ISE-T also prunes the /tmp directory, as the contents of this directory would also not be committed to the underlying disk. Finally, due to the UnionFS implementation used for these experiments, ISE-T also prunes the whiteout files created by UnionFS if there is no equivalent file on the underlying file system. In many cases, temporary files with random names will be created; when they are deleted, UnionFS will create a whiteout file, even if there is no underlying file to whiteout. As this whiteout file does not have an impact on the underlying file system, it is ignored. On the other hand, whiteout files that do correspond to underlying files and therefore indicate that the file was deleted are not ignored. 8.4.1 Software Installation In the software installation category, we had the users perform three separate tests to demonstrate that when multiple users install the same piece of software, as long as they install it in the same general way, the two installations will be equivalent. To demonstrate this, the users were first instructed to install the rdesktop program from its Debian package. Users had multiple ways of installing the package, including downloading and installing it by hand via dpkg, using apt-get to download it and any unfulfilled dependencies, as well as using the aptitude front end to apt-get. Most users decided to install the package via apt-get, but even those who did not made equivalent changes. The only differences were in pruned directories, demon- Chapter 8. ISE-T: Two-Person Control Administration 191 strating that installing a piece of pre-packaged software using regular tools results in an equivalent system. Second, the users were instructed to build the rdesktop program from source code and install it into the system. In this case, multiple differences could have occurred. First, if the compiler were to create a different binary each time the source code is compiled, even without any changes, it would be difficult to check for equivalence. Second, programs generally can be installed in different areas of the file system, such as /usr versus /usr/local. In this case, all the testers decided to install the program into the default location, avoiding the latter problem, while also demonstrating that as long as a the same source code is compiled by the same tool chain, it will result in the same binary. However, some program source code, such as the Linux kernel, will dynamically modify its source during build, for example to define when the program was built. In these cases, we would expect equivalence testing to be more difficult, as each build will result in a different binary. A simple solution would be to patch the source code to avoid this behavior. A more complicated solution would involve evaluating the produced binary’s code and text sections with the ability to determine that certain text section modifications are inconsequential. Again, in this case, the only differences were in pruned directories, notably the /root home directory, to which the users downloaded the source for rdesktop. Finally, we instructed the users to install all the pending security updates. This is more complicated than the first test, as many packages were upgraded. Although differences existed between the environments of the users, the differences were confined to the /var file system tree and depended on how they performed the upgrade. This is because Debian provides multiple ways to do an upgrade of a complete system and those cause different log files to be written. As they all installed the same set of packages, the rest of the file system, as expected, contained no differences. Chapter 8. ISE-T: Two-Person Control Administration 8.4.2 192 System Services Our second set of tests involved adding and removing services. Users were instructed to install SSH and remove PPP. These tests were an extension of the previous package installation tests and demonstrated how one would automatically start and stop services, as well as a demonstration of files we knew would fail equivalence testing. For the first test, we instructed the users to install the SSH daemon. This test sought to demonstrate that ISE-T can detect when a new service is installed and therefore enable it when the changes are committed. In Linux systems, a System-V init script has to be added to the system to allow it to be started each time the machine boots. If the user’s administration environment contains a new init script, ISE-T automatically determines that the service should be started when this set of administration changes is committed to the base system. This test also sought to demonstrate that certain files are always going to be different between users if created within their private environment. The SSH host key for each environment is different because it is created based on the kernel’s random entropy pool, which is different for each user and therefore will never be the same if created in a separate environment. A way around this would be not to create it within the private branch of each user, but instead to create it after the equivalent changes are committed, for example, the first time the service’s init script is executed. For the second test, we instructed the users to remove the PPP daemon. This test demonstrated that there are multiple ways to remove a package in a Debian system, and depending on the way the package is removed, the underlying file system will be different. Specifically, a package can either be removed or purged. When a package is removed, files marked as configuration files are left behind, allowing the packages to be reinstalled and have the configuration remain the same. On the other hand, Chapter 8. ISE-T: Two-Person Control Administration 193 when a package is purged, the package manager will remove the package and all the configuration files associated with it. In this case, the users chose different ways to remove the package, and ISE-T was able to determine the differences for those that chose to remove or purge it. 8.4.3 Configuration Changes Our third set of tests involved modifications to configuration files on the system and included five separate tests in two categories. The first category was composed of simple file configuration changes. We first instructed the users to modify the host name of the machine persistently from debian to iset, which is accomplished by editing the /etc/hostname file. As expected, as this configuration change is very simple. All users modified the system’s hostname in the exact same manner, allowing ISE-T to determine that all the systems were equivalent. Next, we instructed the users to modify the /etc/inetd.conf to enable the discard service. In this case, as the file is more free-form, their changes were not exact, and many were not equivalent. For example, some users enabled it for both TCP and UDP, while others enabled it for TCP alone. Also, some added a comment, while others did not. Whereas the first change is not equivalent, the second change should be considered equivalent, but this cannot be determined by a simple diff. One needs to parse the files correctly to determine that they are equivalent, an ability lacking in our ISE-T prototype. However, ISE-T was able to clearly report the differences that existed between the users environments. The second category of tests involved setting up and removing cron jobs and was composed of three tests. First, we provided the users with a script in /root that we instructed them to install in the system in a manner so that it will be executed Chapter 8. ISE-T: Two-Person Control Administration 194 daily. In Debian there are two ways to have a cron job execute daily. The best way is to simply place the script in the /etc/cron.daily directory, as all the programs located within it are executed once a day. On the other hand, a user could use the /etc/crontab file and instruct it exactly what time each day to execute the program. We were looking for users to place the file in /etc/cron.daily which the large majority did; but one user who perhaps was not used to Debian modified /etc/crontab directly. Again, ISE-T was able to detect the difference when it existed and report the discrepancy. Similarly, we instructed users to remove a cron job that ran hourly. The cron job’s script was already located in /etc/cron.hourly and not in crontab. All users deleted it from the /etc/cron.hourly directory, even the user who modified /etc/crontab in the previous test. Finally, we had an entry in /etc/crontab that executed a cron job at 4:35 AM every day, and instructed the users to change it to 5:35 AM. As expected, all users performed the same set of changes by modifying the /etc/crontab file. 8.4.4 Exploit In our final test, we asked the users to create a back door shell into the system. This test demonstrates that ISE-T will detect when one of the two system administrators makes a rogue administration change to the system, as the other administrator will not have made that same administration change. We did not specify how or where the shell should be created, and left it up to the user. The users created setuid shells in many different locations, some as “dot files” hidden from a normal ls listing, and some just performed setuid on existing shells. In all cases, as none of the users made this rogue modification in the same manner, ISE-T was able to show the rogue changes as different in every environment. Chapter 8. ISE-T: Two-Person Control Administration 8.5 195 Related Work Two-person control mechanisms are known to provide high levels of assurance [134]. Multiple examples exist with regard to nuclear weapons. For instance, to launch a nuclear weapon, two operators must separately confirm that launch orders are valid and must turn their launch keys together to launch the missiles. In fact, every sensitive action concerning nuclear weapons must be performed by two people with the same training and authority [39, Chapter 2]. The same notion is applied in many financial settings: banks will require two people to be involved in certain tasks, such as opening a safe deposit box [148], and companies can require two people to sign a check [55] over a certain amount. This makes it much more difficult for a single person to commit fraud. As far as we know, this mechanism has never been applied directly to system administration. In the Compartmented Mode Workstation (CMW), the system administration job is split into roles, so that many traditional administration actions require more than one user’s involvement [138]. This demarcation of roles was first pioneered in Multics at MIT [75]. Similarly, the Clark-Wilson model was designed to prevent unauthorized and improper modifications to a system to ensure its integrity [44]. All these systems simply divided the administrators’ actions among different users who performed different actions. This differs fundamentally from the traditional notion of two-person control where both people do the same exact action. More recently, many products have been created to help prevent and detect when accidental mistakes occur in a system. SudoSH [69] is able to provide a higher level of assurance during system administration as it records all keystrokes entered during a session and is able to replay the session. However, while sudosh can provide an audit log of what the administrator did, it does not provide the assurances provided Chapter 8. ISE-T: Two-Person Control Administration 196 by the two-person control model. Even if one were to audit the record or replay it, one is not guaranteed to get the same result. Although auditing this record can be useful for detecting accidental mistakes, it cannot detect malicious changes. For instance, a file fetched from the Internet can be modified. If the administrators can control which files are fetched, they can manipulate them before and after the sudosh session. ISE-T, on the other hand, does not care about the steps administrators take to accomplish a task, only the end result as it appears on the file system. Part of the reason accidental mistakes occur is that knowledge is not easily passed between experienced and inexperienced system administrators. Although systems like administration diaries and wikis can help, they do not easily associate specific administration actions with specific problems. Trackle [50] attempts to solve this by combining an issue tracker with a logged console session. Issues can be annotated, edited and cross-referenced while the logged console session logs all actions taken and file changes and stores them with the issue, improving institutional memory. Although this allows less experienced system administrators to see the exact steps a previous administrator took to fix a similar or equivalent issue, it does not actually prevent mistakes from entering and remaining in the system, nor does it prevent a malicious administrator from operating. ISE-T’s notion of file system views was first explored in Plan 9 [104]. In Plan 9, it is a fundamental part of the system’s operation. As Plan 9 does not view manipulating the file system view as a privileged operation, each process can craft the namespace view it or its children will see. A more restricted notion of file system views is described by Ioannidis [71]. There, its purpose is to overlay a different set of permissions on an existing file system. Finally, a common way to make a system tolerant of administration faults is to use file system versioning, which allows rolling back to a configuration file’s previous Chapter 8. ISE-T: Two-Person Control Administration 197 state if an error is made. Operating systems such as Tops-20 [53] and VMS [90] include native operating system support for versioning as a standard feature of their file systems. These operating systems employ a copy-on-write semantic that involves versioning a file each time a process changes it. Other file systems, such as VersionFS [96], ElephantFS [127] and CVFS [132], have been created to provide better control of the file system versioning semantic. Chapter 9 Conclusions and Future Work This dissertation demonstrates that many different types of modern computing problems can be solved in a relatively simple manner with different forms of operating system virtualization. First, we presented *Pod. *Pod decouples a user’s computing experience from a single machine while providing them with the same persistent, personalized computing session they expect from a regular computer. *Pod allows different types of applications to be stored on a small portable storage device that can be easily carried on a key chain or in a user’s pocket, thereby allowing the user increased mobility. *Pod uses operating system and display virtualization to decouple the computing session from the host on which it is currently running. It combines this virtualization mechanism with a checkpoint/restart system that lets *Pod users suspend their computing session, move around, and resume their session at any computer. Second, we presented AutoPod. AutoPod expands on *Pod by enabling isolated applications running within a pod to be transparently migrated across machines running different operating system kernel versions. This lets maintenance occur promptly, as system administrators do not have to take down all applications running on a ma- Chapter 9. Conclusions and Future Work 199 chine when it needs maintenance. Instead, the applications are migrated to a new machine where they can continue execution. As AutoPod enables this across different kernel versions, security patches can be applied to operating systems in a timely manner with minimal impact on the availability of application services. Third, we presented PeaPod, an operating system virtualization layer that enables secure isolation of legacy applications. The virtualization layer leverages pods and introduces peas to encapsulating processes. Pods provide an easy-to-use lightweight virtual machine abstraction that can securely isolate individual applications without the need to run an operating system instance in the pod. Peas provide fine-grained least-privilege mechanism that can further isolate application components within pods. PeaPod’s virtualization layer can isolate untrusted applications, preventing them from being used to attack the underlying host system or other applications even if they are compromised. Fourth, we presented Strata, which improves the way system administrators manage the VAs under their control by introducing the virtual layered file system. By addressing their contents by file location instead of block address, VLFSs allows Strata to quickly and simply provision VAs, as no data needs to be copied into place. Strata provides improved management, as file system modifications are isolated and upgrades can be stored centrally and applied atomically. It also allows Strata to create new VLFSs and VAs by composing together smaller base VLFSs and VAs that provide core components. Strata significantly reducing the amount of disk space required for multiple VAs, allows them to be provisioned almost instantaneously and allows them to quickly updated no matter how many are in use. The research into Strata’s VLFS also enabled DejaView’s ability to provide a time-traveling desktop [81]. By layering a blank layer over the file system snapshot, DejaView was able to quickly recreate a fully writable file system view. Chapter 9. Conclusions and Future Work 200 Fifth, we presented Apiary, which introduces a new compartmentalized application desktop paradigm. Instead of running one’s applications in a single environment with complex rules to isolate the applications from each other, Apiary allows them to be easily and completely isolated while retaining the integrated feel users expect from their desktop computer. The key innovations that make this possible are the use of virtual layered file systems and the ephemeral application execution environments they enable. The VLFS allows the multiple containers to be stored as efficiently as a single regular desktop, while also allowing containers to be instantiated almost instantly. This functionality enables the creation of the ephemeral containers that provide an always fresh and clean environment for applications to run it. Apiary’s usage model of fully isolating each application works well in many scenarios, but can cause complications in others. For instance, as each application’s file system is fully isolated, if one wanted to send a file as an email attachment, one could not create a new email message and attach the file to it; the email program might not have access to the file system containing the file. Although Apiary provides a method for users to copy files between containers, this can have an impact on users’ ability to use the system efficiently. Applying Apiary’s principles to non-desktop environments, such as smartphones and tablets, where user interface paradigms are not as ingrained, as on the desktop, can enable user interface metaphors that behave seamlessly without compromising Apiary’s application isolation. Apiary also raises a number of interesting follow-up questions as it only explores the benefits of applications that can run in total isolation. There are smaller applications, such as browser plugins, that cannot run in total isolation, but must remain part of a larger environment. An interesting follow-up question would be to try to see how Apiary’s concepts apply to multiple components of a single application, where the components cannot be run independently. Chapter 9. Conclusions and Future Work 201 The ephemeral execution model introduced by Apiary provides multiple avenues for follow-up. For instance, many network-facing services, such as mail and web services, continuously run based on untrusted input they receive from the network. These services have also been consistently exploited due to flaws in their programs. However, the ephemeral execution model, as presented by Apiary, is not a perfect fit for these services as they need some level of “write” access to the underlying system that will be persistent. An interesting area of research would be to understand how these services operate and how ephemeral execution could be leveraged to provide more security while still allowing the persistent data storage that these services require. Finally, we presented ISE-T, which enables and applies the two-person controller model to system administration. In administration, this model requires two administrators to perform the same administrative act with equivalent results for the administrative changes to be allowed to affect the system that is being modified. ISE-T creates multiple parallel environments for the administrators to perform their administrative changes independently. ISE-T then compares the results of the administrative changes for equivalence. When the results are equivalent, there is a high assurance that system administration faults have not been introduced into the system, be they malicious or accidental in nature. ISE-T’s application of the two-person controller model is just an element of a larger vision of applying this dual control model to solving computing problems. In particular, we want to explore how the ability to create dual environments can provide improved systems management and security of systems in general. For system management, patching a system is critical to ensure that it remains secure. However, many patches can introduce new bugs as well. By being able to create two environments that run in parallel, one can test the known working system against a patched Chapter 9. Conclusions and Future Work 202 system to ensure that the patch does not introduce any new faults. Similarly, it can improve security as we can create two parallel environments that differ randomly in areas such as their process’s address space layout and stacks. As code injection attacks are directly tied to these layouts, by running two systems in parallel with different layouts, an attack will result in fundamentally different results on the two systems, allowing one to detect that an attack is occurring. Bibliography [1] Fakeroot. http://fakeroot.alioth.debian.org/. [2] Gmail. https://gmail.google.com. [3] Google Docs. https://docs.google.com. [4] he RPM Package Manager. http://www.rpm.org/. [5] Hotmail. http://www.hotmail.com. [6] Linux Containers. http://lxc.sourceforge.net/. [7] Linux VServer Project. http://www.linux-vserver.org/. [8] Portable Firefox. http://johnhaller.com/jh/mozilla/portable_firefox/. [9] SoX - Sound eXchange. http://sox.sourceforge.net. [10] Stealth Surfer. http://www.stealthsurfer.biz/. [11] Trek Thumbdrive TOUCH. http://www.thumbdrive.com/p-thumbdrive. php?product=tdswipecrypto. [12] U3 Platform. http://www.u3.com. Bibliography 204 [13] US DoD Joint Publication 1-02, DOD Dictionary of Military and Associated Terms (as amended through 9 June 2004). [14] Virtual Network Computing. http://www.realvnc.com/. [15] Sendmail v.5 Vulnerability. Technical Report CA-1995-08, CERT Coordination Center, August 1995. [16] MIME Conversion Buffer Overflow in Sendmail Versions 8.8.3 and 8.8.4. Technical Report CA-1997-05, CERT Coordination Center, January 1997. [17] Anurag Acharya and Mandar Raje. MAPbox: Using Parameterized Behavior Classes to Confine Applications. In The 9th USENIX Security Symposium, Denver, CO, August 2000. [18] Adobe Systems Incorporated. Buffer Overflow Issue in Versions 9.0 and Earlier of Adobe Reader and Acrobat. http://www.adobe.com/support/security/ advisories/apsa09-01.html, February 2009. [19] Paul Anderson. LCFG: A Practical Tool for System Configuration. Usenix Association, August 2008. [20] http://www.aim.com/get_aim/express/. [21] Myla Archer, Elizabeth Leonard, and Matteo Pradella. Towards a Methodology and Tool for the Analysis of Security-Enhanced Linux. Technical Report NRL/MR/5540—02-8629, NRL, August 2002. [22] Yeshayahu Artsy, Hung-Yang Chang, and Raphael Finkel. Interprocess Communication in Charlotte. IEEE Software, 4(1):22–28, January 1987. Bibliography 205 [23] Dirk Balfanz and Daniel R. Simon. WindowBox: A Simple Security Model for the Connected Desktop. In The 4th USENIX Windows Systems Symposium, Seattle, WA, August 2000. [24] Amnon Barak and Richard Wheeler. MOSIX: An Integrated Multiprocessor UNIX. In The 1989 USENIX Winter Technical Conference, pages 101–112, San Diego, CA, February 1989. [25] Arash Baratloo, Navjot Singh, and Timothy Tsai. Transparent Run-Time Defense Against Stack Smashing Attacks. In The 2000 USENIX Annual Technical Conference, San Diego, CA, June 2000. [26] Ricardo Baratto, Shaya Potter, Gong Su, and Jason Nieh. MobiDesk: Mobile Virtual Desktop Computing. In The 10th Annual ACM International Conference on Mobile Computing and Networking, Philadelphia, PA, September 2004. [27] Ricardo A. Baratto, Leonard N. Kim, and Jason Nieh. THINC: A Virtual Display Architecture for Thin-Client Computing. In The 20th ACM Symposium on Operating Systems Principles, Brighton, United Kingdom, October 2005. [28] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauery, Ian Pratt, and Andrew Warfield. Xen and the Art of Virtualization. In The 19th ACM Symposium on Operating Systems Principles, Bolton Landing, NY, October 2003. [29] Andrew Baumann, Jonathan Appavoo, Dilma Da Silva, Jeremy Kerr, Orran Krieger, and Robert W. Wisniewski. Providing Dynamic Update in an Operating System. In 2005 USENIX Annual Technical Conference, pages 279–291, Anaheim, CA, April 2005. Bibliography 206 [30] Andrew Berman, Virgil Bourassa, and Erik Selberg. TRON: Process-specific File Protection for the UNIX Operating System. In The 1995 USENIX Winter Technical Conference, pages 165–175, New Orleans, LA, January 1995. [31] bitdefender. Trojan.pws.chromeinject.b. http://www.bitdefender.com/ VIRUS-1000451-en--Trojan.PWS.ChromeInject.B.html, November 2008. [32] Jeff Bonwick and Bill Moore. ZFS: The Last Word In File Systems. http:// opensolaris.org/os/community/zfs/docs/zfs_last.pdf, November 2005. [33] Kevin Borders, Eric Vander Weele, Billy Lau, and Atul Prakash. Protecting Confidential Data on Personal Computers with Storage Capsules. In The 18th USENIX Security Symposium, Montreal. Canada, August 2009. [34] Ed Bugnion, Scott Devine, and Mendel Rosenblum. Disco: Running Commodity Operating Systems on Scalable Multiprocessors. In The 16th ACM Symposium on Operating Systems Principles, pages 143–156, Saint Malo, France, December 1997. [35] Thomas Bushnell. The HURD: Towards a New Strategy of OS Design. http: //www.gnu.org/software/hurd/hurd-paper.html, 1994. [36] Bruce Byfield. An Apt-Get Primer. http://www.linux.com/articles/40745, December 2004. [37] Ramón Cáceres, Casey Carter, Chandra Narayanaswami, and Mandayam Raghunath. Reincarnating PCs with Portable SoulPads. In The 3rd International Conference on Mobile Systems, Applications, and Services, pages 65–78, Seattle, WA, June 2005. ACM. Bibliography 207 [38] Justin Capps, Scott Baker, Jeremy Plichta, Duy Nyugen, Jason Hardies, Matt Borgard, Jeffry Johnston, and John H. Hartman. Stork: Package Management for Distributed VM Environments. In The 21st Large Installation System Administration Conference, Dallas, TX, November 2007. [39] Ashton B. Carter, John D. Steinbruner, and Charles A. Zraket, editors. Managing Nuclear Operations. The Brookings Institution, Washington, DC, 1987. [40] Jeremy Casas, Dan Clark, Rabi Konuru, Steve Otto, Robert Prouty, and Jonathan Walpole. MPVM: A Migration Transparent Version of PVM. Computing Systems, 8(2):171–216, 1995. [41] Ramesh Chandra, Nickolai Zeldovich, Constantine Sapuntzakis, and Monica S. Lam. The Collective: A Cache-Based System Management Architecture. In The 2nd Symposium on Networked Systems Design and Implementation, pages 259–272, Boston, MA, April 2005. [42] David R. Cheriton. The V Distributed System. Communications of the ACM, 31(3):314–333, March 1988. [43] Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, and Andrew Warfield. Live Migration of Virtual Machines. In The 2nd Symposium on Networked Systems Design and Implementation, pages 273–286, Boston, MA, April 2005. [44] David D. Clark and David R. Wilson. A Comparison of Commercial and Military Computer Security Policies. IEEE Symposium on Security and Privacy, 0:184, April 1987. Bibliography 208 [45] Commission for Review of FBI Security Programs, William Webster, chair. Webste Report: A Review of FBI Security Programs, March 2002. [46] Small Form Factors Committee. Specification for Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T.). Technical Report SFF-8035, Technical Committee T13 AT Attachment, April 1996. [47] Roberto Di Cosmo, Berke Durak, Xavier Leroy, Fabio Mancinelli, and Jérôme Vouillon. Maintaining Large Software Distributions: New Challenges from the FOSS Era. EASST Newsletter, 12:7–20, 2006. [48] Crispan Cowan, Calton Pu, Dave Maier, Jonathan Walpole, Peat Bakke, Steve Beattie, Aaron Grier, Perry Wagle, Qian Zhang, and Heather Hinton. StackGuard: Automatic Adaptive Detection and Prevention of Buffer-Overflow Attacks. In The 7th USENIX Security Conference, pages 63–78, San Antonio, TX, January 1998. [49] Crispin Cowan, Steve Beattie, Greg Kroah-Hartman, Calton Pu, Perry Wagle, and Virgil Gligor. SubDomain: Parsimonious Server Security. In 14th USENIX Systems Administration Conference, New Orleans, LA, December 2000. [50] Daniel S. Crosta, Matthew J. Singleton, and Benjamin A. Kuperman. Fighting Institutional Memory Loss: The Trackle Integrated Issue and Solution Tracking System. In The 20th Large Installation System Administration Conference, pages 287–298, Washington, DC, December 2006. [51] B.C. Cumberland, G. Carius, and A. Muir. Microsoft Windows NT Server 4.0, Terminal Server Edition: Technical Reference. Microsoft Press, Redmond, WA, August 1999. Bibliography 209 [52] Martin Davis and Hilary Putnam. A Computing Procedure for Quantification Theory. Journal of the ACM, 7(3):201–215, July 1960. [53] Digital Equipment Corporation. TOPS-20 User’s Guide, January 1980. [54] Fred Douglis and John Ousterhout. Transparent Process Migration: Design Alternatives and the Sprite Implementation. Software - Practice and Experience, 21(8):757–785, August 1991. [55] Michael Sack Elmaleh. Nonprofit Fraud Prevention. http://www. understand-accounting.net/Nonprofitfraudprevention.html, 2007. [56] Javier Fernandez-Sanguino. Debian GNU/Linux FAQ - Chapter 8 - The Debian Package Management Tools. http://www.debian.org/doc/FAQ/ ch-pkgtools.en.html. [57] FreeBSD Project. Developer’s Handbook. http://www.freebsd.org/doc/en_ US.ISO8859-1/books/developers-handbook/secure-chroot.html. [58] Steve Friedl. Best Practices for UNIX chroot() Operations. http://unixwiz. net/techtips/chroot-practices.html, January 2002. [59] Tal Garfinkel. Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools. In The 10th Annual Network and Distributed Systems Security Symposium, San Diego, CA, February 2003. [60] Tal Garfinkel, Ben Pfaff, and Mendel Rosenblum. Ostia: A Delegating Architecture for Secure System Call Interposition. In The 1st Network and Distributed Systems Security Symposium, February 2004. [61] James Gettys and Robert W. Scheifler. Xlib - C Language X Interface. X Consortium, Inc., 1996. p. 224. Bibliography 210 [62] Martyn Gilmore. 10Day CERT Advisory on PDF Files. http://seclists. org/fulldisclosure/2003/Jun/0463.html, June 2003. [63] Gnome.org. Libwnck Reference Manual. http://library.gnome.org/devel/ libwnck/. [64] GOBBLES Security. Local/Remote Mpg123 Exploit. http://www.opennet. ru/base/exploits/1042565884_668.txt.html, January 2003. [65] L. Gong and R. Schemers. Implementing Protection Domains in the Java Development Kit 1.2. In The 1998 Internet Society Symposium on Network and Distributed System Security, pages 125–134, San Diego, CA, 1998. [66] Google. Google Chrome - Features. http://www.google.com/chrome/intl/ en/features.html. [67] GreyMagic Security Research. Reading Local Files in Netscape 6 and Mozilla. http://sec.greymagic.com/adv/gm001-ns/, April 2002. [68] Philippe Grosjean. Speed Comparison of Various Number Crunching Packages (Version 2). http://www.sciviews.org/benchmark/, March 2003. [69] Douglas Hanks. Sudosh. http://sourceforge.net/projects/sudosh/. [70] Joseph Heller. Catch-22. Simon and Schuster, 1961. [71] Sotiris Ioannidis, Steven M. Bellovin, and Jonathan Smith. Sub-Operating Systems: A New Approach to Application Security. In SIGOPS European Workshop, Saint-Emilion, France, September 2002. Bibliography 211 [72] Shvetank Jain, Fareha Shafique, Vladan Djeric, and Ashvin Goel. Applicationlevel Isolation and Recovery with Solitude. In The 3rd ACM European Conference on Computer Systems, pages 95–107, Glasgow, Scotland, April 2008. [73] Michael K. Johnson. Linux Kernel Hackers’ Guide. The Linux Documentation Project, 1997. [74] Poul-Henning Kamp and Robert N. M. Watson. Jails: Confining the Omnipotent Root. In The 2nd International SANE Conference, MECC, Maastricht, The Netherlands, May 2000. [75] Paul Karger. Personal Communication, May 2009. [76] Jeffrey Katcher. PostMark: A New File System Benchmark. Technical Report TR3022, Network Appliance, Inc., October 1997. [77] Jeffry O. Kephart and David M. Chess. The Vision of Autonomic Computing. IEEE Computer, pages 41–50, January 2003. [78] Yousef A. Khalidi and Michael N. Nelson. Extensible File Systems in Spring. In The 14th ACM Symposium on Operating Systems Principles, pages 1–14, Asheville, NC, December 1993. ACM. [79] Gene Kim and Eugene Spafford. Experience with Tripwire: Using Integrity Checkers for Intrusion Detection. In The 1994 System Administration, Networking, and Security Conference, April 1994. [80] Calvin Ko, Timothy Fraser, Lee Badger, and Douglas Kilpatrick. Detecting and Countering System Intrusions Using Software Wrappers. In The 9th USENIX Security Symposium, Denver, CO, August 2000. Bibliography 212 [81] Oren Laadan, Ricardo Baratto, Dan Phung, Shaya Potter, and Jason Nieh. DejaView: A Personal Virtual Computer Recorder. In The 21st ACM Symposium on Operating Systems Principles, Stevenson, WA, October 2007. [82] Butler Lampson. Accountability and Freedom. http://research.microsoft. com/en-us/um/people/blampson/slides/accountabilityandfreedom.ppt, September 2005. [83] Jeffrey P. Lanza and Shawn V. Hernan. Remote Buffer Overflow in Sendmail. Technical Report CA-2003-07, CERT Coordination Center, March 2003. [84] Zhenkai Liang, V.N. Venkatakrishnan, and R. Sekar. Isolated Program Execution: An Application Transparent Approach for Executing Untrusted Programs. In 19th Annual Computer Security Applications Conference, Las Vegas, NV, December 2003. [85] Michael Litzkow, Todd Tannenbaum, Jim Basney, and Miron Livny. Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System. Technical Report 1346, University of Wisconsin Madison Computer Sciences, April 1997. [86] Peter Loscocco and Stephen Smalley. Integrating Flexible Support for Security Policies into the Linux Operating System. In The FREENIX Track: 2001 USENIX Annual Technical Conference, Boston, MA, June 2001. [87] David E. Lowell, Yasushi Saito, and Eileen J. Samberg. Devirtualizable Virtual Machines Enabling General, Single-node, Online Maintenance. In The 11th International Conference on Architectural Support for Programming Languages and Operating Systems, Boston, MA, October 2004. Bibliography 213 [88] Art Manion, Shawn V. Hernan, and Jeffery P. Lanza. Buffer overflow in sendmail. Technical Report CA-2003-12, CERT Coordination Center, March 2003. [89] David Mazières. A Toolkit for User-Level File Systems. In The 2001 USENIX Annual Technical Conference, pages 261–274, Boston, MA, June 2001. [90] Kirby McCoy. VMS File System Internals. Digital Press, 1990. [91] Mark McLoughlin. QCOW2 Image Format. http://www.gnome.org/~markmc/ qcow-image-format.htm, September 2008. [92] Microsoft. Microsoft Application Virtualization. http://www.microsoft.com/ systemcenter/appv/default.mspx. [93] Microsoft Corp. SendMessage Function. http://msdn.microsoft.com/en-us/ library/ms644950(VS.85).aspx. [94] Moka5. Moka5 Technology Overview. http://www.moka5.com/node/381, November 2006. [95] Sape J. Mullender, Guido Van Rossum, Andrew S. Tanenbaum, Robert van Renesse, and Hans Van Staveren. Amoeba: A Distributed Operating System for the 1990s. IEEE Computer, 23(5):44–53, May 1990. [96] Kiran-Kumar Muniswamy-Reddy, Charles P. Wright, Andrew Himmer, and Erez Zadok. A Versatile and User-Oriented Versioning File System. In The 3rd USENIX Conference on File and Storage Technologies, pages 115–128, San Francisco, CA, March/April 2004. [97] Rajeev Nagar. Filter Drivers. In Windows NT File System Internals: A Developer’s Guide. O’Reilly, September 1997. Bibliography 214 [98] Gustavo Niemeyer. Smart Package Manager. http://labix.org/smart. [99] Peter Norton, Peter Aitken, and Richard Wilton. The Peter Norton PC Programmer’s Bible: The Ultimate Reference to the IBM PC and Compatible Hardware and Systems Software. Microsoft Press, 1993. [100] Steven Osman, Dinesh Subhraveti, Gong Su, and Jason Nieh. The Design and Implementation of Zap: A System for Migrating Computing Environments. In The 5th Symposium on Operating Systems Design and Implementation, Boston, MA, December 2002. [101] Paul A. Karger and Roger R. Schell. Multics Security Evaluation: Vulnerability Analysis, Volume II. Technical Report ESD-TR-74-193, HQ Electronic Systems Division: Hanscom AFB, MA, June 1974. [102] Jan-Simon Pendry and Marshall Kirk McKusick. Union Mounts in 4.4BSD-lite. In The 1995 USENIX Technical Conference, New Orleans, LA, January 1995. [103] Ben Pfaff, Tal Garfinkel, and Mendel Rosenblum. Virtualization Aware File Systems: Getting Beyond the Limitations of Virtual Disks. In 3rd Symposium of Networked Systems Design and Implementation, San Jose, CA, May 2006. [104] Rob Pike, David L. Presotto, Ken Thompson, and Howard Trickey. Plan 9 from Bell Labs. In The 1990 Summer UKUUG Conference, pages 1–9, London, United Kingdom, July 1990. UKUUG. [105] Rob Pike and Dennis M. Ritchie. The Styx Architecture for Distributed Systems. Bell Labs Technical Journal, 4(2):146–152, 1999 1999. Bibliography 215 [106] James S. Plank, Micah Beck, Gerry Kingsley, and Kai Li. Libckpt: Transparent Checkpointing under Unix. In The 1995 USENIX Winter Technical Conference, pages 213–223, New Orleans, LA, January 1995. [107] Thomas Porter and Tom Duff. Compositing Digital Images. Computer Graphics, 18(3):253–259, July 1984. [108] Jef Poskanzer. http://www.acme.com/software/http_load/. [109] Shaya Potter, Ricardo Baratto, Oren Laadan, Leonard Kim, and Jason Nieh. MediaPod: A Personalized Multimedia Desktop In Your Pocket. In The 11th IEEE International Symposium on Multimedia, pages 219–226, San Diego, CA, December 2009. [110] Shaya Potter, Ricardo Baratto, Oren Laadan, and Jason Nieh. GamePod: Persistent Gaming Sessions on Pocketable Storage Devices. In The 3rd International Conference on Mobile Ubiquitous Computing, Systems, Services, and Technologies, Sliema, Malta, October 2009. [111] Shaya Potter, Steven M. Bellovin, and Jason Nieh. Two Person Controller Administration: Preventing Administrative Faults through Duplication. In The 23rd Large Installation System Administration Conference, Baltimore, MD, November 2009. [112] Shaya Potter and Jason Nieh. Reducing downtime due to system maintenance and upgrades. In The 19th Large Installation System Administration Conference, pages 47–62, San Diego, CA, December 2005. Bibliography 216 [113] Shaya Potter and Jason Nieh. WebPod: Persistent Web Browsing Sessions with Pocketable Storage Devices. In The 14th International World Wide Web Conference, Chiba, Japan, May 2005. [114] Shaya Potter and Jason Nieh. Highly Reliable Mobile Desktop Computing in Your Pocket. In The 2006 IEEE Computer Society Signature Conference on Software Technology and Applications, September 2006. [115] Shaya Potter, Jason Nieh, and Matt Selsky. Secure Isolation of Untrusted Legacy Applications. In The 21st conference on Large Installation System Administration Conference, pages 117–130, Dallas, TX, November 2007. [116] Daniel Price and Andrew Tucker. Solaris Zones: Operating System Support for Consolidating Commercial Workloads. In 18th Large Installation System Administration Conference, November 2004. [117] Debian Project. DDP Developers’ Manuals. http://www.debian.org/doc/ devel-manuals. [118] Niels Provos. Improving Host Security with System Call Policies. In The 12th USENIX Security Symposium, Washington, DC, August 2003. [119] Jim Pruyne and Miron Livny. Managing Checkpoints for Parallel Programs. In The 2nd Workshop on Job Scheduling Strategies for Parallel Processing, Honolulu, HI, April 1996. [120] Richard F. Rashid and George G. Robertson. Accent: A Communication Oriented Network Operating System Kernel. In The 8th ACM Symposium on Operating System Principles, pages 64–75, Bretton Woods, NH, December 1984. Bibliography 217 [121] Darrell Reimer, Arun Thomas, Glenn Ammons, Todd Mummert, Bowen Alpern, and Vasanth Bala. Opening Black Boxes: Using Semantic Information to Combat Virtual Machine Image Sprawl. In The 2008 ACM International Conference on Virtual Execution Environments, Seattle, WA, March 2008. [122] Charles Reis and Steven D. Gribble. Isolating Web Programs in Modern Browser Architectures. In The 4th ACM European Conference on Computer Systems, Nuremberg, Germany, March 2009. [123] Eric Rescorla. Security Holes... Who Cares? In The 12th USENIX Security Conference, Washington, D.C., August 2003. [124] David Rosenthal. Evolving the Vnode Interface. In The 1990 USENIX Summer Technical Conference, pages 107–118, June 1990. [125] Marc Rozier, Vadim Abrossimov, François Armand, I. Boule, Michel Gien, Marc Guillemont, F. Herrman, Claude Kaiser, S. Langlois, P. Léonard, and W. Neuhauser. Overview of the Chorus Distributed Operating System. In The Workshop on Micro-Kernels and Other Kernel Architectures, pages 39–70, Seattle, WA, 1992. [126] Jerome H. Saltzer and Michael D. Schroeder. The Protection of Information in Computer Systems. In The 4th ACM Symposium on Operating System Principles, Yorktown Heights, NY, October 1973. [127] Douglas S. Santry, Michael J. Feeley, Norman C. Hutchinson, Alistair C. Veitch, Ross W. Carton, and Jacob Ofir. Deciding When to Forget in the Elephant File System. In The 17th ACM Symposium on Operating Systems Principles, Charleston, SC, December 1999. Bibliography 218 [128] Constantine P. Sapuntzakis, Ramesh Chandra, Ben Pfaff, Jim Chow, Monica S. Lam, and Mendel Rosenblum. Optimizing the Migration of Virtual Computers. In The 5th Symposium on Operating Systems Design and Implementation, Boston, MA, December 2002. [129] Brian K. Schmidt. Supporting Ubiquitous Computing with Stateless Consoles and Computation Caches. PhD thesis, Computer Science Department, Stanford University, August 2000. [130] Glenn C. Skinner and Thomas K. Wong. ”Stacking” Vnodes: A Progress Report. In The 1993 USENIX Summer Technical Conference, pages 1–27, Cincinnati, Ohio, June 1993. [131] Peter Smith and Norman C. Hutchinson. Heterogeneous Process Migration: The Tui System. Software – Practice and Experience, 28(6):611–639, 1998. [132] Craig A. N. Soules, Garth R. Goodson, John D. Strunk, and Gregory R. Ganger. Metadata Efficiency in a Comprehensive Versioning File System. In The 2nd USENIX Conference on File and Storage Technologies, San Francisco, CA, March 2003. [133] Ray Spencer, Stephen Smalley, Peter Loscocco, Mike Hibler, David Andersen, and Jay Lepreau. The Flask Security Architecture: System Support for Diverse Security Policies. In The 8th USENIX Security Symposium, Washington, DC, August 1999. [134] Peter Stein and Peter Feaver. Assuring Control of Nuclear Weapons. University Press of America, 1987. Bibliography 219 [135] Sun Microsystems, Inc. NFS: Network File System Protocol Specification. Technical Report RFC 1094, Internet Engineering Task Force, March 1989. [136] Michael M. Swift, Brian N. Bershad, and Henry M. Levy. Improving the Reliability of Commodity Operating Systems. In The 19th ACM Symposium on Operating Systems Principles, pages 207–222, Bolton Landing, NY, USA, October 2003. ACM Press. [137] Miklos Szeredi. Filesystem in Userspace. http://fuse.sourceforge.net/. [138] Johnny S. Tolliver. Compartmented Mode Workstation (CMW) Comparisons. In 17th DOE Computer Security Group Training Conference, Milwaukee, WI, May 1995. [139] Anthony Towns. Checking Installability is an NP-Complete Prob- lem. http://www.mail-archive.com/[email protected]/ msg03311.html, November 2007. [140] Satoshi Uchino. MetaVNC - A Window Aware VNC. http://metavnc. sourceforge.net/. [141] Inc. VMWare. VMware VMotion for Live Migration of Virtual Machines. http: //www.vmware.com/products/vi/vc/vmotion.html. [142] VMware, Inc. http://www.vmware.com. [143] VMware Inc. VMware Worksation 6.5 Release Notes. http://www.vmware. com/support/ws65/doc/releasenotes_ws65.html, October 2008. [144] David Wagner. Janus: An Approach for Confinement of Untrusted Applications. Master’s thesis, University of California, Berkeley, 1999. Bibliography 220 [145] Robert N. M. Watson. Exploiting Concurrency Vulnerabilities in System Call Wrappers. In The 1st USENIX Workshop on Offensive Technologies, Boston, MA, August 2007. [146] Florian Weimer. DSA-1438-1 Tar – Several Vulnerabilities. http://www.ua. debian.org/security/2007/dsa-1438, December 2007. [147] Andrew Whitaker, Marianne Shaw, and Steven D. Gribble. Scale and Performance in the Denali Isolation Kernel. In The 5th Symposium on Operating Systems Design and Implementation, Boston, MA, December 2002. [148] Wilshire State Bank. Safe Deposit Boxes. https://www.wilshirebank.com/ public/additional_safedeposit.asp, 2008. [149] David Wise. Spy: The Inside Story of how the FBI’s Robert Hanssen Betrayed America. Random House, 2002. [150] Charles P. Wright, Jay Dave, Puja Gupta, Harikesavan Krishnan, David P. Quigley, Erez Zadok, and Mohammad Nayyer Zubair. Versatility and Unix Semantics in Namespace Unification. ACM Transactions on Storage, 2(1):1–32, February 2006. [151] X/Open, editor. Protocols for X/Open PC Interworking: SMB, Version 2. X/Open Company Ltd, 1992. [152] Erez Zadok and Jason Nieh. FiST: A Language for Stackable File Systems. In The 2000 USENIX Annual Technical Conference, pages 55–70, San Diego, CA, June 2000. Appendix A Restricted System Calls To securely isolate regular Linux processes, we interpose on a number of additional system calls beyond what is necessary for other forms of virtualization. Below is a complete list of the few system calls that require more than plain virtualization on Linux. We give the reasoning for the interposition, where it is not self-explanatory, and note what functionality was changed from the base system call. Most system calls do not require more than simple virtualization to ensure isolation because virtualization of the resources itself isolates them. For example, the kill system call cannot signal a process outside the virtualized environment because the virtualized namespace will not map it, so the system call cannot reference the process. A.1 Host-Only System Calls These system calls are generally not needed in a virtualized environment and are therefore not allowed. 1. mount – If a user within a virtualized environment were able to mount a file system, they could mount a file system with device nodes already Appendix A. Restricted System Calls present and would thus be able to access the underlying system directly in a manner not controlled by the virtualization architecture. Any file systems that need to be mounted within the virtualized environment must be mounted by the host. 2. stime, adjtimex, settimeofday – Allow a privileged process to adjust the host’s clock. 5. acct – Sets the file on the host that BSD process accounting information should be written to. 6. swapon, swapoff – Control swap space allocation. 8. reboot – Causes the system to reboot or changes Ctrl-Alt-Delete functionality. 9. ioperm, iopl – Allow a privileged process to gain direct access to underlying hardware resources. 11. create module, init module, delete module, query module – Insert and remove kernel modules. 15. nfsservctl – Enables a privileged process inside a virtual environment to change the host’s internal NFS server. 16. bdflush – Controls the kernel’s buffer-dirty-flush daemon. 17. sysctl – A deprecated system call that enables runtime setting of kernel parameters. 18. clock settime – Sets the realtime clock and is only usable by processes with privilege on a regular system. 222 Appendix A. Restricted System Calls A.2 223 Root-Squashed System Calls These system calls, in general, are system calls that are useful within a virtualized environment, but treat the privileged root user in a manner that breaks the virtualization abstraction. These can, however, be used without giving the root user any special privilege. 1. nice, setpriority, sched setscheduler, sched setparam – These system calls let a process change its priority. If a process is running as root (UID 0), it can increase its priority and freeze out other processes on the system. Therefore, we prevent any virtualized process from increasing its priority. 5. ioctl – This system call is a system call demultiplexer that allows kernel device drivers and subsystems to add their own functions that can be called from user space. But because functionality can be exposed that allows root to access the underlying host, all system calls, beyond a limited audited safe set, are squashed to user nobody, much as NFS does. 6. setrlimit – This system call allows processes running as UID 0 to raise their resource limits beyond what was preset, thereby allowing them to disrupt other processes on the system by using too many resources. We therefore prevent virtualized processes from using this system call to increase the resources available to them. 7. mlock, mlockall – These system calls allow a privileged process to pin an arbitrary amount of memory, thereby allowing a virtualized process to lock all of available memory and starve all other processes on the host. We Appendix A. Restricted System Calls 224 therefore squash a privileged processes to user nobody when it attempts to call this system call and treat it like an unprivileged process. A.3 Option-Checked System Calls These are system calls that are used within a virtualized environment, but can be used in a way that can break the virtualization. Therefore, the options passed to them are checked to ensure they are valid options for the virtualized environment. 1. mknod – This system call allows a privileged user to create special files, such as pipes, sockets, devices, and even regular files. Because a privileged process needs to use this functionality, the system call cannot be disabled. However, if the process could create a device, the device would be an access point to the underlying host system. Therefore, when a virtualized process uses this system call, the options are checked to prevent it from creating a device special file, while allowing the other types. 2. ustat – This system call returns information about a mounted file system, specifically how much free space remains. This can be useful for a process within a virtualized environment, but it can also provide information about a host’s file systems that is not accessible to the processes within the virtualized environment. Therefore, the options passed to this system call are checked to ensure that they match the device of a file system available only within the virtualized environment. 3. quotactl – This system call sets a limit on the amount of space individual users can use on a given file system. Virtualized processes are only able to call it for file systems available within their environment. Appendix A. Restricted System Calls A.4 225 Per-Virtual-Environment System Calls These system calls are on top of the IPC, shared memory and process namespace virtualization that was provided by Zap [100]. 1. sethostname, gethostname, setdomainname, getdomainname, uname, newuname, olduname – These system calls read and write the name for the underlying host. We wrap these system calls to read and write a virtual environmentspecific name and allow each virtual environment to set the name independently. 8. socketcall – This system call provides access to the multitude of socket system calls available in the kernel. Because a secure virtualized environment provides each environment with its own network namespace, this system call is restricted to operating only on the namespace that belongs to the virtualized environment. 9. keyctl, add key, request key – These system calls affect the key management provided by the kernel. Because keys can be associated with user and group identifiers, they must be virtualized to a per-virtualizedenvironment namespace. 12. mq open, mq unlink, mq timedsend, mq timedreceive, mq notify, mq getsetattr – These system calls provide access to the kernel’s POSIX message queues. Because they are used by name, they have to be virtualized on a perenvironment basis.